The parallelization of the computational workload outlined in the previous paragraphs requires synchronization between the radiative transfer tasks and the NLTE tasks, since the radiation field and the operator must be passed between them. In addition, our standard model calculations use 50 radial grid points and as the number of nodes increases, so too does the communication and loop overhead, therefore, pure data parallelism does not deliver good scalability. We found good speedup up to about 5 nodes for a typical calculation, with the speedup close to the theoretical maximum. However, for 5 nodes the communication and loop overheads begin to become significant and it is not economical to use more than 10 nodes (depending on the machine and the model calculation, it might be necessary to use more nodes to fit the data in the memory available on a single node).
Since the number of wavelength points in a calculation is very large and the CPU time scales linearly with the number of wavelength points, parallelization with respect to the wavelength points can lead to large speedups and to the ability to use very large numbers of nodes available on massively parallel supercomputers. This poses no difficulties for static configuration, but the coupling of the wavelengths points in expanding atmospheres makes the wavelength parallelization much more complex.
We have developed a wavelength parallelization based on a toroidal topology that uses the concept of wavelength ``clusters'' to distribute a set of wavelength points (for the solution of the wavelength dependent radiative transfer) onto a different set of nodes, see Fig. 3 . In order to achieve optimal load balance and, more importantly, in order to minimize the memory requirements, each cluster (a column of nodes indicated in Fig. 3) works on a single wavelength point at any given time. Each cluster can consist of a number of ``worker'' nodes where the worker node group uses parallelization methods discussed above (see also Ref. ). In order to avoid communication overhead, we use symmetric wavelength clusters: each ``row'' of worker nodes in Fig. 3 performs identical tasks but on a different set of wavelength points for each cluster. We thus arrange the total number of nodes N in a rectangular matrix with n columns and m rows, where n is the number of clusters and m is the number of workers for each cluster, such that N=n * m. Another way of visualizing this parallelization technique is to consider each wavelength cluster as a single entity (although not a single node or CPU) that performs a variety of different tasks at each wavelength point. The entity (cluster) itself is then further subdivided into individual nodes or CPUs each of which perform a given set of tasks at a particular wavelength point. This topology can be implemented very efficiently in the context of the MPI library, see  for details.
For a static model atmosphere, all wavelengths and thus wavelength clusters are completely independent and execute in parallel with no immediate communication or synchronization along the rows of Fig. 3. Communication is only necessary after the calculation is complete for all wavelengths points on all nodes to collect, e.g., the rates and rate operators. Therefore, the speedup is close (80%) to the theoretical maximum, limited only by to combined IO bandwidth of the machine used.
In order to parallelize the spectrum calculations for a model atmosphere with a global velocity field, such as the expanding atmospheres of novae, supernovae or stellar winds, we need to take the mathematical character of the RTE into account. For monotonic velocity fields, the RTE is an initial value problem in wavelength (with the initial condition at the smallest wavelength for expanding atmospheres and at the largest wavelength for contracting atmospheres). This initial value problem must be discretized fully implicitly to ensure stability. In the simplest case of a first order discretization, the solution of the RTE for wavelength point i depends only on the results of the point i-1. In order to parallelize the spectrum calculations, the wavelength cluster ni computing the solution for wavelength point i must know the specific intensities from the cluster ni-1 computing the solution for point i-1. This suggests a ``pipeline'' solution to the wavelength parallelization, transforming the ``matrix'' arrangement of nodes into a ``torus'' arrangement where data are sent along the torus' circumference. Note that only the solution of the RTE is affected by this, the calculation of the opacities and rates remains independent between different wavelength clusters. In this case, the wavelength parallelization works as follows: Each cluster can independently compute the opacities and start the RT calculations (e.g., the calculations, hereafter called the pre-processing phase), it then waits until it receives the specific intensities for the previous wavelength point, then it finishes the solution of the RTE and immediately sends the results to the wavelength cluster calculating the next wavelength point (to minimize waiting time, this is done with non-blocking send/receives), then proceeds to calculate the rates etc. (hereafter called the post-processing phase and the new opacities for its next wavelength point and so on.
The important point in this scheme is that each wavelength cluster can execute the post-processing phase of its current wavelength point and pre-processing phase of its next wavelength point independently and in parallel with all other clusters. This means that the majority of the total computational work can be done in parallel, leading to a substantial reduction in wall-clock time per model. Ideal load balancing can be obtained by dynamically allocating wavelength points to wavelength clusters. This requires only primitive logic with no measurable overhead, however it requires also communication and an arbitration/synchronization process to avoid deadlocks. Typically, the number of clusters n (4-64) is much smaller than the number of wavelength points, , so that at any given time the work required for each wavelength point is roughly the same for each cluster (the work changes as the number of overlapping lines changes, for example). Therefore, a simple round robin allocation of wavelength points to clusters (cluster i calculates wavelength points i, n+i, 2n+i and so on) can be used which will result in nearly optimal performance if the condition is fulfilled. However, once the pipeline is full, adding further wavelength clusters cannot decrease the time required for the calculations, setting a limit for the efficient ``radius'' of the torus topology. However, this limit can be increased somewhat by increasing the number of worker nodes per wavelength cluster.