The parallelization of the computational workload outlined in the previous
paragraphs requires synchronization between the radiative transfer tasks and
the NLTE tasks, since the radiation field and the operator must be passed between them. In addition, our standard
model calculations use 50 radial grid points and as the number of nodes
increases, so too does the communication and loop overhead, therefore,
pure data parallelism does not deliver good scalability. We found good
speedup up to about 5 nodes for a typical calculation, with
the speedup close to the theoretical maximum. However, for 5 nodes the
communication and loop overheads begin to become significant and it is
not economical to use more than 10 nodes (depending on the machine and
the model calculation, it might be *necessary* to use more nodes to
fit the data in the memory available on a single node).

Since the number of wavelength points in a calculation is very large and the CPU time scales linearly with the number of wavelength points, parallelization with respect to the wavelength points can lead to large speedups and to the ability to use very large numbers of nodes available on massively parallel supercomputers. This poses no difficulties for static configuration, but the coupling of the wavelengths points in expanding atmospheres makes the wavelength parallelization much more complex.

We have developed a wavelength parallelization based on a toroidal
topology that uses
the concept of wavelength ``clusters'' to distribute a set of wavelength
points (for the solution of the wavelength dependent radiative transfer)
onto a different set of nodes, see Fig. 3 [9].
In order to achieve optimal load balance and, more importantly, in order
to minimize the memory requirements, each cluster (a column of
nodes indicated in Fig. 3) works on a single
wavelength point at any given time. Each cluster can consist of a *number*
of ``worker'' nodes where the worker node group uses parallelization methods
discussed above (see also Ref. [8]). In order to avoid
communication
overhead, we use *symmetric* wavelength clusters:
each ``row'' of worker nodes in Fig. 3 performs identical
tasks but on a different set of wavelength points for each cluster. We thus
arrange the total number of nodes *N* in a rectangular matrix with *n*
columns and *m* rows, where *n* is the number of clusters and *m* is
the number of workers for each cluster, such that *N*=*n* * *m*. Another
way of visualizing this parallelization technique is to consider each
wavelength cluster as a single entity (although not a single node or CPU)
that performs a variety of different tasks at each wavelength
point. The entity (cluster) itself is then further subdivided into individual
nodes or CPUs each of which perform a given set of tasks *at a
particular wavelength point*. This topology can be implemented
very efficiently in the context of the `MPI` library, see [8]
for details.

For a static model atmosphere, all wavelengths and thus wavelength
clusters are completely independent and execute in parallel with
*no* immediate communication or synchronization along the rows
of Fig. 3. Communication is only necessary *after*
the calculation is complete for all wavelengths points on all nodes to
collect, e.g., the rates and rate operators. Therefore, the speedup is
close (80%) to the theoretical maximum, limited only by to combined IO
bandwidth of the machine used.

In order to parallelize the spectrum calculations for a model atmosphere with a
*global velocity field*, such as the expanding atmospheres of novae,
supernovae or stellar winds, we need to take the mathematical character of the
RTE into account. For monotonic velocity fields, the RTE is an initial value
problem in wavelength (with the initial condition at the smallest wavelength
for expanding atmospheres and at the largest wavelength for contracting
atmospheres). This initial value problem must be discretized fully implicitly
to ensure stability. In the simplest case of a first order discretization, the
solution of the RTE for wavelength point *i* depends only on the results of the
point *i*-1. In order to parallelize the spectrum calculations, the wavelength
cluster *n*_{i} computing the solution for wavelength point *i* must know the
specific intensities from the cluster *n*_{i-1} computing the solution for
point *i*-1. This suggests a ``pipeline'' solution to the wavelength
parallelization, transforming the ``matrix'' arrangement of nodes into a
``torus'' arrangement where data are sent along the torus' circumference. Note
that only the solution of the RTE is affected by this, the calculation of the
opacities and rates remains independent between different wavelength clusters.
In this case, the wavelength parallelization works as follows: Each cluster can
independently compute the opacities and start the RT calculations (e.g., the
calculations, hereafter called the *pre-processing phase*), it
then waits until it receives the specific intensities for the previous
wavelength point, then it finishes the solution of the RTE and *
immediately* sends the results to the wavelength cluster calculating the next
wavelength point (to minimize waiting time, this is done with non-blocking
send/receives), then proceeds to calculate the rates etc. (hereafter called
the *post-processing phase* and the new opacities for its *next*
wavelength point and so on.

The important point in this scheme is that each wavelength cluster can
execute the post-processing phase of its current wavelength point and
pre-processing phase of its next wavelength point *independently
and in parallel with all other clusters*. This means that the majority
of the total computational work can be done in parallel, leading to a
substantial reduction in wall-clock time per model. Ideal load balancing
can be obtained by dynamically allocating wavelength points to wavelength
clusters. This requires only primitive logic with no measurable overhead,
however it requires also communication and an arbitration/synchronization
process to avoid deadlocks. Typically, the number of clusters *n*
(4-64) is much smaller than the number of wavelength points, , so that at any given time the work required for each
wavelength point is roughly the same for each cluster (the work changes
as the number of overlapping lines changes, for example). Therefore,
a simple *round robin* allocation of wavelength points to clusters
(cluster *i* calculates wavelength points *i*, *n*+*i*, 2*n*+*i* and so
on) can be used which will result in nearly optimal performance if the
condition is fulfilled. However, once the pipeline
is full, adding further wavelength clusters cannot decrease the
time required for the calculations, setting a limit for the efficient
``radius'' of the torus topology. However, this limit can be increased
somewhat by increasing the number of worker nodes per wavelength cluster.