Our method for iteratively solving the NLTE radiative transfer and rate equations with an operator splitting method are discussed in detail in and , therefore, we present here only a brief summary. The method uses a ``rate-operator'' formalism that extends the approach of to the general case of multi-level NLTE calculations with overlapping lines and continua. We use an ``approximate rate-operator'' that is constructed using the exact elements of the discretized -matrix (these are constructed in the radiative transfer calculation for every wavelength point). This approximate rate-operator can be either diagonal or tri-diagonal. The method gives good convergence for a wide range of applications and is very stable. It has the additional advantage that it can handle very large model atoms, e.g., we use a 617 level Fe II model atom in regular model calculations .

Parallelizing the NLTE calculations involves parallelizing three different sections: the NLTE opacities, the rates, and the solution of the rate equations. In the following discussion, we consider only a diagonal approximate rate-operator for simplicity. In this case, the computation of the rates and the solution of the rate equations as well as the NLTE opacity calculations can simply be distributed onto a set of nodes without any communication (besides the gathering of the data at the end of the iteration). This provides a very simple way of achieving parallelism and minimizes the total communication overhead. The generalization to a tridiagonal approximate rate-operator is in principle straightforward, and involves only communication at the boundaries between two adjacent nodes.

It would be possible to use the other two methods that we discussed in the section on the LTE line opacities, namely local and global set of lines distributed to different nodes. However, both would involve an enormous amount of communication between the NLTE nodes because each NLTE transition can be coupled to any other NLTE transition. These couplings require that a node working on any NLTE transition have the required data to incorporate the coupling correctly. Although this only applies to the nodes that work on the NLTE rates, it would require both communication of each NLTE opacity task with each other (to prepare the necessary data) and communication from the NLTE opacity nodes to the NLTE rate nodes. This could mean that several MB data would have to be transferred between nodes at each wavelength point, which is prohibitive with current communication devices.

Another way to parallelize the NLTE calculation exploits the fact that
our numerical method allows the grouping of NLTE species (elements) into
separate entities. These groups do not need to communicate with each
other until the end of an iteration. Thus distributing groups onto
different nodes will allow very effective parallelization with little or
no communication overhead. This approach can be combined with the all
the parallelization approaches discussed previously, leading to better
speed-ups. In addition, it would significantly reduce the memory
requirements for each node, thus allowing even larger problems to be
solved. We will implement this method in later releases of `PHOENIX` and
report the results elsewhere.

We have implemented the parallelization of the
NLTE calculations by distributing the set of radial points on different
nodes. In order to minimize communication, we also `pair' NLTE nodes
so that each node works on NLTE opacities, rates and rate equations
for a given set of radial points. This means that the communication at
every wavelength point involves only gathering the NLTE opacity
data to the radiative transfer nodes and the broadcast of the result of the
radiative transfer calculations (i.e., the ALO and the *J*'s) to the nodes
computing the NLTE rates. The overhead for these operations is negligible
but it involves synchronization (the rates can only be computed after
the results of the radiative transfer are known, similarly, the radiative transfer nodes have to
wait until the NLTE opacities have been computed). Therefore, a good
balance between the radiative transfer tasks and the NLTE tasks is important
to minimize waiting. The rate equation calculations parallelize
trivially over the layers and involve no communication if the diagonal
approximate rate-operator is used. After the new solution
has been computed, the data must be gathered and broadcast to *all*
nodes, the time for this operation is negligible because it is
required only once per model iteration.

In Fig. and Table 3 we show performance results for a simplified test model with the following parameters: ,, and solar abundances. The background LTE line opacities have been omitted and the radiative transfer code has been run in plane parallel mode on a single node to concentrate the results on the NLTE calculation. The speed-ups are acceptable although they do not reach the theoretical maximum. This is caused by several effects. First, the NLTE opacity calculations involve the computation of the b-f cross-section, which has not been parallelized in this version of the code. These calculations are negligible in serial mode, but they become comparatively more costly in parallel mode. The solution to this will be their parallelization, which involves significant communication at each wavelength point. This will be investigated in future work.

An important problem arises from the fact that the time spent in
the NLTE routines is not dominated by the floating point operations
(both the number and placement of floating point operation were
optimized in the serial version of the code) but by *memory access*,
in particular for the NLTE rate construction. Although the parallel
version accesses a much smaller number of storage cells (which naturally
reduces the wall-clock time), effects like cache and TLB misses and page
faults contribute significantly to the total wall-clock time. All of
these can be reduced by using Fortran-90 specific constructs in the
following way: We have replaced the static allocation of the arrays that
store the profiles, rates etc. used in the NLTE calculations (done with
`COMMON` blocks) with a Fortran-90 module and explicit allocation
(using `ALLOCATE` and `SAVE`) of these arrays at the start of
the model run. This allows us to tailor the size of the arrays to fit
exactly the number of radial points handled by each individual node.
This reduces both the storage requirements of the code on every node
and, more importantly, it minimizes cache/TLB misses and page faults. In
addition, it allows much larger calculations to be performed because the
RAM, virtual memory and scratch disk space of every node can be fully
utilized, thus effectively increasing the size of the possible
calculation by the number of nodes.

We find that the use of adapted array sizes significantly improves the
overall performance, even for a small number of processes. The scaling
with more nodes has also improved considerably. However, the overhead of
loops cannot be reduced and thus the speed-up cannot increase
significantly if more nodes are used. We have verified with a simple
test program which only included one of the loops important for the
radiative rate computation that this is indeed the case. Therefore, we
conclude that further improvements cannot be obtained at the Fortran-90
source level but would require either re-coding of routines in assembly
language (which is not practical) or improvements of the
compiler/linker/library system. We note that these performance changes
are very system dependent. For example, the serial ```COMMON` block''
version of the code runs significantly (50%) faster than the serial
``Fortran-90 module'' version on both Cray's or SGI's, whereas on `
IBM` `RS/6000`'s we found no noticeable difference in speed. This
stresses that raw CPU performance is irrelevant as long as it is not
supplemented by adequate compilers.

We also ran tests on the `HP J200` machine, see Table 3. We use
the public domain `MPICH` implementation of `MPI` compiled on this
machine with its default compiler options. There is no full-F90
compiler available from Hewlett-Packard for this machine, therefore, we
had to use the ```COMMON` block'' version of the code. Therefore,
the speed-ups cannot be expected to be optimal. However, a total
speed-up of about 1.5 is acceptable. The speed-up for the opacity part of
the NLTE calculation is a factor of 1.7, somewhat better than the factor
of 1.4 achieved for the NLTE rate calculations. This is probably due to
the fact that the rate computation is much more memory access dominated
than the more floating point intensive NLTE opacity calculations.