Our method for iteratively solving the NLTE radiative transfer and rate equations with an operator splitting method are discussed in detail in and , therefore, we present here only a brief summary. The method uses a ``rate-operator'' formalism that extends the approach of to the general case of multi-level NLTE calculations with overlapping lines and continua. We use an ``approximate rate-operator'' that is constructed using the exact elements of the discretized -matrix (these are constructed in the radiative transfer calculation for every wavelength point). This approximate rate-operator can be either diagonal or tri-diagonal. The method gives good convergence for a wide range of applications and is very stable. It has the additional advantage that it can handle very large model atoms, e.g., we use a 617 level Fe II model atom in regular model calculations .
Parallelizing the NLTE calculations involves parallelizing three different sections: the NLTE opacities, the rates, and the solution of the rate equations. In the following discussion, we consider only a diagonal approximate rate-operator for simplicity. In this case, the computation of the rates and the solution of the rate equations as well as the NLTE opacity calculations can simply be distributed onto a set of nodes without any communication (besides the gathering of the data at the end of the iteration). This provides a very simple way of achieving parallelism and minimizes the total communication overhead. The generalization to a tridiagonal approximate rate-operator is in principle straightforward, and involves only communication at the boundaries between two adjacent nodes.
It would be possible to use the other two methods that we discussed in the section on the LTE line opacities, namely local and global set of lines distributed to different nodes. However, both would involve an enormous amount of communication between the NLTE nodes because each NLTE transition can be coupled to any other NLTE transition. These couplings require that a node working on any NLTE transition have the required data to incorporate the coupling correctly. Although this only applies to the nodes that work on the NLTE rates, it would require both communication of each NLTE opacity task with each other (to prepare the necessary data) and communication from the NLTE opacity nodes to the NLTE rate nodes. This could mean that several MB data would have to be transferred between nodes at each wavelength point, which is prohibitive with current communication devices.
Another way to parallelize the NLTE calculation exploits the fact that our numerical method allows the grouping of NLTE species (elements) into separate entities. These groups do not need to communicate with each other until the end of an iteration. Thus distributing groups onto different nodes will allow very effective parallelization with little or no communication overhead. This approach can be combined with the all the parallelization approaches discussed previously, leading to better speed-ups. In addition, it would significantly reduce the memory requirements for each node, thus allowing even larger problems to be solved. We will implement this method in later releases of PHOENIX and report the results elsewhere.
We have implemented the parallelization of the NLTE calculations by distributing the set of radial points on different nodes. In order to minimize communication, we also `pair' NLTE nodes so that each node works on NLTE opacities, rates and rate equations for a given set of radial points. This means that the communication at every wavelength point involves only gathering the NLTE opacity data to the radiative transfer nodes and the broadcast of the result of the radiative transfer calculations (i.e., the ALO and the J's) to the nodes computing the NLTE rates. The overhead for these operations is negligible but it involves synchronization (the rates can only be computed after the results of the radiative transfer are known, similarly, the radiative transfer nodes have to wait until the NLTE opacities have been computed). Therefore, a good balance between the radiative transfer tasks and the NLTE tasks is important to minimize waiting. The rate equation calculations parallelize trivially over the layers and involve no communication if the diagonal approximate rate-operator is used. After the new solution has been computed, the data must be gathered and broadcast to all nodes, the time for this operation is negligible because it is required only once per model iteration.
In Fig. and Table 3 we show performance results for a simplified test model with the following parameters: ,, and solar abundances. The background LTE line opacities have been omitted and the radiative transfer code has been run in plane parallel mode on a single node to concentrate the results on the NLTE calculation. The speed-ups are acceptable although they do not reach the theoretical maximum. This is caused by several effects. First, the NLTE opacity calculations involve the computation of the b-f cross-section, which has not been parallelized in this version of the code. These calculations are negligible in serial mode, but they become comparatively more costly in parallel mode. The solution to this will be their parallelization, which involves significant communication at each wavelength point. This will be investigated in future work.
An important problem arises from the fact that the time spent in the NLTE routines is not dominated by the floating point operations (both the number and placement of floating point operation were optimized in the serial version of the code) but by memory access, in particular for the NLTE rate construction. Although the parallel version accesses a much smaller number of storage cells (which naturally reduces the wall-clock time), effects like cache and TLB misses and page faults contribute significantly to the total wall-clock time. All of these can be reduced by using Fortran-90 specific constructs in the following way: We have replaced the static allocation of the arrays that store the profiles, rates etc. used in the NLTE calculations (done with COMMON blocks) with a Fortran-90 module and explicit allocation (using ALLOCATE and SAVE) of these arrays at the start of the model run. This allows us to tailor the size of the arrays to fit exactly the number of radial points handled by each individual node. This reduces both the storage requirements of the code on every node and, more importantly, it minimizes cache/TLB misses and page faults. In addition, it allows much larger calculations to be performed because the RAM, virtual memory and scratch disk space of every node can be fully utilized, thus effectively increasing the size of the possible calculation by the number of nodes.
We find that the use of adapted array sizes significantly improves the overall performance, even for a small number of processes. The scaling with more nodes has also improved considerably. However, the overhead of loops cannot be reduced and thus the speed-up cannot increase significantly if more nodes are used. We have verified with a simple test program which only included one of the loops important for the radiative rate computation that this is indeed the case. Therefore, we conclude that further improvements cannot be obtained at the Fortran-90 source level but would require either re-coding of routines in assembly language (which is not practical) or improvements of the compiler/linker/library system. We note that these performance changes are very system dependent. For example, the serial ``COMMON block'' version of the code runs significantly (50%) faster than the serial ``Fortran-90 module'' version on both Cray's or SGI's, whereas on IBM RS/6000's we found no noticeable difference in speed. This stresses that raw CPU performance is irrelevant as long as it is not supplemented by adequate compilers.
We also ran tests on the HP J200 machine, see Table 3. We use the public domain MPICH implementation of MPI compiled on this machine with its default compiler options. There is no full-F90 compiler available from Hewlett-Packard for this machine, therefore, we had to use the ``COMMON block'' version of the code. Therefore, the speed-ups cannot be expected to be optimal. However, a total speed-up of about 1.5 is acceptable. The speed-up for the opacity part of the NLTE calculation is a factor of 1.7, somewhat better than the factor of 1.4 achieved for the NLTE rate calculations. This is probably due to the fact that the rate computation is much more memory access dominated than the more floating point intensive NLTE opacity calculations.