We have implemented a hierarchical parallelization scheme on
distributed memory machines using the MPI framework similar to the
scheme discussed in Baron & Hauschildt (1998). Basically, the most efficient
parallelization opportunities in the problem are the solid angle and
wavelength sub-spaces. The total number of solid angles in the test
models is at least 4096, the number of wavelength points is 32 in the
tests presented here but will be much larger in large scale
applications. Thus even in the simple tests presented here, the
calculations could theoretically be run on 131072 processors. The work
required for each solid angle is roughly constant (the number of
points that need to be calculated depends on the angle points) and
during the formal solution process the solid angles are independent
from each other. The mean intensities (and other solid angle
integrated quantities) are only needed after the formal solution for
all solid angles is complete, so a single collective MPI operation is
needed to finish the computation of the mean intensities at each
wavelength. Similarly, the wavelength integrated mean intensities
are needed only after the formal solutions are completed for all
wavelengths (and solid angles). Therefore, the different wavelength
points can be computed in parallel also with the only communication
occurring as collective MPI operations after all wavelength points have
been computed. We thus divide the total number of processes up in a
number of `wavelength clusters' (each working on a different set of
wavelength points) that each have a number of `worker' processes which
work on a different set of solid angle points for any wavelength. In
the simplest case, each wavelength cluster has the same number of
worker processes so that
where
is the total number of MPI processes,
is the number of wavelength clusters and
is the number of worker processes for each wavelength
cluster. For our tests we could use a maximum number of 128 CPUs on
the HLRN IBM Regatta (Power4 CPUs) system. In
Table 1 we show the results for the 3 combinations
that we could run (due to computer time limitations) for a
line transfer test case with 32 wavelength
points,
spatial points and
solid angle points. For example, the third row in the table is for a
configuration with
wavelength clusters, each of them using
CPUs working in parallel on different solid angles, for a total of 128
CPUs. The 3rd and 4th columns give the time (in seconds) for a full
formal solution, the construction of the
operator and an OS
step (the first iteration) and the time for a formal solution and an
OS step (the second iteration), respectively. As the
has to
be constructed only in the first iteration, the overall time per
iteration drops in subsequent iterations. Similarly to the 1D case,
the construction of the
is roughly equivalent to one formal
solution. We have verified that all parallel configurations lead to
identical results. The table shows that configurations with more
clusters are slightly more efficient, mostly due to better load
balancing. However, the differences are not really significant in
practical applications so that the exact choice of the setup is not
important. This also means that the code can easily scale to much
larger numbers of processors since realistic applications will require
much more than the 32 wavelength points used in the test
calculations. Note that the MPI parallelization can be combined with
shared memory parallelization (e.g., using openMP) in order to more
efficiently utilize modern multi-core processors with shared
caches. Although this is implemented in the current version of the 3D
code, we do not have access to a machine with such an architecture and
it was not efficient to use openMP on multiple single core processors.