In a previous paper we described our
method for parallelizing three separate modules: (1) The radiative
transfer calculation itself, where we divide up the characteristic
rays among nodes and use an `MPI_REDUCE` to send the J_ to all
the radiative transfer and NLTE rate computation tasks; (2) the line
opacity which requires the calculation of about 10,000 Voigt profiles
per wavelength point at each radial grid point, here we split the work
amongst the processors both by radial grid point and by dividing up
the individual lines to be calculated among the processors; and (3) the
NLTE calculations. The NLTE calculations involve three separate parts:
the calculation of the NLTE opacities, the calculation of the rates at
each wavelength point, and the solution of the NLTE rate equations. In
Paper I we performed all these parallelizations by distribution of the
radial grid points among the different nodes or by distributing sets of
spectral lines onto different nodes. In addition, to prevent communication
overhead, each task computing the NLTE rates is paired on the same node
with and the corresponding task computing NLTE opacities and emissivities
to reduce communication. The solution of the rate equations parallelizes
trivially with the use of a diagonal rate operator.

In the latest version of our code, `PHOENIX 8.1`, we have incorporated
the additional strategy of distributing each NLTE species (the total
number of ionization stages of a particular element treated in NLTE) on
separate nodes. Since different species have different numbers of levels
treated in NLTE (e.g. Fe II [singly ionized iron] has 617 NLTE levels,
whereas H I has 30 levels), care is needed to balance the number of
levels and NLTE transitions treated among the nodes to avoid unnecessary
synchronization problems.

In addition to the data parallelism discussed above, the version of
`PHOENIX` described in paper I also uses simultaneous task parallelism
by allocating different tasks to different nodes. This can result in
further speed-up and better scalability but requires a careful analysis
of the workload between different tasks (the workload is also a function
of wavelength, e.g., different number of lines that overlap at each
wavelength point) to obtain optimal load balancing.

we described our
method for parallelizing three separate modules: (1) The radiative
transfer calculation itself, where we divide up the characteristic
rays among nodes and use an `MPI_REDUCE` to send the J_ to all
the radiative transfer and NLTE rate computation tasks; (2) the line
opacity which requires the calculation of about 10,000 Voigt profiles
per wavelength point at each radial grid point, here we split the work
amongst the processors both by radial grid point and by dividing up
the individual lines to be calculated among the processors; and (3) the
NLTE calculations. The NLTE calculations involve three separate parts:
the calculation of the NLTE opacities, the calculation of the rates at
each wavelength point, and the solution of the NLTE rate equations. In
Paper I we performed all these parallelizations by distribution of the
radial grid points among the different nodes or by distributing sets of
spectral lines onto different nodes. In addition, to prevent communication
overhead, each task computing the NLTE rates is paired on the same node
with and the corresponding task computing NLTE opacities and emissivities
to reduce communication. The solution of the rate equations parallelizes
trivially with the use of a diagonal rate operator.

In the latest version of our code, `PHOENIX 8.1`, we have incorporated
the additional strategy of distributing each NLTE species (the total
number of ionization stages of a particular element treated in NLTE) on
separate nodes. Since different species have different numbers of levels
treated in NLTE (e.g. Fe II [singly ionized iron] has 617 NLTE levels,
whereas H I has 30 levels), care is needed to balance the number of
levels and NLTE transitions treated among the nodes to avoid unnecessary
synchronization problems.

In addition to the data parallelism discussed above, the version of
`PHOENIX` described in paper I also uses simultaneous task parallelism
by allocating different tasks to different nodes. This can result in
further speed-up and better scalability but requires a careful analysis
of the workload between different tasks (the workload is also a function
of wavelength, e.g., different number of lines that overlap at each
wavelength point) to obtain optimal load balancing.