Next: Wavelength Parallelization Up: Numerical Solution of the Previous: Global iteration scheme

Parallelization

Solving the above set of coupled non-linear equations for large numbers of NLTE species requires large amounts of memory to store the rates for each level in all the model atoms at each radial grid point, and large amounts of CPU time because many wavelength points are required in order to resolve the line profiles in the co-moving frame. In order to minimize both CPU and memory requirements we have parallelized the separate Fortran 90 modules which make up the PHOENIX code. Our experience indicates that only the simultaneous use of data and task parallelism can deliver reasonable parallel speedups [8]. This involves:

1.: The radiative transfer calculation itself, where we divide up the characteristic rays among nodes and use a ``reduce'' operation to collect and send the $\ifmmode{J_{\nu}}\else{\hbox{$J_{\nu}$} }\fi$ to all the radiative transfer and NLTE rate computation tasks (data parallelism);
2.: the line opacity which requires the calculation of up to 50,000 Voigt profiles per wavelength point at each radial grid point, here we split the work amongst the processors both by radial grid points and by dividing up the individual lines to be calculated among the processors (combined data and task parallelism); and
3.: the NLTE calculations. The NLTE calculations involve three separate parts: the calculation of the NLTE opacities, the calculation of the rates at each wavelength point, and the solution of the NLTE rate and statistical equilibrium equations. To prevent communication overhead, each task computing the NLTE rates is forced to be on the same node with the corresponding task computing NLTE opacities and emissivities, (combined data and task parallelism). The solution of the rate equations parallelizes trivially with the use of a diagonal approximate rate operator.

In the latest version of our code, PHOENIX 9.1, we have incorporated the additional strategy of distributing each NLTE species (the total number of ionization stages of a particular element treated in NLTE) on separate nodes. Since different species have different numbers of levels treated in NLTE (e.g. Fe II [singly ionized iron] has 617 NLTE levels, whereas H I has 30 levels), care is taken to balance the number of levels and NLTE transitions treated on each node to avoid unnecessary synchronization and communication problems. We have also parallelized the selection of background atomic and molecular LTE lines (a significant amount of work considering that our combined line lists currently include about 400 million lines and we expect line lists with about 1 billion lines in the near future). Although the line selection seems at first glance to be an inherently serial process, since a file sorted in wavelength with selected lines must be written to disk, we are able to obtain reasonable speedups, by employing a client-server model with a server line-selection task which receives the selected lines and writes them to disk and client nodes which read pieces (blocks) of the line list files and carry out the actual selection processes on each block of lines.

In addition to the combined data and task parallelism discussed above, PHOENIX also uses simultaneous explicit task parallelism by allocating different tasks (e.g., atomic line opacity, molecular line opacity, radiative transfer) to different nodes. This can result in further speed-up and better scalability but requires a careful analysis of the workload between different tasks (the workload is also a function of wavelength, e.g., different number of lines that overlap at each wavelength point) to obtain optimal load balancing.

Wavelength Parallelization
- Scaling Results

Next: Wavelength Parallelization Up: Numerical Solution of the Previous: Global iteration scheme

Peter H. Hauschildt
8/20/1998