Solving the above set of coupled non-linear equations for large numbers of NLTE species requires large amounts of memory to store the rates for each level in all the model atoms at each radial grid point, and large amounts of CPU time because many wavelength points are required in order to resolve the line profiles in the co-moving frame. In order to minimize both CPU and memory requirements we have parallelized the separate Fortran 90 modules which make up the PHOENIX code. Our experience indicates that only the simultaneous use of data and task parallelism can deliver reasonable parallel speedups [8]. This involves:
In the latest version of our code, PHOENIX 9.1, we have incorporated the additional strategy of distributing each NLTE species (the total number of ionization stages of a particular element treated in NLTE) on separate nodes. Since different species have different numbers of levels treated in NLTE (e.g. Fe II [singly ionized iron] has 617 NLTE levels, whereas H I has 30 levels), care is taken to balance the number of levels and NLTE transitions treated on each node to avoid unnecessary synchronization and communication problems. We have also parallelized the selection of background atomic and molecular LTE lines (a significant amount of work considering that our combined line lists currently include about 400 million lines and we expect line lists with about 1 billion lines in the near future). Although the line selection seems at first glance to be an inherently serial process, since a file sorted in wavelength with selected lines must be written to disk, we are able to obtain reasonable speedups, by employing a client-server model with a server line-selection task which receives the selected lines and writes them to disk and client nodes which read pieces (blocks) of the line list files and carry out the actual selection processes on each block of lines.
In addition to the combined data and task parallelism discussed above, PHOENIX also uses simultaneous explicit task parallelism by allocating different tasks (e.g., atomic line opacity, molecular line opacity, radiative transfer) to different nodes. This can result in further speed-up and better scalability but requires a careful analysis of the workload between different tasks (the workload is also a function of wavelength, e.g., different number of lines that overlap at each wavelength point) to obtain optimal load balancing.