Next: Parallelizing NLTE calculations Up: Parallel Implementation of the Previous: Performance and scalability

Parallelization of the line opacity calculation

The contribution of spectral lines by both atoms and molecules is calculated in PHOENIX by direct summation of all contributing lines. Each line profile is computed individually at every radial point for each line within a search window (typically, a few hundred to few thousand Doppler widths or about $1000\hbox{\AA}$ ). This method is more accurate than methods that rely on pre-computed line opacity tables (the ``opacity sampling method'') or methods based on distribution functions (the ``opacity distribution function (ODF) method''). Both ODF and the opacity sampling method neglect the details (e.g., depth dependence) of the individual line profiles. This introduces systematic errors in the line opacity because the pressures on the top and the bottom of the line forming region are vastly different (several orders of magnitude in cool dwarf stars) which causes very different pressure damping and therefore differing line widths over the line forming region. In addition, the pre-computed tables require a specified and fixed wavelength grid, which is too restrictive for NLTE calculations that include important background line opacities.

In typical model calculations we find that about 1,000 to 10,000 spectral lines contribute to the line opacity (e.g., in M dwarf model calculations) at any wavelength point. Therefore, we need to calculate a large number of individual Voigt profiles at every wavelength point. The subroutines for these computations can easily be vectorized and we use a block algorithm with caches and direct access scratch files for the line data to minimize storage requirements. A block is the number of lines stored in active memory, and the cache is the total number of blocks stored in memory. When the memory size is exceeded, the blocks are written to direct access files on disk. Thus the number of lines that can be included is not limited by RAM, but rather by disk space and the cost of I/O. This approach is computationally efficient because it provides high data and code locality and thus minimizes cache/TLB and page faults. This has proven to be so effective that model calculations with more than 15 million lines can be performed on medium sized workstations (e.g., IBM RS/6000-530 with 64MB RAM and 360MB scratch disk space). The CPU time for a single iteration for a LTE model is dominated by calculating the line opacity (50 to 90% depending on the model parameters), therefore, both the LTE atomic and molecular line opacities are good candidates for parallelization.

There are several obvious ways to parallelize the line opacity calculations. The first method is to let each node compute the opacity at $N_{\rm r}/N_{\rm node}$ radial points, the second is to let each node work on a different subset of spectral lines within each search window. A third way, related to the second method, is to use completely different sets of lines for each node (i.e., use a global workload split between the nodes in contrast to a local split in the second method). In the following, we will discuss the advantages and disadvantages of these three methods. All three methods require only a very small amount of communication, namely a gather of all results to the master node with an MPI_REDUCE library call.

The first and second methods, distributing sets of radial points or sets of lines within the local (wavelength dependent) search window over the nodes, respectively, are very simple to implement. They can easily be combined to optimize their performance: if for any wavelength point the number of depth points is larger than the number of lines within the local search window, then it is more effective to run this wavelength point parallel with respect to the radial points, otherwise it is better to parallelize with respect to the lines within the local window. It is trivial to add logic to decide the optimum method for every wavelength point individually. This optimizes overall performance with negligible overhead. The speed-up for this method of parallelization is very close to the optimum value if the total number of blocks of the blocking algorithm is relatively small ( $\le 3\ldots 5$ ).

However, if the total number of line blocks is larger (typically, about 10 to 20 blocks are used), the overhead due to the read operations for the block scratch files becomes noticeable and can reach 20% or more of the total wall-clock time. This is due to the fact that the `local parallelization' required that each node working on the line opacities needs to read every line block, thus increasing the IO time and load to the IO subsystem by the number of nodes themselves.

An implementation of the third method, i.e., distributing the global set of lines onto the nodes, will therefore be more effective if the number of lines is large (this is the usual case in practical applications). There are a number of ways to implement this approach; however, many of them would require either significant administration and communication during the line selection procedure or a large number of individual scratch files (one for each node). The by far easiest and fastest way to implement the global distribution on the 2 is to use the Parallel IO Filesystem (PIOFS). The PIOFS has the advantage that (a) the code changes required are simple and easily reversible (for compatibility with other machines), (b) the PIOFS software allows the creation of a single file on the line selection nodes (the line selection is a process that is inherently serial and can be parallelized only by separating the atomic and molecular line selections, which although simple has not yet been implemented), (c) it allows different nodes to access distinct portions of a single file as their own `subfiles', and (d) that the IO load is distributed over all IO nodes that are PIOFS servers.

Points (b) and (c) make the implementation very simple. We chose a single line as the atomic size of the PIOFS file that is used to store the line-blocks and create the PIOFS file with $N_{\rm node}$ sub-files (where $N_{\rm node}$ is the number of nodes that will later work on the LTE line opacity). The line selection routines then set the PIOFS file view to the equivalent of a single direct access file with the appropriate block size. After the line selection, each of the $N_{\rm node}$ nodes sets the local view of the PIOFS file such that it `sees' the $1 \ldots N_{\rm node}$ subfile, so that node N reads lines N, $N+N_{\rm node}$ , $N+2N_{\rm node}$ , and so forth. This is equivalent to the `global line distribution' method with a minimal programming effort. The advantages are not only that each node has to read only $1/N_{\rm node}$ of the total lines but also that the IO itself is distributed both over the available PIOFS fileservers and over time (because the different line sets will cause the block IO operations to happen at different times). The latter is very useful and enhances the parallel performance and scalability. In addition, the process is completely transparent to the program, only 2 explicit PIOFS subroutine calls had to be inserted to set the `view' of the PIOFS file to the correct value.

In Fig. 1 and Table 2 we show the performance and speed-up as function of the number of nodes for a simplified model of a very low mass star. We use an M dwarf star model with the parameters $\hbox{$\,T_{\rm eff}$}=2700\k$ , $\log(g)=5.0$ , and solar abundances, appropriate for the late M dwarf VB10 . The test model includes 226,000 atomic lines (of these, 29,000 are calculated using Voigt profiles) and 11.2 million molecular lines (with 3.6 million calculated with Voigt profiles) and uses a wavelength grid with 13,000 points. We have set the line search windows to values larger than required in order to simulate ``real'' model calculations (which typically use twice as many wavelength points). In addition, we set the blocksizes for the line data storage to 30,000 lines for both molecular and atomic lines and used a line data cache size of 2 (that is we store in RAM 2 blocks of 30,000 lines each). This setting is optimal for the atomic lines but the blocksizes are set to larger values (about 100,000) for the molecular lines in production models. However, the cache sizes were large enough to prevent cache thrashing.

The speed-up that we find is very good and the scaling is close to the theoretical maximum for the atomic lines (a factor of 4.5 for 5 nodes). For the molecular lines the speed-up is 4.1 for 5 nodes, a little lower than for the atomic lines. This is caused by the additional IO time required to read the line data blocks from the scratch file. Ignoring the IO time, the molecular line speed-up is 4.6, very close to the value found for the atomic lines. These results were obtained by using the parallel IO system, if we instead rely on a standard file, the IO time for the molecular lines increases by a factor of 3.2 and the speed-up for the molecular lines decreases to a factor of only 2.2 (for 5 nodes, respectively). In production runs on machines that do not have a PIOFS filesystem, we would of course use larger blocksizes for the line data arrays to minimize the IO time, we usually do this on serial machines.

. The test model includes 226,000 atomic lines (of these, 29,000 are calculated using Voigt profiles) and 11.2 million molecular lines (with 3.6 million calculated with Voigt profiles) and uses a wavelength grid with 13,000 points. We have set the line search windows to values larger than required in order to simulate ``real'' model calculations (which typically use twice as many wavelength points). In addition, we set the blocksizes for the line data storage to 30,000 lines for both molecular and atomic lines and used a line data cache size of 2 (that is we store in RAM 2 blocks of 30,000 lines each). This setting is optimal for the atomic lines but the blocksizes are set to larger values (about 100,000) for the molecular lines in production models. However, the cache sizes were large enough to prevent cache thrashing.

Next: Parallelizing NLTE calculations Up: Parallel Implementation of the Previous: Performance and scalability

Peter H. Hauschildt
4/27/1999