The contribution of spectral lines by both atoms and molecules is
calculated in PHOENIX by direct summation of all contributing lines.
Each line profile is computed individually at every radial point for
each line within a search window (typically, a few hundred to few
thousand Doppler widths or about ). This method is more
accurate than methods that rely on pre-computed line opacity tables
(the ``opacity sampling method'') or methods based on distribution
functions (the ``opacity distribution function (ODF) method''). Both
ODF and the opacity sampling method
neglect the details (e.g., depth dependence) of the
individual line profiles. This introduces systematic errors in the
line opacity because the pressures on the top and the bottom of the
line forming region are vastly different (several orders of magnitude
in cool dwarf stars) which causes very different pressure damping and
therefore differing line widths over the line forming region. In
addition, the pre-computed tables require a specified and fixed
wavelength grid, which is too restrictive for NLTE calculations that
include important background line opacities.
In typical model calculations we find that about 1,000 to 10,000 spectral lines contribute to the line opacity (e.g., in M dwarf model calculations) at any wavelength point. Therefore, we need to calculate a large number of individual Voigt profiles at every wavelength point. The subroutines for these computations can easily be vectorized and we use a block algorithm with caches and direct access scratch files for the line data to minimize storage requirements. A block is the number of lines stored in active memory, and the cache is the total number of blocks stored in memory. When the memory size is exceeded, the blocks are written to direct access files on disk. Thus the number of lines that can be included is not limited by RAM, but rather by disk space and the cost of I/O. This approach is computationally efficient because it provides high data and code locality and thus minimizes cache/TLB and page faults. This has proven to be so effective that model calculations with more than 15 million lines can be performed on medium sized workstations (e.g., IBM RS/6000-530 with 64MB RAM and 360MB scratch disk space). The CPU time for a single iteration for a LTE model is dominated by calculating the line opacity (50 to 90% depending on the model parameters), therefore, both the LTE atomic and molecular line opacities are good candidates for parallelization.
There are several obvious ways to parallelize the line opacity
calculations. The first method is to let each node compute the opacity
at radial points, the second is to let each node
work on a different subset of spectral lines within each search window.
A third way, related to the second method, is to use completely
different sets of lines for each node (i.e., use a global workload split
between the nodes in contrast to a local split in the second method).
In the following, we will discuss the advantages and disadvantages of
these three methods. All three methods require only a very small
amount of communication, namely a gather of all results to the
master node with an MPI_REDUCE library call.
The first and second methods, distributing sets of radial points or
sets of lines within the local (wavelength dependent) search window
over the nodes, respectively, are very simple to implement. They can
easily be combined to optimize their performance: if for any
wavelength point the number of depth points is larger than the number
of lines within the local search window, then it is more effective to
run this wavelength point parallel with respect to the radial points,
otherwise it is better to parallelize with respect to the lines within
the local window. It is trivial to add logic to decide the optimum
method for every wavelength point individually. This optimizes overall
performance with negligible overhead. The speed-up for this method of
parallelization is very close to the optimum value if the total number
of blocks of the blocking algorithm is relatively small ().
However, if the total number of line blocks is larger (typically, about 10 to 20 blocks are used), the overhead due to the read operations for the block scratch files becomes noticeable and can reach 20% or more of the total wall-clock time. This is due to the fact that the `local parallelization' required that each node working on the line opacities needs to read every line block, thus increasing the IO time and load to the IO subsystem by the number of nodes themselves.
An implementation of the third method, i.e., distributing the global set of lines onto the nodes, will therefore be more effective if the number of lines is large (this is the usual case in practical applications). There are a number of ways to implement this approach; however, many of them would require either significant administration and communication during the line selection procedure or a large number of individual scratch files (one for each node). The by far easiest and fastest way to implement the global distribution on the 2 is to use the Parallel IO Filesystem (PIOFS). The PIOFS has the advantage that (a) the code changes required are simple and easily reversible (for compatibility with other machines), (b) the PIOFS software allows the creation of a single file on the line selection nodes (the line selection is a process that is inherently serial and can be parallelized only by separating the atomic and molecular line selections, which although simple has not yet been implemented), (c) it allows different nodes to access distinct portions of a single file as their own `subfiles', and (d) that the IO load is distributed over all IO nodes that are PIOFS servers.
Points (b) and (c) make the implementation very simple. We chose
a single line
as the atomic size of the PIOFS file that is used to store the
line-blocks and create the PIOFS file with
sub-files (where
is the number of nodes that
will later work on the LTE line opacity). The line selection
routines then set the PIOFS file view to the equivalent of a
single direct access file with the appropriate block size. After the
line selection, each of the
nodes sets the local
view of the PIOFS file such that it `sees' the
subfile, so that node N reads lines N,
,
,
and so forth. This is equivalent to the `global line distribution'
method with a minimal programming effort. The advantages are
not only that each node has to read only
of the total
lines but also that the IO itself is distributed both over the
available PIOFS fileservers and over time (because the different
line sets will cause the block IO operations to happen at different
times). The latter is very useful and enhances the parallel performance
and scalability. In addition, the process is completely transparent
to the program, only 2 explicit PIOFS subroutine calls
had to be inserted to set the `view' of the PIOFS file to the
correct value.
In Fig. 1 and Table 2
we show the performance and speed-up
as function of the number of nodes for a simplified model
of a very low mass star.
We use an M dwarf star model with the parameters ,
, and solar abundances, appropriate for the late
M dwarf VB10 . The test model includes 226,000 atomic
lines (of these, 29,000 are calculated using Voigt profiles)
and 11.2 million molecular lines (with 3.6 million calculated
with Voigt profiles) and uses a wavelength grid with 13,000 points.
We have set the line search windows to values larger than
required in order to simulate ``real'' model calculations
(which typically use twice as many wavelength points). In
addition, we set the blocksizes
for the line data storage to 30,000 lines for both molecular
and atomic lines and used a line data cache size of 2 (that is we
store in RAM 2 blocks of 30,000 lines each). This setting
is optimal for the atomic lines but the blocksizes are
set to larger values (about 100,000) for the molecular lines
in production models. However, the cache sizes were large
enough to prevent cache thrashing.
The speed-up that we find is very good and the scaling is close to the theoretical maximum for the atomic lines (a factor of 4.5 for 5 nodes). For the molecular lines the speed-up is 4.1 for 5 nodes, a little lower than for the atomic lines. This is caused by the additional IO time required to read the line data blocks from the scratch file. Ignoring the IO time, the molecular line speed-up is 4.6, very close to the value found for the atomic lines. These results were obtained by using the parallel IO system, if we instead rely on a standard file, the IO time for the molecular lines increases by a factor of 3.2 and the speed-up for the molecular lines decreases to a factor of only 2.2 (for 5 nodes, respectively). In production runs on machines that do not have a PIOFS filesystem, we would of course use larger blocksizes for the line data arrays to minimize the IO time, we usually do this on serial machines.
. The test model includes 226,000 atomic lines (of these, 29,000 are calculated using Voigt profiles) and 11.2 million molecular lines (with 3.6 million calculated with Voigt profiles) and uses a wavelength grid with 13,000 points. We have set the line search windows to values larger than required in order to simulate ``real'' model calculations (which typically use twice as many wavelength points). In addition, we set the blocksizes for the line data storage to 30,000 lines for both molecular and atomic lines and used a line data cache size of 2 (that is we store in RAM 2 blocks of 30,000 lines each). This setting is optimal for the atomic lines but the blocksizes are set to larger values (about 100,000) for the molecular lines in production models. However, the cache sizes were large enough to prevent cache thrashing.
The speed-up that we find is very good and the scaling is close to the theoretical maximum for the atomic lines (a factor of 4.5 for 5 nodes). For the molecular lines the speed-up is 4.1 for 5 nodes, a little lower than for the atomic lines. This is caused by the additional IO time required to read the line data blocks from the scratch file. Ignoring the IO time, the molecular line speed-up is 4.6, very close to the value found for the atomic lines. These results were obtained by using the parallel IO system, if we instead rely on a standard file, the IO time for the molecular lines increases by a factor of 3.2 and the speed-up for the molecular lines decreases to a factor of only 2.2 (for 5 nodes, respectively). In production runs on machines that do not have a PIOFS filesystem, we would of course use larger blocksizes for the line data arrays to minimize the IO time, we usually do this on serial machines.