In Fig. 4 we show the performance results for a nova model
atmosphere calculation (,
,
, solar
abundances) using various configurations running on the same IBM SP2.
The model includes 1775 NLTE levels with 32056 primary NLTE lines, about
1.3 million background LTE lines and about 90000 secondary NLTE lines
(dynamically selected). The calculation was performed on a grid of about
175000 wavelength points. This model is somewhat smaller than our typical
nova models, it was used because it is small enough to run in serial
mode on the IBM SP2 that we used for the tests. The behavior of the
parallel performance and scalability is essentially as expected. For a
small number of nodes, the speedup obtained by using wavelength cluster
is smaller than the speedup obtained by using one wavelength cluster
but several worker nodes. As the number of nodes increases, it is more
effective to use more wavelength clusters than more workers. However,
as the number of clusters increases over a limit (about 8 clusters in
this model), the speedup remains constant if the number of clusters is
increased (and the number of workers remains constant). The optimum load
distribution is thus a combination of all parallelization methods,
depending not only on the machine but also on the workload distribution
of the model calculation itself.
For a very large supernova calculation, we examine both the
scaling and performance tradeoff of spatial versus wavelength
parallelization. Figure 5 presents the results of our timing
tests for one iteration of a Type Ic supernova model
atmosphere, with a model temperature K (the observed
luminosity is given by
), characteristic velocity
v0=10000
km
s-1, 4666 NLTE levels, 163812 NLTE Gauss lines, 211680
LTE Gauss lines, non-homogeneous abundances, and 260630 wavelength
points. This is among the largest calculations we run and hence it has
the highest potential for synchronization, I/O waiting, and swapping to
reduce performance. It is however, characteristic of the level of detail
needed to accurately model supernovae. This calculation has also been
designed to barely fit into the memory of a single node. The behavior of
the speedup is very similar to the results reported for the nova
test case. The fact that the turnover is at lower number of processor
elements is almost certainly due to the higher I/O and memory
bandwidth required by the larger calculation that the supernova
represents over the nova calculation.
The ``saturation point'' at which the wavelength pipeline fills and no further speedup can be obtained if more wavelength clusters are used lies for the machines used here at about 5 to 8 clusters. More clusters will not lead to larger speedups, as expected. Larger speedups can be obtained by using more worker nodes per cluster, which also drastically reduces the amount of memory required on each node.