In Fig. 4 we show the performance results for a nova model atmosphere calculation (, , , solar abundances) using various configurations running on the same IBM SP2. The model includes 1775 NLTE levels with 32056 primary NLTE lines, about 1.3 million background LTE lines and about 90000 secondary NLTE lines (dynamically selected). The calculation was performed on a grid of about 175000 wavelength points. This model is somewhat smaller than our typical nova models, it was used because it is small enough to run in serial mode on the IBM SP2 that we used for the tests. The behavior of the parallel performance and scalability is essentially as expected. For a small number of nodes, the speedup obtained by using wavelength cluster is smaller than the speedup obtained by using one wavelength cluster but several worker nodes. As the number of nodes increases, it is more effective to use more wavelength clusters than more workers. However, as the number of clusters increases over a limit (about 8 clusters in this model), the speedup remains constant if the number of clusters is increased (and the number of workers remains constant). The optimum load distribution is thus a combination of all parallelization methods, depending not only on the machine but also on the workload distribution of the model calculation itself.
For a very large supernova calculation, we examine both the scaling and performance tradeoff of spatial versus wavelength parallelization. Figure 5 presents the results of our timing tests for one iteration of a Type Ic supernova model atmosphere, with a model temperature K (the observed luminosity is given by ), characteristic velocity v0=10000 kms-1, 4666 NLTE levels, 163812 NLTE Gauss lines, 211680 LTE Gauss lines, non-homogeneous abundances, and 260630 wavelength points. This is among the largest calculations we run and hence it has the highest potential for synchronization, I/O waiting, and swapping to reduce performance. It is however, characteristic of the level of detail needed to accurately model supernovae. This calculation has also been designed to barely fit into the memory of a single node. The behavior of the speedup is very similar to the results reported for the nova test case. The fact that the turnover is at lower number of processor elements is almost certainly due to the higher I/O and memory bandwidth required by the larger calculation that the supernova represents over the nova calculation.
The ``saturation point'' at which the wavelength pipeline fills and no further speedup can be obtained if more wavelength clusters are used lies for the machines used here at about 5 to 8 clusters. More clusters will not lead to larger speedups, as expected. Larger speedups can be obtained by using more worker nodes per cluster, which also drastically reduces the amount of memory required on each node.