Next: Results of performance tests Up: Parallel Implementation of the Previous: Task and data parallel

Wavelength Parallelization

The division of labor outlined in the previous section requires synchronization between the radiative transfer tasks and the NLTE tasks, since the radiation field and the ``approximate $\Lambda$ operator'' must be passed between them. In addition, our standard model calculations use 50 radial grid points and as the number of nodes increases, so too does the communication and loop overhead. We found good speedup up to about 5 nodes for a typical supernova calculation, with the speedup close to the theoretical maximum. However, for 5 nodes the communication and loop overheads begin to become significant and it is not economical to use more than 10 nodes (depending on the machine and the model calculation, it might be necessary to use more nodes to fit the data in the memory available on a single node).

Since the number of wavelength points in a calculation is very large and the CPU time scales linearly with the number of wavelength points, a further distribution of labor by wavelength points would potentially lead to large speedups and to the ability to use very large numbers of nodes available on massively parallel supercomputers. Thus, we have developed the concept of wavelength ``clusters'' to distribute a set of wavelength points (for the solution of the frequency dependent radiative transfer) onto a different set of nodes, see Fig. 1. In order to achieve optimal load balance and, more importantly, in order to minimize the memory requirements, each cluster works on a single wavelength point at any given time, but it may consist of a number of ``worker'' nodes where the worker nodes use parallelization methods discussed in paper I. In order to avoid communication overhead, the workers of each wavelength cluster are symmetric: each corresponding worker on each wavelength cluster performs identical tasks but on a different set of wavelengths for each cluster. We thus arrange the total number of nodes N in a rectangular matrix with n columns and m rows, where n is the number of clusters and m is the number of workers for each cluster, such that N=n * m.

This design allows us to make use of communicator contexts, a concept which is built into MPI. The nodes of a given wavelength cluster are assigned to a single MPI_GROUP (a vertical column in Fig. 1) so that the m nodes of each cluster form their own MPI_GROUP and have their own MPI communicator to pass messages within a cluster. We use the task and data parallelism introduced in paper I within each individual cluster if the m is larger than one. In addition to the n MPI_GROUPs for the m workers of each of the n clusters, we also use m MPI_GROUPs for the n clusters with the corresponding communicators. These groups can pass messages within an individual row of Fig. 1, thus allowing the flow of information between wavelength points, this is important for the solution of the co-moving frame RTE as discussed below. The code has been designed so that the number of wavelength clusters n, the number of workers per wavelength cluster m, and the task distribution within a wavelength cluster is arbitrary and can be specified dynamically at run time.

For a static model atmosphere, all wavelengths and thus wavelength clusters are completely independent and execute in parallel with no communication or synchronization along the rows of Fig. 1. The basic pre- and post-processing required are illustrated in the pseudo-code in Fig. 2. However, in order to parallelize the spectrum calculations for a model atmosphere with a global velocity field, such as the expanding atmospheres of novae, supernovae or stellar winds, we need to take the mathematical character of the RTE into account. For monotonic velocity fields, the RTE is an initial value problem in wavelength (with the initial condition at the smallest wavelength for expanding atmospheres and at the largest wavelength for contracting atmospheres). This initial value problem must be discretized fully implicitly to ensure stability. In the simplest case of a first order discretization, the solution of the RTE for wavelength point i depends only on the results of the point i-1. In order to parallelize the spectrum calculations, the wavelength cluster n_i computing the solution for wavelength point i must get the specific intensities from the cluster n_i-1 computing the solution for point i-1. This suggests a ``pipeline'' solution to the wavelength parallelization. Note that only the solution of the RTE is affected by this, the calculation of the opacities and rates remains independent between different wavelength clusters and remains fully parallelized. In this case, the wavelength parallelization works as follows: Each cluster can independently compute the opacities and start the RT calculations (hereafter called the pre-processing phase), it then waits until it receives the specific intensities for the previous wavelength point, then it finishes the solution of the RTE and immediately sends the results to the wavelength cluster calculating the next wavelength point (to minimize waiting time, this is done with non-blocking send/receives), then proceeds to calculate the rates etc. (hereafter called the post-processing phase and the new opacities for its next wavelength point and so on.

The important point in this scheme is that each wavelength cluster can execute the post-processing phase of its current wavelength point and pre-processing phase of its next wavelength point independently and in parallel with all other clusters. This means that the majority of the total computational work can be done in parallel, leading to a substantial reduction in wall-clock time per model. Ideal load balancing can be obtained by dynamically allocating wavelength points to wavelength clusters. This requires only primitive logic with no measurable overhead, however it requires also communication and an arbitration/synchronization process to avoid deadlocks. Typically, the number of clusters n (4-64) is much smaller than the number of wavelength points, $n_{\rm wl}\approx 300,000$ , so that at any given time the work required for each wavelength point is roughly the same for each cluster (the work changes as the number of overlapping lines changes, for example). Therefore, a simple round robin allocation of wavelength points to clusters (cluster i calculates wavelength points i, n+i, 2n+i and so on) can be used which will result in nearly optimal performance if the condition $n \ll n_{\rm wl}$ is fulfilled.

As an example, we consider the the simple case of 16 nodes (with one CPU each), distributed to 2 wavelength clusters with 8 worker nodes each, i.e., n=2, m=8. Using the round-robin scheme, clusters 1-2 are allocated to wavelength points 1-2, respectively. Within in each cluster the work is divided using the combined task and data parallelism discussed in paper I where the work topology is identical for each cluster. All clusters begin immediately by executing the various preprocessing required. Since cluster 1 begins with wavelength point 1, it sets the initial condition for the co-moving frame RT. It then solves the RTE and immediately sends the specific intensities off to cluster 2 (which is already working on wavelength point 2) using a non-blocking MPI send. Cluster 1 then continues to calculate the rates and various other post-processing at this wavelength point and then immediately proceeds the pre-processing phase of its next wavelength point, number 3 in this case. At the same time, cluster 2 has finished calculating the opacities at its wavelength point number 2, done the preprocessing for solving the RTE and then must wait until cluster 1 has sent it the specific intensities for the previous wavelength point. It can then solve the RTE and immediately send its specific intensities on to cluster 1. Since node 1 was busy doing post-processing for wavelength point 1 and pre-processing for wavelength point 3, it may in fact have the specific intensities from node 2 just in time when it needs them to continue with the solution of the RTE for wavelength point 3 and so minimal waiting may be required and the process proceeds in a round robin fashion. Because we employ more than one worker per wavelength cluster the combined task and data parallel method described in is used within the wavelength cluster and since the workers are symmetric, the sending of data must only be done between identical workers on each wavelength cluster as depicted in Figure , thus minimizing the inter-cluster communication.

This scheme has some predictable properties similar to the performance results for classical serial vector machines. First, for a very small number of clusters (e.g., two), the speedup will be small because the clusters would spend a significant amount of time waiting for the results from the previous cluster (the ``pipeline'' has no time to fill). Second, the speedup will level off for a very large number of clusters when the clusters have to wait because some of the clusters working on previous wavelength points have not yet finished their RTE solution, thus limiting the minimum theoretical time for the spectrum calculation to roughly the time required to solve the RTE for all the wavelength points together (the ``pipeline'' is completely filled). This means that there is a ``sweet spot'' for which the speedup to number-of-wavelength-clusters ratio is optimal. This ratio can be further optimized by using the optimal number of worker nodes per cluster, thus obtaining an optimal number of total nodes. The optimum will depend on the model atmosphere parameters, the speed of each node itself and the communication speed, as well as the quality of the compilers and libraries.

The wavelength parallelization has the drawback that it does not reduce the memory requirement per node compared to runs with a single wavelength cluster. Increasing the number of worker nodes per cluster will decrease the memory requirements per node drastically, however, so that large runs can use both parallelization methods at the same time to execute large simulations on nodes with limited memory. On a shared-memory machine with distributed physical memory (such as the SGI Origin 2000), this scheme can also be used to minimize memory access latency.

Next: Results of performance tests Up: Parallel Implementation of the Previous: Task and data parallel

Peter H. Hauschildt
4/27/1999