Parallelization

Next: Conclusions Up: Application examples Previous: Convergence

Parallelization

We have implemented a hierarchical parallelization scheme on distributed memory machines using the MPI framework similar to the scheme discussed in Baron & Hauschildt (1998). Basically, the most efficient parallelization opportunities in the problem are the solid angle and wavelength sub-spaces. The total number of solid angles in the test models is at least 4096, the number of wavelength points is 32 in the tests presented here but will be much larger in large scale applications. Thus even in the simple tests presented here, the calculations could theoretically be run on 131072 processors. The work required for each solid angle is roughly constant (the number of points that need to be calculated depends on the angle points) and during the formal solution process the solid angles are independent from each other. The mean intensities (and other solid angle integrated quantities) are only needed after the formal solution for all solid angles is complete, so a single collective MPI operation is needed to finish the computation of the mean intensities at each wavelength. Similarly, the wavelength integrated mean intensities ${\bar J}$ are needed only after the formal solutions are completed for all wavelengths (and solid angles). Therefore, the different wavelength points can be computed in parallel also with the only communication occurring as collective MPI operations after all wavelength points have been computed. We thus divide the total number of processes up in a number of `wavelength clusters' (each working on a different set of wavelength points) that each have a number of `worker' processes which work on a different set of solid angle points for any wavelength. In the simplest case, each wavelength cluster has the same number of worker processes so that $N_{\rm tot} = N_{\rm cluster}\times N_{\rm worker}$ where $N_{\rm tot}$ is the total number of MPI processes, $N_{\rm cluster}$ is the number of wavelength clusters and $N_{\rm worker}$ is the number of worker processes for each wavelength cluster. For our tests we could use a maximum number of 128 CPUs on the HLRN IBM Regatta (Power4 CPUs) system. In Table 1 we show the results for the 3 combinations that we could run (due to computer time limitations) for a $\epsilon _l=10^{-4}$ line transfer test case with 32 wavelength points, spatial points and $n_\theta=n_\phi=64$ solid angle points. For example, the third row in the table is for a configuration with wavelength clusters, each of them using CPUs working in parallel on different solid angles, for a total of 128 CPUs. The 3rd and 4th columns give the time (in seconds) for a full formal solution, the construction of the $\ifmmode{\Lambda^*}\else\hbox{$\Lambda^*$}\fi$ operator and an OS step (the first iteration) and the time for a formal solution and an OS step (the second iteration), respectively. As the $\ifmmode{\Lambda^*}\else\hbox{$\Lambda^*$}\fi$ has to be constructed only in the first iteration, the overall time per iteration drops in subsequent iterations. Similarly to the 1D case, the construction of the $\ifmmode{\Lambda^*}\else\hbox{$\Lambda^*$}\fi$ is roughly equivalent to one formal solution. We have verified that all parallel configurations lead to identical results. The table shows that configurations with more clusters are slightly more efficient, mostly due to better load balancing. However, the differences are not really significant in practical applications so that the exact choice of the setup is not important. This also means that the code can easily scale to much larger numbers of processors since realistic applications will require much more than the 32 wavelength points used in the test calculations. Note that the MPI parallelization can be combined with shared memory parallelization (e.g., using openMP) in order to more efficiently utilize modern multi-core processors with shared caches. Although this is implemented in the current version of the 3D code, we do not have access to a machine with such an architecture and it was not efficient to use openMP on multiple single core processors.

Next: Conclusions Up: Application examples Previous: Convergence

Peter Hauschildt 2008-08-05