Next: Performance and scalability Up: Parallel radiative transfer Previous: Parallel radiative transfer

## Strategy and implementation

We use the method discussed in for the numerical solution of the special relativistic radiative transfer equation (RTE) at every wavelength point . This iterative scheme is based on the operator splitting approach. The RTE is written in its characteristic form and the formal solution along the characteristics is done using a piecewise parabolic integration . We use the exact band-matrix subset of the discretized -operator as the 'approximate -operator' (ALO) in the operator splitting iteration scheme . This has the advantage of giving very good convergence and high speed-up when compared to diagonal ALO's.

The serial radiative transfer code has been optimized for superscalar and vector computers and is numerically very efficient. It is therefore crucial to optimize the ratio of communication to computation in the parallel implementation of the radiative transfer method. In terms of CPU time, the most costly parts of the radiative transfer code are the setup of the PPM interpolation coefficients and the formal solutions (which have to be performed in every iteration). The construction of a tri-diagonal ALO requires about the same CPU time as a single formal solution of the RTE and is thus not a very important contributor to the total CPU time required to solve the RTE at every given wavelength point.

In principle, the computation of the PPM coefficients does not require any communication and thus could be distributed arbitrarily between the nodes. However, the formal solution is recursive along each characteristic. Within the formal solution, communication is only required during the computation of the mean intensities J, as they involve integrals over the angle at every radial point. Thus, a straightforward and efficient way to parallelize the radiative transfer code is to distribute sets of characteristics onto different nodes. This will minimize the communication during the iterations, and thus optimize the performance. Within one iteration step, the current values of the mean intensities need to be broadcast to all radiative transfer nodes and the new contributions of every radiative transfer node to the mean intensities at every radius must be sent to the master node. The master radiative transfer node then computes and broadcasts an updated J vector using the operator splitting scheme, the next iteration begins, and the process continues until the solution is converged to the required accuracy. The setup, i.e., the computation of the PPM interpolation coefficients and the construction of the ALO, can be parallelized using the same method and node distribution. The communication overhead for the setup is roughly equal to the communication required for a single iteration.

An important point to consider is the load balancing between the radiative transfer nodes. The workload to compute the formal solution along each characteristic is proportional to the number of intersection points of the characteristic with the concentric spherical shells of the radial grid (the number of points' along each characteristic). Therefore, if the total number of points is , the optimum solution would be to let each radiative transfer node work on points. This optimum can, in general, not be reached exactly because it would require splitting characteristics between nodes (which involves both communication and synchronization). A simple load distribution based on , where is the total number of characteristics, is far from optimal because the characteristics do not have the same number of intersection points (consider tangential characteristics with different impact parameters). We therefore chose a compromise of distributing the characteristics to the radiative transfer nodes so that the total number of points that are calculated by each node is roughly the same and that every node works on a different set of characteristics.

. This has the advantage of giving very good convergence and high speed-up when compared to diagonal ALO's.

The serial radiative transfer code has been optimized for superscalar and vector computers and is numerically very efficient. It is therefore crucial to optimize the ratio of communication to computation in the parallel implementation of the radiative transfer method. In terms of CPU time, the most costly parts of the radiative transfer code are the setup of the PPM interpolation coefficients and the formal solutions (which have to be performed in every iteration). The construction of a tri-diagonal ALO requires about the same CPU time as a single formal solution of the RTE and is thus not a very important contributor to the total CPU time required to solve the RTE at every given wavelength point.

In principle, the computation of the PPM coefficients does not require any communication and thus could be distributed arbitrarily between the nodes. However, the formal solution is recursive along each characteristic. Within the formal solution, communication is only required during the computation of the mean intensities J, as they involve integrals over the angle at every radial point. Thus, a straightforward and efficient way to parallelize the radiative transfer code is to distribute sets of characteristics onto different nodes. This will minimize the communication during the iterations, and thus optimize the performance. Within one iteration step, the current values of the mean intensities need to be broadcast to all radiative transfer nodes and the new contributions of every radiative transfer node to the mean intensities at every radius must be sent to the master node. The master radiative transfer node then computes and broadcasts an updated J vector using the operator splitting scheme, the next iteration begins, and the process continues until the solution is converged to the required accuracy. The setup, i.e., the computation of the PPM interpolation coefficients and the construction of the ALO, can be parallelized using the same method and node distribution. The communication overhead for the setup is roughly equal to the communication required for a single iteration.

An important point to consider is the load balancing between the radiative transfer nodes. The workload to compute the formal solution along each characteristic is proportional to the number of intersection points of the characteristic with the concentric spherical shells of the radial grid (the number of points' along each characteristic). Therefore, if the total number of points is , the optimum solution would be to let each radiative transfer node work on points. This optimum can, in general, not be reached exactly because it would require splitting characteristics between nodes (which involves both communication and synchronization). A simple load distribution based on , where is the total number of characteristics, is far from optimal because the characteristics do not have the same number of intersection points (consider tangential characteristics with different impact parameters). We therefore chose a compromise of distributing the characteristics to the radiative transfer nodes so that the total number of points that are calculated by each node is roughly the same and that every node works on a different set of characteristics.

Next: Performance and scalability Up: Parallel radiative transfer Previous: Parallel radiative transfer
Peter H. Hauschildt
4/27/1999