Performance and scalability

Next: Parallelization of the line Up: Parallel radiative transfer Previous: Strategy and implementation

Performance and scalability

Following the outline given in the previous paragraphs, we have adapted the serial version of our radiative transfer code using the MPI libraries. The additions required to implement the parallel version were relatively small and required only the addition of the MPI subroutine calls. The total number of MPI statements in the radiative transfer code is 220, which is very small compared to a total of 10,700 statements.

We test the performance and scalability of the parallel radiative transfer code with a simple single wavelength radiative transfer model on an 2. The test model uses 128 radial points, we have chosen the following test parameters: the radial extension is a factor of 10, the total optical depth is 100, the ratio of absorptive to total opacity is $\epsilon=10^{-4}$ , and we require as convergence criterion that changes in the mean intensities are less than 10^-8 between consecutive iterations (which requires 17 iterations starting with J=B). For simplicity, a static model is used, the tests of the full PHOENIX code described below will employ test models with velocity fields. In table 1a we list the results of test runs with different number of nodes.

The scaling of the radiative transfer code is acceptable as the number of nodes rises to about 8 nodes. More nodes do not decrease the wall-clock time significantly because the rapidly increasing communication overhead (the J must be broadcast to all nodes and the results must be gathered from all nodes in every radiative transfer iteration) as well as the worsening of the load balance (the latter becomes more acute as the number of nodes increases to $\ge 16$ ). We have verified that the individual modules of the radiative transfer code (i.e., PPM coefficients, ALO computation, and formal solution) indeed scale with the number of nodes, within the limits of the load balancing described above. In test calculations we found that the time for either the gathering of the results of the formal solution (using MPI_REDUCE) or the broadcast of the updated J (using MPI_BCAST) individually do not require significant amount of time. However, the combination of these two communications use much more time than would be predicted by the sum of the times for the individual operations. This is probably partly due to the required synchronization but also partly due to limitations of the MPI implementation and the communication hardware of the 2.

Table 1a also shows the results obtained on the GC/PP for the same test case for comparison. On a single CPU, the GC/PP is about a factor of 4.5 slower than a single CPU of the 2, which is consistent with their LINPACK results. The scaling of the GC/PP with more nodes is about the same as for the 2, indicating that the slower communication speed of the GC/PP does not produce performance problems if the number of CPU's is small. In Table 1a we also include the results for the HP J200. The speed-up for 2 CPU's is only 30%, which is due to the slower communication caused by a non-optimized MPI library.

The radiative transfer test problem with 128 radial grid-points was too large to fit into the memory of a single transputer node of the GCel without major changes in the code (which would be very time consuming because of the lack of a Fortran 90 compiler for both Parsytec systems). Therefore, we have also run a smaller test case with 50 radial gridpoints. The results are also listed in Table 1b. The performance ratio of the results on a single CPU is, as in the previous case, comparable to the ratio of the LINPACK results: The code runs about a factor of 31 faster on a single PPC 601 CPU than on the T805 transputer. However, there are now major differences in the scalability between the two systems. Whereas the scaling of the GCel results are slightly better than the results obtained for the 2 with the large test case, the wall-clock times do not scale well for both the 2 and the GC/PP, in contrast to its behavior in the large test case. The reason is the relatively slower communication (compared to raw processor speed) of the 2 and the GC/PP: in the large test case the floating point operations dominate over the communication. For the GCel with lower floating point performance but faster communication than the other two machines, the scaling to more processors is much better and comparable to the large test case for the 2 and GC/PP. This demonstrates that flexibility of the load distribution is very important in order to obtain good performance on a number of different machines.

Next: Parallelization of the line Up: Parallel radiative transfer Previous: Strategy and implementation

Peter H. Hauschildt
4/27/1999