In Table 4 we give the wall-clock times for one iteration on the 2. We have used a ``small'' code configuration with blocksizes appropriate for machines with about 128MB RAM per node although the test machine had up to 300 per node paging space and we used very large search windows for the atomic, molecular and NLTE lines in order to obtain a ``worst case'' scenario. Table 4 shows that the calculation is dominated by the LTE atomic and molecular line opacity whereas the NLTE opacities and rates are only a second order contribution to the total time per iteration. The scaling of the calculation is, therefore, very good up to the largest configuration that we have tested. We could not run the test model on a single 2 CPU due to both wall-clock time and memory restrictions, this demonstrates the importance of parallelization for practical applications.
There are possibilities to reduce the wall-clock time by, e.g., using larger blocksizes and a specially tuned load-distribution. The last 2 entry in Table 4 shows that an alternative load distribution can easily improve the overall speed although now some of the sub-tasks require more wall-clock time.
We also include the timing results of the test run that we obtained on a single processor of a Cray C90 (CPU times). The Table shows that the C90 is about as fast as 5 nodes of the 2, which is roughly the relative performance ratio of a single 2 node to a single C90 processor. The wall-clock time on the C90 was much worse than on the 2, due to the time-sharing operation of the C90 CPUs.