Next: Summary and Conclusions Up: Parallel Implementation of the Previous: PPro/Solaris system

IBM SP system

We ran the same (small) test on an IBM SP for comparison. The tests were run on a non-dedicated production system and thus timings are representative of standard operation conditions and not optimum values. The global files were stored on IBM's General Purpose File System (GPFS), which is installed on a number of system-dedicated I/O nodes replacing the NFS fileserver used on the PPro/Solaris system. GPFS access is facilitated through the same ``switch'' architecture that also carries MPI messages on the IBM SP. The results for the line selection code are shown in Fig. 3. For the small test the results are markedly different from the results for the PPro/Solaris system. The LTF algorithm performs significantly better for all tested configurations, however, scaling is very limited. The GTF code does not scale well at all for this small test on the IBM SP. This is due to the small size, so that the processing is so fast (nearly 100 times faster than on the PPro/Solaris system) so that, e.g., latencies and actual line selection calculations overwhelm the timing. The IBM SP has a very fast switched communications network that can easily handle the higher message traffic created by the LTF code. This explains why the LTF line selection executes faster and scales better for this small test on the IBM compared to the PPro/Solaris system.

The line opacity part of the test shown in Fig. 4 performs distinctively different in the IBM SP compared to the PPro/Solaris system (cf. Fig. 2). In contrast to the latter, the IBM SP delivers better performance for the GTF algorithm compared to the LTF code. The scaling of the GTF code is also significantly better than that of the LTF approach. This surprising result is a consequence of the high I/O bandwidth of the GPFS running on many I/O nodes, the I/O bandwidth available to GPFS is significantly higher than the bandwidth of the local disks (including all filesystem overheads etc). The I/O nodes of the GPFS can also cache blocks in their memory which can eliminate physical I/O to a disk and replace access by data exchange over the IBM ``switch''. Note that the test was designed and run with parameters set to maximize actual I/O operations in order to explicitly test this property of the algorithms.

The results of the large test case, for which the input file size is about 16 times bigger, are very different for the line selection, cf. Fig. 5. Now the GTF algorithm executes much faster (factor of 3) than the LTF code. This is probably caused by the larger I/O performance of the GPFS that can easily deliver the data to all nodes and the smaller number or messages that need to be exchanged in the GTF algorithm. The drop of performance at 32PE's in the GTF line selection run could have been caused by a temporary overload of the I/O subsystem (these tests were run on a non-dedicated machine). In contrast to the previous tests, the LTF approach does not scale in this case. This is likely caused by the large number of relatively small messages that are exchanged by the PEs (the line list master database is the same as for the GTF approach, so it is read through the GPFS as in the GTF case). This could be improved, e.g., by choosing larger block sizes for data sent via MPI messages, however, this will have the drawback of more memory usage and larger messages are more likely to block than small messages that can be stored within the communication hardware (or driver) itself.

The situation for the line opacities is shown in Fig. 6. The scaling is somewhat worse due to increased physical I/O (the temporary files are about 16 times larger as in the small test case). This is more problematic for the LTF approach which scales very poorly for larger numbers of PEs as the maximum local I/O bandwidth is reached far earlier than for the GTF approach. This is rather surprising as conventional wisdom would favor local disk I/O over global filesystem I/O on parallel machines. Although this is certainly true for farms of workstations or PCs, this is evidently not true on high-performance parallel computers with parallel I/O subsystems.

Next: Summary and Conclusions Up: Parallel Implementation of the Previous: PPro/Solaris system

Peter Hauschildt
2001-04-16