next up previous
Next: Basic numerical methods Up: Parallel Implementation of the Previous: Parallel Implementation of the

Introduction

Much of the astrophysical information that we possess has been obtained via spectroscopy. Spectroscopy of astrophysical objects allows us to ascertain velocities, temperatures, abundances, sizes, and luminosities of astrophysical objects. Much can be learned from some objects just by examining line ratios and redshifts, but to really understand an observed spectrum in detail often requires detailed fits to synthetic spectra. Detailed synthetic spectra also serve to test theoretical models for various astrophysical objects. Since many astrophysical objects of interest feature relativistic flows (e.g., supernovae, novae, and accretion disks), low electron densities (e.g., supernovae and novae), and/or molecular formation (e.g., cool stars, brown dwarfs, and giant planets), detailed models must include special relativistic effects and a very large number of ions and molecules in the equation of state in order to make quantitative predictions about, e.g., abundances. In addition, deviations from local thermodynamic equilibrium (LTE) must be considered to correctly describe the transfer of radiation and the emergent spectrum.

We have developed the spherically symmetric special relativistic non-LTE generalized radiative transfer and stellar atmosphere computer code PHOENIX which can handle very large model atoms as well as line blanketing by millions of atomic and molecular lines. This code is designed to be very flexible, it is used to compute model atmospheres and synthetic spectra for, e.g., novae, supernovae, M and brown dwarfs, white dwarfs and accretion disks in Active Galactic Nuclei (AGN); and it is highly portable. When we include a large number of line transitions, the line profiles must be resolved in the co-moving (Lagrangian) frame. This requires many wavelength points, (we typically use 50,000 to 150,000 wavelength points). Since the CPU time scales linearly with the number of wavelength points the CPU time requirements of such a calculation are large. In addition, each NLTE radiative rate for both line and continuum transitions must be calculated and stored at every spatial grid point which requires large amounts of storage and can cause significant performance degradation if the corresponding routines are not optimally coded.

In order to take advantage of the enormous computing power and vast memory sizes of modern parallel supercomputers, both potentially allowing much faster model construction as well as more detailed models, we have implemented a parallel version of PHOENIX. Since the code uses a modular design, we have implemented different parallelization strategies for different modules in order to maximize the total parallel speed-up of the code. In addition, our implementation allows us to change the load distribution onto different nodes both via input files and dynamically during a model run, which gives a high degree of flexibility to optimize the performance on a number of different machines and for a number of different model parameters.

Since we have both large CPU and memory requirements we have initially implemented the parallel version of the code on the 2 using the MPI message passing library . Since the single processor speed of the 2 is high, even a small number of additional processors can lead to significant speed-up. We have chosen to work with the MPI message passing interface, since it is both portable , running on dedicated parallel machines and heterogeneous workstation clusters and it is available for both distributed and shared memory architectures. For our application, the distributed memory model is in fact easier to code than a shared memory model, since then we do not have to worry about locks and synchronization, etc. on small scales and we, in addition, retain full control over interprocess communication. This is especially clear once one realizes that it is fine to execute the same code on many nodes as long as it is not too CPU intensive, and avoids costly communication,

An alternative to an implementation with MPI is an implementation using High Performance Fortran (HPF) directives (in fact, both can co-exist to improve performance). However, the process of automatic parallelization guided by the HPF directives is presently not yet generating optimal results because the compiler technology is still very new. In addition, HPF compilers are not yet widely available and they are currently not available for heterogeneous workstation clusters. HPF is also more suited for problems that are purely data-parallel (SIMD problems) and would not benefit much from a MIMD approach. An optimal HPF implementation of PHOENIX would also require a significant number of code changes in order to explicitly instruct the compiler not to generate too many communication requests, which would slow down the code significantly. The MPI implementation requires only the addition of a few explicit communication requests, which can be done with a small number of library calls.

, running on dedicated parallel machines and heterogeneous workstation clusters and it is available for both distributed and shared memory architectures. For our application, the distributed memory model is in fact easier to code than a shared memory model, since then we do not have to worry about locks and synchronization, etc. on small scales and we, in addition, retain full control over interprocess communication. This is especially clear once one realizes that it is fine to execute the same code on many nodes as long as it is not too CPU intensive, and avoids costly communication,

An alternative to an implementation with MPI is an implementation using High Performance Fortran (HPF) directives (in fact, both can co-exist to improve performance). However, the process of automatic parallelization guided by the HPF directives is presently not yet generating optimal results because the compiler technology is still very new. In addition, HPF compilers are not yet widely available and they are currently not available for heterogeneous workstation clusters. HPF is also more suited for problems that are purely data-parallel (SIMD problems) and would not benefit much from a MIMD approach. An optimal HPF implementation of PHOENIX would also require a significant number of code changes in order to explicitly instruct the compiler not to generate too many communication requests, which would slow down the code significantly. The MPI implementation requires only the addition of a few explicit communication requests, which can be done with a small number of library calls.


next up previous
Next: Basic numerical methods Up: Parallel Implementation of the Previous: Parallel Implementation of the
Peter H. Hauschildt
4/27/1999