No Title

Parallel Supercomputing In Stellar Atmosphere Simulations

Peter H. Hauschildt
Department of Physics & Astronomy
and
Center for Simulational Physics
The University of Georgia
Athens, GA 30602

in collaboration with
E. Baron (Univ. of OK)
F. Allard (WSU & CNRS/Lyon)
D. Lowenthal (UGA)

Overview

1.: Motivation
2.: The Computational Problem
3.: Solution through parallelization
4.: Parallel algorithms
5.: Conclusions & The Future

Motivation

astrophysical information through spectroscopy
- velocities
- temperatures
- densities
- element abundances
- luminosities
detailed modeling required to extract physical data from observed spectra
follow radiative transfer of photons of all wavelengths through the atmosphere in great detail
$\to$ synthesize spectrum to compare to observations
basic physical model
- spherical shell
- static (stars) or expanding (novae, winds, SNe)
- hydrostatic or hydrodynamical equilibrium
- central source provides energy
Constraint equations:
- energy conservation
  $\to$ temperature structure
- momentum conservation
  $\to$ pressure & velocity structure
``Auxiliary'' equations:
- Radiative transfer equation
  $\to$ special relativistic form
- equation of state relation
  - high temperatures:
    (hot stars, Supernovae, novae)
    $\to$ need to include many ions
  - low temperatures:
    (Brown dwarfs, Jupiter-like planets)
    $\to$ need to include 100's of molecules & dust species
- statistical equilibrium equations
  $\to$ relate the population of each energy level to non-local radiation field and to collisional processes
- radiative transfer & statistical equilibrium equations are intricately coupled
  $\to$ must be solved together
  $\to$ statistical equations are inherently non-local
- lists of spectral lines for all species
  $\to$ account for blocking of photons & synthesize spectral features
  $\to$ spectral analysis

Radiative Transfer

true cm Assumptions:

spherical symmetry,
time independence ( $\partial/\partial t\equiv 0$ ),
full special relativistic treatment in the Lagrangian frame.

Spherically symmetric, special relativistic equation of radiative transfer

partial integro-differential equation,
telegrapher's equation: boundary value problem in r and initial value problem in $\lambda$ (certain restrictions apply)

The equation of radiative transfer

true cm

$\begin{displaymath} e {\partial I \over \partial r} + {\partial \over \partial ... ...tial \lambda} \left( \lambda I \right) + h I = \eta - \chi I \end{displaymath}$

with

and

$I(r,\mu,\lambda)$ : specific intensity scaled by r²,
r: radial coordinate,
$\mu$ : cosine of the direction angle, $\mu=\cos\phi$
v: velocity, $\b = v/c$ , $\gamma^2 = 1/(1-\b^2)$ ,
$\chi(r,\lambda)$ : extinction coefficient, $\chi=\kappa+\sigma_e+\kappa_l\varphi(\lambda)$
$\eta(r,\lambda)$ : emissivity.

Example for $\eta(r,\lambda)$

$\begin{displaymath} \eta = \kappa B_\lambda(T) + \sigma_e J(\lambda) + \kappa_l... ...phi(\lambda) \int_0^\infty \varphi(\lambda)J(\lambda)\,d\lambda\end{displaymath}$

with

$\begin{displaymath} J(\lambda) = \int_{-1}^1 I(\lambda)\,d\mu\end{displaymath}$

$\kappa B_\lambda(T)$ : thermal emission
$\sigma_e J(\lambda)$ : electron scattering
${\sigma \over 2} \int_0^\infty\int_{-1}^1 \varphi(\lambda)I\,d\mu\,d\lambda$ :spectral line emissivity

Numerical solution:

Basic idea: discretize $\partial/\partial \lambda$ and treat the boundary value problem for each wavelength individually,
Operator splitting (OS) method
- solve along characteristics of the RTE
- iterative method:
- piecewise parabolic ansatz to calculate I for given J
- iterate to self-consistent solution for J
- eigenvalues of iteration matrix close to unity
  $\to$ use operator splitting to reduce eigenvalues of amplification matrix

Statistical Equilibrium Equations

true cm

$\bullet$ Line and continuum scattering prevents the use of the $\Lambda$ -iteration for the solution of the rate equations!

Solution of the statistical equilibrium equations:

true cm Operator Splitting method:

define a ``rate operator'' in analogy to the $\Lambda$ -operator:
R_ij = [R_ij][n]
define an ``approximate rate operator'' $\ifmmode{[R_{ij}^*]}\else\hbox{$[R_{ij}^*]$}\fi$ and write the iteration scheme in the form:
$\begin{displaymath} R_{ij} = \ifmmode{[R_{ij}^*]}\else\hbox{$[R_{ij}^*]$}\fi[n_{... ...ode{[R_{ij}^*]}\else\hbox{$[R_{ij}^*]$}\fi\right) [n_{\rm old}]\end{displaymath}$

=0pt

The Computational Problem

1.

input data size

number of atomic/ionic spectral lines: $\approx 42\times 10^{6}$ $\to 0.6\,$ GB
number of diatomic molecular lines: $\approx 35\times 10^{6}$ $\to 0.5\,$ GB
number of hot water vapor lines:
all lines need to be accessed in a line selection procedure
line selection creates sub-lists that can be as large as the original list
$\to$ poses a significant data handling problem!

3.

memory/IO requirements

line lists too large for memory
$\to$ scratch files & block algorithm
$\to$ trade memory for IO bandwidth
number of individual energy levels: $\approx 6,000$ $\to$ $\approx 10\,$ MB
number of individual transitions:
$\approx 65,000$ $\to$ $\approx 100\,$ MB
EOS data storage $\approx 40\,$ MB
auxiliary storage $\approx 50\,$ MB
$\to$ total memory requirement $\ge 200\,$ MB
number of individual energy levels and transitions will increase dramatically $\to$ memory requirements $\gt.5\,$ GB

4.

(serial) CPU time

small for each individual point on the wavelength grid: $\approx 10\ldots 100\,$ msec
number of wavelength points for radiative transfer: 30,000-300,000 (can be >10⁶)
$\to$ up to 30,000 sec to ``sweep'' once through all wavelength points
typically, $\approx 10$ iterations (sweeps) are required to obtain an equilibrium model
$\to \approx 3.5\,$ CPU days
there are, literally, 100's of models in a typical grid ...

Solution through parallelization

1.

portability issues

large number of simulations
complex code (verification on several different architectures)
need to be able to run efficiently on different parallel supercomputers
$\to$ Fortran 90 & MPI
available on all major platforms
public domain implementation MPICH

2.

memory issues

MPI available on distributed memory systems
large aggregate memory of distributed memory machines
allows reduction of memory requirements per node
$\to$ larger model calculations possible!

3.

IO issues

parallel filesystems allow file partitioning
results in higher IO throughput
parallel code: each node needs to read only part of the data
$\to$ IO reduction on per-node basis
but sometimes better scalability means more IO

4.

scalability issues

allows more efficient usage of multiple CPUs
reduces wall-clock time for typical simulations
depends very often on type of models:
some simulations (stars) allow algorithms that scale very well, but some simulations (novae, SNe) do not
$\to$ implement several algorithms that can be selected at run-time to obtain ``best'' overall performance while making simulations feasible!

Parallel algorithms

1.

The PHOENIX code

multi-purpose stellar atmosphere code
implements detailed micro-physics
in development for $\approx 10\,$ years
portable
about 180,000 lines of Fortran 77/90 code
applied successfully to a large variety of problems
parallelization to allow larger and more detailed simulations in reasonable timeframe.
independent physical & logical program modules
$\to$ allows task parallelism
$\to$ plus data parallelism within each module
problem: very different types of simulations
$\to$ require different algorithms

2.

Spectral line selection

serial version can take 3h for the large line lists!
3 separate line lists $\to$ task parallelism
data parallelism: use client-server model
clients: select lines in assigned chunks of the line list files
server: collects data and creates sorted sub-list files
potentially reduces time proportional to the number of client nodes
creates large message traffic

3.

Spectral line opacity calculations

5 separate sub-lists $\to$ possible task parallelism
data parallelism: spatial points
problem: typically only 50 spatial points
$\to$ low scalability (5-10 nodes)
data parallelism: individual spectral lines
up to 50,000 spectral lines per wavelength point
$\to$ good scalability
parallel filesystem
$\to$ automatically partition file so that each node reads only its fraction of the lists
$\to$ no code changes required
problem: significant fraction of wavelength points with small (<50) spectral lines
$\to$ implement dynamic change between parallel algorithms at runtime
communication overhead reduces scalability

4.

Radiative transfer

characteristic ``rays'' are independent
$\to$ parallelize ``formal solution'' part of iteration
problem: number of mesh points along each characteristic is different
$\to$ need to balance load by distributing sets of characteristics to nodes
scalability low because of load-balance and communication overhead problems

5.

Parallelizing the wavelength loop

``longest'' loop in the whole code: number of wavelength points!
$\to$ ideal for parallelization
works extremely well for static configurations:
each wavelength point can be done in parallel with no communication until each node has completed its sweep.
does not reduce memory requirements per node
$\to$ combine with other task/data parallel algorithms
$\to$ concept of ``wavelength clusters'' with a set of ``worker nodes'' each
expanding atmospheres: radiative transfer is initial value problem in wavelength
wavelength point l depends on previous point l-1
$\to$ use a ``pipeline'' approach to parallelization
$\to$ cluster working on point l-1 sends data to cluster working on l

problem separates into pre- and post-processing phases:

  for i := 1 to NUMWAVELENGTHS
     pre_processing: {
       ...
       atomicLineOpacity(...)
       molecularLineOpacity(...)
       nlteOpacity(...)
     }
     radiativeTransfer(...)
     post_processing: {
       ...
       nlteUpdateRates(...)
     }
  end

properties similar to vector pipeline
limited scalability
combination of clusters and workers can be used to increase performance on any given architecture

$\psfig {file=design.ps,height=\textheight}$

Conclusions & The Future

parallelization of PHOENIX allows physically more detailed models
decrease in wall-clock time per model is substantial for many types of simulations
coding effort to implement MPI calls relatively small (about 7400 lines)
logic for algorithm selection and load balancing fairly complex
parallel version of PHOENIX is regularly used in production
depending on simulation type we use between 4 and 64 nodes (single CPU)

In the Future

parallel asynchronous IO (MPI-2)
radiative transfer for arbitrary velocity fields (requires 70-700 GB of memory/disk space)
time dependence
3D RT in the Lagrangian frame for optically thick moving configurations

About this document ...

Next: About this document ...

Peter H. Hauschildt
4/27/1999