415 lines
15 KiB
Plaintext
415 lines
15 KiB
Plaintext
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Code Description
|
|
|
|
A. General description:
|
|
|
|
SMG98 is a parallel semicoarsening multigrid solver for the linear
|
|
systems arising from finite difference, finite volume, or finite
|
|
element discretizations of the diffusion equation,
|
|
|
|
\grad \cdot ( D \grad u ) + \sigma u = f
|
|
|
|
on logically rectangular grids. The code solves both 2D and 3D
|
|
problems with discretization stencils of up to 9-point in 2D and up to
|
|
27-point in 3D. See the following paper for details on the algorithm
|
|
and its parallel implementation/performance:
|
|
|
|
P. N. Brown, R. D. Falgout, and J. E. Jones,
|
|
"Semicoarsening multigrid on distributed memory machines".
|
|
To appear in the SIAM Journal on Scientific Computing special
|
|
issue on the Fifth Copper Mountain Conference on Iterative Methods.
|
|
Also available as LLNL technical report UCRL-JC-130720.
|
|
|
|
The driver provided with SMG98 builds linear systems for the special
|
|
case of the above equation,
|
|
|
|
- cx u_xx - cy u_yy - cz u_zz = (1/h)^2 , (in 3D)
|
|
- cx u_xx - cy u_yy = (1/h)^2 , (in 2D)
|
|
- cx u_xx = (1/h)^2 , (in 1D)
|
|
|
|
with Dirichlet boundary conditions of u = 0, where h is the mesh
|
|
spacing in each direction. Standard finite differences are used to
|
|
discretize the equations, yielding 3-pt., 5-pt., and 7-pt. stencils in
|
|
1D, 2D, and 3D, respectively.
|
|
|
|
To determine when the solver has converged, the driver currently uses
|
|
the relative-residual stopping criteria,
|
|
|
|
||r_k||_2 / ||b||_2 < tol
|
|
|
|
with tol = 10^-6. Note that in 1D, SMG98 reduces to a direct method
|
|
(cyclic reduction), so that the exact solution is obtained in just one
|
|
V-cycle (iteration).
|
|
|
|
This solver can serve as a key component for achieving scalability in
|
|
radiation diffusion simulations.
|
|
|
|
B. Coding:
|
|
|
|
SMG98 is written in ISO-C. It is an SPMD code which uses either MPI,
|
|
or both MPI and POSIX threads. Parallelism is achieved by data
|
|
decomposition. The driver provided with SMG98 achieves this
|
|
decomposition by simply subdividing the grid into logical P x Q x R
|
|
(in 3D) chunks of equal size.
|
|
|
|
C. Parallelism:
|
|
|
|
SMG98 is a highly synchronous code. The communications and
|
|
computations patterns exhibit the surface-to-volume relationship
|
|
common to many parallel scientific codes. Hence, parallel efficiency
|
|
is largely determined by the size of the data "chunks" mentioned
|
|
above, and the speed of communications and computations on the
|
|
machine. SMG98 is also memory-access bound, doing only about 1-2
|
|
computations per memory access, so memory-access speeds will also have
|
|
a large impact on performance.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Files in this Distribution
|
|
|
|
NOTE: Not all of these files are actually needed by SMG98. This code
|
|
is part of a larger linear solver library that is being developed in
|
|
the Center for Applied Scientific Computing (CASC) at LLNL.
|
|
|
|
In the linear_solvers directory the following files are included:
|
|
Makefile.in
|
|
configure
|
|
|
|
The following subdirectories are also included:
|
|
docs
|
|
struct_linear_solvers
|
|
struct_matrix_vector
|
|
test
|
|
utilities
|
|
|
|
In the docs directory the following files are included:
|
|
smg98.readme
|
|
|
|
In the utilities directory the following files are included:
|
|
|
|
In the struct_matrix_vector directory the following files are included:
|
|
|
|
In the struct_linear_solvers directory the following files are included:
|
|
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Building the Code
|
|
|
|
SMG98 uses GNU Autoconf to generate machine-specific makefiles for
|
|
building the code. To generate the makefiles, type
|
|
|
|
configure
|
|
|
|
in the top-level directory. The configure script is a portable script
|
|
generated by GNU Autoconf. It runs a series of tests to determine
|
|
characteristics of the machine on which it is running, and it uses the
|
|
results of the these tests to produce the machine specific makefiles,
|
|
called `Makefile', from template files called `Makefile.in' in each
|
|
directory. The configure script produces a file called `config.cache'
|
|
which stores some of its results. If you wish to run configure again
|
|
in a way that will get different results, you should remove this file.
|
|
|
|
Once the makefiles are produced, type
|
|
|
|
make
|
|
|
|
Other available targets are
|
|
|
|
make clean (deletes .o files)
|
|
make veryclean (deletes .o files, libraries, and executables)
|
|
|
|
This configure script primarily does the following things:
|
|
|
|
1. selects a C compiler
|
|
2. provides either optimization or debugging options for the C compiler
|
|
3. finds the headers and libraries for MPI
|
|
|
|
The configure script has some command-line options that can give you
|
|
some control over the choices it will make. You can type
|
|
|
|
configure --help
|
|
|
|
to see the list of all of the command-line options to configure, but
|
|
the most significant options are at the bottom of the list, after the
|
|
line that reads
|
|
|
|
--enable and --with options recognized:
|
|
|
|
Here are the current options:
|
|
|
|
--with-CC=ARG
|
|
|
|
This option allows you to choose the C compiler you wish to use. The
|
|
default compiler that configure uses is the MPI compiler.
|
|
|
|
--enable-opt-debug=ARG
|
|
|
|
Choose whether you want the C compiler to have optimization or
|
|
debugging flags. For debugging, replace `ARG' with `debug'. For
|
|
optimization, replace `ARG' with `opt'. If you want both sets of
|
|
flags, replace `ARG' with `both'. The default is optimization.
|
|
|
|
--without-MPI
|
|
|
|
This flag suppresses the use of MPI.
|
|
|
|
--with-mpi-include=DIR
|
|
--with-mpi-libs=LIBS
|
|
--with-mpi-lib-dirs=DIRS
|
|
|
|
These three flags are to be used if you want to override the automatic
|
|
search for MPI. If you use one of these flags, you must use all
|
|
three. Replace `DIR' with the path of the directory that contains
|
|
mpi.h, replace `LIBS' with a list of the stub names of all the
|
|
libraries needed for MPI, and replace `DIRS' with a list of the
|
|
directory paths containing the libraries specified by `LIBS'. NOTE:
|
|
The lists `LIBS' and `DIRS' should be space-separated and contained in
|
|
quotes, e.g.
|
|
|
|
--with-mpi-libs="nslsocket mpi"
|
|
--with-mpi-lib-dirs="/usr/lib /usr/local/mpi/lib"
|
|
|
|
--with-mpi-flags=FLAGS
|
|
|
|
Sometimes other compiler flags are needed for certain MPI
|
|
implementations to work. Replace `FLAGS' with a space-separated list
|
|
of whatever flags are necessary. This option does not override the
|
|
automatic search for MPI. It can be used to add to the results of the
|
|
automatic search, or it can be used along with the three previous
|
|
flags.
|
|
|
|
--with-MPICC=ARG
|
|
|
|
The automatic search for the MPI stuff is based on a tool such as
|
|
mpicc that configure finds in your $PATH. If there is an
|
|
implementation of MPI that you wish to use, you can replace `ARG' with
|
|
the name of that MPI version's C compiler wrapper, if it has one.
|
|
(MPICH has mpicc, IBM MPI has mpcc, other MPIs use other names.)
|
|
configure will then automatically find the necessary libraries and
|
|
headers.
|
|
|
|
--with-pthreads
|
|
|
|
This option should be used if POSIX threads are to be used in SMG98. The
|
|
default is to use MPI only without POSIX threads.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Optimization and Improvement Challenges
|
|
|
|
This code is memory-access bound. We believe it would be very
|
|
difficult to obtain "good" cache reuse with an optimized version of
|
|
the code.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Parallelism and Scalability Expectations
|
|
|
|
SMG98 has been run on the following platforms:
|
|
|
|
Blue-Pacific (ID) - up to 64 procs
|
|
Blue-Pacific (TR) - up to 256 procs
|
|
Red - up to 1000 procs
|
|
DEC cluster - up to 8 procs
|
|
Sun sparc 5's, 10's - up to 4 machines
|
|
Pentium PC - up to 1 proc
|
|
|
|
Consider increasing both problem size and number of processors in tandem.
|
|
On scalable architectures, time-to-solution for SMG98 will initially
|
|
increase, then it will level off at a modest numbers of processors,
|
|
remaining roughly constant for larger numbers of processors. Iteration
|
|
counts will also increase slightly for small to modest sized problems,
|
|
then level off at a roughly constant number for larger problem sizes.
|
|
|
|
For example, we get the following results for a 3D problem with
|
|
cx = 0.1, cy = 1.0, and cz = 10.0, for a problem distributed on
|
|
a logical P x Q x R processor topology, with fixed local problem
|
|
size per processor given as 40x40x40:
|
|
|
|
"P x Q x R" P "iters" "solve time" "solve mflops"
|
|
1x1x1 1 6 23.255241 6.464325
|
|
2x2x2 8 6 32.262907 37.030568
|
|
3x3x3 27 7 41.341892 111.707595
|
|
4x4x4 64 7 46.672215 236.775982
|
|
5x5x5 125 7 50.051737 433.673948
|
|
8x8x8 512 7 54.094806 1631.579065
|
|
10x10x10 1000 8 62.725305 3136.280769
|
|
|
|
These results were obtained on ASCI Red.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Running the Code
|
|
|
|
The driver for SMG98 is called `struct_linear_solvers', and is located
|
|
in the linear_solvers/test subdirectory. Type
|
|
|
|
mpirun -np 1 struct_linear_solvers -help
|
|
|
|
to get usage information. This prints out the following:
|
|
|
|
Usage: .../linear_solvers/test/struct_linear_solvers [<options>]
|
|
|
|
-n <nx> <ny> <nz> : problem size per block
|
|
-P <Px> <Py> <Pz> : processor topology
|
|
-b <bx> <by> <bz> : blocking per processor
|
|
-c <cx> <cy> <cz> : diffusion coefficients
|
|
-v <n_pre> <n_post> : number of pre and post relaxations
|
|
-d <dim> : problem dimension (2 or 3)
|
|
-solver <ID> : solver ID
|
|
|
|
All of the arguments are optional. The most important options for the
|
|
SMG98 compact application are the `-n' and `-P' options. The `-n'
|
|
option allows one to specify the local problem size per processor, the
|
|
the `-P' option specifies the processor topology to run on. The
|
|
global problem size will be <Px>*<nx> by <Py>*<ny> by <Pz>*<nz>.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Timing Issues
|
|
|
|
The whole code is timed using the MPI timers. Timing results are
|
|
printed to standard out, and are divided into "Setup Phase" times and
|
|
"Solve Phase" times. Timings for a few individual routines are also
|
|
printed out.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Memory Needed
|
|
|
|
SMG98 is a memory intensive code, and its memory needs are somewhat
|
|
complicated to describe. For the 3D problems discussed in this
|
|
document, memory requirements are roughly 54 times the local problem
|
|
size times the size of a double plus some overhead for storing ghost
|
|
points, etc. in the code. Unfortunately, the overhead required by
|
|
this version of the SMG code grows with problem size, and can be
|
|
quite substantial for very large runs.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
About the Data
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Expected Results
|
|
|
|
Consider the following run:
|
|
|
|
mpirun -np 1 struct_linear_solvers -np 12 12 12 -c 2.0 3.0 40
|
|
|
|
This is what SMG98 prints out:
|
|
|
|
Running with these driver parameters:
|
|
(nx, ny, nz) = (12, 12, 12)
|
|
(Px, Py, Pz) = (1, 1, 1)
|
|
(bx, by, bz) = (1, 1, 1)
|
|
(cx, cy, cz) = (2.000000, 3.000000, 40.000000)
|
|
(n_pre, n_post) = (1, 1)
|
|
dim = 3
|
|
solver ID = 0
|
|
=============================================
|
|
Setup phase times:
|
|
=============================================
|
|
SMG Setup:
|
|
wall clock time = 0.383954 seconds
|
|
wall MFLOPS = 0.825474
|
|
cpu clock time = 0.380000 seconds
|
|
cpu MFLOPS = 0.834063
|
|
SMG:
|
|
wall clock time = 0.050722 seconds
|
|
wall MFLOPS = 4.637869
|
|
cpu clock time = 0.050000 seconds
|
|
cpu MFLOPS = 4.704840
|
|
SMGRelax:
|
|
wall clock time = 0.072742 seconds
|
|
wall MFLOPS = 4.357098
|
|
cpu clock time = 0.080000 seconds
|
|
cpu MFLOPS = 3.961800
|
|
SMGResidual:
|
|
wall clock time = 0.024079 seconds
|
|
wall MFLOPS = 5.973338
|
|
cpu clock time = 0.040000 seconds
|
|
cpu MFLOPS = 3.595800
|
|
CyclicReduction:
|
|
wall clock time = 0.039484 seconds
|
|
wall MFLOPS = 3.662243
|
|
cpu clock time = 0.040000 seconds
|
|
cpu MFLOPS = 3.615000
|
|
SMGIntAdd:
|
|
wall clock time = 0.002812 seconds
|
|
wall MFLOPS = 5.633001
|
|
cpu clock time = 0.000000 seconds
|
|
cpu MFLOPS = 0.000000
|
|
SMGRestrict:
|
|
wall clock time = 0.001255 seconds
|
|
wall MFLOPS = 10.097211
|
|
cpu clock time = 0.000000 seconds
|
|
cpu MFLOPS = 0.000000
|
|
=============================================
|
|
Solve phase times:
|
|
=============================================
|
|
SMG Solve:
|
|
wall clock time = 0.526451 seconds
|
|
wall MFLOPS = 4.864426
|
|
cpu clock time = 0.530000 seconds
|
|
cpu MFLOPS = 4.831853
|
|
SMG:
|
|
wall clock time = 0.526433 seconds
|
|
wall MFLOPS = 4.864592
|
|
cpu clock time = 0.530000 seconds
|
|
cpu MFLOPS = 4.831853
|
|
SMGRelax:
|
|
wall clock time = 0.496075 seconds
|
|
wall MFLOPS = 4.651397
|
|
cpu clock time = 0.510000 seconds
|
|
cpu MFLOPS = 4.524396
|
|
SMGResidual:
|
|
wall clock time = 0.216037 seconds
|
|
wall MFLOPS = 6.290034
|
|
cpu clock time = 0.230000 seconds
|
|
cpu MFLOPS = 5.908174
|
|
CyclicReduction:
|
|
wall clock time = 0.235903 seconds
|
|
wall MFLOPS = 3.644964
|
|
cpu clock time = 0.260000 seconds
|
|
cpu MFLOPS = 3.307146
|
|
SMGIntAdd:
|
|
wall clock time = 0.027868 seconds
|
|
wall MFLOPS = 6.407349
|
|
cpu clock time = 0.020000 seconds
|
|
cpu MFLOPS = 8.928000
|
|
SMGRestrict:
|
|
wall clock time = 0.015325 seconds
|
|
wall MFLOPS = 9.321240
|
|
cpu clock time = 0.000000 seconds
|
|
cpu MFLOPS = 0.000000
|
|
|
|
Iterations = 4
|
|
Final Relative Residual Norm = 8.972097e-07
|
|
|
|
The relative residual norm may differ slightly from machine to machine
|
|
or compiler to compiler, but should only differ very slightly (say,
|
|
the 6th or 7th decimal place). Also, the code should generate nearly
|
|
identical results for a given problem, independent of the data
|
|
distribution. The only part of the code that does not guarantee
|
|
bitwise identical results is the inner product used to compute norms.
|
|
In practice, the above residual norm has remained the same.
|
|
|
|
%==========================================================================
|
|
%==========================================================================
|
|
|
|
Release and Modification Record
|
|
|