hypre/docs_misc/smg98.readme

415 lines
15 KiB
Plaintext

%==========================================================================
%==========================================================================
Code Description
A. General description:
SMG98 is a parallel semicoarsening multigrid solver for the linear
systems arising from finite difference, finite volume, or finite
element discretizations of the diffusion equation,
\grad \cdot ( D \grad u ) + \sigma u = f
on logically rectangular grids. The code solves both 2D and 3D
problems with discretization stencils of up to 9-point in 2D and up to
27-point in 3D. See the following paper for details on the algorithm
and its parallel implementation/performance:
P. N. Brown, R. D. Falgout, and J. E. Jones,
"Semicoarsening multigrid on distributed memory machines".
To appear in the SIAM Journal on Scientific Computing special
issue on the Fifth Copper Mountain Conference on Iterative Methods.
Also available as LLNL technical report UCRL-JC-130720.
The driver provided with SMG98 builds linear systems for the special
case of the above equation,
- cx u_xx - cy u_yy - cz u_zz = (1/h)^2 , (in 3D)
- cx u_xx - cy u_yy = (1/h)^2 , (in 2D)
- cx u_xx = (1/h)^2 , (in 1D)
with Dirichlet boundary conditions of u = 0, where h is the mesh
spacing in each direction. Standard finite differences are used to
discretize the equations, yielding 3-pt., 5-pt., and 7-pt. stencils in
1D, 2D, and 3D, respectively.
To determine when the solver has converged, the driver currently uses
the relative-residual stopping criteria,
||r_k||_2 / ||b||_2 < tol
with tol = 10^-6. Note that in 1D, SMG98 reduces to a direct method
(cyclic reduction), so that the exact solution is obtained in just one
V-cycle (iteration).
This solver can serve as a key component for achieving scalability in
radiation diffusion simulations.
B. Coding:
SMG98 is written in ISO-C. It is an SPMD code which uses either MPI,
or both MPI and POSIX threads. Parallelism is achieved by data
decomposition. The driver provided with SMG98 achieves this
decomposition by simply subdividing the grid into logical P x Q x R
(in 3D) chunks of equal size.
C. Parallelism:
SMG98 is a highly synchronous code. The communications and
computations patterns exhibit the surface-to-volume relationship
common to many parallel scientific codes. Hence, parallel efficiency
is largely determined by the size of the data "chunks" mentioned
above, and the speed of communications and computations on the
machine. SMG98 is also memory-access bound, doing only about 1-2
computations per memory access, so memory-access speeds will also have
a large impact on performance.
%==========================================================================
%==========================================================================
Files in this Distribution
NOTE: Not all of these files are actually needed by SMG98. This code
is part of a larger linear solver library that is being developed in
the Center for Applied Scientific Computing (CASC) at LLNL.
In the linear_solvers directory the following files are included:
Makefile.in
configure
The following subdirectories are also included:
docs
struct_linear_solvers
struct_matrix_vector
test
utilities
In the docs directory the following files are included:
smg98.readme
In the utilities directory the following files are included:
In the struct_matrix_vector directory the following files are included:
In the struct_linear_solvers directory the following files are included:
%==========================================================================
%==========================================================================
Building the Code
SMG98 uses GNU Autoconf to generate machine-specific makefiles for
building the code. To generate the makefiles, type
configure
in the top-level directory. The configure script is a portable script
generated by GNU Autoconf. It runs a series of tests to determine
characteristics of the machine on which it is running, and it uses the
results of the these tests to produce the machine specific makefiles,
called `Makefile', from template files called `Makefile.in' in each
directory. The configure script produces a file called `config.cache'
which stores some of its results. If you wish to run configure again
in a way that will get different results, you should remove this file.
Once the makefiles are produced, type
make
Other available targets are
make clean (deletes .o files)
make veryclean (deletes .o files, libraries, and executables)
This configure script primarily does the following things:
1. selects a C compiler
2. provides either optimization or debugging options for the C compiler
3. finds the headers and libraries for MPI
The configure script has some command-line options that can give you
some control over the choices it will make. You can type
configure --help
to see the list of all of the command-line options to configure, but
the most significant options are at the bottom of the list, after the
line that reads
--enable and --with options recognized:
Here are the current options:
--with-CC=ARG
This option allows you to choose the C compiler you wish to use. The
default compiler that configure uses is the MPI compiler.
--enable-opt-debug=ARG
Choose whether you want the C compiler to have optimization or
debugging flags. For debugging, replace `ARG' with `debug'. For
optimization, replace `ARG' with `opt'. If you want both sets of
flags, replace `ARG' with `both'. The default is optimization.
--without-MPI
This flag suppresses the use of MPI.
--with-mpi-include=DIR
--with-mpi-libs=LIBS
--with-mpi-lib-dirs=DIRS
These three flags are to be used if you want to override the automatic
search for MPI. If you use one of these flags, you must use all
three. Replace `DIR' with the path of the directory that contains
mpi.h, replace `LIBS' with a list of the stub names of all the
libraries needed for MPI, and replace `DIRS' with a list of the
directory paths containing the libraries specified by `LIBS'. NOTE:
The lists `LIBS' and `DIRS' should be space-separated and contained in
quotes, e.g.
--with-mpi-libs="nslsocket mpi"
--with-mpi-lib-dirs="/usr/lib /usr/local/mpi/lib"
--with-mpi-flags=FLAGS
Sometimes other compiler flags are needed for certain MPI
implementations to work. Replace `FLAGS' with a space-separated list
of whatever flags are necessary. This option does not override the
automatic search for MPI. It can be used to add to the results of the
automatic search, or it can be used along with the three previous
flags.
--with-MPICC=ARG
The automatic search for the MPI stuff is based on a tool such as
mpicc that configure finds in your $PATH. If there is an
implementation of MPI that you wish to use, you can replace `ARG' with
the name of that MPI version's C compiler wrapper, if it has one.
(MPICH has mpicc, IBM MPI has mpcc, other MPIs use other names.)
configure will then automatically find the necessary libraries and
headers.
--with-pthreads
This option should be used if POSIX threads are to be used in SMG98. The
default is to use MPI only without POSIX threads.
%==========================================================================
%==========================================================================
Optimization and Improvement Challenges
This code is memory-access bound. We believe it would be very
difficult to obtain "good" cache reuse with an optimized version of
the code.
%==========================================================================
%==========================================================================
Parallelism and Scalability Expectations
SMG98 has been run on the following platforms:
Blue-Pacific (ID) - up to 64 procs
Blue-Pacific (TR) - up to 256 procs
Red - up to 1000 procs
DEC cluster - up to 8 procs
Sun sparc 5's, 10's - up to 4 machines
Pentium PC - up to 1 proc
Consider increasing both problem size and number of processors in tandem.
On scalable architectures, time-to-solution for SMG98 will initially
increase, then it will level off at a modest numbers of processors,
remaining roughly constant for larger numbers of processors. Iteration
counts will also increase slightly for small to modest sized problems,
then level off at a roughly constant number for larger problem sizes.
For example, we get the following results for a 3D problem with
cx = 0.1, cy = 1.0, and cz = 10.0, for a problem distributed on
a logical P x Q x R processor topology, with fixed local problem
size per processor given as 40x40x40:
"P x Q x R" P "iters" "solve time" "solve mflops"
1x1x1 1 6 23.255241 6.464325
2x2x2 8 6 32.262907 37.030568
3x3x3 27 7 41.341892 111.707595
4x4x4 64 7 46.672215 236.775982
5x5x5 125 7 50.051737 433.673948
8x8x8 512 7 54.094806 1631.579065
10x10x10 1000 8 62.725305 3136.280769
These results were obtained on ASCI Red.
%==========================================================================
%==========================================================================
Running the Code
The driver for SMG98 is called `struct_linear_solvers', and is located
in the linear_solvers/test subdirectory. Type
mpirun -np 1 struct_linear_solvers -help
to get usage information. This prints out the following:
Usage: .../linear_solvers/test/struct_linear_solvers [<options>]
-n <nx> <ny> <nz> : problem size per block
-P <Px> <Py> <Pz> : processor topology
-b <bx> <by> <bz> : blocking per processor
-c <cx> <cy> <cz> : diffusion coefficients
-v <n_pre> <n_post> : number of pre and post relaxations
-d <dim> : problem dimension (2 or 3)
-solver <ID> : solver ID
All of the arguments are optional. The most important options for the
SMG98 compact application are the `-n' and `-P' options. The `-n'
option allows one to specify the local problem size per processor, the
the `-P' option specifies the processor topology to run on. The
global problem size will be <Px>*<nx> by <Py>*<ny> by <Pz>*<nz>.
%==========================================================================
%==========================================================================
Timing Issues
The whole code is timed using the MPI timers. Timing results are
printed to standard out, and are divided into "Setup Phase" times and
"Solve Phase" times. Timings for a few individual routines are also
printed out.
%==========================================================================
%==========================================================================
Memory Needed
SMG98 is a memory intensive code, and its memory needs are somewhat
complicated to describe. For the 3D problems discussed in this
document, memory requirements are roughly 54 times the local problem
size times the size of a double plus some overhead for storing ghost
points, etc. in the code. Unfortunately, the overhead required by
this version of the SMG code grows with problem size, and can be
quite substantial for very large runs.
%==========================================================================
%==========================================================================
About the Data
%==========================================================================
%==========================================================================
Expected Results
Consider the following run:
mpirun -np 1 struct_linear_solvers -np 12 12 12 -c 2.0 3.0 40
This is what SMG98 prints out:
Running with these driver parameters:
(nx, ny, nz) = (12, 12, 12)
(Px, Py, Pz) = (1, 1, 1)
(bx, by, bz) = (1, 1, 1)
(cx, cy, cz) = (2.000000, 3.000000, 40.000000)
(n_pre, n_post) = (1, 1)
dim = 3
solver ID = 0
=============================================
Setup phase times:
=============================================
SMG Setup:
wall clock time = 0.383954 seconds
wall MFLOPS = 0.825474
cpu clock time = 0.380000 seconds
cpu MFLOPS = 0.834063
SMG:
wall clock time = 0.050722 seconds
wall MFLOPS = 4.637869
cpu clock time = 0.050000 seconds
cpu MFLOPS = 4.704840
SMGRelax:
wall clock time = 0.072742 seconds
wall MFLOPS = 4.357098
cpu clock time = 0.080000 seconds
cpu MFLOPS = 3.961800
SMGResidual:
wall clock time = 0.024079 seconds
wall MFLOPS = 5.973338
cpu clock time = 0.040000 seconds
cpu MFLOPS = 3.595800
CyclicReduction:
wall clock time = 0.039484 seconds
wall MFLOPS = 3.662243
cpu clock time = 0.040000 seconds
cpu MFLOPS = 3.615000
SMGIntAdd:
wall clock time = 0.002812 seconds
wall MFLOPS = 5.633001
cpu clock time = 0.000000 seconds
cpu MFLOPS = 0.000000
SMGRestrict:
wall clock time = 0.001255 seconds
wall MFLOPS = 10.097211
cpu clock time = 0.000000 seconds
cpu MFLOPS = 0.000000
=============================================
Solve phase times:
=============================================
SMG Solve:
wall clock time = 0.526451 seconds
wall MFLOPS = 4.864426
cpu clock time = 0.530000 seconds
cpu MFLOPS = 4.831853
SMG:
wall clock time = 0.526433 seconds
wall MFLOPS = 4.864592
cpu clock time = 0.530000 seconds
cpu MFLOPS = 4.831853
SMGRelax:
wall clock time = 0.496075 seconds
wall MFLOPS = 4.651397
cpu clock time = 0.510000 seconds
cpu MFLOPS = 4.524396
SMGResidual:
wall clock time = 0.216037 seconds
wall MFLOPS = 6.290034
cpu clock time = 0.230000 seconds
cpu MFLOPS = 5.908174
CyclicReduction:
wall clock time = 0.235903 seconds
wall MFLOPS = 3.644964
cpu clock time = 0.260000 seconds
cpu MFLOPS = 3.307146
SMGIntAdd:
wall clock time = 0.027868 seconds
wall MFLOPS = 6.407349
cpu clock time = 0.020000 seconds
cpu MFLOPS = 8.928000
SMGRestrict:
wall clock time = 0.015325 seconds
wall MFLOPS = 9.321240
cpu clock time = 0.000000 seconds
cpu MFLOPS = 0.000000
Iterations = 4
Final Relative Residual Norm = 8.972097e-07
The relative residual norm may differ slightly from machine to machine
or compiler to compiler, but should only differ very slightly (say,
the 6th or 7th decimal place). Also, the code should generate nearly
identical results for a given problem, independent of the data
distribution. The only part of the code that does not guarantee
bitwise identical results is the inner product used to compute norms.
In practice, the above residual norm has remained the same.
%==========================================================================
%==========================================================================
Release and Modification Record