hypre/docs_misc/smg98.readme

%==========================================================================
%==========================================================================

Code Description

A. General description:

SMG98 is a parallel semicoarsening multigrid solver for the linear
systems arising from finite difference, finite volume, or finite
element discretizations of the diffusion equation,

  \grad \cdot ( D \grad u ) + \sigma u = f

on logically rectangular grids.  The code solves both 2D and 3D
problems with discretization stencils of up to 9-point in 2D and up to
27-point in 3D.  See the following paper for details on the algorithm
and its parallel implementation/performance:

  P. N. Brown, R. D. Falgout, and J. E. Jones,
    "Semicoarsening multigrid on distributed memory machines".
    To appear in the SIAM Journal on Scientific Computing special
    issue on the Fifth Copper Mountain Conference on Iterative Methods.
    Also available as LLNL technical report UCRL-JC-130720.

The driver provided with SMG98 builds linear systems for the special
case of the above equation,

  - cx u_xx - cy u_yy - cz u_zz = (1/h)^2 ,         (in 3D)
  - cx u_xx - cy u_yy           = (1/h)^2 ,         (in 2D)
  - cx u_xx                     = (1/h)^2 ,         (in 1D)

with Dirichlet boundary conditions of u = 0, where h is the mesh
spacing in each direction.  Standard finite differences are used to
discretize the equations, yielding 3-pt., 5-pt., and 7-pt. stencils in
1D, 2D, and 3D, respectively.

To determine when the solver has converged, the driver currently uses
the relative-residual stopping criteria,

  ||r_k||_2 / ||b||_2 < tol

with tol = 10^-6.  Note that in 1D, SMG98 reduces to a direct method
(cyclic reduction), so that the exact solution is obtained in just one
V-cycle (iteration).

This solver can serve as a key component for achieving scalability in
radiation diffusion simulations.

B. Coding:

SMG98 is written in ISO-C.  It is an SPMD code which uses either MPI,
or both MPI and POSIX threads.  Parallelism is achieved by data
decomposition.  The driver provided with SMG98 achieves this
decomposition by simply subdividing the grid into logical P x Q x R
(in 3D) chunks of equal size.

C. Parallelism:

SMG98 is a highly synchronous code.  The communications and
computations patterns exhibit the surface-to-volume relationship
common to many parallel scientific codes.  Hence, parallel efficiency
is largely determined by the size of the data "chunks" mentioned
above, and the speed of communications and computations on the
machine.  SMG98 is also memory-access bound, doing only about 1-2
computations per memory access, so memory-access speeds will also have
a large impact on performance.

%==========================================================================
%==========================================================================

Files in this Distribution

NOTE: Not all of these files are actually needed by SMG98.  This code
is part of a larger linear solver library that is being developed in
the Center for Applied Scientific Computing (CASC) at LLNL.

In the linear_solvers directory the following files are included:
Makefile.in
configure

The following subdirectories are also included:
docs
struct_linear_solvers
struct_matrix_vector
test
utilities

In the docs directory the following files are included:
smg98.readme

In the utilities directory the following files are included:

In the struct_matrix_vector directory the following files are included:

In the struct_linear_solvers directory the following files are included:


%==========================================================================
%==========================================================================

Building the Code

SMG98 uses GNU Autoconf to generate machine-specific makefiles for
building the code.  To generate the makefiles, type

  configure

in the top-level directory.  The configure script is a portable script
generated by GNU Autoconf.  It runs a series of tests to determine
characteristics of the machine on which it is running, and it uses the
results of the these tests to produce the machine specific makefiles,
called `Makefile', from template files called `Makefile.in' in each
directory.  The configure script produces a file called `config.cache'
which stores some of its results.  If you wish to run configure again
in a way that will get different results, you should remove this file.

Once the makefiles are produced, type

  make

Other available targets are

  make clean        (deletes .o files)
  make veryclean    (deletes .o files, libraries, and executables)

This configure script primarily does the following things:

   1. selects a C compiler
   2. provides either optimization or debugging options for the C compiler
   3. finds the headers and libraries for MPI

The configure script has some command-line options that can give you
some control over the choices it will make.  You can type

   configure --help

to see the list of all of the command-line options to configure, but
the most significant options are at the bottom of the list, after the
line that reads

   --enable and --with options recognized:

Here are the current options:

--with-CC=ARG

This option allows you to choose the C compiler you wish to use.  The
default compiler that configure uses is the MPI compiler.

--enable-opt-debug=ARG

Choose whether you want the C compiler to have optimization or
debugging flags.  For debugging, replace `ARG' with `debug'.  For
optimization, replace `ARG' with `opt'.  If you want both sets of
flags, replace `ARG' with `both'.  The default is optimization.

--without-MPI

This flag suppresses the use of MPI.

--with-mpi-include=DIR
--with-mpi-libs=LIBS
--with-mpi-lib-dirs=DIRS

These three flags are to be used if you want to override the automatic
search for MPI.  If you use one of these flags, you must use all
three.  Replace `DIR' with the path of the directory that contains
mpi.h, replace `LIBS' with a list of the stub names of all the
libraries needed for MPI, and replace `DIRS' with a list of the
directory paths containing the libraries specified by `LIBS'.  NOTE:
The lists `LIBS' and `DIRS' should be space-separated and contained in
quotes, e.g.

   --with-mpi-libs="nslsocket mpi"
   --with-mpi-lib-dirs="/usr/lib /usr/local/mpi/lib"

--with-mpi-flags=FLAGS

Sometimes other compiler flags are needed for certain MPI
implementations to work.  Replace `FLAGS' with a space-separated list
of whatever flags are necessary.  This option does not override the
automatic search for MPI.  It can be used to add to the results of the
automatic search, or it can be used along with the three previous
flags.

--with-MPICC=ARG

The automatic search for the MPI stuff is based on a tool such as
mpicc that configure finds in your $PATH.  If there is an
implementation of MPI that you wish to use, you can replace `ARG' with
the name of that MPI version's C compiler wrapper, if it has one.
(MPICH has mpicc, IBM MPI has mpcc, other MPIs use other names.)
configure will then automatically find the necessary libraries and
headers.

--with-pthreads

This option should be used if POSIX threads are to be used in SMG98.  The
default is to use MPI only without POSIX threads.

%==========================================================================
%==========================================================================

Optimization and Improvement Challenges

This code is memory-access bound.  We believe it would be very
difficult to obtain "good" cache reuse with an optimized version of
the code.

%==========================================================================
%==========================================================================

Parallelism and Scalability Expectations

SMG98 has been run on the following platforms:

 Blue-Pacific (ID)   - up to 64 procs
 Blue-Pacific (TR)   - up to 256 procs
 Red                 - up to 1000 procs
 DEC cluster         - up to 8 procs
 Sun sparc 5's, 10's - up to 4 machines
 Pentium PC          - up to 1 proc

Consider increasing both problem size and number of processors in tandem.
On scalable architectures, time-to-solution for SMG98 will initially
increase, then it will level off at a modest numbers of processors,
remaining roughly constant for larger numbers of processors.  Iteration
counts will also increase slightly for small to modest sized problems,
then level off at a roughly constant number for larger problem sizes.

For example, we get the following results for a 3D problem with
cx = 0.1, cy = 1.0, and cz = 10.0, for a problem distributed on
a logical P x Q x R processor topology, with fixed local problem
size per processor given as 40x40x40:

 "P x Q x R"      P  "iters"     "solve time"    "solve mflops"
  1x1x1           1     6          23.255241          6.464325
  2x2x2           8     6          32.262907         37.030568
  3x3x3          27     7          41.341892        111.707595
  4x4x4          64     7          46.672215        236.775982
  5x5x5         125     7          50.051737        433.673948
  8x8x8         512     7          54.094806       1631.579065
  10x10x10     1000     8          62.725305       3136.280769

These results were obtained on ASCI Red.

%==========================================================================
%==========================================================================

Running the Code

The driver for SMG98 is called `struct_linear_solvers', and is located
in the linear_solvers/test subdirectory.  Type

   mpirun -np 1 struct_linear_solvers -help

to get usage information.  This prints out the following:

Usage: .../linear_solvers/test/struct_linear_solvers [<options>]

  -n <nx> <ny> <nz>    : problem size per block
  -P <Px> <Py> <Pz>    : processor topology
  -b <bx> <by> <bz>    : blocking per processor
  -c <cx> <cy> <cz>    : diffusion coefficients
  -v <n_pre> <n_post>  : number of pre and post relaxations
  -d <dim>             : problem dimension (2 or 3)
  -solver <ID>         : solver ID

All of the arguments are optional.  The most important options for the
SMG98 compact application are the `-n' and `-P' options.  The `-n'
option allows one to specify the local problem size per processor, the
the `-P' option specifies the processor topology to run on.  The
global problem size will be <Px>*<nx> by <Py>*<ny> by <Pz>*<nz>.

%==========================================================================
%==========================================================================

Timing Issues

The whole code is timed using the MPI timers.  Timing results are
printed to standard out, and are divided into "Setup Phase" times and
"Solve Phase" times.  Timings for a few individual routines are also
printed out.

%==========================================================================
%==========================================================================

Memory Needed

SMG98 is a memory intensive code, and its memory needs are somewhat
complicated to describe.  For the 3D problems discussed in this
document, memory requirements are roughly 54 times the local problem
size times the size of a double plus some overhead for storing ghost
points, etc. in the code.  Unfortunately, the overhead required by
this version of the SMG code grows with problem size, and can be
quite substantial for very large runs.

%==========================================================================
%==========================================================================

About the Data

%==========================================================================
%==========================================================================

Expected Results

Consider the following run:

  mpirun -np 1 struct_linear_solvers -np 12 12 12 -c 2.0 3.0 40

This is what SMG98 prints out:

   Running with these driver parameters:
     (nx, ny, nz)    = (12, 12, 12)
     (Px, Py, Pz)    = (1, 1, 1)
     (bx, by, bz)    = (1, 1, 1)
     (cx, cy, cz)    = (2.000000, 3.000000, 40.000000)
     (n_pre, n_post) = (1, 1)
     dim             = 3
     solver ID       = 0
   =============================================
   Setup phase times:
   =============================================
   SMG Setup:
     wall clock time = 0.383954 seconds
     wall MFLOPS     = 0.825474
     cpu clock time  = 0.380000 seconds
     cpu MFLOPS      = 0.834063
   SMG:
     wall clock time = 0.050722 seconds
     wall MFLOPS     = 4.637869
     cpu clock time  = 0.050000 seconds
     cpu MFLOPS      = 4.704840
   SMGRelax:
     wall clock time = 0.072742 seconds
     wall MFLOPS     = 4.357098
     cpu clock time  = 0.080000 seconds
     cpu MFLOPS      = 3.961800
   SMGResidual:
     wall clock time = 0.024079 seconds
     wall MFLOPS     = 5.973338
     cpu clock time  = 0.040000 seconds
     cpu MFLOPS      = 3.595800
   CyclicReduction:
     wall clock time = 0.039484 seconds
     wall MFLOPS     = 3.662243
     cpu clock time  = 0.040000 seconds
     cpu MFLOPS      = 3.615000
   SMGIntAdd:
     wall clock time = 0.002812 seconds
     wall MFLOPS     = 5.633001
     cpu clock time  = 0.000000 seconds
     cpu MFLOPS      = 0.000000
   SMGRestrict:
     wall clock time = 0.001255 seconds
     wall MFLOPS     = 10.097211
     cpu clock time  = 0.000000 seconds
     cpu MFLOPS      = 0.000000
   =============================================
   Solve phase times:
   =============================================
   SMG Solve:
     wall clock time = 0.526451 seconds
     wall MFLOPS     = 4.864426
     cpu clock time  = 0.530000 seconds
     cpu MFLOPS      = 4.831853
   SMG:
     wall clock time = 0.526433 seconds
     wall MFLOPS     = 4.864592
     cpu clock time  = 0.530000 seconds
     cpu MFLOPS      = 4.831853
   SMGRelax:
     wall clock time = 0.496075 seconds
     wall MFLOPS     = 4.651397
     cpu clock time  = 0.510000 seconds
     cpu MFLOPS      = 4.524396
   SMGResidual:
     wall clock time = 0.216037 seconds
     wall MFLOPS     = 6.290034
     cpu clock time  = 0.230000 seconds
     cpu MFLOPS      = 5.908174
   CyclicReduction:
     wall clock time = 0.235903 seconds
     wall MFLOPS     = 3.644964
     cpu clock time  = 0.260000 seconds
     cpu MFLOPS      = 3.307146
   SMGIntAdd:
     wall clock time = 0.027868 seconds
     wall MFLOPS     = 6.407349
     cpu clock time  = 0.020000 seconds
     cpu MFLOPS      = 8.928000
   SMGRestrict:
     wall clock time = 0.015325 seconds
     wall MFLOPS     = 9.321240
     cpu clock time  = 0.000000 seconds
     cpu MFLOPS      = 0.000000

   Iterations = 4
   Final Relative Residual Norm = 8.972097e-07

The relative residual norm may differ slightly from machine to machine
or compiler to compiler, but should only differ very slightly (say,
the 6th or 7th decimal place).  Also, the code should generate nearly
identical results for a given problem, independent of the data
distribution.  The only part of the code that does not guarantee
bitwise identical results is the inner product used to compute norms.
In practice, the above residual norm has remained the same.

%==========================================================================
%==========================================================================

Release and Modification Record