Doc updates (#974)

* Updated documentation for clarity and to clean up a few typos. * Add warning messages to FEI, ParaSails, PILUT, Euclid. * Improved and updated GPU information * Added CMake build information
2023-09-28 18:43:53 -07:00 · 2023-09-28 18:43:53 -07:00 · 27b8471742
commit 27b8471742
parent 57862ef6e2
11 changed files with 370 additions and 112 deletions
--- a/.gitignore
+++ b/.gitignore
@ -19,11 +19,13 @@ cmbuild/
 ###############
 # Documentation
 ###############
+src/docs/ref-manual/html/
 src/docs/ref-manual-html/
 src/docs/ref-manual.pdf
 src/docs/ref-manual/latex/
 src/docs/ref-manual/xml/
 src/docs/usr-manual-html/
+src/docs/usr-manual/html/
 src/docs/usr-manual.pdf
 src/docs/usr-manual/_build/

--- a/src/docs/usr-manual/ch-fei.rst
+++ b/src/docs/usr-manual/ch-fei.rst
@ -10,6 +10,11 @@
 Finite Element Interface
 ******************************************************************************

+.. warning::
+   FEI is not actively supported by the hypre development team. For similar
+   functionality, we recommend using :ref:`sec-Block-Structured-Grids-FEM`, which
+   allows the representation of block-structured grid problems via hypre's
+   SStruct interface.

 Introduction
 ==============================================================================
@ -48,7 +53,7 @@ illustrate this in the rest of the section and refer to example 10 (in the
 In hypre, one creates an instance of the FEI as follows:

 .. code-block:: c++
-   
+
   LLNL_FEI_Impl *feiPtr = new LLNL_FEI_Impl(mpiComm);

 Here ``mpiComm`` is an MPI communicator (e.g. ``MPI\_COMM\_WORLD``).  If
@ -56,7 +61,7 @@ Sandia's FEI package is to be used, one needs to define a hypre solver object
 first:

 .. code-block:: c++
-   
+
   LinearSystemCore   *solver = HYPRE_base_create(mpiComm);
   FEI_Implementation *feiPtr = FEI_Implementation(solver,mpiComm,rank);

@ -76,19 +81,19 @@ variables, and if we assign ``fieldID`` :math:`7` and :math:`8` to them,
 respectively, then the finite element field information can be set up by

 .. code-block:: c++
-   
+
   nFields   = 2;                 /* number of unknown fields */
   fieldID   = new int[nFields];  /* field identifiers */
   fieldSize = new int[nFields];  /* vector dimension of each field */
-   
+
   /* velocity (a 3D vector) */
   fieldID[0]   = 7;
   fieldSize[0] = 3;
-   
+
   /* pressure (a scalar function) */
   fieldID[1]   = 8;
   fieldSize[1] = 1;
-   
+
   feiPtr -> initFields(nFields, fieldSize, fieldID);

 Once the field information has been established, we are ready to initialize an
@ -99,21 +104,21 @@ element fields (fields that have been defined previously). Suppose we use
 consists of

 .. code-block:: c++
-   
+
   elemBlkID  = 0;     /* identifier for a block of elements */
   nElems     = 1000;  /* number of elements in the block */
   elemNNodes = 8;     /* number of nodes per element */
-   
+
   /* nodal-based field for the velocity */
   nodeNFields     = 1;
   nodeFieldIDs    = new[nodeNFields];
   nodeFieldIDs[0] = fieldID[0];
-   
+
   /* element-based field for the pressure */
   elemNFields     = 1;
   elemFieldIDs    = new[elemNFields];
   elemFieldIDs[0] = fieldID[1];
-   
+
   feiPtr -> initElemBlock(elemBlkID, nElems, elemNNodes, nodeNFields,
                           nodeFieldIDs, elemNFields, elemFieldIDs, 0);

@ -132,14 +137,14 @@ are shared with the other processors.  The syntax for setting up the shared
 nodes is

 .. code-block:: c++
-   
+
   feiPtr -> initSharedNodes(nShared, sharedIDs, sharedLengs, sharedProcs);

 This completes the initialization phase, and a completion signal is sent to the
 FEI via

 .. code-block:: c++
-   
+
   feiPtr -> initComplete();

 Next, we begin the *load* phase. The first entity for loading is the nodal
@ -149,7 +154,7 @@ on whether the boundary conditions are Dirichlet, Neumann, or mixed, the three
 values should be passed into the FEI accordingly.

 .. code-block:: c++
-   
+
   feiPtr -> loadNodeBCs(nBCs, BCEqn, fieldID, alpha, beta, gamma);

 The element stiffness matrices are to be loaded in the next step. We need to
@ -162,7 +167,7 @@ equations are arranged (similar to the interleaving scheme mentioned above).
 The calling sequence for loading element stiffness matrices is

 .. code-block:: c++
-   
+
   for (i = 0; i < nElems; i++)
      feiPtr -> sumInElem(elemBlkID, elemID, elemConn[i], elemStiff[i],
                          elemLoads[i], elemFormat);
@ -171,6 +176,5 @@ To complete the assembling of the global stiffness matrix and the corresponding
 right hand side, a signal is sent to the FEI via

 .. code-block:: c++
-   
-   feiPtr -> loadComplete();

+   feiPtr -> loadComplete();
--- a/src/docs/usr-manual/ch-intro.rst
+++ b/src/docs/usr-manual/ch-intro.rst
@ -127,7 +127,7 @@ As previously noted, on most systems hypre can be built by simply typing
 Alternatively, the CMake system [CMakeWeb]_ can be used, and is the best
 approach for building hypre on Windows systems in particular.  For more detailed
 instructions, read the ``INSTALL`` file provided with the hypre distribution or
-refer to the last chapter in this manual.  Note the following requirements:
+the :ref:`ch-General` section of this manual.  Note the following requirements:

 * To run in parallel, hypre requires an installation of MPI.

@ -235,7 +235,7 @@ Writing your code
 As discussed in the previous section, the following decisions should be made
 before writing any code:

-* Choose a conceptual interface. 
+* Choose a conceptual interface.
 * Choose your desired solver strategy.
 * Look up matrix requirements for each solver and preconditioner.
 * Choose a matrix storage class that is compatible with your solvers and
--- a/src/docs/usr-manual/ch-misc.rst
+++ b/src/docs/usr-manual/ch-misc.rst
@ -14,24 +14,21 @@ General Information
 Getting the Source Code
 ==============================================================================

-The hypre distribution tar file is available from the Software link of the hypre
-web page, http://www.llnl.gov/CASC/hypre/.  The hypre Software distribution page
-allows access to the tar files of the latest and previous general and beta
-distributions as well as documentation.
-
+The most recent hypre distribution is available at
+https://github.com/hypre-space/hypre/tags along with previous distribution versions.

 Building the Library
 ==============================================================================

 In this and the following several sections, we discuss the steps to install and
-use hypre on a Unix-like operating system, such as Linux, AIX, and Mac OS X.
-Alternatively, the CMake build system [CMakeWeb]_ can be used, and is the best
-approach for building hypre on Windows systems in particular (see the
-``INSTALL`` file for details).
+use hypre.  First, we focus on the primary method targeting Unix-like operating
+systems, such as Linux, AIX, and Mac OS X.  Then in `CMake instructions`_, we
+explain an alternative approach using the CMake build system [CMakeWeb]_, which
+is the best approach for building hypre on Windows systems in particular.

 After unpacking the hypre tar file, the source code will be in the ``src``
 sub-directory of a directory named hypre-VERSION, where VERSION is the current
-version number (e.g., hypre-1.8.4, with a "b" appended for a beta release).
+version number (e.g., hypre-2.29.0).

 Move to the ``src`` sub-directory to build hypre for the host platform.  The
 simplest method is to configure, compile and install the libraries in
@ -87,7 +84,7 @@ is to display the help package, by executing ``./configure --help``, which also
 includes the usage information.  The user can mix and match the configure
 options and variable settings to meet their needs.

-Some of the commonly used options include:
+Some commonly used options include:

 .. code-block:: none

@ -102,10 +99,12 @@ Some of the commonly used options include:
   --with-openmp                  Use OpenMP. This may affect which compiler is
                                  chosen.
   --enable-bigint                Use long long int for HYPRE_Int (default is NO).
+                                  NOTE: This option is not available for Nvidia
+                                  and AMD GPUs.
   --enable-mixedint              Use long long int for HYPRE_BigInt and int for
                                  HYPRE_Int.
                                  NOTE: This option disables Euclid, ParaSails,
-                                        pilut and CGC coarsening.
+                                        PILUT and CGC coarsening.

 The user can mix and match the configure options and variable settings to meet
 their needs.  It should be noted that hypre can be configured with external BLAS
@ -167,7 +166,7 @@ Hypre can support NVIDIA GPUs with CUDA and OpenMP (:math:`{\ge}` 4.5). The rela

 .. code-block:: none

-  --with-cuda             Use CUDA. Require cuda-8.0 or higher (default is
+  --with-cuda             Use CUDA. Require cuda-9.0 or higher (default is
                          NO).

  --with-device-openmp    Use OpenMP 4.5 Device Directives. This may affect
@ -190,12 +189,11 @@ need to be set properly, which can be also set by
   --with-cuda-home=DIR

 When configured with ``--with-cuda`` or ``--with-device-openmp``, the memory allocated on the GPUs, by default, is the GPU device memory, which is not accessible from the CPUs.
-Hypre's structured solvers can work fine with device memory,
-whereas only selected unstructured solvers can run with device memory. See 
-Chapter :ref:`ch-boomeramg-gpu` for details.
-In general, BoomerAMG and the SStruct
-require  unified (CUDA managed) memory, for which
-the following option should be added
+Hypre's structured solvers can run with device memory,
+whereas only selected unstructured solvers can run with device memory. See
+:ref:`ch-boomeramg-gpu` for details.
+Some solver options for BoomerAMG require unified (CUDA managed) memory.
+To use these options add the following configure option:

 .. code-block:: none

@ -220,11 +218,11 @@ The other NVIDIA GPU related options include:

 * ``--enable-gpu-profiling``  Use NVTX on CUDA, rocTX on HIP (default is NO)
 * ``--enable-cusparse``       Use cuSPARSE for GPU sparse kernels (default is YES)
-* ``--enable-cublas``         Use cuBLAS for GPU dense kernels (default is NO)
+* ``--enable-cublas``         Use cuBLAS for GPU dense kernels (default is YES)
 * ``--enable-curand``         Use random numbers generators on GPUs (default is YES)

 Allocations and deallocations of GPU memory are expensive. Memory pooling is a common approach to reduce such overhead and improve performance.
-hypre provides caching allocators for GPU device memory and unified memory, 
+hypre provides caching allocators for GPU device memory and unified memory,
 enabled by

 .. code-block:: none
@ -250,15 +248,44 @@ For running on AMD GPUs, configure with
  --with-hip              Use HIP for AMD GPUs. (default is NO)
  --with-gpu-arch=ARG     Use appropriate AMD GPU architecture

-Currently, only BoomerAMG is supported with HIP. The other AMD GPU related options include:
+The other AMD GPU related options include:

 * ``--enable-gpu-profiling``  Use NVTX on CUDA, rocTX on HIP (default is NO)
 * ``--enable-rocsparse``      Use rocSPARSE (default is YES)
 * ``--enable-rocblas``        Use rocBLAS (default is NO)
 * ``--enable-rocrand``        Use rocRAND (default is YES)

+All the options supported by CUDA are also supported with HIP. **Note that the ``--enable-bigint`` option is not supported with CUDA or HIP.**
+
+For running on Intel GPUs, configure with
+
+.. code-block:: none
+
+  --with-sycl             Use SYCL for Intel GPUs. (default is NO).
+  --with-sycl-target=ARG  User specifies sycl targets for AOT compilation in
+                          ARG, where ARG is a comma-separated list (enclosed
+                          in quotes), e.g. "spir64_gen".
+  --with-sycl-target-backend=ARG
+                          User specifies additional options for the sycl
+                          target backend for AOT compilation in ARG, where ARG
+                          contains the desired options (enclosed in
+                          double+single quotes), e.g.
+                          --with-sycl-target-backend="'-device
+                          12.1.0,12.4.0'".
+
+Intel oneMKL functionality is also used by default (and required for certain hypre solvers):
+
+.. code-block:: none
+
+  --enable-onemklsparse   Use oneMKL sparse (default is YES).
+  --enable-onemklblas     Use oneMKL blas (default is YES).
+  --enable-onemklrand     Use oneMKL rand (default is YES).
+
+The SYCL backend now supports all GPU-enabled hypre functionality currently supported by CUDA/HIP except for FSAI (work in progress).
+The ``--enable-bigint`` option is supported with SYCL (not supported for CUDA/HIP).
+
 Testing the Library
-==============================================================================
+------------------------------------------------------------------------------

 The ``examples`` subdirectory contains several codes that can be used to test
 the newly created hypre library.  To create the executable versions, move into
@ -266,6 +293,181 @@ the ``examples`` subdirectory, enter ``make`` then execute the codes as
 described in the initial comments section of each source code.


+.. _CMake instructions:
+
+CMake-based Build Instructions
+==============================================================================
+
+This section describes hypre's CMake build system, which is particularly useful for building
+the code on Windows machines. CMake-based installation provides a platform-independent
+build system. CMake can generate Unix and Linux Makefiles, as well as Visual Studio and
+(Apple) XCode project files from the same configuration file.  In addition,
+CMake also provides a GUI front end and which allows an interactive build and
+installation process. For more detailed information on using CMake,
+see `CMake's User Interaction Guide <https://cmake.org/cmake/help/latest/guide/user-interaction/index.html>`_.
+
+**Note**: Not all options are currently supported when using CMake. This is an
+on-going effort to support all hypre configure options.
+
+Here are the basic steps to configure, make, and install hypre using CMake:
+
+#. Ensure that CMake version 3.13.0 or later is installed on the system.
+#. After unpacking the hypre tar file or cloning, move to the ``src`` sub-directory.
+#. To build the library, run CMake on the top-level hypre source directory to
+   generate files appropriate for the native build system.  To prevent writing
+   over the Makefiles in hypre's configure/make system above, only out-of-source
+   builds are allowed with CMake, that is, it is required to use a separate build
+   directory.
+
+   The directory ``src/cmbuild``
+   is provided in the release for convenience, but
+   alternative build directories may be created by the user. To configure with
+   the default options:
+
+   - Unix: From the ``src/cmbuild`` directory, type ``cmake ..``.
+
+   - Windows Visual Studio: Set the source and build directories to ``src`` and ``src/cmbuild``,
+     then click on `Configure` following by `Generate`.
+
+
+#. To build the library, compile with the native build system:
+
+   - Unix: From the ``src/cmbuild`` directory, type ``make`` or ``make -j 4``
+     (for a faster parallel build with 4 threads).
+
+   - Windows Visual Studio: Open the 'hypre' VS solution file generated by CMake
+     and build the `ALL_BUILD` target.
+
+#. To install hypre to the installation directory specified in the configuration:
+
+   - Unix: From the ``src/cmbuild`` directory, type ``make install``.
+
+   - Windows Visual Studio: Open the `hypre` VS solution file generated by CMake
+     and build the `INSTALL` target.
+
+   - *Note*: The default installation location is set to ``src/hypre``.
+     Use the ``HYPRE_INSTALL_PREFIX`` option to change this location if desired.
+
+Changing Default CMake Configuration Options
+------------------------------------------------------------------------------
+
+Various configuration options can be set from within CMake (see `CMake options`_).
+One option is to specify these options in the command-line CMake invocation,
+e.g., to enabling building of the examples:
+
+.. code-block:: none
+
+  cmake -DHYPRE_BUILD_EXAMPLES=ON ..
+
+Another option is to use the CMake GUI (``ccmake`` or ``cmake-gui``) to change the default options
+as appropriate, then reconfigure / generate:
+
+- Unix: From the ``src/cmbuild`` directory, type ``ccmake ..``.
+
+  * Change options to desired settings:
+
+    * To set a variable, move the cursor to the variable and press enter.
+    * If it is a boolean (ON/OFF) it will toggle the value.
+    * If it is string or file, it will allow editing of the string.
+
+  * Then configure (``c`` key).
+  * Repeat until all values are set as desired and then generate (``g`` key).
+
+- Windows Visual Studio: Change options, then click on `Configure` then `Generate`.
+
+Then the re-build and re-install with the updated configuration options.
+
+.. _CMake options:
+
+CMake Configure Options
+------------------------------------------------------------------------------
+
+There are many options to allow the user to override and refine
+the defaults for any system.  The best way to find out what options are available
+is to use ``cmake``, ``cmake-gui``, or inspect using Windows Visual Studio.
+
+
+Some commonly used options (default value) include:
+
+.. code-block:: none
+
+ HYPRE_INSTALL_PREFIX (src/hypre) Installation location.
+ HYPRE_BUILD_EXAMPLES (OFF)       Compile test cases for examples of using the library.
+ HYPRE_BUILD_TYPE (Release)       Sets compiler flags to generate information.
+                                  needed for debugging.
+ HYPRE_ENABLE_SHARED (OFF)        Build shared libraries.
+ HYPRE_PRINT_ERRORS (OFF)         Print HYPRE errors.
+ HYPRE_WITH_OPENMP (OFF)          Use OpenMP.
+
+ HYPRE_ENABLE_BIGINT (OFF)        Use long long int for HYPRE_Int.
+ HYPRE_ENABLE_MIXEDINT (OFF)      Use long long int for HYPRE_BigInt and int for
+                                  HYPRE_Int.
+
+GPU CMake Build Options
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Some of the commonly used options for GPU CMake builds of hypre are listed below.
+
+* CUDA support for NVIDIA GPUs relevant options:
+
+.. code-block:: none
+
+ HYPRE_WITH_CUDA (OFF)            Use CUDA v9.0 or higher.
+ HYPRE_CUDA_SM (70)               Target CUDA architecture.
+
+When configured with CUDA, the memory allocated on the GPUs, by default, is the GPU device memory, which is not accessible from the CPUs.
+Hypre's structured solvers can run with device memory,
+whereas only selected unstructured solvers can run with device memory. See
+:ref:`ch-boomeramg-gpu` for details.
+Some solver options for BoomerAMG require unified (CUDA managed) memory.
+To use these options turn the following option on:
+
+.. code-block:: none
+
+  HYPRE_ENABLE_UNIFIED_MEMORY (OFF)  Use unified memory for allocating the memory.
+
+The other NVIDIA GPU related options include:
+
+.. code-block:: none
+
+ HYPRE_ENABLE_GPU_PROFILING (OFF) Use NVTX.
+ HYPRE_ENABLE_CUSPARSE (ON)       Use cuSPARSE for GPU sparse kernels.
+ HYPRE_ENABLE_CUBLAS (OFF)        Use cuBLAS for GPU dense kernels.
+ HYPRE_ENABLE_CURAND (ON)         Use random numbers generators on GPUs.
+
+Allocations and deallocations of GPU memory are expensive. Memory pooling is a common approach to reduce such overhead and improve performance.
+hypre provides caching allocators for GPU device memory and unified memory,
+enabled by
+
+.. code-block:: none
+
+ HYPRE_ENABLE_DEVICE_POOL (OFF)   Enable the caching GPU memory allocator in hypre
+
+
+hypre also supports Umpire [Umpire]_. To enable Umpire pool, include the following options:
+
+.. code-block:: none
+
+ HYPRE_WITH_UMPIRE (OFF)          Use Umpire Allocator for device and unified memory.
+ TPL_UMPIRE_LIBRARIES             List of absolute paths to Umpire link libraries.
+ TPL_UMPIRE_INCLUDE_DIRS          List of absolute paths to Umpire include directories.
+
+SYCL support for Intel GPUs relevant options:
+
+.. code-block:: none
+
+ HYPRE_WITH_SYCL (OFF)            Enable SYCL support.
+ HYPRE_SYCL_TARGET                Target SYCL architecture, e.g. 'spir64_gen'.
+ HYPRE_SYCL_TARGET_BACKEND        Additional SYCL backend options, e.g. '-device 12.1.0,12.4.0'.
+
+
+Testing the Library with CMake Build Process
+------------------------------------------------------------------------------
+
+The ``examples`` subdirectory contains several codes that can be used to test
+the newly created hypre library. The CMake option ``HYPRE_BUILD_EXAMPLES`` should
+be enabled so ensure the executables in the ``examples`` subdirectory are built.
+
 Linking to the Library
 ==============================================================================

@ -365,6 +567,12 @@ system, MPI implementation, compiler, and any error messages produced.
 Using HYPRE in External FEI Implementations
 ==============================================================================

+.. warning::
+   FEI is not actively supported by the hypre development team. For similar
+   functionality, we recommend using :ref:`sec-Block-Structured-Grids-FEM`, which
+   allows the representation of block-structured grid problems via hypre's
+   SStruct interface.
+
 To set up hypre for use in external, e.g. Sandia's, FEI implementations one
 needs to follow the following steps:

--- a/src/docs/usr-manual/ch-references.rst
+++ b/src/docs/usr-manual/ch-references.rst
@ -141,6 +141,10 @@
   Approximate Inverse Preconditionings I. Theory. *SIAM J. Matrix Anal. A.*, 14(1):45--58, 1993.
   `https://doi.org/10.1137/0614004 <https://doi.org/10.1137/0614004>`_.

+.. [LiSY2021] R. Li, B. Sjogreen and U. M. Yang. A new class of AMG interpolation
+   methods based on matrix-matrix multiplications. *SIAM J. Sci. Comput.*, 43(5), 
+   S540--S564.
+
 .. [JaFe2015] C. Janna, M. Ferronato, F. Sartoretto and G. Gambolati.
   FSAIPACK: A Software Package for High-Performance Factored Sparse Approximate Inverse
   Preconditioning. *ACM T. Math. Software*, 41(2):1–-26, 2015.
--- a/src/docs/usr-manual/solvers-boomeramg.rst
+++ b/src/docs/usr-manual/solvers-boomeramg.rst
@ -10,13 +10,11 @@ BoomerAMG
 BoomerAMG is a parallel implementation of the algebraic multigrid method
 [RuSt1987]_.  It can be used both as a solver or as a preconditioner.  The user
 can choose between various different parallel coarsening techniques,
-interpolation and relaxation schemes.  While the default settings work fairly
-well for two-dimensional diffusion problems, for three-dimensional diffusion
-problems, it is recommended to choose a lower complexity coarsening like HMIS or
-PMIS (coarsening 10 or 8) and combine it with a distance-two interpolation
-(interpolation 6 or 7), that is also truncated to 4 or 5 elements per
-row. Additional reduction in complexity and increased scalability can often be
-achieved using one or two levels of aggressive coarsening.
+interpolation and relaxation schemes. The default settings for CPUs, HMIS 
+(coarsening 8) combined with a distance-two interpolation (6) truncated to 4
+or 5 elements per row, should work fairly well for two- and three-dimensional 
+diffusion problems. Additional reduction in complexity and increased scalability 
+can often be achieved using one or two levels of aggressive coarsening.


 Parameter Options
@ -42,6 +40,7 @@ techniques can be found in [HeYa2002]_, [Yang2005]_.
 Various coarsening techniques are available:

 * the Cleary-Luby-Jones-Plassman (CLJP) coarsening,
+* parallel versions of the classical RS coarsening described in [HeYa2002]_.
 * the Falgout coarsening which is a combination of CLJP and the classical RS
  coarsening algorithm,
 * CGC and CGC-E coarsenings [GrMS2006a]_, [GrMS2006b]_,
@ -51,14 +50,15 @@ Various coarsening techniques are available:
  techniques mentioned above a nd thus achieving much lower complexities and
  lower memory use [Stue1999]_.

-To use aggressive coarsening the user has to set the number of levels to which
-he wants to apply aggressive coarsening (starting with the finest level) via
+To use aggressive coarsening users have to set the number of levels to which
+they want to apply aggressive coarsening (starting with the finest level) via
 ``HYPRE_BoomerAMGSetAggNumLevels``. Since aggressive coarsening requires long
 range interpolation, multipass interpolation is always used on levels with
 aggressive coarsening, unless the user specifies another long-range
-interpolation suitable for aggressive coarsening.
+interpolation suitable for aggressive coarsening via 
+``HYPRE_BoomerAMGSetAggInterpType``..

-Note that the default coarsening is HMIS [DeYH2004]_.
+Note that the default coarsening for CPUs is HMIS, for GPUs PMIS [DeYH2004]_.


 Interpolation Options
@ -66,18 +66,19 @@ Interpolation Options

 Various interpolation techniques can be set using ``HYPRE_BoomerAMGSetInterpType``:

-* the "classical" interpolation as defined in [RuSt1987]_,
-* direct interpolation [Stue1999]_,
-* standard interpolation [Stue1999]_,
+* the "classical" interpolation (0) as defined in [RuSt1987]_,
+* direct interpolation (3) [Stue1999]_,
+* standard interpolation (8) [Stue1999]_,
 * an extended "classical" interpolation, which is a long range interpolation and
  is recommended to be used with PMIS and HMIS coarsening for harder problems
-  [DFNY2008]_,
-* multipass interpolation [Stue1999]_,
+  (6) [DFNY2008]_,
+* distance-two interpolation based on matrix operations (17) [LiSY2021]_,
+* multipass interpolation (4) [Stue1999]_,
 * two-stage interpolation [Yang2010]_,
 * Jacobi interpolation [Stue1999]_,
-* the "classical" interpolation modified for hyperbolic PDEs.
+* the "classical" interpolation modified for hyperbolic PDEs (2).

-Jacobi interpolation is only use to improve certain interpolation operators and
+Jacobi interpolation is only used to improve certain interpolation operators and
 can be used with ``HYPRE_BoomerAMGSetPostInterpType``.  Since some of the
 interpolation operators might generate large stencils, it is often possible and
 recommended to control complexity and truncate the interpolation operators using
@ -85,7 +86,8 @@ recommended to control complexity and truncate the interpolation operators using
 ``HYPRE_BoomerAMGSetJacobiTruncTheshold`` (for Jacobi interpolation only).

 Note that the default interpolation is extended+i interpolation [DFNY2008]_
-truncated to 4 elements per row.
+truncated to 4 elements per row, for CPUs, and a version of this interpolation
+based on matrix operations for GPUs [LiSY2021]_.


 Non-Galerkin Options
@ -112,11 +114,12 @@ Smoother Options
 A good overview of parallel smoothers and their properties can be found in
 [BFKY2011]_. Various of the described relaxation techniques are available:

-* weighted Jacobi relaxation,
-* a hybrid Gauss-Seidel / Jacobi relaxation scheme,
-* a symmetric hybrid Gauss-Seidel / Jacobi relaxation scheme,
-* l1-Gauss-Seidel or Jacobi,
-* Chebyshev smoothers,
+* weighted Jacobi relaxation (0),
+* a hybrid Gauss-Seidel / Jacobi relaxation scheme (3 4),
+* a symmetric hybrid Gauss-Seidel / Jacobi relaxation scheme (6),
+* l1-Gauss-Seidel or Jacobi (13 14 18 8),
+* Chebyshev smoothers (16),
+* two-stage Gauss-Seidel smoothers (11 12) [BKRHSMTY2021]_,
 * hybrid block and Schwarz smoothers [Yang2004]_,
 * Incomplete LU factorization, see :ref:`ilu-amg-smoother`.
 * Factorized Sparse Approximate Inverse (FSAI), see :ref:`fsai-amg-smoother`.
@ -144,6 +147,19 @@ used. Functions that enable the user to access the systems AMG version are
 ``HYPRE_BoomerAMGSetNumFunctions``, ``HYPRE_BoomerAMGSetDofFunc`` and
 ``HYPRE_BoomerAMGSetNodal``.

+There are basically two approaches to deal with matrices derived from systems
+of PDEs. The unknown-based approach (which is the default) treats variables 
+corresponding to the same unknown or function separately, i.e., when coarsening 
+or generating interpolation, connections between variables associated with 
+different unknowns are ignored. This can work well for weakly coupled PDEs, 
+but will be problematic for strongly coupled PDEs. For such problems, we recommend 
+to use hypre's multigrid reduction (MGR) solver. The second approach, called 
+the nodal approach, considers all unknowns at a physical grid point together 
+such that coarsening, interpolation and relaxation occur in a point-wise fashion. 
+It is possible and sometimes prefered to combine nodal coarsening with unknown-based 
+interpolation. For this case, ``HYPRE_BoomerAMGSetNodal`` should be set > 1. 
+For details see the reference manual.
+
 If the user can provide the near null-space vectors, such as the rigid body
 modes for linear elasticity problems, an interpolation is available that will
 incorporate these vectors with ``HYPRE_BoomerAMGSetInterpVectors`` and
@ -178,8 +194,25 @@ The currently available  GPU-supported BoomerAMG options include:
 * Interpolation:  direct (3), BAMG-direct (15), extended (14), extended+i (6) and extended+e (18)
 * Aggressive coarsening
 * Second-stage interpolation with aggressive coarsening: extended (5) and extended+e (7)
-* Smoother: Jacobi (7), l1-Jacobi (18), hybrid Gauss Seidel/SRROR (3 4 6), two-stage Gauss-Seidel (11,12) [BKRHSMTY2021]_
-* Relaxation order: must be 0, i.e., lexicographic order
+* Smoother: Jacobi (7), l1-Jacobi (18), hybrid Gauss Seidel/SSOR (3 4 6), two-stage Gauss-Seidel (11,12) [BKRHSMTY2021]_,  and Chebyshev (16)
+* Relaxation order can be 0, lexicographic order, or C/F for (7) and (18)
+
+Memory locations and execution policies
+------------------------------------------------------------------------------
+Hypre provides two user-level memory locations, ``HYPRE_MEMORY_HOST`` and ``HYPRE_MEMORY_DEVICE``, where
+``HYPRE_MEMORY_HOST`` is always the CPU memory while ``HYPRE_MEMORY_DEVICE`` can be mapped to different memory spaces 
+based on the configure options of hypre.
+When built with ``--with-cuda``, ``--with-hip``, ``--with-sycl``, or ``--with-device-openmp``,
+``HYPRE_MEMORY_DEVICE`` is the GPU device memory,
+and when built additionally with ``--enable-unified-memory``, it is the GPU unified memory (UM).
+For a non-GPU build, ``HYPRE_MEMORY_DEVICE`` is also mapped to the CPU memory.
+The default memory location of hypre's matrix and vector objects is ``HYPRE_MEMORY_DEVICE``,
+which can be changed at runtime by ``HYPRE_SetMemoryLocation(...)``.
+
+The execution policies define the platform of running computations based on the memory locations of participating objects.
+The default policy is ``HYPRE_EXEC_HOST``, i.e., executing on the host **if the objects are accessible from the host**.
+It can be adjusted by ``HYPRE_SetExecutionPolicy(...)``.
+Clearly, this policy only affects objects in UM, since UM is accessible from **both CPUs and GPUs**.

 A sample code of setting up IJ matrix :math:`A` and solve :math:`Ax=b` using AMG-preconditioned CG
 on GPUs is shown below.
@ -249,8 +282,6 @@ For best performance, it might be necessary to set certain parameters, which
 will affect both coarsening and interpolation.  One important parameter is the
 strong threshold, which can be set using the function
 ``HYPRE_BoomerAMGSetStrongThreshold``.  The default value is 0.25, which appears
-to be a good choice for 2-dimensional problems and the low complexity coarsening
-algorithms.  For 3-dimensional problems a better choice appears to be 0.5, when
-using the default coarsening algorithm. However, the choice of the strength
-threshold is problem dependent and therefore there could be better choices than
-the two suggested ones.
+to be a good choice for diffusion problems.  The choice of the strength
+threshold is problem dependent. For example, elasticity problems often require a larger
+strength threshold.
--- a/src/docs/usr-manual/solvers-fei.rst
+++ b/src/docs/usr-manual/solvers-fei.rst
@ -9,6 +9,12 @@
 FEI Solvers
 ==============================================================================

+.. warning::
+   FEI is not actively supported by the hypre development team. For similar
+   functionality, we recommend using :ref:`sec-Block-Structured-Grids-FEM`, which
+   allows the representation of block-structured grid problems via hypre's
+   SStruct interface.
+
 After the FEI has been used to assemble the global linear system (as described
 in Chapter :ref:`ch-FEI`), a number of hypre solvers can be called to perform
 the solution.  This is straightforward, if hypre's FEI has been used.  If an
@ -20,24 +26,24 @@ the available options can be found in the FEI section of the reference manual.
 They are passed to the FEI as in the following example:

 .. code-block:: c++
-   
+
   nParams = 5;
   paramStrings = new char*[nParams];
   for (i = 0; i < nParams; i++) }
      paramStrings[i] = new char[100];
-   
+
   strcpy(paramStrings[0], "solver cg");
   strcpy(paramStrings[1], "preconditioner diag");
   strcpy(paramStrings[2], "maxiterations 100");
   strcpy(paramStrings[3], "tolerance 1.0e-6");
   strcpy(paramStrings[4], "outputLevel 1");
-   
+
   feiPtr -> parameters(nParams, paramStrings);

 To solve the linear system of equations, we call

 .. code-block:: c++
-   
+
   feiPtr -> solve(&status);

 where the returned value ``status`` indicates whether the solve was successful.
@ -45,7 +51,7 @@ where the returned value ``status`` indicates whether the solve was successful.
 Finally, the solution can be retrieved by the following function call:

 .. code-block:: c++
-   
+
   feiPtr -> getBlockNodeSolution(elemBlkID, nNodes, nodeIDList,
                                  solnOffsets, solnValues);

@ -53,7 +59,7 @@ where ``nodeIDList`` is a list of nodes in element block ``elemBlkID``, and
 ``solnOffsets[i]`` is the index pointing to the first location where the
 variables at node :math:`i` is returned in ``solnValues``.

-Solvers Available Only through the FEI 
+Solvers Available Only through the FEI
 ------------------------------------------------------------------------------

 While most of the solvers from the previous sections are available through the
@ -71,9 +77,9 @@ following we list some of these internal solvers.

 #. Additional Krylov solvers (FGMRES, TFQMR, symmetric QMR),
 #. SuperLU direct solver (sequential),
-#. SuperLU direct solver with iterative refinement (sequential), 
+#. SuperLU direct solver with iterative refinement (sequential),

-Parallel Preconditioners 
+Parallel Preconditioners
 ^^^^^^^^^^^^^^^^^^^^^^^^

 The performance of the Krylov solvers can be improved by clever selection of
@ -82,7 +88,7 @@ following preconditioners are available via the ``LinearSystemCore`` interface:

 #. the modified version of MLI, which requires the finite element substructure
   matrices to construct the prolongation operators,
-#. parallel domain decomposition with inexact local solves (``DDIlut``), 
+#. parallel domain decomposition with inexact local solves (``DDIlut``),
 #. least-squares polynomial preconditioner,
 #. :math:`2 \times 2` block preconditioner, and
 #. :math:`2 \times 2` Uzawa preconditioner.
@ -107,25 +113,25 @@ The incoming linear system of equations is assumed to be in the form:

 .. math::

-   \left[ 
-   \begin{array}{cc} 
+   \left[
+   \begin{array}{cc}
      D   & B \\
      B^T & 0
   \end{array}
-     \right] 
+     \right]
     \left[
-   \begin{array}{c} 
+   \begin{array}{c}
      x_1 \\
      x_2
   \end{array}
-     \right] 
+     \right]
     =
     \left[
-   \begin{array}{c} 
+   \begin{array}{c}
      b_1 \\
      b_2
   \end{array}
-     \right] 
+     \right]

 where :math:`D` is a diagonal matrix.  After Schur complement reduction is
 applied, the resulting linear system becomes
@ -145,20 +151,20 @@ re-order the system into a :math:`3 \times 3` block matrix.

 .. math::

-   \left[ 
-   \begin{array}{ccc} 
+   \left[
+   \begin{array}{ccc}
      A_{11}  & A_{12} & N \\
      A_{21}  & A_{22} & D \\
      N_{T}   & D      & 0 \\
   \end{array}
-   \right] 
+   \right]
   =
-   \left[ 
-   \begin{array}{ccc} 
+   \left[
+   \begin{array}{ccc}
      A_{11}       & \hat{A}_{12} \\
      \hat{A}_{21} & \hat{A}_{22}.
   \end{array}
-   \right] 
+   \right]

 The reduced system has the form :

--- a/src/docs/usr-manual/solvers-fsai.rst
+++ b/src/docs/usr-manual/solvers-fsai.rst
@ -3,6 +3,7 @@

   SPDX-License-Identifier: (Apache-2.0 OR MIT)

+.. _fsai:

 FSAI
 ==============================================================================
--- a/src/docs/usr-manual/solvers-parasails.rst
+++ b/src/docs/usr-manual/solvers-parasails.rst
@ -7,6 +7,11 @@
 ParaSails
 ==============================================================================

+.. warning::
+   ParaSails is not actively supported by the hypre development team. We recommend using
+   :ref:`fsai` for parallel sparse approximate inverse algorithms. This new implementation
+   includes NVIDIA/AMD GPU support through the CUDA/HIP backends.
+
 ParaSails is a parallel implementation of a sparse approximate inverse
 preconditioner, using *a priori* sparsity patterns and least-squares (Frobenius
 norm) minimization.  Symmetric positive definite (SPD) problems are handled
@ -53,7 +58,7 @@ in order to construct the preconditioner.
 ParaSail's Create function differs from the synopsis in the following way:

 .. code-block:: c
-   
+
   int HYPRE_ParaSailsCreate(MPI_Comm comm, HYPRE_Solver *solver, int symmetry);

 where ``comm`` is the MPI communicator.
@ -75,7 +80,7 @@ For more information about the final case, see section :ref:`nearly`.
 Parameters for setting up the preconditioner are specified using

 .. code-block:: c
-   
+
   int HYPRE_ParaSailsSetParams(HYPRE_Solver solver, double thresh,
                                int nlevel, double filter);

@ -119,4 +124,3 @@ latter may be guaranteed by 1) constructing the sparsity pattern with a
 symmetric matrix, or 2) if the matrix is structurally symmetric (has symmetric
 pattern), then thresholding to construct the pattern is not used (i.e., zero
 value of the ``thresh`` parameter is used).
-
--- a/src/parcsr_ls/HYPRE_parcsr_ls.h
+++ b/src/parcsr_ls/HYPRE_parcsr_ls.h
@ -1280,7 +1280,7 @@ HYPRE_Int HYPRE_BoomerAMGSetPrintLevel(HYPRE_Solver solver,

 /**
 * (Optional) Requests additional computations for diagnostic and similar
- * data to be logged by the user. Default to 0 for do nothing.  The latest
+ * data to be logged by the user. Default to 0 to do nothing.  The latest
 * residual will be available if logging > 1.
 **/
 HYPRE_Int HYPRE_BoomerAMGSetLogging(HYPRE_Solver solver,
@ -4059,7 +4059,7 @@ HYPRE_MGRSetReservedCpointsLevelToKeep( HYPRE_Solver solver, HYPRE_Int level);
 * Currently supports the following flavors of relaxation types
 * as described in the \e BoomerAMGSetRelaxType:
 * \e relax_type 0, 3 - 8, 13, 14, 18. Also supports AMG (options 1 and 2)
- *    and direct solver variants (9, 99, 199). See HYPRE_MGRSetLevelFRelaxType for details.
+ *    and direct solver variants (9, 99, 199). See \e HYPRE_MGRSetLevelFRelaxType for details.
 **/
 HYPRE_Int
 HYPRE_MGRSetRelaxType(HYPRE_Solver solver,
@ -4072,7 +4072,7 @@ HYPRE_MGRSetRelaxType(HYPRE_Solver solver,
 *    - 0 : Single-level relaxation sweeps for F-relaxation as prescribed by \e MGRSetRelaxType
 *    - 1 : Multi-level relaxation strategy for F-relaxation (V(1,0) cycle currently supported).
 *
- *    NOTE: This function will be removed in favor of /e HYPRE_MGRSetLevelFRelaxType!!
+ *    NOTE: This function will be removed in favor of \e HYPRE_MGRSetLevelFRelaxType!!
 **/
 HYPRE_Int
 HYPRE_MGRSetFRelaxMethod(HYPRE_Solver solver,
@ -4148,7 +4148,7 @@ HYPRE_MGRSetRestrictType( HYPRE_Solver solver,
                          HYPRE_Int restrict_type);

 /**
- * (Optional) This function is an extension of HYPRE_MGRSetRestrictType. It allows setting
+ * (Optional) This function is an extension of \e HYPRE_MGRSetRestrictType. It allows setting
 * the restriction operator strategy for each MGR level.
 **/
 HYPRE_Int
@ -4182,7 +4182,7 @@ HYPRE_MGRSetInterpType( HYPRE_Solver solver,
                        HYPRE_Int interp_type );

 /**
- * (Optional) This function is an extension of HYPRE_MGRSetInterpType. It allows setting
+ * (Optional) This function is an extension of \e HYPRE_MGRSetInterpType. It allows setting
 * the prolongation (interpolation) operator strategy for each MGR level.
 **/
 HYPRE_Int
@ -4198,7 +4198,7 @@ HYPRE_MGRSetNumRelaxSweeps( HYPRE_Solver solver,
                            HYPRE_Int nsweeps );

 /**
- * (Optional) This function is an extension of HYPRE_MGRSetNumRelaxSweeps. It allows setting
+ * (Optional) This function is an extension of \e HYPRE_MGRSetNumRelaxSweeps. It allows setting
 * the number of single-level relaxation sweeps for each MGR level.
 **/
 HYPRE_Int
@ -4287,10 +4287,8 @@ HYPRE_MGRSetCoarseGridPrintLevel( HYPRE_Solver solver,
                                  HYPRE_Int print_level );

 /**
- * (Optional) Set the threshold to compress the coarse grid at each level
- * Use threshold = 0.0 if no truncation is applied. Otherwise, set the threshold
- * value for dropping entries for the coarse grid.
- * The default is 0.0.
+ * (Optional) Set the threshold for dropping small entries on the coarse grid at each level.
+ * No dropping is applied if \e threshold = 0.0 (default). 
 **/
 HYPRE_Int
 HYPRE_MGRSetTruncateCoarseGridThreshold( HYPRE_Solver solver,
@ -4299,7 +4297,7 @@ HYPRE_MGRSetTruncateCoarseGridThreshold( HYPRE_Solver solver,
 /**
 * (Optional) Requests logging of solver diagnostics.
 * Requests additional computations for diagnostic and similar
- * data to be logged by the user. Default to 0 for do nothing.  The latest
+ * data to be logged by the user. Default is 0, do nothing.  The latest
 * residual will be available if logging > 1.
 **/
 HYPRE_Int
@ -4338,8 +4336,8 @@ HYPRE_Int
 HYPRE_MGRSetLevelSmoothIters( HYPRE_Solver solver,
                              HYPRE_Int *smooth_iters );
 /**
- * (Optional) Set the smoothing order for global smoothing at each level.
- * Options for \e level_smooth_order are:
+ * (Optional) Set the cycle for global smoothing.
+ * Options for \e global_smooth_cycle are:
 *    - 1 : Pre-smoothing - Down cycle (default)
 *    - 2 : Post-smoothing - Up cycle
 **/
--- a/src/parcsr_ls/HYPRE_parcsr_mgr.c
+++ b/src/parcsr_ls/HYPRE_parcsr_mgr.c
@ -533,9 +533,9 @@ HYPRE_MGRSetLevelSmoothIters( HYPRE_Solver solver,
 * HYPRE_MGRSetGlobalsmoothType
 *--------------------------------------------------------------------------*/
 HYPRE_Int
-HYPRE_MGRSetGlobalSmoothType( HYPRE_Solver solver, HYPRE_Int iter_type )
+HYPRE_MGRSetGlobalSmoothType( HYPRE_Solver solver, HYPRE_Int smooth_type )
 {
-   return hypre_MGRSetGlobalSmoothType(solver, iter_type);
+   return hypre_MGRSetGlobalSmoothType(solver, smooth_type);
 }
 /*--------------------------------------------------------------------------
 * HYPRE_MGRSetLevelsmoothType