# High Performance Conjugate Gradient benchmark for Intel GPUs

This is a version of the High Performace Conjugate Gradient (HPCG) benchmark  based on 
[hpcg benchmark](http://www.hpcg-benchmark.org/) application.  It is written in 
C++/SYCL programming language and is optimized for Intel's latest discrete GPUs, in 
particular, is optimized for Intel(R) Data Center GPU Max Series (previously codenamed
Ponte Vecchio or PVC) GPUs.

This implementation is compliant with version 3.1 of HPCG benchmark.

## Package Contents

This package contains Intel(R) Optimized High Performance Conjugate Gradient
Benchmark (Intel(R) Optimized HPCG) for Linux* OS with Intel GPUs. The package is
based on the reference implementation of the HPCG benchmark v3.1 with the
following additions:

    setup/Make.*
      - Setup files for building a benchmark optimized for one or more Intel(R)
      Data Center GPU Max Series GPUs.  The build steps are described below
    example_*_runscript.sh
      - example build and runscripts for the various setup/Make.* configurations

Reference HPCG code in the ./src directory was modified in order to bundle Intel(R)
architecture optimizations with the implementation.

```
===============================================================================
  Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
  and/or other countries.

* Other names and brands may be claimed as the property of others.
===============================================================================
```

See `third-party-programs.txt` for specific copyright license associated with this
benchmark and software.

## Requirements

- GNU Makefile
- Intel(R) Data Center GPU Max Series GPU(s) with latest recommended drivers installed
- "Intel(R) GPU"-aware MPI library
- Intel(R) oneAPI DPC++ Compiler (an implementation of C++/SYCL language)
- Intel(R) oneAPI DPC++ Library (oneDPL)
- Intel(R) oneAPI Math Kernel Library (oneMKL) (only actually necessary for testing or nonstandard code paths)

See `minimal_dependencies.txt` for versioning details.

## Quickstart build HPCG for Intel GPUs

Example build and runscripts are provided in `example_impi_runscript.sh`, `example_mpich_runscript.sh`
and `example_nompi_runscript.sh` and can be modified to fit your local environment.

Use some version of the following steps to setup/build/run the HPCG benchmark for Intel GPUs:

```
# Step 1: download and enter hpcg_gpu benchmark directory
cd /path/to/hpcg_gpu/benchmark

# Step 2: modify example runscript to match user configuration and software stack setup
cp examples_impi_runscript.sh my-run.sh
#cp examples_mpich_runscript.sh my-run.sh
#cp examples_nompi_runscript.sh my-run.sh

# Step 2.1: modify my-run.sh to match your local environment, turn on or off the build
# step in script as desired, if building, can modify build time settings in 
# `setup/Make.<config>` where <config> is one of IMPI_PVC, MPICH_PVC or NOMPI_PVC.

# Step 2.2: modify runtime parameters to configure benchmark settings.

# Step 3: (build and) run benchmark
./my-run.sh
```

## HPCG Benchmark Overview ##

HPCG is a software package that performs a fixed number of multigrid preconditioned
(using a symmetric Gauss-Seidel smoother) conjugate gradient (PCG) iterations using double
precision (64 bit) floating point values.

The HPCG rating is is a weighted GFLOP/s (billion floating operations per second) value
that is composed of the operations performed in the PCG iteration phase over
the time taken.  The overhead time of problem construction and any modifications to improve
performance are divided by 500 iterations (the amortization weight) and added to the runtime.

Integer arrays have global and local
scope (global indices are unique across the entire distributed memory system,
local indices are unique within a memory image).  Integer data for global/local
indices have three modes:

* 32/32 - global and local integers are 32-bit
* 64/32 - global integers are 64-bit, local are 32-bit
* 64/64 - global and local are 64-bit.

These various modes are required in order to address sufficiently big problems
if the range of indexing goes above 2^31 (roughly 2.147B), or to conserve storage
costs if the range of indexing is less than 2^31.

The HPCG software package requires the availability on your system of an implementation of the
Message Passing Interface (MPI) if enabling the MPI build of HPCG, and a compiler that supports
both OpenMP syntax for the reference code and SYCL language including eSIMD extensions.  An
implementation of MPI that supports device to device communication is recommended.

## HPCG Valid Runs ##

HPCG can be run in just a few minutes from start to finish.  However, official
runs must be at least 1800 seconds (30 minutes) as reported in the output file.

A valid run must also execute a problem size that is large enough so that data
arrays accessed in the CG iteration loop do not fit in the cache of the device
in a way that would be unrealistic in a real application setting.  Presently this
restriction means that the problem size should be large enough to occupy a
significant fraction of *main device memory*, at least 1/4 of the total.

Future memory system architectures may require restatement of the specific memory
size requirements.  But the guiding principle will always be that the problem
size should reflect what would be reasonable for a real sparse iterative solver.

### PVC Recommended Local Problem Settings ###

On Intel(R) Data Center GPU Max Series (previously codenamed "PVC") systems, we
utilize explicit scaling with 1 MPI rank per GPU tile and in order to comply with
the spirit of the memory requirement restrictions, we recommend using local problem
sizes similar to `320x320x320` or `512x512x256` as these fit into `int32_t` representable
CSR matrix format sizes and also are not too small that the vector fits into last
level cache. For user convenience, an estimation of sizes for several local problem
sizes are provided in the following table.  The counts and sizes are to be considered
`per rank` and are approximations, typically erring on the "upper bound" side, though
if some deallocation is delayed by runtime, it could temporarily exceed these limits.


| nx  | ny  | nz  | local nrows | local rhs vector size | approximate local nnz | estimated peak memory footprint  | estimated main run memory footprint | OK for GPU?       |
| --- | --- | --- | ----------- | --------------------- | --------------------- | -------------------------------- | ----------------------------------- | ----------------- |
| 192 | 192 | 192 |   7,077,888 |   54.0 MiB            | 191.11 million        |    5.98 GiB                      |   3.1 GiB                           | too small for PVC 1550/1100|
| 256 | 256 | 256 |  16,777,216 |  128.0 MiB            | 452.99 million        |   14.14 GiB                      |   7.3 GiB                           | too small for PVC 1550/1100|
| 320 | 320 | 320 |  32,768,000 |  250.0 MiB            | 884.74 million        |   27.60 GiB                      |  14.8 GiB                           | OK for PVC 1550/1100       |
| 512 | 512 | 128 |  33,554,432 |  256.0 MiB            | 905.97 million        |   28.27 GiB                      |  15.2 GiB                           | OK for PVC 1550/1100       |
| 512 | 256 | 256 |  33,554,432 |  256.0 MiB            | 905.97 million        |   28.25 GiB                      |  15.2 GiB                           | OK for PVC 1550/1100       |
| 512 | 512 | 256 |  67,108,864 |  512.0 MiB            | 1.812  billion        |   56.49 GiB                      |  30.4 GiB                           | OK for PVC 1550            |
| 512 | 512 | 512 | 134,217,728 | 1024.0 MiB            | 3.624  billion        |  112.90 GiB                      |  60.6 GiB                           | Exceeds `int32_t` representable sizes |


## HPCG Input Arguments ##

The following options can be passed into the executable as arguments to specify particular behaviour.

### Overview of possible input parameters and their meaning: ###
The HPCG benchmark constructs a 3D global rectangular domain which consists of 1
or more local 3D rectangular subdomains of size `nx`x`ny`x`nz`, each of which is
owned by an MPI rank.

The most common way to specify the HPCG benchmark by is setting the local problem
size: `nx`, `ny`, `nz` and the number of MPI ranks, which leads to a 3 factor breakdown
of the number of MPI ranks `num_mpi_ranks = npx * npy * npz` into number of ranks/processors
in x(`npx`), in y(`npy`) and in z(`npz`) direction.  The goal is to make it as close
to a cube as possible. Alternatively, the user can specify `npx`, `npy` and `npz` directly.
In this case, the product `npx * npy * npz` must be less than or equal to the number of
MPI ranks spawned.

HPCG version 3.1 adds the possibility of creating an "global almost rectangular domain" where
there can be two rectangular zones each with a different number of ranks in z dimension.  The
`pz` (default `pz=npz`) parameter can be set to partition where the second zone of nz values
start.  Then `zl` can be set as number of processors in z direction for the first zone (`< pz`)
and `zu` as number of processors in z direction for the second zone (`>=pz`).

### Local Problem Size parameters: ###

Specify a square local domain
```
-n=320 # sets nx=ny=nz=value
```
or specify nx/ny/nz directly for more control:
```
--nx=512 # sets nx directly
--ny=512 # sets ny directly
--nz=256 # sets nz directly 
```

### Run time parameters: ###
```
-t=$(run_time_in_seconds) # >= 1800 for official runs submittable to TOP500 list
                          # = 60 or 100 is ususally sufficient for stable results in testing
```

### Mapping MPI Ranks to GPUs within a Node parameters: ###
Each CPU node may have one or more GPU cards (each with 1 or more tiles) attached on which to assign
work.  The recommendation for PVC is to assign 1 MPI rank per tile, so we recommend users to not
exceed the `tiles_per_node` count in their MPI `processors per node (ppn)`.  However, if fewer than
`tiles_per_node` ranks are used, for instance in a weak scaling analysis, there are two options for
how those ranks are distributed to the GPU tiles within the node.
```
--affinity-per-node=compact # compact     : Default, assign ranks close to each other, filling
                            #               up cards as more are assigned.
                            # round_robin : Assign ranks with 1 rank per card until all cards have
                            #               work, then wrap around and fill in a second tile on each
                            #               card until full.
```

### Testing Parameters: ###
```
--run-real-ref=1 # 0 : runs 50 iters per CG set (no comparison of convergence to reference CG), fast run and good for tracking , 
                 # 1 : Default, official runs must use this.  May require >= 50 iters to match reference
                 #     CG convergence after 50 iters, but will only count as 50.
                 # 2 : same as 1, but for certain sizes and small # of ranks, offers a shorter path to
                 #     measuring reference code convergence after 50 iters, not officially useable

```

### Less commonly used parameters: ###
```
--npx=4 # explicitly specify num MPI ranks in x direction of global domain
--npy=4 # explicitly specify num MPI ranks in y direction of global domain
--npz=4 # explicitly specify num MPI ranks in z direction of global domain

--pz=2 # explicitly partition the global z processor dimension into 2 zones, default is npz
--zl=4 # explicitly set nz for processors in z dimension with value < pz
--zu=3 # explicitly nz for processors in z dimension with value >= pz

--yaml=my_custom_report # custom file name where results will be reported,  (.txt is appended internally)
                        # default is <nx>-<#mpi_ranks>-<#OMP threads>_<hpcg_version>_<date>
```


## Technical Support ##

Maintained/Optimized by Intel(R) oneAPI Math Kernel Library software engineers. Any bugs, issues
or relevant feature requests can be submitted via the 
[Intel oneMKL Community Forum](https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/bd-p/oneapi-math-kernel-library).


## Further information ##

Check out the [Official HPCG Benchmark Site](http://www.hpcg-benchmark.org/) for the latest
information and performance results on various clusters and hardware.


