GPU Programming¶

Available GPUs¶

The full hardware specifications of the GPU-compute nodes may be found in the HPC Resources page. The clusters may have different software modules available:

E.g. the available CUDA versions can be listed with

marie@login$ module spider CUDA

Note that some modules use a specific CUDA version which is visible in the module name, e.g. GDRCopy/2.1-CUDA-11.1.1 or Horovod/0.28.1-CUDA-11.7.0-TensorFlow-2.11.0.

This especially applies to the optimized CUDA libraries like cuDNN, NCCL and magma.

CUDA-aware MPI

When running CUDA applications using MPI for interprocess communication you need to additionally load the modules that enable CUDA-aware MPI which may provide improved performance. Those are UCX-CUDA and UCC-CUDA which supplement the UCX and UCC modules respectively. Some modules, like NCCL, load those automatically.

Using GPUs with Slurm¶

For general information on how to use the batch system Slurm, read the respective page in this compendium. When allocating resources on a GPU-node, you must specify the number of requested GPUs by using the --gres=gpu:<N> option, like this:

Cluster Alpha and Capella

#!/bin/bash                           # Batch script starts with shebang line

#SBATCH --ntasks=1                    # All #SBATCH lines have to follow uninterrupted
#SBATCH --time=01:00:00               # after the shebang line
#SBATCH --account=p_number_crunch     # Comments start with # and do not count as interruptions
#SBATCH --job-name=fancyExp
#SBATCH --output=simulation-%j.out
#SBATCH --error=simulation-%j.err
#SBATCH --gres=gpu:1                  # request GPU(s) from Slurm

module purge                          # Set up environment, e.g., clean modules environment
module load module/version module2    # and load necessary modules

srun ./application [options]          # Execute parallel application with srun

Alternatively, you can work on the clusters interactively:

marie@login$ srun --nodes=1 --gres=gpu:<N> --time=00:30:00 --pty bash
marie@compute$ module purge; module switch release/<env>

Directive Based GPU Programming¶

Directives are special compiler commands in your C/C++ or Fortran source code. They tell the compiler how to parallelize and offload work to a GPU. This section explains how to use this technique.

OpenACC¶

OpenACC is a directive based GPU programming model. It currently only supports NVIDIA GPUs as a target.

Please use the following information as a start on OpenACC:

Introduction¶

OpenACC can be used with the NVIDIA HPC compilers "NVHPC" (former PGI), which are shipped as part of the NVIDIA HPC SDK.

nvc is the NVIDIA C compiler (don't mix with nvcc, which is used for CUDA).

Using OpenACC with NVIDIA HPC Compilers¶

See NVIDIA's OpenACC getting started guide for related information.
Switch into the correct module environment for your selected compute nodes (see list of available GPUs).
Load the NVHPC module for the correct module environment. Either load the default (module load NVHPC) or search for a specific version.
Use the correct compiler for your code: nvc for C, nvc++ for C++ and nvfortran for Fortran.
Use the -acc and -Minfo flags.
To create optimized code for both the A100 and H100 GPUs, use -gpu=cc80,cc90.
Further information on this compiler is provided in the user guide and the reference guide, which includes descriptions of available command line options .

Using OpenACC with MPI¶

To use a MPI + OpenACC programming model, you have to use a MPI module that is built with an OpenACC compiler. For example, a GCC-built MPI won't work with OpenACC. The NVHPC module ships an OpenMPI as the so called HPC-X suite, but it has to be activated before it can be used (refer to loading HPC-X ). On the other hand, we provide an OpenMPI module that has been built with the NVHPC compilers for this special case. It is available on our Alpha Centauri and Capella systems.

Load available NVHPC and MPI modules module load release/24.04 NVHPC OpenMPI.
Alternative: Use HPC-X shipped with the NVHPC module - ask for support on how to load it.

OpenMP Target Offloading¶

OpenMP supports target offloading as of version 4.0. A dedicated set of compiler directives can be used to annotate code-sections that are intended for execution on the GPU (i.e., target offloading). Not all OpenMP compilers support target offloading, refer to the official list for details. Note: Experience has shown that the performance for target offloading with GCC is very poor (i.e. slower than pure CPU code). We recommend to use Nvidia compilers (see below) or Clang for this programming model. Also consider consulting our performance experts if you have a use case.

Using OpenMP Target Offloading with NVIDIA HPC Compilers¶

Load the module environments and the NVIDIA HPC SDK as described in the OpenACC section.
Use the -mp=gpu flag to enable OpenMP with offloading.
-Minfo tells you what the compiler is actually doing to your code.
The same compiler options as mentioned in the OpenACC section are available for OpenMP, including the -gpu=ccXY flag.
OpenMP-specific advice may also be found in the respective section of NVIDIA's user guide.

Native GPU Programming¶

CUDA¶

Native CUDA programs can sometimes offer a better performance. NVIDIA provides some introductory material and links. An introduction to CUDA is provided as well. The toolkit documentation page links to the programming guide and the best practice guide. Optimization guides for supported NVIDIA architectures are available, including for Ampere (A100) and Hopper (H100).

In order to compile an application with CUDA use the nvcc compiler command, which is described in detail in the nvcc documentation. This compiler is available via several CUDA packages, a default version can be loaded via module load CUDA. Additionally, the NVHPC modules provide CUDA tools as well.

For using CUDA with Open MPI at multiple nodes, the OpenMPI module loaded shall have be compiled with CUDA support. If you aren't sure if the module you are using has support for it you can check it as following:

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value | awk -F":" '{print "Open MPI supports CUDA:",$7}'

Using the CUDA Compiler¶

The simple invocation nvcc <code.cu> will compile a valid CUDA program. nvcc differentiates between the device and the host code, which will be compiled in separate phases. Therefore, compiler options can be defined specifically for the device as well as for the host code. By default, the GCC is used as the host compiler. The following flags may be useful:

--generate-code (-gencode): generate optimized code for a target GPU (caution: these binaries cannot be used with GPUs of other generations).
- For Ampere (A100): --generate-code arch=compute_80,code=sm_80
- For Hopper (H100): --generate-code arch=compute_90,code=sm_90
-Xcompiler: pass flags to the host compiler. E.g., generate OpenMP-parallel host code: -Xcompiler -fopenmp. The -Xcompiler flag has to be invoked for each host-flag.

Performance Analysis¶

Consult NVIDIA's Best Practices Guide and the performance guidelines for possible steps to take for the performance analysis and optimization.

Multiple tools can be used for the performance analysis. For the analysis of applications on the newer GPUs (A100 and H100), we recommend the use of the newer NVIDIA Nsight tools, Nsight Systems for a system-wide sampling and tracing and Nsight Compute for a detailed analysis of individual kernels.

NVIDIA nvprof and Visual Profiler¶

The nvprof command line and the Visual Profiler are available once a CUDA module has been loaded. For a simple analysis, you can call nvprof without any options, like such:

marie@compute$ nvprof ./application [options]

For a more in-depth analysis, we recommend you use the command line tool first to generate a report file, which you can later analyze in the Visual Profiler. In order to collect a set of general metrics for the analysis in the Visual Profiler, use the --analysis-metrics flag to collect metrics and --export-profile to generate a report file, like this:

marie@compute$ nvprof --analysis-metrics --export-profile  <output>.nvvp ./application [options]

Transfer the report file to your local system and analyze it in the Visual Profiler (nvvp) locally. This will give the smoothest user experience. Alternatively, you can use X11-forwarding. Refer to the documentation for details about the individual features and views of the Visual Profiler.

Besides these generic analysis methods, you can profile specific aspects of your GPU kernels. nvprof can profile specific events. For this, use

marie@compute$ nvprof --query-events

to get a list of available events. Analyze one or more events by using specifying one or more events, separated by comma:

marie@compute$ nvprof --events <event_1>[,<event_2>[,...]] ./application [options]

Additionally, you can analyze specific metrics. Similar to the profiling of events, you can get a list of available metrics:

marie@compute$ nvprof --query-metrics

One or more metrics can be profiled at the same time:

marie@compute$ nvprof --metrics <metric_1>[,<metric_2>[,...]] ./application [options]

If you want to limit the profiler's scope to one or more kernels, you can use the --kernels <kernel_1>[,<kernel_2>] flag. For further command line options, refer to the documentation on command line options.

NVIDIA Nsight Systems¶

Use NVIDIA Nsight Systems for a system-wide sampling of your code. Refer to the NVIDIA Nsight Systems User Guide for details. With this, you can identify parts of your code that take a long time to run and are suitable optimization candidates.

Use the command-line version to sample your code and create a report file for later analysis:

marie@compute$ nsys profile [--stats=true] ./application [options]

The --stats=true flag is optional and will create a summary on the command line. Depending on your needs, this analysis may be sufficient to identify optimizations targets.

The graphical user interface version can be used for a thorough analysis of your previously generated report file. For an optimal user experience, we recommend a local installation of NVIDIA Nsight Systems. In this case, you can transfer the report file to your local system. Alternatively, you can use X11-forwarding. The graphical user interface is usually available as nsys-ui.

Furthermore, you can use the command line interface for further analyses. Refer to the documentation for a list of available command line options.

NVIDIA Nsight Compute¶

Nsight Compute is used for the analysis of individual GPU-kernels. It supports GPUs from the Volta architecture onward (on the ZIH system: A100 and H100). If you are familiar with nvprof, you may want to consult the Nvprof Transition Guide, as Nsight Compute uses a new scheme for metrics. We recommend those kernels as optimization targets that require a large portion of you run time, according to Nsight Systems. Nsight Compute is particularly useful for CUDA code, as you have much greater control over your code compared to the directive based approaches.

Nsight Compute comes in a command line and a graphical version. Refer to the Kernel Profiling Guide to get an overview of the functionality of these tools.

You can call the command line version (ncu) without further options to get a broad overview of your kernel's performance:

marie@compute$ ncu ./application [options]

As with the other profiling tools, the Nsight Compute profiler can generate report files like this:

marie@compute$ ncu --export <report> ./application [options]

The report file will automatically get the file ending .ncu-rep, you do not need to specify this manually.

This report file can be analyzed in the graphical user interface profiler. Again, we recommend you generate a report file on a compute node and transfer the report file to your local system. Alternatively, you can use X11-forwarding. The graphical user interface is usually available as ncu-ui or nv-nsight-cu.

Similar to the nvprof profiler, you can analyze specific metrics. NVIDIA provides a Metrics Guide. Use --query-metrics to get a list of available metrics, listing them by base name. Individual metrics can be collected by using

marie@compute$ ncu --metrics <metric_1>[,<metric_2>,...] ./application [options]

Collection of events is no longer possible with Nsight Compute. Instead, many nvprof events can be measured with metrics.

You can collect metrics for individual kernels by specifying the --kernel-name flag.