Skip to content

Compare System Performance with SPEChpc

SPEChpc 2021 is a benchmark suite developed by the Standard Performance Evaluation Corporation (SPEC) for the evaluation of various, heterogeneous HPC systems. Documentation and released benchmark results can be found on their web page. In fact, our system Taurus (partition haswell) is the benchmark's reference system and thus represents the baseline score.

The tool includes nine real-world scientific applications (see benchmark table) with different workload sizes ranging from tiny, small, medium to large, and different parallelization models including MPI only, MPI+OpenACC, MPI+OpenMP and MPI+OpenMP with target offloading. With this benchmark suite you can compare the performance of different HPC systems and furthermore, evaluate parallel strategies for applications on a target HPC system. When you e.g. want to implement an algorithm, port an application to another platform or integrate acceleration into your code, you can determine from which target system and parallelization model your application performance could benefit most. Or this way you can check whether an acceleration scheme can be deployed and run on a given system, since there could be software issues restricting a capable hardware (see this CUDA issue).

Since TU Dresden is a member of the SPEC consortium, the HPC benchmarks can be requested by anyone interested. Please contact Holger Brunst for access.

Installation

The target partition determines which of the parallelization models can be used, and vice versa. For example, if you want to run a model including acceleration, you would have to use a partition with GPUs.

Once the target partition is determined, follow SPEC's Installation Guide. It is straight-forward and easy to use.

Building for partition ml

The partition ml is a Power9 architecture. Thus, you need to provide the -e ppc64le switch when installing.

Building with NVHPC for partition alpha

To build the benchmark for partition alpha, you don't need an interactive session on the target architecture. You can stay on the login nodes as long as you set the flag -tp=zen. You can add this compiler flag to the configuration file.

If you are facing errors during the installation process, check the solved and unresolved issues sections for our systems. The problem might already be listed there.

Configuration

The behavior in terms of how to build, run and report the benchmark in a particular environment is controlled by a configuration file. There are a few examples included in the source code. Here you can apply compiler tuning and porting, specify the runtime environment and describe the system under test. SPEChpc 2021 has been deployed on the partitions haswell, ml and alpha. Configurations are available, respectively:

No matter which one you choose as a starting point, double-check the line that defines the submit command and make sure it says srun [...], e.g.

submit = srun $command

Otherwise this can cause trouble (see Slurm Bug). You can also put Slurm options in the configuration but it is recommended to do this in a job script (see chapter Execution). Use the following to apply your configuration to the benchmark run:

runhpc --config <configfile.cfg> [...]

For more details about configuration settings check out the following links:

Execution

The SPEChpc 2021 benchmark suite is executed with the runhpc command, which also sets it's configuration and controls it's runtime behavior. For all options, see SPEC's documentation about runhpc options. First, execute source shrc in your SPEC installation directory. Then use a job script to submit a job with the benchmark or parts of it.

In the following there are job scripts shown for partitions haswell, ml and alpha, respectively. You can use them as a template in order to reproduce results or to transfer the execution to a different partition.

  • Replace <p_number_crunch> (line 2) with your project name
  • Replace ws=</scratch/ws/spec/installation> (line 15/18) with your SPEC installation path

Submit SPEChpc Benchmarks with a Job File

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
#SBATCH --account=<p_number_crunch>
#SBATCH --partition=haswell64
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2541M
#SBATCH --time=16:00:00
#SBATCH --constraint=DA

module purge
module load gompi/2019a

ws=</scratch/ws/spec/installation>
cd ${ws}
source shrc

# reportable run with all benchmarks
BENCH="tiny"

runhpc --config gnu-taurus --define model=mpi --ranks=24 --reportable --tune=base --flagsurl=$SPEC/config/flags/gcc_flags.xml ${BENCH}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
#SBATCH --account=<p_number_crunch>
#SBATCH --partition=ml
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gres=gpu:6
#SBATCH --mem-per-cpu=5772M
#SBATCH --time=00:45:00
#SBATCH --export=ALL
#SBATCH --hint=nomultithread

module --force purge
module load modenv/ml NVHPC OpenMPI/4.0.5-NVHPC-21.2-CUDA-11.2.1

ws=</scratch/ws/spec/installation>
cd ${ws}
source shrc

export OMPI_CC=nvc
export OMPI_CXX=nvc++
export OMPI_FC=nvfortran

suite='tiny ^pot3d_t'
cfg=nvhpc_ppc.cfg

# test run
runhpc -I --config ${cfg} --ranks ${SLURM_NTASKS} --define pmodel=acc --size=test --noreportable --tune=base --iterations=1 ${suite}

# reference run
runhpc --config ${cfg} --ranks ${SLURM_NTASKS} --define pmodel=acc --rebuild --tune=base --iterations=3 ${suite}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
#SBATCH --account=<p_number_crunch>
#SBATCH --partition=alpha
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=6
#SBATCH --gpus-per-task=1
#SBATCH --gres=gpu:8
#SBATCH --mem-per-cpu=20624M
#SBATCH --time=00:45:00
#SBATCH --export=ALL
#SBATCH --hint=nomultithread

module --force purge
module load modenv/hiera NVHPC OpenMPI

ws=</scratch/ws/spec/installation>
cd ${ws}
source shrc

suite='tiny'
cfg=nvhpc_alpha.cfg

# test run
runhpc -I --config ${cfg} --ranks ${SLURM_NTASKS} --define pmodel=acc --size=test --noreportable --tune=base --iterations=1 ${suite}

# reference workload
runhpc --config ${cfg} --ranks ${SLURM_NTASKS} --define pmodel=acc --tune=base --iterations=3 ${suite}

Solved Issues

Fortran Compilation Error

PGF90-F-0004-Corrupt or Old Module file

Explanation

If this error arises during runtime, it means that the benchmark binaries and the MPI module do not fit together. This happens when you have built the benchmarks written in Fortran with a different compiler than which was used to build the MPI module that was loaded for the run.

Solution

  1. Use the correct MPI module
    • The MPI module in use must be compiled with the same compiler that was used to build the benchmark binaries. Check the results of module avail and choose a corresponding module.
  2. Rebuild the binaries
    • Rebuild the binaries using the same compiler as for the compilation of the MPI module of choice.
  3. Request a new module
    • Ask the HPC support to install a compatible MPI module.
  4. Build your own MPI module (as a last resort)
    • Download and build a private MPI module using the same compiler as for building the benchmark binaries.

pmix Error

PMIX ERROR

It looks like the function `pmix_init` failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;

mix_progress_thread_start failed
--> Returned value -1 instead of PMIX_SUCCESS

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Explanation

This is most probably a MPI related issue. If you built your own MPI module, PMIX support might be configured wrong.

Solution

Use configure --with-pmix=internal during the cmake configuration routine.

ORTE Error (too many processes)

Error: system limit exceeded on number of processes that can be started

ORTE_ERROR_LOG: The system limit on number of children a process can have was reached.

Explanation

There are too many processes spawned, probably due to a wrong job allocation and/or invocation.

Solution

Check the invocation command line in your job script. It must not say srun runhpc [...] there, but only runhpc [...]. The submit command in the configuration file already contains srun. When srun is called in both places, too many parallel processes are spawned.

Error with OpenFabrics Device

There was an error initializing an OpenFabrics device

Explanation

"I think it’s just trying to find the InfiniBand libraries, which aren’t used, but can’t. It’s probably safe to ignore."

Matthew Colgrove, Nvidia

Solution

This is just a warning which cannot be suppressed, but can be ignored.

Out of Memory

Out of memory

Out of memory allocating [...] bytes of device memory
call to cuMemAlloc returned error 2: Out of memory

Explanation

  • When running on a single node with all of its memory allocated, there is not enough memory for the benchmark.
  • When running on multiple nodes, this might be a wrong resource distribution caused by Slurm. Check the $SLURM_NTASKS_PER_NODE environment variable. If it says something like 15,1 when you requested 8 processes per node, Slurm was not able to hand over the resource distribution to mpirun.

Solution

  • Expand your job from single node to multiple nodes.
  • Reduce the workload (e.g. form small to tiny).
  • Make sure to use srun instead of mpirun as the submit command in your configuration file.

Unresolved Issues

CUDA Reduction Operation Error

There was a problem while initializing support for the CUDA reduction operations.

Explanation

For OpenACC, NVHPC was in the process of adding OpenMP array reduction support which is needed for the pot3d benchmark. An Nvidia driver version of 450.80.00 or higher is required. Since the driver version on partiton ml is 440.64.00, it is not supported and not possible to run the pot3d benchmark in OpenACC mode here.

Workaround

As for the partition ml, you can only wait until the OS update to CentOS 8 is carried out, as no driver update will be done beforehand. As a workaround, you can do one of the following:

  • Exclude the pot3d benchmark.
  • Switch the partition (e.g. to partition alpha).

Slurm Bug

Wrong resource distribution

When working with multiple nodes on partition ml or alpha, the Slurm parameter $SLURM_NTASKS_PER_NODE does not work as intended when used in conjunction with mpirun.

Explanation

In the described case, when setting e.g. SLURM_NTASKS_PER_NODE=8 and calling mpirun, Slurm is not able to pass on the allocation settings correctly. With two nodes, this leads to a distribution of 15 processes on the first node and 1 process on the second node instead. In fact, none of the proposed methods of Slurm's man page (like --distribution=plane=8) will give the result as intended in this case.

Workaround

  • Use srun instead of mpirun.
  • Use mpirun along with a rank-binding perl script (like mpirun -np <ranks> perl <bind.pl> <command>) as seen on the bottom of the configurations here and here in order to enforce the correct distribution of ranks as it was intended.

Benchmark Hangs Forever

The benchmark runs forever and produces a timeout.

Explanation

The reason for this is not known, however, it is caused by the flag -DSPEC_ACCEL_AWARE_MPI.

Workaround

Remove the flag -DSPEC_ACCEL_AWARE_MPI from the compiler options in your configuration file.

Other Issues

For any further issues you can consult SPEC's FAQ page, search through their known issues or contact their support.