Compare System Performance with SPEChpc¶
SPEChpc 2021 is a benchmark suite developed by the Standard Performance Evaluation Corporation
(SPEC) for the evaluation of various, heterogeneous HPC systems. Documentation and released
benchmark results can be found on their web page. In fact, our
system Taurus (partition
haswell) is the benchmark's reference system and thus represents
the baseline score.
The tool includes nine real-world scientific applications (see benchmark table) with different workload sizes ranging from tiny, small, medium to large, and different parallelization models including MPI only, MPI+OpenACC, MPI+OpenMP and MPI+OpenMP with target offloading. With this benchmark suite you can compare the performance of different HPC systems and furthermore, evaluate parallel strategies for applications on a target HPC system. When you e.g. want to implement an algorithm, port an application to another platform or integrate acceleration into your code, you can determine from which target system and parallelization model your application performance could benefit most. Or this way you can check whether an acceleration scheme can be deployed and run on a given system, since there could be software issues restricting a capable hardware (see this CUDA issue).
Since TU Dresden is a member of the SPEC consortium, the HPC benchmarks can be requested by anyone interested. Please contact Holger Brunst for access.
The target partition determines which of the parallelization models can be used, and vice versa. For example, if you want to run a model including acceleration, you would have to use a partition with GPUs.
Once the target partition is determined, follow SPEC's Installation Guide. It is straight-forward and easy to use.
Building for partition
ml is a Power9 architecture. Thus, you need to provide the
-e ppc64le switch
Building with NVHPC for partition
To build the benchmark for partition
alpha, you don't need an interactive session
on the target architecture. You can stay on the login nodes as long as you set the
-tp=zen. You can add this compiler flag to the configuration file.
The behavior in terms of how to build, run and report the benchmark in a particular environment is
controlled by a configuration file. There are a few examples included in the source code.
Here you can apply compiler tuning and porting, specify the runtime environment and describe the
system under test. SPEChpc 2021 has been deployed on the partitions
alpha. Configurations are available, respectively:
No matter which one you choose as a starting point,
double-check the line that defines the submit command and make sure it says
srun [...], e.g.
submit = srun $command
Otherwise this can cause trouble (see Slurm Bug). You can also put Slurm options in the configuration but it is recommended to do this in a job script (see chapter Execution). Use the following to apply your configuration to the benchmark run:
runhpc --config <configfile.cfg> [...]
For more details about configuration settings check out the following links:
The SPEChpc 2021 benchmark suite is executed with the
runhpc command, which also sets it's
configuration and controls it's runtime behavior. For all options, see SPEC's documentation about
source shrc in your SPEC installation directory. Then use a job script to submit a
job with the benchmark or parts of it.
In the following there are job scripts shown for partitions
respectively. You can use them as a template in order to reproduce results or to transfer the
execution to a different partition.
<p_number_crunch>(line 2) with your project name
ws=</scratch/ws/spec/installation>(line 15/18) with your SPEC installation path
Submit SPEChpc Benchmarks with a Job File¶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Fortran Compilation Error¶
PGF90-F-0004-Corrupt or Old Module file
If this error arises during runtime, it means that the benchmark binaries and the MPI module do not fit together. This happens when you have built the benchmarks written in Fortran with a different compiler than which was used to build the MPI module that was loaded for the run.
- Use the correct MPI module
- The MPI module in use must be compiled with the same compiler that was used to build the
benchmark binaries. Check the results of
module availand choose a corresponding module.
- The MPI module in use must be compiled with the same compiler that was used to build the benchmark binaries. Check the results of
- Rebuild the binaries
- Rebuild the binaries using the same compiler as for the compilation of the MPI module of choice.
- Request a new module
- Ask the HPC support to install a compatible MPI module.
- Build your own MPI module (as a last resort)
- Download and build a private MPI module using the same compiler as for building the benchmark binaries.
It looks like the function `pmix_init` failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during pmix_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; mix_progress_thread_start failed --> Returned value -1 instead of PMIX_SUCCESS *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job)
This is most probably a MPI related issue. If you built your own MPI module, PMIX support might be configured wrong.
configure --with-pmix=internal during the
cmake configuration routine.
ORTE Error (too many processes)¶
Error: system limit exceeded on number of processes that can be started
ORTE_ERROR_LOG: The system limit on number of children a process can have was reached.
There are too many processes spawned, probably due to a wrong job allocation and/or invocation.
Check the invocation command line in your job script. It must not say
srun runhpc [...]
there, but only
runhpc [...]. The submit command in the configuration file
srun is called in both places, too many parallel processes are
Error with OpenFabrics Device¶
There was an error initializing an OpenFabrics device
"I think it’s just trying to find the InfiniBand libraries, which aren’t used, but can’t. It’s probably safe to ignore."
Matthew Colgrove, Nvidia
This is just a warning which cannot be suppressed, but can be ignored.
Out of Memory¶
Out of memory
Out of memory allocating [...] bytes of device memory call to cuMemAlloc returned error 2: Out of memory
- When running on a single node with all of its memory allocated, there is not enough memory for the benchmark.
- When running on multiple nodes, this might be a wrong resource distribution caused by Slurm.
$SLURM_NTASKS_PER_NODEenvironment variable. If it says something like
15,1when you requested 8 processes per node, Slurm was not able to hand over the resource distribution to
- Expand your job from single node to multiple nodes.
- Reduce the workload (e.g. form small to tiny).
- Make sure to use
mpirunas the submit command in your configuration file.
CUDA Reduction Operation Error¶
There was a problem while initializing support for the CUDA reduction operations.
For OpenACC, NVHPC was in the process of adding OpenMP array reduction support which is needed
pot3d benchmark. An Nvidia driver version of 450.80.00 or higher is required. Since
the driver version on partiton
ml is 440.64.00, it is not supported and not possible to run
pot3d benchmark in OpenACC mode here.
As for the partition
ml, you can only wait until the OS update to CentOS 8 is carried out,
as no driver update will be done beforehand. As a workaround, you can do one of the following:
- Exclude the
- Switch the partition (e.g. to partition
Wrong resource distribution
When working with multiple nodes on partition
alpha, the Slurm parameter
$SLURM_NTASKS_PER_NODE does not work as intended when used in conjunction with
In the described case, when setting e.g.
SLURM_NTASKS_PER_NODE=8 and calling
is not able to pass on the allocation settings correctly. With two nodes, this leads to a
distribution of 15 processes on the first node and 1 process on the second node instead. In
fact, none of the proposed methods of Slurm's man page (like
give the result as intended in this case.
Benchmark Hangs Forever¶
The benchmark runs forever and produces a timeout.
The reason for this is not known, however, it is caused by the flag
Remove the flag
-DSPEC_ACCEL_AWARE_MPI from the compiler options in your configuration file.