Slurm Resource Limits¶

There is no such thing as free lunch at ZIH systems. Since compute nodes are operated in multi-user node by default, jobs of several users can run at the same time at the very same node sharing resources, like memory (but not CPU). On the other hand, a higher throughput can be achieved by smaller jobs. Thus, restrictions w.r.t. memory and runtime limits have to be respected when submitting jobs.

Runtime Limits¶

Runtime limits on login nodes

There is a time limit of 600 seconds set for processes on login nodes. Each process running longer than this time limit is automatically killed. The login nodes are shared ressources between all users of ZIH system and thus, need to be available and cannot be used for productive runs.

CPU time limit exceeded

Please submit extensive application runs to the compute nodes using the batch system.

Runtime limits are enforced.

A job is canceled as soon as it exceeds its requested limit. Currently, the maximum run time limit is 7 days.

Shorter jobs come with multiple advantages:

lower risk of loss of computing time,
shorter waiting time for scheduling,
higher job fluctuation; thus, jobs with high priorities may start faster.

To bring down the percentage of long running jobs we restrict the number of cores with jobs longer than 2 days to approximately 50% and with jobs longer than 24 to 75% of the total number of cores. (These numbers are subject to change.) As best practice we advise a run time of about 8h.

Please always try to make a good estimation of your needed time limit.

For this, you can use a command line like this to compare the requested timelimit with the elapsed time for your completed jobs that started after a given date:

marie@login$ sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED

Instead of running one long job, you should split it up into a chain job. Even applications that are not capable of checkpoint/restart can be adapted. Please refer to the section Checkpoint/Restart for further documentation.

Memory Limits¶

Memory limits are enforced.

Jobs which exceed their per-node memory limit are killed automatically by the batch system.

Memory requirements for your job can be specified via the sbatch/srun parameters:

--mem-per-cpu=<MB> or --mem=<MB> (which is "memory per node"). The default limit regardless of the partition it runs on is quite low at 300 MB per CPU. If you need more memory, you need to request it.

ZIH systems comprise different sets of nodes with different amount of installed memory which affect where your job may be run. To achieve the shortest possible waiting time for your jobs, you should be aware of the limits shown in the Slurm resource limits table.

Slurm Resource Limits Table¶

The physical installed memory might differ from the amount available for Slurm jobs. One reason are so-called diskless compute nodes, i.e., nodes without additional local drive. At these nodes, the operating system and other components reside in the main memory, lowering the available memory for jobs. The reserved amount of memory for the system operation might vary slightly over time. The following table depicts the resource limits for all our HPC systems.

Slurm resource limits table
HPC System	Nodes	# Nodes	Cores per Node	Threads per Core	Memory per Node [in MB]	Memory per (SMT) Core [in MB]	GPUs per Node	Cores per GPU	Job Max Time
`Barnard`	`n[1001-1630].barnard`	630	104	2	515,000	4,951	-	-	unlimited
`Power9`	`ml[1-29].power9`	29	44	4	254,000	1,443	6	-	unlimited
`Romeo`	`i[8001-8190].romeo`	190	128	2	505,000	1,972	-	-	unlimited
`Julia`	`julia`	1	896	1	48,390,000	54,006	-	-	unlimited
`Alpha Centauri`	`i[8001-8037].alpha`	37	48	2	990,000	10,312	8	6	unlimited

All HPC systems have Simultaneous Multithreading (SMT) enabled. You request for this additional threads using the Slurm option --hint=multithread or by setting the environment variable SLURM_HINT=multithread. Besides the usage of the threads to speed up the computations, the memory of the other threads is allocated implicitly, too, and you will always get Memory per Core*number of threads as memory pledge.