Skip to content

Partitions and Limits

There is no such thing as free lunch at ZIH systems. Since compute nodes are operated in multi-user node by default, jobs of several users can run at the same time at the very same node sharing resources, like memory (but not CPU). On the other hand, a higher throughput can be achieved by smaller jobs. Thus, restrictions w.r.t. memory and runtime limits have to be respected when submitting jobs.

Runtime Limits

Runtime limits on login nodes

There is a time limit set for processes on login nodes. If you run applications outside of a compute job, it will be stopped automatically after 5 minutes with

CPU time limit exceeded

Please start a job using the batch system.

Runtime limits are enforced.

A job is canceled as soon as it exceeds its requested limit. Currently, the maximum run time is 7 days.

Shorter jobs come with multiple advantages:

  • lower risk of loss of computing time,
  • shorter waiting time for scheduling,
  • higher job fluctuation; thus, jobs with high priorities may start faster.

To bring down the percentage of long running jobs we restrict the number of cores with jobs longer than 2 days to approximately 50% and with jobs longer than 24 to 75% of the total number of cores. (These numbers are subject to changes.) As best practice we advise a run time of about 8h.

Please always try to make a good estimation of your needed time limit.

For this, you can use a command line like this to compare the requested timelimit with the elapsed time for your completed jobs that started after a given date:

marie@login$ sacct -X -S 2021-01-01 -E now --format=start,JobID,jobname,elapsed,timelimit -s COMPLETED

Instead of running one long job, you should split it up into a chain job. Even applications that are not capable of checkpoint/restart can be adapted. Please refer to the section Checkpoint/Restart for further documentation.

Partitions
Partitions image

Memory Limits

Memory limits are enforced.

Jobs which exceed their per-node memory limit are killed automatically by the batch system.

Memory requirements for your job can be specified via the sbatch/srun parameters:

--mem-per-cpu=<MB> or --mem=<MB> (which is "memory per node"). The default limit is quite low at 300 MB per CPU.

ZIH systems comprises different sets of nodes with different amount of installed memory which affect where your job may be run. To achieve the shortest possible waiting time for your jobs, you should be aware of the limits shown in the following table.

Partitions and memory limits
Partition Nodes # Nodes Cores per Node MB per Core MB per Node GPUs per Node
interactive taurusi[6605-6612] 8 24 2541 61000 -
haswell64 taurusi[4037-4104,5001-5612,6001-6604] 1284 24 2541 61000 -
haswell64ht taurusi[4018-4036] 18 24 (HT: 48) 1270* 61000 -
haswell128 taurusi[4105-4188] 84 24 5250 126000 -
haswell256 taurusi[4189-4232] 44 24 10583 254000 -
broadwell taurusi[4233-4264] 32 28 2214 62000 -
smp2 taurussmp[3-7] 5 56 36500 2044000 -
gpu2** taurusi[2045-2103] 59 24 2583 62000 4 (2 dual GPUs)
hpdlf taurusa[3-16] 14 12 7916 95000 3
ml** taurusml[1-32] 32 44 (HT: 176) 1443* 254000 6
romeo** taurusi[7001-7192] 192 128 (HT: 256) 1972* 505000 -
julia taurussmp8 1 896 27343 49000000 -
alpha** taurusi[8001-8034] 34 48 (HT: 96) 10312* 990000 8
Partitions and limits table

Note

Some nodes have multithreading (SMT) enabled, so for every physical core allocated (e.g., with SLURM_HINT=nomultithread), you will always get MB per Core*number of threads, because the memory of the other threads is allocated implicitly, too. Those nodes are marked with an asterisk. Some of the partitions, denoted with a double asterisk, have a counterpart for interactive jobs. These partitions have a -interactive suffix (e.g. ml-interactive) and have the same configuration. There is also a meta partition haswell, which contain partition haswell64, haswell128, haswell256 and smp2and this is also the default partition. If you specify no partition or partition haswell a Slurm plugin will choose the partition which fits to your memory requirements. There are some other partitions, which are not specified in the table above, but those partitions should not be used directly.