ALERT! Warning: your browser isn't supported. Please install a modern one, like Firefox, Opera, Safari, Chrome or the latest Internet Explorer. Thank you!
Startseite » ... » Zentrale Einrichtungen  » ZIH  » Wiki
phone prefix: +49 351 463.....

HPC Support

Operation Status

Ulf Markwardt: 33640
Claudia Schmidt: 39833 hpcsupport@zih.tu-dresden.de

Login and project application

Phone: 40000
Fax: 42328
servicedesk@tu-dresden.de

You are here: Compendium » SystemTaurus » Slurm

Slurm

The HRSK-II systems are operated with the batch system Slurm. Just specify the resources you need in terms of cores, memory, and time and your job will be placed on the system.

Job Submission

Job submission can be done with the command: srun [options] <command>

However, using srun directly on the shell will be blocking and launch an interactive job. Apart from short test runs, it is recommended to launch your jobs into the background by using batch jobs. For that, you can conveniently put the parameters directly in a job file which you can submit using sbatch [options] <job file>

Some options of srun/sbatch are:

slurm option Description
-n <N> or --ntasks <N> set number of tasks to N(default=1). This determines how many processes will be spawned by srun (for MPI jobs).
-N <N> or --nodes <N> set number of nodes that will be part of a job, on each node there will be --ntasks-per-node processes started, if the option --ntasks-per-node is not given, 1 process per node will be started
--ntasks-per-node <N> how many tasks per allocated node to start, as stated in the line before
-c <N> or --cpus-per-task <N> this option is needed for multithreaded (e.g. OpenMP) jobs, it tells SLURM to allocate N cores per task allocated; typically N should be equal to the number of threads you program spawns, e.g. it should be set to the same number as OMP_NUM_THREADS
-p <name> or --partition <name> select the type of nodes where you want to execute your job, on Taurus we currently have haswell, smp, sandy, west, and gpu available
--mem-per-cpu <name> specify the the memory need per allocated CPU in MB
--time <HH:MM:SS> specify the maximum runtime of your job, if you just put a single number in, it will be interpreted as minutes
--mail-user <your email> tell the batch system your email address to get updates about the jobs status
--mail-type ALL specify for what type of events you want to get a mail; valid options beside ALL are: BEGIN, END, FAIL, REQUEUE
-J <name> or --job-name <name> give your job a name which is shown in the queue, the name will also be included in job emails (but cut after 24 chars within emails)
--exclusive tell SLURM that only your job is allowed on the nodes allocated to this job; please be aware that you will be charged for all CPUs/cores on the node
-A <project> Charge resources used by this job to specified project, useful if a user belongs to multiple projects.
-o <filename> or --output <filename> specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "slurm-%j.out"
-e <filename> or --error <filename> specify a file name that will be used to store all error output (stderr), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stderr goes to "slurm-%j.out" as well
-a or --array submit an array job, see extra section below
-w <node1>,<node2>,... restrict job to run on specific nodes only
-x <node1>,<node2>,... exclude specific nodes from job
The following example job file shows how you can make use of sbatch
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --output=simulation-m-%j.out
#SBATCH --error=simulation-m-%j.err
#SBATCH --ntasks=512
#SBATCH -A myproject

echo Starting Program

During runtime, the environment variable SLURM_JOB_ID will be set to the id of your job.

You can also use our Slurm Batch File Generator, which could help you create basic SLURM job scripts.

Detailed information on memory limits on Taurus.

Interactive Jobs

Interactive activities like editing, compiling etc. are normally limited to the login nodes. For longer interactive sessions you can allocate cores on the compute node with the command "salloc". It takes the same options like sbatch to specify the required resources.

The difference to LSF is, that salloc returns a new shell on the node, where you submitted the job. You need to use the command srun in front of the following commands to have these commands executed on the allocated resources. If you allocate more than one task, please be aware that srun will run the command on each allocated task!

An example of an interactive session looks like:

tauruslogin3 /home/mark> srun --pty -n 1 -c 4 --time=1:00:00 --mem-per-cpu=1700 bash
srun: job 13598400 queued and waiting for resources
srun: job 13598400 has been allocated resources
taurusi1262 /home/mark> # start interactive work with e.g. 4 cores.

Note: A dedicated partition "interactive" is reserved for short jobs (<2h) with not more than one job per user. Please check the availability of nodes there with sinfo -p interactive .

Interactive X11/GUI Jobs

SLURM will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option. For example, an interactive session for 1 hour with Matlab using eigth cores can be started with:

module load matlab
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first matlab

Requesting an Nvidia Tesla K20X / K80

SLURM will allocate one or many GPUs for your job if requested. Please note that GPUs are only available in the gpu and gpu-interactive partition. The option for sbatch/srun in this case is --gres=gpu:[NUM_PER_NODE] (where NUM_PER_NODE can be 1, 2 or 4, meaning that one, two or four of the GPUs per node will be used for the job). A sample job file could look like this
#!/bin/bash
#SBATCH -A Project1            # account CPU time to Project1
#SBATCH --nodes=2              # request 2 nodes
#SBATCH --mincpus=1 # allocate one task per node...
#SBATCH --ntasks=2 # ...which means 2 tasks in total (see note below) #SBATCH --cpus-per-task=6 # use 6 threads per task #SBATCH --gres=gpu:1 # use 1 GPU per node (i.e. use one GPU per task) #SBATCH --time=01:00:00 # run for 1 hour srun ./your/cuda/application # start you application (probably requires MPI to use both nodes)

Please be aware that the partitions gpu, gpu1 and gpu2 can only be used for non-interactive jobs which are submitted by sbatch. Interactive jobs (salloc, srun) will have to use the partition gpu-interactive. SLURM will automatically select the right partition if the partition parameter (-p) is omitted.

Note: Due to an unresolved issue concering the SLURM job scheduling behavior, it is currently not practical to use --ntasks-per-node together with GPU jobs. If you want to use multiple nodes, please use the parameters --ntasks and --mincpus instead. The values of mincpus * nodes has to equal ntasks in this case.

Limitations of GPU job allocations

The number of cores per node that are currently allowed to be allocated for GPU jobs is limited depending on how many GPUs are being requested. On the K80 nodes, you may only request up to 6 cores per requested GPU (8 per on the K20 nodes). This is because we do not wish that GPUs remain unusable due to all cores on a node being used by a single job which does not, at the same time, request all GPUs.

E.g., if you specify --gres=gpu:2, your total number of cores per node (meaning: ntasks * cpus-per-task) may not exceed 12 (on the K80 nodes)

Note that this also has implications for the use of the --exclusive parameter. Since this sets the number of allocated cores to 24 (or 16 on the K20X nodes), you also must request all four GPUs by specifying --gres=gpu:4, otherwise your job will not start. In the case of --exclusive, it won't be denied on submission, because this is evaluated in a later scheduling step. Jobs that directly request too many cores per GPU will be denied with the error message:
Batch job submission failed: Requested node configuration is not available

Parallel Jobs

For submitting parallel jobs, a few rules have to be understood and followed. In general they depend on the type of parallelization and the architecture.

OpenMP Jobs

An SMP-parallel job can only run within a node, so it is necessary to include the options -N 1 and -n 1. The maximum number of processors for an SMP-parallel program is 488 on Venus and 56 on taurus (smp island). Using --cpus-per-task N SLURM will start one task and you will have N CPUs available for your job. An example job file would look like:
#!/bin/bash
#SBATCH -J Science1
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mail-type=end
#SBATCH --mail-user=your.name@tu-dresden.de
#SBATCH --time=08:00:00

export OMP_NUM_THREADS=8
./path/to/binary

MPI Jobs

For MPI jobs one typically allocates one core per task that has to be started. Please note: There are different MPI libraries on Taurus and Venus, so you have to compile the binaries specifically for their target.

#!/bin/bash
#SBATCH -J Science1
#SBATCH --ntasks=864
#SBATCH --mail-type=end
#SBATCH --mail-user=your.name@tu-dresden.de
#SBATCH --time=08:00:00

srun ./path/to/binary

Exclusive Jobs for Benchmarking

Jobs on taurus run, by default, in shared-mode, meaning that multiple jobs can run on the same compute nodes. Sometimes, this behaviour is not desired (e.g. for benchmarking purposes), in which case it can be turned off by specifying the SLURM parameter: --exclusive .

Setting --exclusive only makes sure that there will be no other jobs running on your nodes. It does not, however, mean that you automatically get access to all the resources which the node might provide without explicitly requesting them, e.g. you still have to request a GPU via the generic resources parameter (gres) to run on the GPU partitions, or you still have to request all cores of a node if you need them. CPU cores can either to be used for a task (--ntasks) or for multi-threading within the same task (--cpus-per-task). Since those two options are semantically different (e.g., the former will influence how many MPI processes will be spawned by 'srun' whereas the latter does not), SLURM cannot determine automatically which of the two you might want to use. Since we use cgroups for separation of jobs, your job is not allowed to use more resources than requested.

If you just want to use all available cores in a node, you have to specify how Slurm should organize them, like with "-p haswell -c 24" or "-p haswell --ntasks-per-node=24".

Here is a short example to ensure that a benchmark is not spoiled by other jobs, even if it doesn't use up all resources in the nodes:
#!/bin/bash
#SBATCH -J Benchmark
#SBATCH -p haswell
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2 #SBATCH --cpus-per-task=8
#SBATCH --exclusive # ensure that nobody spoils my measurement on 2 x 2 x 8 cores
#SBATCH --mail-user=your.name@tu-dresden.de #SBATCH --time=00:10:00 srun ./my_benchmark

Array Jobs

Array jobs can be used to create a sequence of jobs that share the same executable and resource requirements, but have different input files, to be submitted, controlled, and monitored as a single unit. The arguments -a or --array take an additional parameter that specify the array indices. Within the job you can read the environment variables SLURM_ARRAY_JOB_ID, which will be set to the first job ID of the array, and SLURM_ARRAY_TASK_ID, which will be set individually for each step.

Within an array job, you can use %a and %A in addition to %j and %N (described above) to make the output file name specific to the job. %A will be replaced by the value of SLURM_ARRAY_JOB_ID and %a will be replaced by the value of SLURM_ARRAY_TASK_ID.

Here is an example how an array job can looks like:

#!/bin/bash
#SBATCH -J Science1
#SBATCH --array 0-9
#SBATCH -o arraytest-%A_%a.out
#SBATCH -e arraytest-%A_%a.err
#SBATCH --ntasks=864
#SBATCH --mail-type=end
#SBATCH --mail-user=your.name@tu-dresden.de
#SBATCH --time=08:00:00

echo "Hi, I am step $SLURM_ARRAY_TASK_ID in this array job $SLURM_ARRAY_JOB_ID"

For further details please read the Slurm documentation.

Chain Jobs

You can use chain jobs to create dependencies between jobs. This is often the case if a job relies on the result of one or more preceding jobs. Chain jobs can also be used if the runtime limit of the batch queues is not sufficient for your job. SLURM has an option -d or "--dependency" that allows to specify that a job is only allowed to start if another job finished.

Here is an example how a chain job can looks like, the example submits 4 jobs (described in a job file) that will be executed on after each other with different CPU numbers:
#!/bin/bash
TASK_NUMBERS="1 2 4 8"
DEPENDENCY=""
JOB_FILE="myjob.slurm"

for TASKS in $TASK_NUMBERS ; do
    JOB_CMD="sbatch --ntasks=$TASKS"
    if [ -n "$DEPENDENCY" ] ; then
        JOB_CMD="$JOB_CMD --dependency afterany:$DEPENDENCY"
    fi
    JOB_CMD="$JOB_CMD $JOB_FILE"
    echo -n "Running command: $JOB_CMD  "
    OUT=`$JOB_CMD`
    echo "Result: $OUT"
    DEPENDENCY=`echo $OUT | awk '{print $4}'`
done

Binding and Distribution of Tasks

The SLURM provides several binding strategies to place and bind the tasks and/or threads of your job to cores, sockets and nodes. Note: Keep in mind that the distribution method has a direct impact on the execution time of your application. The manipulation of the distribution can either speedup or slow down your application. More detailed information about the binding can be found here.

The default allocation of the tasks/threads for OpenMP, MPI and Hybrid (MPI and OpenMP) are as following.

OpenMP

The illustration below shows the default binding of a pure OpenMP-job on 1 node with 16 cpus on which 16 threads are allocated.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=16

export OMP_NUM_THREADS=16

srun --ntasks 1 --cpus-per-task $OMP_NUM_THREADS ./application

MPI

The illustration below shows the default binding of a pure MPI-job. In which 32 global ranks are distributed onto 2 nodes with 16 cores each. Each rank has 1 core assigned to it.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --cpus-per-task=1

srun --ntasks 32 ./application

Hybrid (MPI and OpenMP)

In the illustration below the default binding of a Hybrid-job is shown. In which 8 global ranks are distributed onto 2 nodes with 16 cores each. Each rank has 4 cores assigned to it.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=4

export OMP_NUM_THREADS=4

srun --ntasks 8 --cpus-per-task $OMP_NUM_THREADS ./application

Editing Jobs

Jobs that have not yet started can be altered. Using scontrol update timelimit=4:00:00 jobid=<jobid> is is for example possible to modify the maximum runtime. scontrol understands many different options, please take a look at the man page for more details.

Job and SLURM Monitoring

On the command line, use squeue to watch the scheduling queue. This command will tell the reason, why a job is not running (job status in the last column of the output). More information about job parameters can also be determined with scontrol -d show job <jobid> Here are detailed descriptions of the possible job status:

Reason Long description
Dependency This job is waiting for a dependent job to complete.
None No reason is set for this job.
PartitionDown The partition required by this job is in a DOWN state.
PartitionNodeLimit The number of nodes required by this job is outside of itís partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit The jobís time limit exceeds itís partitionís current time limit.
Priority One or more higher priority jobs exist for this partition.
Resources The job is waiting for resources to become available.
NodeDown A node required by the job is down.
BadConstraints The jobís constraints can not be satisfied.
SystemFailure Failure of the SLURM system, a file system, the network, etc.
JobLaunchFailure The job could not be launched. This may be due to a file system problem, invalid program name, etc.
NonZeroExitCode The job terminated with a non-zero exit code.
TimeLimit The job exhausted its time limit.
InactiveLimit The job reached the system InactiveLimit.
In addition, the sinfo command gives you a quick status overview.

For detailed information on why your submitted job has not started yet, you can use whypending <jobid>.

Accounting

The SLRUM command sacct provides job statistics like memory usage, CPU time, energy usage etc. Examples:

# show all own jobs contained in the accounting database
sacct
# show specific job
sacct -j <JOBID>
# specify fields
sacct -j <JOBID> -o JobName,MaxRSS,MaxVMSize,CPUTime,ConsumedEnergy
# show all fields
sacct -j <JOBID> -o ALL

Read the manpage (man sacct) for information on the provided fields.

Killing jobs

The command scancel <jobid> kills a single job and removes it from the queue. By using scancel -u <username> you are able to kill all of your jobs at once.

Host List

If you want to place you job onto specific nodes, there are two options for doing this. Either use -p to specify a host group that fits your needs. Or, use -w or (--nodelist) with a name node nodes that will work for you.

Job Profiling

SLURM offers the option to gather profiling data from every task / node of the job. Following data can be gathered:
  • Task data, such as CPU frequency, CPU utilization, memory consumption (RSS and VMSize), I/O
  • Energy consumption of the nodes
  • Infiniband data (currently deactivated)
  • Lustre filesystem data (currently deactivated)
The data is sampled at a fixed rate (i.e. every 5 seconds) and is stored in a HDF5 file.

CAUTION: Please be aware that the profiling data may be quiet large, depending on job size, runtime, and sampling rate. Always remove the local profiles from /lustre/scratch2/profiling/${USER}, either by running sh5util as shown above or by simply removing those files.

Usage examples:

# create energy and task profiling data (--acctg-freq is the sampling rate in seconds)
srun --profile=All --acctg-freq=5,energy=5 -n 32 ./a.out
# create task profiling data only
srun --profile=All --acctg-freq=5 -n 32 ./a.out

# merge the node local files in /lustre/scratch2/profiling/${USER} to single file
# (without -o option output file defaults to job_<JOBID>.h5)
sh5util -j <JOBID> -o profile.h5
# in jobscripts or in interactive sessions (via salloc):
sh5util -j ${SLURM_JOBID} -o profile.h5

# view data:
module load hdf5/hdfview
hdfview.sh profile.h5

More information about profiling with SLURM:

Reservations

If you want to run jobs, which specifications are out of our job limitations, you could ask for a reservation (hpcsupport@zih.tu-dresden.de). Please add the following information to your request mail:
  • start time (please note, that the start time have to be later than the day of the request plus 7 days, better more, because the longest jobs run 7 days)
  • duration or end time
  • account
  • node count or cpu count
  • partition
After we agreed with your requirements, we will sent you an e-mail with your reservation name. Then you could see more information about your reservation with the following command:

scontrol show res=<reservation name>
e.g. scontrol show res=hpcsupport_123

If you want to use your reservation, you have to add the parameter "--reservation=<reservation name>" either in your sbatch script or to your srun or salloc command.

SLURM External Links