MEGWARE PC-Farm Atlas (Outdated)¶

Warning

This page is deprecated! Atlas is a former system!

System¶

The PC farm Atlas is a heterogeneous, general purpose cluster based on multicore chips AMD Opteron 6274 ("Bulldozer"). The nodes are operated by the Linux operating system SUSE SLES 11 with a 2.6 kernel. Currently, the following hardware is installed:

Component	Count
CPUs	AMD Opteron 6274
number of cores	5120
th. peak performance	45 TFLOPS
compute nodes	4-way nodes Saxonid with 64 cores
nodes with 64 GB RAM	48
nodes with 128 GB RAM	12
nodes with 512 GB RAM	8

Mars and Deimos users: Please read the migration hints.

All nodes share the /home and /fastfs filesystem with our other HPC systems. Each node has 180 GB local disk space for scratch mounted on /tmp. The jobs for the compute nodes are scheduled by the Platform LSF batch system from the login nodes atlas.hrsk.tu-dresden.de .

A QDR InfiniBand interconnect provides the communication and I/O infrastructure for low latency / high throughput data traffic.

Users with a login on the SGI Altix can access their home directory via NFS below the mount point /hpc_work.

CPU AMD Opteron 6274¶

Component	Count
Clock rate	2.2 GHz
cores	16
L1 data cache	16 KB per core
L1 instruction cache	64 KB shared in a module (i.e. 2 cores)
L2 cache	2 MB per module
L3 cache	12 MB total, 6 MB shared between 4 modules = 8 cores
FP units	1 per module (supports fused multiply-add)
th. peak performance	8.8 GFLOPS per core (w/o turbo)

The CPU belongs to the x86_64 family. Since it is fully capable of running x86-code, one should compare the performances of the 32 and 64 bit versions of the same code.

For more architectural details, see the AMD Bulldozer block diagram and topology of Atlas compute nodes.

Usage¶

Compiling Parallel Applications¶

When loading a compiler module on Atlas, the module for the MPI implementation OpenMPI is also loaded in most cases. If not, you should explicitly load the OpenMPI module with module load openmpi. This also applies when you use the system's (old) GNU compiler.

Use the wrapper commands mpicc , mpiCC , mpif77 , or mpif90 to compile MPI source code. They use the currently loaded compiler. To reveal the command lines behind the wrappers, use the option -show.

For running your code, you have to load the same compiler and MPI module as for compiling the program. Please follow the outlined guidelines to run your parallel program using the batch system.

Batch System¶

Applications on an HPC system can not be run on the login node. They have to be submitted to compute nodes with dedicated resources for the user's job. Normally a job can be submitted with these data:

number of CPU cores,
requested CPU cores have to belong on one node (OpenMP programs) or can distributed (MPI),
memory per process,
maximum wall clock time (after reaching this limit the process is killed automatically),
files for redirection of output and error messages,
executable and command line parameters.

LSF¶

The batch system on Atlas is LSF, see also the general information on LSF.

Submission of Parallel Jobs¶

To run MPI jobs ensure that the same MPI module is loaded as during compile-time. In doubt, check you loaded modules with module list. If you code has been compiled with the standard OpenMPI installation, you can load the OpenMPI module via module load openmpi.

Please pay attention to the messages you get loading the module. They are more up-to-date than this manual. To submit a job the user has to use a script or a command-line like this:

bsub -n <N> mpirun <program name>

Memory Limits¶

Memory limits are enforced. This means that jobs which exceed their per-node memory limit may be killed automatically by the batch system.

The default limit is 300 MB per job slot (bsub -n).

Atlas has sets of nodes with different amount of installed memory which affect where your job may be run. To achieve the shortest possible waiting time for your jobs, you should be aware of the limits shown in the following table and read through the explanation below.

Nodes	No. of Cores	Avail. Memory per Job Slot	Max. Memory per Job Slot for Oversubscription
`n[001-047]`	`3008`	`940 MB`	`1880 MB`
`n[049-072]`	`1536`	`1950 MB`	`3900 MB`
`n[085-092]`	`512`	`8050 MB`	`16100 MB`

Explanation¶

The amount of memory that you request for your job (-M ) restricts to which nodes it will be scheduled. Usually, the column Avail. Memory per Job Slot shows the maximum that will be allowed on the respective nodes.

However, we allow for oversubscribing of job slot memory. This means that jobs which use -n32 or less may be scheduled to smaller memory nodes.

Have a look at the examples below.

Monitoring Memory Usage¶

At the end of the job completion mail there will be a link to a website which shows the memory usage over time per node. This will only be available for longer running jobs (>10 min).

Examples¶

Job Spec.	Nodes Allowed	Remark
`bsub -n 1 -M 500`	All nodes	<= 940 Fits everywhere
`bsub -n 64 -M 700`	All nodes	<= 940 Fits everywhere
`bsub -n 4 -M 1800`	All nodes	Is allowed to oversubscribe on small nodes n[001-047]
`bsub -n 64 -M 1800`	`n[049-092]`	64*1800 will not fit onto a single small node and is therefore restricted to running on medium and large nodes
`bsub -n 4 -M 2000`	`-n[049-092]`	Over limit for oversubscribing on small nodes `n[001-047]`, but may still go to medium nodes
`bsub -n 32 -M 2000`	`-n[049-092]`	Same as above
`bsub -n 32 -M 1880`	All nodes	Using max. 1880 MB, the job is eligible for running on any node
`bsub -n 64 -M 2000`	`-n[085-092]`	Maximum for medium nodes is 1950 per slot - does the job really need 2000 MB per process?
`bsub -n 64 -M 1950`	`n[049-092]`	When using 1950 as maximum, it will fit to the medium nodes
`bsub -n 32 -M 16000`	`n[085-092]`	Wait time might be very long
`bsub -n 64 -M 16000`	`n[085-092]`	Memory request cannot be satisfied (6416 MB = 1024 GB), cannot schedule job*

Batch Queues¶

Batch queues are subject to (mostly minor) changes anytime. The scheduling policy prefers short running jobs over long running ones. This means that short jobs get higher priorities and are usually started earlier than long running jobs.

Batch Queue	Admitted Users	Max. Cores	Default Runtime	Max. Runtime
`interactive`	`all`	n/a	12h 00min	12h 00min
`short`	`all`	1024	1h 00min	24h 00min
`medium`	`all`	1024	24h 01min	72h 00min
`long`	`all`	1024	72h 01min	120h 00min
`rtc`	`on request`	4	12h 00min	300h 00min