Skip to content

HPC Resources

HPC resources in ZIH systems comprise the High Performance Computing and Storage Complex and its extension High Performance Computing – Data Analytics. In total it offers scientists about 60,000 CPU cores and a peak performance of more than 1.5 quadrillion floating point operations per second. The architecture specifically tailored to data-intensive computing, Big Data analytics, and artificial intelligence methods with extensive capabilities for energy measurement and performance monitoring provides ideal conditions to achieve the ambitious research goals of the users and the ZIH.

HPC Systems Migration Phase

On December 11 2023 Taurus will be decommissioned for good.

With our new HPC system Barnard comes a significant change in HPC system landscape at ZIH: We will have five homogeneous clusters with their own Slurm instances and with cluster specific login nodes running on the same CPU.

With the installation and start of operation of the new HPC system Barnard, quite significant changes w.r.t. HPC system landscape at ZIH follow. The former HPC system Taurus is partly switched-off and partly split up into separate clusters. In the end, from the users' perspective, there will be five separate clusters:

Name Description Year of Installation DNS
Barnard CPU cluster 2023 n[1001-1630].barnard.hpc.tu-dresden.de
Alpha Centauri GPU cluster 2021 i[8001-8037].alpha.hpc.tu-dresden.de
Julia Single SMP system 2021 julia.hpc.tu-dresden.de
Romeo CPU cluster 2020 i[8001-8190].romeo.hpc.tu-dresden.de
Power9 IBM Power/GPU cluster 2018 ml[1-29].power9.hpc.tu-dresden.de

All clusters will run with their own Slurm batch system and job submission is possible only from their respective login nodes.

Architectural Re-Design 2023

Over the last decade we have been running our HPC system of high heterogeneity with a single Slurm batch system. This made things very complicated, especially to inexperienced users. With the replacement of the Taurus system by the cluster Barnard we now create homogeneous clusters with their own Slurm instances and with cluster specific login nodes running on the same CPU. Job submission will be possible only from within the cluster (compute or login node).

All clusters will be integrated to the new InfiniBand fabric and have then the same access to the shared filesystems. This recabling will require a brief downtime of a few days.

Architecture overview 2023

Compute Systems

All compute clusters now act as separate entities having their own login nodes of the same hardware and their very own Slurm batch systems. The different hardware, e.g. Romeo and Alpha Centauri, is no longer managed via a single Slurm instance with corresponding partitions. Instead, you as user now chose the hardware by the choice of the correct login node.

The login nodes can be used for smaller interactive jobs on the clusters. There are restrictions in place, though, wrt. usable resources and time per user. For larger computations, please use interactive jobs.

Storage Systems

For an easier grasp on the major categories (size, speed), the work filesystems now come with the names of animals.

Permanent Filesystems

We now have /home and /software in a Lustre filesystem. Snapshots and tape backup are configured. (/projects remains the same until a recabling.)

The Lustre filesystem /data/walrus is meant for larger data with a slow access. It is installed to replace /warm_archive.

Work Filesystems

In the filesystem market with new players it is getting more and more complicated to identify the best suited filesystem for a specific use case. Often, only tests can find the best setup for a specific workload.

  • /data/horse - 20 PB - high bandwidth (Lustre)
  • /data/octopus - 0.5 PB - for interactive usage (Lustre) - to be mounted on Alpha Centauri
  • /data/weasel - 1 PB - for high IOPS (WEKA) - coming 2024.

Difference Between "Work" And "Permanent"

A large number of changing files is a challenge for any backup system. To protect our snapshots and backup from work data, /projects cannot be used for temporary data on the compute nodes - it is mounted read-only.

For /home, we create snapshots and tape backups. That's why working there, with a high frequency of changing files is a bad idea.

Please use our data mover mechanisms to transfer worthy data to permanent storages or long-term archives.

Migration Phase

For about one month, the new cluster Barnard, and the old cluster Taurus will run side-by-side - both with their respective filesystems. We provide a comprehensive description of the migration to Barnard.

Login and Dataport Nodes

On December 11 2023 Taurus will be decommissioned for good.

Do not use Taurus for production anymore.

  • Login-Nodes
    • Individual for each cluster. See sections below.
  • 2 Data-Transfer-Nodes
    • 2 servers without interactive login, only available via file transfer protocols (rsync, ftp)
    • dataport[3-4].hpc.tu-dresden.de
    • IPs: 141.30.73.[4,5]
    • Further information on the usage is documented on the site dataport Nodes
  • outdated: 2 Data-Transfer-Nodes taurusexport[3-4].hrsk.tu-dresden.de
    • DNS Alias taurusexport.hrsk.tu-dresden.de
    • 2 Servers without interactive login, only available via file transfer protocols (rsync, ftp)
    • available as long as outdated filesystems (e.g. scratch) are accessible

Barnard

The cluster Barnard is a general purpose cluster by Bull. It is based on Intel Sapphire Rapids CPUs.

  • 630 nodes, each with
    • 2 x Intel Xeon Platinum 8470 (52 cores) @ 2.00 GHz, Multithreading enabled
    • 512 GB RAM
    • 12 nodes provide 1.8 TB local storage on NVMe device at /tmp
    • All other nodes are diskless and have no or very limited local storage (i.e. /tmp)
  • Login nodes: login[1-4].barnard.hpc.tu-dresden.de
  • Hostnames: n[1001-1630].barnard.hpc.tu-dresden.de
  • Operating system: Red Hat Enterpise Linux 8.7
  • Further information on the usage is documented on the site CPU Cluster Barnard

Alpha Centauri

The cluster Alpha Centauri (short: Alpha) by NEC provides AMD Rome CPUs and NVIDIA A100 GPUs and is designed for AI and ML tasks.

  • 34 nodes, each with
    • 8 x NVIDIA A100-SXM4 Tensor Core-GPUs
    • 2 x AMD EPYC CPU 7352 (24 cores) @ 2.3 GHz, Multithreading available
    • 1 TB RAM
    • 3.5 TB local memory on NVMe device at /tmp
  • Login nodes: login[1-2].alpha.hpc.tu-dresden.de
  • Hostnames: i[8001-8037].alpha.hpc.tu-dresden.de
  • Operating system: Rocky Linux 8.7
  • Further information on the usage is documented on the site GPU Cluster Alpha Centauri

Romeo

The cluster Romeo is a general purpose cluster by NEC based on AMD Rome CPUs.

  • 192 nodes, each with
    • 2 x AMD EPYC CPU 7702 (64 cores) @ 2.0 GHz, Multithreading available
    • 512 GB RAM
    • 200 GB local memory on SSD at /tmp
  • Login nodes: login[1-2].romeo.hpc.tu-dresden.de
  • Hostnames: i[7001-7190].romeo.hpc.tu-dresden.de
  • Operating system: Rocky Linux 8.7
  • Further information on the usage is documented on the site CPU Cluster Romeo

Julia

The cluster Julia is a large SMP (shared memory parallel) system by HPE based on Superdome Flex architecture.

  • 1 node, with
    • 32 x Intel(R) Xeon(R) Platinum 8276M CPU @ 2.20 GHz (28 cores)
    • 47 TB RAM
  • Configured as one single node
  • 48 TB RAM (usable: 47 TB - one TB is used for cache coherence protocols)
  • 370 TB of fast NVME storage available at /nvme/<projectname>
  • Login node: julia.hpc.tu-dresden.de
  • Hostname: julia.hpc.tu-dresden.de
  • Further information on the usage is documented on the site SMP System Julia
Maintenance from November 27 to December 12

The recabling will take place from November 27 to December 12. These works are planned:

  • update the software stack (OS, firmware, software),
  • change the ethernet access (new VLANs),
  • complete integration of Romeo and Julia into the Barnard Infiniband network to get full bandwidth access to all Barnard filesystems,
  • configure and deploy stand-alone Slurm batch systems.

After the maintenance, the Julia system reappears as a stand-alone cluster that can be reached via julia.hpc.tu-dresden.de.

Changes w.r.t. filesystems: Your new /home directory (from Barnard) will become your /home on Romeo, Julia, Alpha Centauri and the Power9 system. Thus, please migrate your /home from Taurus to your new /home on Barnard.

The old work filesystems /lustre/scratch and /lustre/ssd will be turned off on January 1 2024 for good (no data access afterwards!). The new work filesystem available on the Julia system will be /horse. Please migrate your working data to /horse.

Power9

The cluster Power9 by IBM is based on Power9 CPUs and provides NVIDIA V100 GPUs. Power9 is specifically designed for machine learning (ML) tasks.

  • 32 nodes, each with
    • 2 x IBM Power9 CPU (2.80 GHz, 3.10 GHz boost, 22 cores)
    • 256 GB RAM DDR4 2666 MHz
    • 6 x NVIDIA VOLTA V100 with 32 GB HBM2
    • NVLINK bandwidth 150 GB/s between GPUs and host
  • Login nodes: login[1-2].power9.hpc.tu-dresden.de
  • Hostnames: ml[1-29].power9.hpc.tu-dresden.de (after recabling phase; expected January '24)
  • Further information on the usage is documented on the site GPU Cluster Power9
Maintenance

The recabling will take place from November 27 to December 12. After the maintenance, the Power9 system reappears as a stand-alone cluster that can be reached via ml[1-29].power9.hpc.tu-dresden.de.

Changes w.r.t. filesystems: Your new /home directory (from Barnard) will become your /home on Romeo, Julia, Alpha Centauri and the Power9 system. Thus, please migrate your /home from Taurus to your new /home on Barnard.

The old work filesystems /lustre/scratch and /lustre/ssd will be turned off on January 1 2024 for good (no data access afterwards!). The only work filesystem available on the Power9 system will be /beegfs. Please migrate your working data to /horse.