ALERT! Warning: your browser isn't supported. Please install a modern one, like Firefox, Opera, Safari, Chrome or the latest Internet Explorer. Thank you!
Startseite » ... » Zentrale Einrichtungen  » ZIH  » Wiki
phone prefix: +49 351 463.....

HPC Support

Operation Status

Ulf Markwardt: 33640
Claudia Schmidt: 39833 hpcsupport@zih.tu-dresden.de

Login and project application

Phone: 40000
Fax: 42328
servicedesk@tu-dresden.de

You are here: Compendium

Foreword

This compendium is work in progress, since we try to incorporate more information with increasing experience and with every question you ask us. We invite you to take part in the improvement of these pages by correcting or adding useful information or commenting the pages.

Ulf Markwardt

Contents

News

  • 2017-11-09 HPC Introduction Chemnitz
  • 2017-10-27 Downtimes for IB extension
    Our planned extension of Taurus requires changes in the Infiniband infrastructure. Island 6 in Taurus gets a new IB switch layer (complete
    re-cabling). And a new Infiniband top-level switch is added. These works require the following major restrictions:
    • November 6-17 - Taurus Island 6: taurusi[6001-6612]
      • the interconnect network will be stopped
      • /scratch and /lustre/ssd will not be available
    • November 7-9 - Taurus and Venus
      • the complete interconnect network will be stopped
      • /scratch and /lustre/ssd will not be available
    • Single-node jobs with low data requirements can be run following these hints to run jobs without Infiniband.
  • 2017-09-14: Introduction to HPC - slides
  • 2017-08-08:
    New Hardware: We now have 32 nodes with Intel Broadwell CPUs (28 cores and 64 GB RAM, see https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/HardwareTaurus#broadwell). To explicitely run a job here, use the Slurm option "-p broadwell". Without a partition flag, Slurm automatically schedules jobs to these nodes according to the memory requirements. This integration and a Slurm update last week have been responsible for a few minor interruptions in the batch system... thank you for your patience!
    New Module Environment: We have decided to make LMOD the new default for our environment modules infrastructure on Taurus. Its main advantage is a higher speed at commands like "module avail" or the tab completion due to the caching of module files. (https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/RuntimeEnvironment#Lmod:_An_Alternative_Module_Implementation)
    Checkpoint/Restart: We are aiming to reduce the maximum job run-time in our cluster step by step. In order to help you shorten your individual jobs, we now offer a solution to automatically create chain jobs with generic checkpoint/restart out of normal batch scripts. There are still some limitations with regard to MPI-applications, but we'd be very interested in your feedback for your sequential/multi-threaded applications. You can find detailed documentation in our HPC wiki. (https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/CheckpointRestart)
    Containers: We now have established Singularity as container environment on Taurus. Your existing Docker containers can easily be converted. This can also be very useful to make software run on Taurus that has several distribution specific dependencies which are not easily installable in our RHEL-based environment, as well as for circumenventing problems of binary-distributed software that, e.g., depends on a newer GLIBC version than what is currently available on Taurus. (https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Container)