GPU Cluster Alpha Centauri¶

The multi-GPU cluster Alpha Centauri has been installed for AI-related computations (ScaDS.AI).

The hardware specification is documented on the page HPC Resources.

Recabling Maintenance¶

User Action Required

Please read the following information carefully and follow the provided instructions.

We are in the process of becoming Alpha Centauri a Stand-Alone Cluster. Planned now is the integration of the cluster into the InfiniBand infrastructure of the new cluster Barnard.

Maintenance Work

On June 4+5, we will shut down and migrate Alpha Centauri to the Barnard InfiniBand infrastructure.

As consequences,

BeeGFS will no longer be available,
all Barnard filesystems (/home, /software, /data/horse, /data/walrus) can be used normally.

For your convenience, we already have started migrating your data from /beegfs to /data/horse/beegfs. Starting with the downtime, we again synchronize these data.

User Action Required

The less we have to synchronize the faster will be the overall process. So, please clean-up as much as possible as soon as possible.

Important for your work is:

Do not add terabytes of data to /beegfs if you cannot "consume" it before June 4.
After the final successful data transfer to /data/horse/beegfs, you then have to move it to the normal workspaces on /data/horse.
Be prepared to adapt your workflows to the new paths.

What happens afterward:

complete deletion of all user data in /beegfs
complete recabling of the storage nodes (BeeGFS hardware)
software and firmware updates
set-up of a new WEKA filesystem for high I/O demands on the same hardware

In case of any question regarding this maintenance or required action, please do not hesitate to contact the HPC support team.

Becoming a Stand-Alone Cluster¶

The former HPC system Taurus is partly switched-off and partly split up into separate clusters until the end of 2023. One such upcoming separate cluster is what you have known as partition alpha so far. With the end of the maintenance at November 30 2023, Alpha Centauri is now a stand-alone cluster with

homogeneous hardware resources incl. two login nodes login[1-2].alpha.hpc.tu-dresden.de,
and own Slurm batch system.

Filesystems¶

Your new /home directory (from Barnard) is also your /home on Alpha Centauri. If you have not migrated your /home from Taurus to your new /home on Barnard , please do so as soon as possible!

Current limitations w.r.t. filesystems

For now, Alpha Centauri will not be integrated in the InfiniBand fabric of Barnard. With this comes a dire restriction: the only work filesystems for Alpha Centauri will be the /beegfs filesystems. (/scratch and /lustre/ssd are not usable any longer.)

Please, prepare your stage-in/stage-out workflows using our datamovers to enable the work with larger datasets that might be stored on Barnard’s new capacity filesystem /data/walrus. The datamover commands are not yet running. Thus, you need to use them from Barnard!

The new Lustre filesystems, namely horse and walrus, will be mounted as soon as Alpha is recabled (planned for May 2024).

Current limitations w.r.t. workspace management

Workspace management commands do not work for beegfs yet. (Use them from Taurus!)

Usage¶

Note

The NVIDIA A100 GPUs may only be used with CUDA 11 or later. Earlier versions do not recognize the new hardware properly. Make sure the software you are using is built with CUDA11.

There is a total of 48 physical cores in each node. SMT is also active, so in total, 96 logical cores are available per node.

Note

Multithreading is disabled per default in a job. See the Slurm page on how to enable it.

Modules¶

The easiest way is using the module system. All software available from the module system has been specifically build for the cluster Alpha i.e., with optimization for Zen2 microarchitecture and CUDA-support enabled.

To check the available modules for Alpha, use the command

marie@login.alpha$ module spider <module_name>

Example: Searching and loading PyTorch

For example, to check which PyTorch versions are available you can invoke

marie@login.alpha$ module spider PyTorch
-------------------------------------------------------------------------------------------------------------------------
  PyTorch:
-------------------------------------------------------------------------------------------------------------------------
    Description:
      Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
      that puts Python first.

     Versions:
        PyTorch/1.12.0
        PyTorch/1.12.1-CUDA-11.7.0
        PyTorch/1.12.1
[...]

Not all modules can be loaded directly. Most modules are build with a certain compiler or toolchain that need to be loaded beforehand. Luckely, the module system can tell us, what we need to do for a specific module or software version

marie@login.alpha$ module spider PyTorch/1.12.1-CUDA-11.7.0

-------------------------------------------------------------------------------------------------------------------------
  PyTorch: PyTorch/1.12.1-CUDA-11.7.0
-------------------------------------------------------------------------------------------------------------------------
    Description:
      Tensors and Dynamic neural networks in Python with strong GPU acceleration. PyTorch is a deep learning framework
      that puts Python first.


    You will need to load all module(s) on any one of the lines below before the "PyTorch/1.12.1" module is available to load.

      release/23.04  GCC/11.3.0  OpenMPI/4.1.4
[...]

Finaly, the commandline to load the PyTorch/1.12.1-CUDA-11.7.0 module is

marie@login.alpha$ module load release/23.04  GCC/11.3.0  OpenMPI/4.1.4 PyTorch/1.12.1-CUDA-11.7.0
Module GCC/11.3.0, OpenMPI/4.1.4, PyTorch/1.12.1-CUDA-11.7.0 and 64 dependencies loaded.

marie@login.alpha$ python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"
1.12.1
True

Python Virtual Environments¶

Virtual environments allow you to install additional Python packages and create an isolated runtime environment. We recommend using virtualenv for this purpose.

Hint

We recommend to use workspaces for your virtual environments.

Example: Creating a virtual environment and installing torchvision package

marie@login.alpha$ srun --nodes=1 --cpus-per-task=1 --gres=gpu:1 --time=01:00:00 --pty bash -l
marie@alpha$ ws_allocate -n python_virtual_environment -d 1
Info: creating workspace.
/beegfs/ws/1/marie-python_virtual_environment
remaining extensions  : 2
remaining time in days: 1
marie@alpha$ module load release/23.04 GCCcore/11.3.0 GCC/11.3.0 OpenMPI/4.1.4 Python/3.10.4
Module GCC/11.3.0, OpenMPI/4.1.4, Python/3.10.4 and 21 dependencies loaded.
marie@alpha$ module load PyTorch/1.12.1-CUDA-11.7.0
Module PyTorch/1.12.1-CUDA-11.7.0 and 42 dependencies loaded.
marie@alpha$ which python
/software/rome/r23.04/Python/3.10.4-GCCcore-11.3.0/bin/python
marie@alpha$ pip list
[...]
marie@alpha$ virtualenv --system-site-packages /beegfs/ws/1/marie-python_virtual_environment/my-torch-env
created virtual environment CPython3.8.6.final.0-64 in 42960ms
  creator CPython3Posix(dest=/beegfs/.global1/ws/marie-python_virtual_environment/my-torch-env, clear=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=~/.local/share/virtualenv)
    added seed packages: pip==21.1.3, setuptools==57.2.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
marie@alpha$ source /beegfs/ws/1/marie-python_virtual_environment/my-torch-env/bin/activate
(my-torch-env) marie@alpha$ pip install torchvision==0.13.1
[...]
Installing collected packages: torchvision
Successfully installed torchvision-0.13.1
[...]
(my-torch-env) marie@alpha$ python -c "import torchvision; print(torchvision.__version__)"
0.13.1+cu102
(my-torch-env) marie@alpha$ deactivate

JupyterHub¶

JupyterHub can be used to run Jupyter notebooks on Alpha Centauri cluster. You can either use the standard profiles for Alpha or use the advanced form and define the resources for your JupyterHub job. The "Alpha GPU (NVIDIA Ampere A100)" preset is a good starting configuration.

Containers¶

Singularity containers enable users to have full control of their software environment. For more information, see the Singularity container details.

Nvidia NGC containers can be used as an effective solution for machine learning related tasks. (Downloading containers requires registration). Nvidia-prepared containers with software solutions for specific scientific problems can simplify the deployment of deep learning workloads on HPC. NGC containers have shown consistent performance compared to directly run code.