Machine Learning¶
This is an introduction of how to run machine learning applications on ZIH systems.
For machine learning purposes, we recommend to use the partitions alpha
and/or ml
.
Partition ml
¶
The compute nodes of the partition ml
are built on the base of
Power9 architecture from IBM. The system was created
for AI challenges, analytics and working with data-intensive workloads and accelerated databases.
The main feature of the nodes is the ability to work with the
NVIDIA Tesla V100 GPU with NV-Link
support that allows a total bandwidth with up to 300 GB/s. Each node on the
partition ml
has 6x Tesla V-100 GPUs. You can find a detailed specification of the partition in our
Power9 documentation.
Note
The partition ml
is based on the Power9 architecture, which means that the software built
for x86_64 will not work on this partition. Also, users need to use the modules which are
specially build for this architecture (from modenv/ml
).
Modules¶
On the partition ml
load the module environment:
marie@ml$ module load modenv/ml
The following have been reloaded with a version change: 1) modenv/scs5 => modenv/ml
Power AI¶
There are tools provided by IBM, that work on partition ml
and are related to AI tasks.
For more information see our Power AI documentation.
Partition: Alpha¶
Another partition for machine learning tasks is alpha
. It is mainly dedicated to
ScaDS.AI topics. Each node on partition alpha
has 2x AMD EPYC CPUs, 8x NVIDIA
A100-SXM4 GPUs, 1 TB RAM and 3.5 TB local space (/tmp
) on an NVMe device. You can find more
details of the partition in our Alpha Centauri
documentation.
Modules¶
On the partition alpha
load the module environment:
marie@alpha$ module load modenv/hiera
The following have been reloaded with a version change: 1) modenv/ml => modenv/hiera
Note
On partition alpha
, the most recent modules are build in hiera
. Alternative modules might be
build in scs5
.
Machine Learning via Console¶
Python and Virtual Environments¶
Python users should use a virtual environment when conducting machine learning tasks via console.
For more details on machine learning or data science with Python see data analytics with Python.
R¶
R also supports machine learning via console. It does not require a virtual environment due to a different package management.
For more details on machine learning or data science with R see data analytics with R.
Machine Learning with Jupyter¶
The Jupyter Notebook is an open-source web application that allows you to create documents containing live code, equations, visualizations, and narrative text. JupyterHub allows to work with machine learning frameworks (e.g. TensorFlow or PyTorch) on ZIH systems and to run your Jupyter notebooks on HPC nodes.
After accessing JupyterHub, you can start a new session and configure it. For machine learning
purposes, select either partition alpha
or ml
and the resources, your application requires.
In your session you can use Python, R or RStudio for your machine learning and data science topics.
Machine Learning with Containers¶
Some machine learning tasks require using containers. In the HPC domain, the Singularity container system is a widely used tool. Docker containers can also be used by Singularity. You can find further information on working with containers on ZIH systems in our containers documentation.
The official source for Docker containers with TensorFlow, PyTorch and many other packages is the PowerAI container DockerHub repository of IBM.
Note
You could find other versions of software in the container on the "tag" tab on the docker web page of the container.
In the following example, we build a Singularity container with TensorFlow from the DockerHub and start it:
marie@ml$ singularity build my-ML-container.sif docker://ibmcom/powerai:1.6.2-tensorflow-ubuntu18.04-py37-ppc64le #create a container from the DockerHub with TensorFlow version 1.6.2
[...]
marie@ml$ singularity run --nv my-ML-container.sif #run my-ML-container.sif container supporting the Nvidia's GPU. You can also work with your container by: singularity shell, singularity exec
[...]
Additional Libraries for Machine Learning¶
The following NVIDIA libraries are available on all nodes:
Name | Path |
---|---|
NCCL | /usr/local/cuda/targets/ppc64le-linux |
cuDNN | /usr/local/cuda/targets/ppc64le-linux |
Note
For optimal NCCL performance it is recommended to set the NCCL_MIN_NRINGS environment variable during execution. You can try different values but 4 should be a pretty good starting point.
marie@compute$ export NCCL_MIN_NRINGS=4
HPC-Related Software¶
The following HPC related software is installed on all nodes:
Name | Path |
---|---|
IBM Spectrum MPI | /opt/ibm/spectrum_mpi/ |
PGI compiler | /opt/pgi/ |
IBM XLC Compiler | /opt/ibm/xlC/ |
IBM XLF Compiler | /opt/ibm/xlf/ |
IBM ESSL | /opt/ibmmath/essl/ |
IBM PESSL | /opt/ibmmath/pessl/ |
Datasets for Machine Learning¶
There are many different datasets designed for research purposes. If you would like to download some of them, keep in mind that many machine learning libraries have direct access to public datasets without downloading it, e.g. TensorFlow Datasets. If you still need to download some datasets use Datamover machine.
The ImageNet Dataset¶
The ImageNet project is a large visual database designed for use in visual object recognition
software research. In order to save space in the filesystem by avoiding to have multiple duplicates
of this lying around, we have put a copy of the ImageNet database (ILSVRC2012 and ILSVR2017) under
/scratch/imagenet
which you can use without having to download it again. For the future, the
ImageNet dataset will be available in
Warm Archive. ILSVR2017 also includes a dataset
for recognition objects from a video. Please respect the corresponding
Terms of Use.