ALERT! Warning: your browser isn't supported. Please install a modern one, like Firefox, Opera, Safari, Chrome or the latest Internet Explorer. Thank you!
Startseite » ... » Zentrale Einrichtungen  » ZIH  » Wiki
phone prefix: +49 351 463.....

HPC Support

Operation Status

Ulf Markwardt: 33640
Claudia Schmidt: 39833 hpcsupport@zih.tu-dresden.de

Login and project application

Phone: 40000
Fax: 42328
servicedesk@tu-dresden.de

You are here: Compendium » Applications » DeepLearning

Deep Learning Software

Please refer to our List of Modules page for a daily-updated list of the respective software versions that are currently installed.

Caffe

Caffe is available in our EasyBuild module environment under the module name "Caffe".

TensorFlow

TensorFlow is available in our EasyBuild module environment under the module name "tensorflow". There also is a build with the "-avx2fma" suffix available, however, according to our internal tests the addition of AVX2/FMA did not have a positive effect on performance, so we recommend to use the default version instead.

Note that up to version 1.2.1, it was installed using the binary distribution packages from Google. Since those are built with GLIBC 2.16 and the operating system on Taurus only includes GLIBC 2.12, it is not possible to use those versions out-of-the-box. You have to use the supplied wrapper script "python-glibc2.17" as the interpreter for your scripts in order to make it work. Starting from version 1.3.0 we have done custom builds which make this workaround obsolete.

Keras

Keras is available in our EasyBuild module environment under the module name "Keras".

It can either use TensorFlow or Theano as its backend. If you wish to use the TensorFlow backend, please note the special circumstances described in the section above (don't forget to load the corresponding tensorflow module). Theano should be loaded automatically as a dependency. You can select your desired backend by editing the configuration file ~/.keras/keras.json in your home directory and specify either:
   "backend": "tensorflow",

or:
   "backend": "theano",

The file is created automatically when running Keras for the first time.

Multi-CPU Theano

If you wish to use CPU-parallelism with Theano, you have to supply it with a multi-threaded BLAS library. On Taurus, it is recommended to use the Intel MKL for that. For our module, it is already loaded as a dependency, so you just have to add the following to your ~/.theanorc:
[blas]
ldflags = '-L${MKLROOT}/lib/intel64 -lmkl_avx2 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl'

#NOTE: you might have to replace "-lmkl_avx2" with "-lmkl_def" if you want to run on non-Haswell nodes

Then increase your --cpus-per-task SLURM parameter according to the number of threads you wish to use.

Multi-GPU Theano

For multi-GPU support you have to use the new libgpuarray backend for Theano (device=gpu is old backend, device=cuda uses new backend, also see [1]). It is supported in the latest release starting from Theano 0.9.0, which is now available as a module on Taurus.

Example:
THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'

Note: This does not work in an interactive bash that was started with "srun --pty ... bash -l", because the sourcing of /etc/profile that happens when using the "-l" parameter to bash overwrites some environment variables, which leads to a PMI error.

[1] https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end(gpuarray)

Test Case 1: Keras with Theano/Tensorflow on MNIST data

Go to a directory on taurus, get Keras for the examples and go to the examples:
git clone https://github.com/fchollet/keras.git
cd keras/examples/

Used configuration for our test case (if you do not specify Keras backend, then tensorflow is used as default):
$ cat ~/.keras/keras.json
{
    "floatx": "float32",
    "epsilon": 1e-07,
    "image_dim_ordering": "tf",
    "backend": "theano"
}
$ cat ~/.theanorc 
[global]
floatX = float32
device = cuda
[lib]
cnmem = 1
[nvcc]
fastmath = True

Job-file (schedule job with sbatch, check status with squeue -u <Username>):
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --mem=10000
#SBATCH -p gpu2 # K80 GPUs on Haswell node
#SBATCH --time=01:00:00

## with Theano (using configs from above)
module purge # purge if you already have modules loaded
module load modenv/eb
module load Keras

srun python mnist_cnn.py


## with Tensorflow
module purge
module load modenv/eb
module load Keras
module load tensorflow
# if you see 'broken pipe error's (might happen in interactive session after the second srun command)
module load h5py/2.6.0-intel-2016.03-GCC-5.3-Python-3.5.2-HDF5-1.8.17-serial

export KERAS_BACKEND=tensorflow # configure Keras to use tensorflow

srun python mnist_cnn.py

Output with Theano backend:
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 45s - loss: 0.3244 - acc: 0.9008 - val_loss: 0.0748 - val_acc: 0.9768
Epoch 2/12
60000/60000 [==============================] - 44s - loss: 0.1116 - acc: 0.9662 - val_loss: 0.0543 - val_acc: 0.9827
Epoch 3/12
60000/60000 [==============================] - 44s - loss: 0.0860 - acc: 0.9744 - val_loss: 0.0516 - val_acc: 0.9840
Epoch 4/12
60000/60000 [==============================] - 44s - loss: 0.0714 - acc: 0.9788 - val_loss: 0.0404 - val_acc: 0.9865
Epoch 5/12
60000/60000 [==============================] - 44s - loss: 0.0634 - acc: 0.9808 - val_loss: 0.0366 - val_acc: 0.9877
Epoch 6/12
60000/60000 [==============================] - 44s - loss: 0.0569 - acc: 0.9825 - val_loss: 0.0337 - val_acc: 0.9885
Epoch 7/12
60000/60000 [==============================] - 44s - loss: 0.0519 - acc: 0.9844 - val_loss: 0.0324 - val_acc: 0.9893
Epoch 8/12
60000/60000 [==============================] - 44s - loss: 0.0472 - acc: 0.9860 - val_loss: 0.0316 - val_acc: 0.9895
Epoch 9/12
60000/60000 [==============================] - 44s - loss: 0.0445 - acc: 0.9865 - val_loss: 0.0331 - val_acc: 0.9889
Epoch 10/12
60000/60000 [==============================] - 44s - loss: 0.0406 - acc: 0.9885 - val_loss: 0.0319 - val_acc: 0.9895
Epoch 11/12
60000/60000 [==============================] - 44s - loss: 0.0384 - acc: 0.9889 - val_loss: 0.0299 - val_acc: 0.9901
Epoch 12/12
60000/60000 [==============================] - 44s - loss: 0.0366 - acc: 0.9888 - val_loss: 0.0298 - val_acc: 0.9906Using Theano backend.
Using cuDNN version 5105 on context None
Mapped name None to device cuda: Tesla K80 (0000:05:00.0)

Test loss: 0.0298093453895
Test accuracy: 0.9906

Output with Tensorflow backend:
[...]
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:05:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 10s - loss: 0.3288 - acc: 0.8980 - val_loss: 0.0787 - val_acc: 0.9758
Epoch 2/12
60000/60000 [==============================] - 9s - loss: 0.1164 - acc: 0.9658 - val_loss: 0.0534 - val_acc: 0.9826
Epoch 3/12
60000/60000 [==============================] - 9s - loss: 0.0885 - acc: 0.9738 - val_loss: 0.0465 - val_acc: 0.9837
Epoch 4/12
60000/60000 [==============================] - 9s - loss: 0.0737 - acc: 0.9783 - val_loss: 0.0403 - val_acc: 0.9868
Epoch 5/12
60000/60000 [==============================] - 9s - loss: 0.0656 - acc: 0.9807 - val_loss: 0.0361 - val_acc: 0.9876
Epoch 6/12
60000/60000 [==============================] - 9s - loss: 0.0581 - acc: 0.9828 - val_loss: 0.0361 - val_acc: 0.9884
Epoch 7/12
60000/60000 [==============================] - 9s - loss: 0.0522 - acc: 0.9843 - val_loss: 0.0324 - val_acc: 0.9886
Epoch 8/12
60000/60000 [==============================] - 9s - loss: 0.0479 - acc: 0.9851 - val_loss: 0.0304 - val_acc: 0.9893
Epoch 9/12
60000/60000 [==============================] - 9s - loss: 0.0450 - acc: 0.9868 - val_loss: 0.0291 - val_acc: 0.9902
Epoch 10/12
60000/60000 [==============================] - 9s - loss: 0.0420 - acc: 0.9873 - val_loss: 0.0308 - val_acc: 0.9899
Epoch 11/12
60000/60000 [==============================] - 9s - loss: 0.0404 - acc: 0.9882 - val_loss: 0.0282 - val_acc: 0.9906
Epoch 12/12
60000/60000 [==============================] - 11s - loss: 0.0382 - acc: 0.9885 - val_loss: 0.0292 - val_acc: 0.9910
Test loss: 0.0292089643462
Test accuracy: 0.991
Using TensorFlow backend.

Test Case 2: Multi-GPU usage (Theano)

Job file
#!/bin/bash
#SBATCH --gres=gpu:2  # using 2 GPUs
#SBATCH --mem=5000
#SBATCH -p gpu2
#SBATCH --time=01:00:00

module purge
module load modenv/eb
module load Theano

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1" python -c 'import theano'

Output (with Theano backend):
Using cuDNN version 5105 on context dev0
Mapped name dev0 to device cuda0: Tesla K80 (0000:04:00.0)
Using cuDNN version 5105 on context dev1
Mapped name dev1 to device cuda1: Tesla K80 (0000:05:00.0)

Jupyter Notebooks

This sections shows how to run a jupyter server within an sbatch GPU job and which modules and packages you need. The jupyter module is loaded by:

module load modenv/eb
module load jupyter

Setup phase (optional)

if you need to adjust the config, you might want to create the template:

jupyter notebook --generate-config

if you want an SSL certificate for https connections, you can create a self-signed certificate:

openssl req -x509 -nodes -days 365 -newkey rsa:4096 -keyout mykey.key -out mycert.pem

Possible entries for your jupyter config (.jupyter/jupyter_notebook_config.py):

c.NotebookApp.certfile = u'<path-to-cert>/mycert.pem'
c.NotebookApp.keyfile = u'<path-to-cert>/mykey.key'
# set ip to '*' otherwise server is bound to localhost only
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

Your Keras Configurations (.keras/keras.json):

{
    "floatx": "float32",
    "epsilon": 1e-07,
    "image_dim_ordering": "tf",
//    "backend": "theano"
    "backend": "tensorflow"
}

SLURM job file to run the jupyter server on Taurus with 1x K80 (also works on K20)

#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU
#SBATCH --partition=gpu2 # use GPU partition
#SBATCH --output=output.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=1:30:00
#SBATCH --mem=4000M
#SBATCH -J jupyter-server
#SBATCH -A <your_reservation>

## ... other settings

module purge
module load modenv/eb # load easybuild

module load matplotlib/1.5.3-intel-2016.03-GCC-5.3-Python-3.5.2
module load jupyter/1.0.0-Python-3.5.2
# load keras first to avoid issue with cudnn 6.0
module load Keras
module load tensorflow
# to avoid MPI and broken pipe errors
module load h5py/2.6.0-intel-2016.03-GCC-5.3-Python-3.5.2-HDF5-1.8.17-serial

## if cert does not exist, create one
if [ ! -f "$HOME/.jupyter/mycert.pem" ]; then
    mkdir -p $HOME/.jupyter
    openssl req -x509 -nodes -days 365 -newkey rsa:4096 -keyout "$HOME/.jupyter/mykey.key" -out "$HOME/.jupyter/mycert.pem" -subj "/C=DE/ST=./L=./O=./OU=./CN=." || exit 1
fi

srun jupyter notebook --no-browser --ip="*" --certfile="$HOME/.jupyter/mycert.pem" --keyfile="$HOME/.jupyter/mykey.key"

Check by less output.txt the status and the token of the server.
You see the server node's hostname by squeue -u <username>.

Connect to the server

On your client you now can connect to the server. You need to know the node's hostname, the port of the server and the token to login (see paragraph above).
You can connect directly, if you know the IP address (just ping the node's hostname while logged on taurus).
host taurusi2092
# copy IP address from output
# paste IP to your browser or call e.g.
firefox https://<IP>:<PORT>

Or you create an ssh tunnel, if you have problems with the solution above.
node=taurusi2092
localport=8888
serverport=8888
user=<username>
ssh -fNL ${localport}:${node}:${serverport} ${user}@tauruslogin4.hrsk.tu-dresden.de
pgrep -f "ssh -fNL ${localport}"

To login into the jupyter notebook site you have to enter the token. Now you can create and execute notebooks on taurus with GPU support.