GPU jobs

In this page we will show you how to run jobs that use GPUs on the Grid:

About GPU jobs

Certain problems you want to run on special hardware to decrease the runtime of your program, such as on a GPU. The Grid has started to support and run GPUs jobs. This section contains the best practices of running GPU jobs on the Dutch Grid, as supported by SURF.

GPU job submission

To submit a GPU job to the Grid, the JDL file must contain a tag field set to gpu. This field looks like the following:

Tags = {"gpu"};

This will put your job in the GPU queue after which the job lands on a compute element (CE) that contains a GPU and can run the relevant code.

Quick GPU example

To quickly run the example, download the shell-script and jdl-file as given below. For the full explanation on what the job just did, see the instructions in the Example GPU Job section.

  • Copy the shell-script gpu_job.sh to your UI directory:

    $wget http://doc.grid.surfsara.nl/en/latest/_downloads/gpu_job.sh
    
  • Copy the jdl-file gpu_job.sh to your UI directory:

    $wget http://doc.grid.surfsara.nl/en/latest/_downloads/gpu_job.jdl
    

And submit the job with:

$dirac-wms-job-submit gpu_job.jdl
JobID = 123

And inspect the output after retrieving the files with:

$dirac-wms-job-get-output 123

Example GPU Job

An example of a GPU job starts with the code we want to run. For this example, we decide to run a CUDA example made by Nvidia, which can be found at github. To ensure the code runs on the Grid, it is containerized with apptainer and the container is distributed through CVMFS.

To build the container yourself, you can run

$apptainer build --fakeroot --nv --sandbox cuda_example_unpacked.sif cuda_example.def

on a machine with apptainer installed, where the user has fakeroot privileges. The cuda_example.def definitions file contains the following recipe:

Bootstrap: docker
From: nvidia/cuda:11.8.0-devel-centos7

%post
#This section is run inside the container
yum -y install git make
mkdir /test_repo
cd /test_repo
git clone https://github.com/NVIDIA/cuda-samples.git
cd /test_repo/cuda-samples/Samples/2_Concepts_and_Techniques/eigenvalues/
make

%runscript
#Executes when the "apptainer run" command is used
#Useful when you want the container to run as an executable
cd /test_repo/cuda-samples/Samples/2_Concepts_and_Techniques/eigenvalues/
./eigenvalues

%help
This is a demo container to show how to build and run a CUDA application
on a GPU node

This will create a container with the compiled eigenvalues example inside, which makes use of an Nvidia GPU to calculate the eigenvalues of a 2048 x 2048 matrix.

This container has already been distributed on CVMFS and can be found at /cvmfs/softdrive.nl/lodewijkn/cuda_example_unpacked.sif.

To run this container on the Grid, making use of the GPUs, the following jdl has to be submitted to DIRAC, or the workload management system (WMS) of your choice. This jdl runs a shellscript on the CE, calling the container on CVMFS. This script, called gpu_job.sh is given as:

#!/bin/bash

cat /proc/self/status | grep 'Cpus_allowed_list:'
echo
/usr/bin/nvidia-smi

pwd
hostname

/cvmfs/oasis.opensciencegrid.org/mis/apptainer/bin/apptainer run --nv /cvmfs/softdrive.nl/lodewijkn/cuda_example_unpacked.sif

In the command that calls apptainer the --nv flag is necessary to expose the GPU to the container. The gpu_job.sh script is then passed along inwith the jdl to the Grid when submitting the job. The jdl, called gpu_job.jdl, is given next:

[
  JobName = "my_gpu_job";
  Executable = "gpu_job.sh";
  StdOutput = "StdOut";
  StdError = "StdErr";
  InputSandbox = {"gpu_job.sh"};
  OutputSandbox = {"StdOut","StdErr"};
  Site = "GRID.SURF.nl";
  Tags = {"gpu"};
  CPUTime = 600;
]

To submit the job, ensure all the necessary files are available: the gpu_job.sh script and gpu_job.jdl jdl file. Then submit the job with:

$dirac-wms-job-submit gpu_job.jdl

After the job has run succesfully, the stdout output looks like:

Cpus_allowed_list:  11-21

Thu Dec  8 14:57:12 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10          Off  | 00000000:00:07.0 Off |                    0 |
|  0%   42C    P0    57W / 150W |      0MiB / 23028MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
/tmp/ZTiMDmYBVN2nCifV3nVLvLKmABFKDmABFKDmURGKDmABFKDm5i8MKo/DIRAC_hA3xkNpilot/20
wn-a10-04.gina.surfsara.nl
Starting eigenvalues
GPU Device 0: "Ampere" with compute capability 8.6

Matrix size: 2048 x 2048
Precision: 0.000010
Iterations to be timed: 100
Result filename: 'eigenvalues.dat'
Gerschgorin interval: -2.894310 / 2.923303
Average time step 1: 0.987820 ms
Average time step 2, one intervals: 1.169709 ms
Average time step 2, mult intervals: 2.616700 ms
Average time TOTAL: 4.785920 ms
Test Succeeded!

And you have run your first GPU job on the Grid!