GPU jobs¶
In this page we will show you how to run jobs that use GPUs on the Grid:
About GPU jobs¶
Certain problems you want to run on special hardware to decrease the runtime of your program, such as on a GPU. The Grid has started to support and run GPUs jobs. This section contains the best practices of running GPU jobs on the Dutch Grid, as supported by SURF.
GPU job submission¶
To submit a GPU job to the Grid, the JDL file must contain a tag field set to gpu
. This field looks like the following:
Tags = {"gpu"};
This will put your job in the GPU queue after which the job lands on a compute element (CE) that contains a GPU and can run the relevant code.
Quick GPU example¶
To quickly run the example, download the shell-script and jdl-file as given below. For the full explanation on what the job just did, see the instructions in the Example GPU Job section.
Copy the shell-script
gpu_job.sh
to your UI directory:$wget http://doc.grid.surfsara.nl/en/latest/_downloads/gpu_job.sh
Copy the jdl-file
gpu_job.sh
to your UI directory:$wget http://doc.grid.surfsara.nl/en/latest/_downloads/gpu_job.jdl
And submit the job with:
$dirac-wms-job-submit gpu_job.jdl JobID = 123
And inspect the output after retrieving the files with:
$dirac-wms-job-get-output 123
Example GPU Job¶
An example of a GPU job starts with the code we want to run. For this example, we decide to run a CUDA example made by Nvidia, which can be found at github. To ensure the code runs on the Grid, it is containerized with apptainer and the container is distributed through CVMFS.
To build the container yourself, you can run
$apptainer build --fakeroot --nv --sandbox cuda_example_unpacked.sif cuda_example.def
on a machine with apptainer installed, where the user has fakeroot privileges. The cuda_example.def
definitions file contains the following recipe:
Bootstrap: docker From: nvidia/cuda:11.8.0-devel-centos7 %post #This section is run inside the container yum -y install git make mkdir /test_repo cd /test_repo git clone https://github.com/NVIDIA/cuda-samples.git cd /test_repo/cuda-samples/Samples/2_Concepts_and_Techniques/eigenvalues/ make %runscript #Executes when the "apptainer run" command is used #Useful when you want the container to run as an executable cd /test_repo/cuda-samples/Samples/2_Concepts_and_Techniques/eigenvalues/ ./eigenvalues %help This is a demo container to show how to build and run a CUDA application on a GPU node
This will create a container with the compiled eigenvalues
example inside, which makes use of an Nvidia GPU to calculate the eigenvalues of a 2048 x 2048 matrix.
This container has already been distributed on CVMFS and can be found at /cvmfs/softdrive.nl/lodewijkn/cuda_example_unpacked.sif
.
To run this container on the Grid, making use of the GPUs, the following jdl
has to be submitted to DIRAC, or the workload management system (WMS) of your choice. This jdl runs a shellscript on the CE, calling the container on CVMFS. This script, called gpu_job.sh
is given as:
#!/bin/bash cat /proc/self/status | grep 'Cpus_allowed_list:' echo /usr/bin/nvidia-smi pwd hostname /cvmfs/oasis.opensciencegrid.org/mis/apptainer/bin/apptainer run --nv /cvmfs/softdrive.nl/lodewijkn/cuda_example_unpacked.sif
In the command that calls apptainer the --nv
flag is necessary to expose the GPU to the container. The gpu_job.sh
script is then passed along inwith the jdl
to the Grid when submitting the job. The jdl, called gpu_job.jdl
, is given next:
[ JobName = "my_gpu_job"; Executable = "gpu_job.sh"; StdOutput = "StdOut"; StdError = "StdErr"; InputSandbox = {"gpu_job.sh"}; OutputSandbox = {"StdOut","StdErr"}; Site = "GRID.SURF.nl"; Tags = {"gpu"}; CPUTime = 600; ]
To submit the job, ensure all the necessary files are available: the gpu_job.sh
script and gpu_job.jdl
jdl file. Then submit the job with:
$dirac-wms-job-submit gpu_job.jdl
After the job has run succesfully, the stdout
output looks like:
Cpus_allowed_list: 11-21 Thu Dec 8 14:57:12 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A10 Off | 00000000:00:07.0 Off | 0 | | 0% 42C P0 57W / 150W | 0MiB / 23028MiB | 3% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ /tmp/ZTiMDmYBVN2nCifV3nVLvLKmABFKDmABFKDmURGKDmABFKDm5i8MKo/DIRAC_hA3xkNpilot/20 wn-a10-04.gina.surfsara.nl Starting eigenvalues GPU Device 0: "Ampere" with compute capability 8.6 Matrix size: 2048 x 2048 Precision: 0.000010 Iterations to be timed: 100 Result filename: 'eigenvalues.dat' Gerschgorin interval: -2.894310 / 2.923303 Average time step 1: 0.987820 ms Average time step 2, one intervals: 1.169709 ms Average time step 2, mult intervals: 2.616700 ms Average time TOTAL: 4.785920 ms Test Succeeded!
And you have run your first GPU job on the Grid!