PBS local jobs

In this page we will talk about job submission to the local Life Science Grid (LSG) cluster. The information here applies to LSG users who own an account on their local LSG cluster:

Warning

The Life Science Grid infrastructure is scheduled to be decommissioned mid 2018. After the decommissioning the smaller LSG clusters within the UMC’s and other universities will cease to exist; the large central Grid clusters at NIKHEF and SURFsara will remain. More details about the decommissioning can be found here: https://userinfo.surfsara.nl/documentation/decommissioning-life-science-grid

Introduction

The Life Science Grid or LSG is a group of clusters which can be used locally only, or as one big cluster (Grid). Each local LSG cluster is part of the Life Science Grid that has its own User Interface (UI) and two Worker Nodes of 64 cores (see LSG specifications). You can use the local UI for submitting both local pbs jobs or Grid jobs.

In this section we will focus on the usage of local LSG cluster as a common batch system. The local job submission can be useful when:

  • prototyping your Grid application

  • running multicore jobs with high number of cores (e.g. more than 8 cores)

  • running applications that require just a few jobs to complete. For a large-scale applications that require thousands of analysis to complete, the best option is Grid due to its large compute and storage capacity.

Quickstart example

In this example we will submit a simple PBS job to the local LSG cluster using the fractals example.

Preamble

  • Log in to the LSG User Interface, e.g. “ams” cluster (you can find the hostname in the list of LSG hostnames):

    $ssh -X homer@gb-ui-ams.els.sara.nl   # replace homer with your username and the UI address of your local cluster
    
  • Copy the tarball pbsp_fractals.tar to your UI directory:

    $wget http://doc.grid.surfsara.nl/en/latest/_downloads/pbs_fractals.tar
    
  • Copy the fractals source code fractals.c to your UI directory.

    $wget http://doc.grid.surfsara.nl/en/latest/_downloads/fractals.c
    
  • Untar the example and check the files:

    $tar -xvf pbs_fractals.tar
    $cd pbs_fractals/
    $mv ../fractals.c ./
    $ls -l
    
    -rw-r--r-- 1 homer homer fractals.c
    -rw-rw-r-- 1 homer homer wrapper.sh
    
  • Compile the example:

    $cc fractals.c -o fractals -lm
    

Submit a PBS job

  • Submit the job to the local cluster:

    $qsub wrapper.sh
    
    6401.gb-ce-ams.els.sara.nl
    

This command returns a jobID (6401) that can be used to monitor the progress of the job.

  • Monitor the progress of your job:

    $qstat -f 6401   # replace 6401 with your jobID
    

    Optionally, when the job finishes, display the job output image:

    $convert output "output.png"
    $display output.png
    
  • List your own jobs:

    $qstat -u homer   # replace homer with your username
    
  • Cancel the job you submitted:

    $qdel 6401   # replace 6401 with your jobID
    

Directives

  • Specify the maximum job walltime in hh::mm:ss:

    ##PBS -l walltime=4:00:00 # the job will run 4h at maximum
    
  • Specify the number of cores to be allocated for your job:

    ##PBS -l nodes=1:ppn=2  # asks two cores on a single node
    
  • The default stdout/stderr target is the directory that you submit the job from. The following line changes the stdout/stderr directory to a specified path (e.g. samples directory):

    ##PBS -e /home/homer/samples/
    ##PBS -o /home/homer/samples/
    

System status commands

  • List all the running/queued jobs in the cluster:

    $qstat
    
  • Get details for all jobs in a queue, e.g. “long”:

    $qstat -f long
    
  • Show all the running jobs in the system and the occupied cores on the two worker nodes. The very last number in each row (after ‘/‘) shows the rank of corresponding core:

    $qstat -an1
    
  • List all running jobs per worker node and core:

    $pbsnodes
    

Local queues

On the LSG clusters you can find different queue types. We recommend you to estimate the walltime of your jobs and specify the queue to send your job. This can be done with the ‘-q’ option in your qsub command. For example, if you want to run a job for 72 hours, you need to specify the queue “long”:

$qsub -q long wrapper.sh # allow job to run for 72 hours

If you don’t specify a particular queue, then your jobs will be scheduled by default on the medium queue (32 hours limit). When the queue walltime is reached, the job will be killed.

How to use local scratch

When you submit a local job, it will land on one of the cluster nodes. This means that the working directory will be different to the directory from where you submit the job (the worker node is a different machine to the UI).

The home UI directory is mounted on the worker node via NFS. For better I/O performance, copy files, computation to the worker node’s /scratch.

Note

There is an environment variable set on the worker nodes called $TMPDIR that points to your job directory, e.g. /scratch/<jobID>.gb-ui-ams.els.sara.nl/.

Use $TMPDIR in your scripts to locate the /scratch directory. The $TMPDIR directory also makes sure that any created data is cleaned up properly when the job has finished.

Example with $TMPDIR

  • Use the {PBS_O_WORKDIR} variable to locate your scripts and make sure that your code does not contain any hard-coded paths pointing to your home directory. This variable points to the directory from where you submit the job. Edit the script that you submit with qsub as:

    cd $TMPDIR
    cp -r ${PBS_O_WORKDIR}/<your scripts,files> .  # note the dot at the end of `cp` command
    # ...
    # Run the executables
    # ...
    # When done, copy the output to your home directory:
    cp -r $TMPDIR/results ${PBS_O_WORKDIR}/
    
  • Submit the script with qsub.

How to use Grid Storage from the local cluster

There are many cases that the data that your program needs to run can not be available locally, either because the volume of your home directory is limited or because it is already stored on the Grid storage.

Any interaction with the Grid, compute nodes or storage element, requires a proxy for your authentication. Even if you run your compute on a local cluster worker node but need to use data from the Grid storage, you will have to Get a Grid certificate and Join a Virtual Organisation.

To access the Grid storage from jobs submitted locally through qsub, you need a valid proxy certificate. However, for local jobs submitted using qsub this proxy certificate is not copied automatically.

Therefore, to interact with the Grid storage, you need:

  1. A proxy certificate, see StartGridSession. You need to do this once, not for each job.

  2. To tell the system where the proxy certificate is:

  • Copy your proxy certificate to for example your home-directory using:

    $cp /tmp/x509up_u39111 /home/homer/  # replace x509up_u39111 with your own proxy file, here "39111" is your unix user-id
    
  • Set the rights of this file to 600 and treat it as confidential:

    $chmod 600 /home/homer/x509up_u39111
    

Because your home-directory is shared across the cluster, your proxy will also be available on all nodes within the cluster.

You also need to do this step once every week, and not for each job.

  • Tell the system where your proxy certificate is, by setting an environment variable. Add in the job script:

    $export X509_USER_PROXY=/home/homer/x509up_u39111
    

Now within the job, your Storage clients commands will work.

See also

This section covers the basic usage of PBS jobs particularly on the LSG. For advanced usage of a PBS cluster you may check out the Lisa batch usage guide or the NYU Cluster usage guide.