First Grid job with Dirac

This section summarises all the steps to submit your first job on the Grid, check its status and retrieve the output:

Warning

You can continue with this guide only after you have completed the preparations for Grid. If you skipped that, go back to the Prerequisites section. Still need help with obtaining or installing your certificate? We can help! Contact us at helpdesk@surfsara.nl.

Once you finish with the First Grid job, you can continue with more advanced topics and also Best practices, the section that contains guidelines for porting real complex simulations on the Grid.

Grid job lifecycle

To run your application on the Grid you need to describe its requirements in a specific language called job description language (JDL). This is similar to the information that we need to specify when we run jobs using a local batch scheduling system (e.g., with PBS, SLURM), although it is slightly more complex as we are now scheduling jobs across multiple sites.

Except for the application requirements, you also need to specify in the JDL the content of the input/output sandboxes. These sandboxes allow you to transfer data to or from the Grid. The input sandbox contains all the files that you want to send with your job to the worker node, like e.g. a script that you want executed. The output sandbox contains all the files that you want to have transferred back to the UI.

Note

The amount of data that you can transfer using the sandboxes is very limited, in the order of a few megabytes (less than 100MB). This means that you should normally limit the input sandbox to a few script files and the output sandbox to the stderr and stdout files.

Once you have the JDL file ready, you can submit it to multiple clusters with dirac-* commands. Dirac will schedule your job on a Grid worker node. The purpose of Dirac is to distribute and manage tasks across computing resources. More specifically, Dirac will accept your job, assign it to the most appropriate Computing Element (CE), record the job status and retrieve the output.

Dirac proxy creation

Before submitting your first Grid job, you need to create a proxy from your certificate. This has a short lifetime and prevents you from passing along your personal certificate to the Grid. The job will keep a copy of your proxy and pass it along to the Worker Node.

This section will show you how to create a valid proxy:

  • Log in to your UI account:

    $ssh homer@ui.grid.surfsara.nl # replace "homer" with your username
    
  • To enable the software environment to use Dirac tools, please run the following command:

    $source /etc/dirac/pro/bashrc
    

Please note that you need to run this command every time you login to the UI. You may also add this command in your configuration file ($HOME/.bashrc).

  • Create a proxy with the following command and provide your Grid certificate password when prompted:

    $dirac-proxy-init -g pvier_user -M pvier --valid 168:00
    

Each VO (e.g., pvier in the above example) is mapped to a group in Dirac (pvier_user in this case) and may have a different name than the VO itself. Please contact helpdesk@surfsara.nl if you are unsure of the group name to use. The above command creates a local proxy with a validity of maximum 7 days.

You should see a similar output displayed in your terminal:

    Generating proxy...
Enter Certificate password:
Added VOMS attribute /pvier
Uploading proxy..
Proxy generated:
subject      : /DC=org/DC=terena/DC=tcs/C=NL/O=SURF B.V./CN=homer homer@example.com/...
issuer       : /DC=org/DC=terena/DC=tcs/C=NL/O=SURF B.V./CN=homer homer@example.com/...
identity     : /DC=org/DC=terena/DC=tcs/C=NL/O=SURF B.V./CN=homer homer@example.com
timeleft     : 167:53:58
DIRAC group  : pvier_user
path         : /tmp/x509up_uxxxx
username     : homer
properties   : NormalUser
VOMS         : True
VOMS fqan    : [u'/pvier']

Proxies uploaded:
DN                                                                                   | Group | Until (GMT)
/DC=org/DC=terena/DC=tcs/C=NL/O=SURF B.V./CN=homer homer@surf.nl |  | 2022/07/31 23:54

Note

What does the dirac-proxy-init command actually do?

  • It generates a local proxy x509up_uXXX in the UI /tmp/ directory
  • It uploads this proxy to Dirac proxy server

And now you are ready to submit jobs to the Grid! Or copy data from and to the Grid.

Describe your job in a JDL file

To submit a Grid job you must describe this in a plain text file, called JDL. The JDL file will pass the details of your job to Dirac.

Warning

Make sure you have started your session and created already a valid proxy.

  • Log in to your User Interface.
  • Create a file with the following content describing the job requirements. Save it as simple.jdl:

This job involves no large input or output files. It will copy the jobscript.sh on the Worker Node that the job will land on and execute it. The Standard output and Standard error will be directed to the files simple.out and simple.err, respectively, and retrieved when the Job Output is retrieved.

Job list match

Before actually submitting the job, you can optionally check the matching Computing Elements that satisfy your job description. It does not guarantee anything about the CE load, just matches your JDL criteria with the available VO resources:

$dirac-wms-match simple.jdl # replace simple.jdl with your JDL file

The job matching functionality is useful for testing purposes only and not intended for usage when submitting hundreds of jobs.

Your job is now ready. Continue to the next step to submit it to the Grid!

To submit your first Grid job and get an understanding of the job lifecycle, we will perform these steps:

Submit the job to the Grid

You should have your simple.jdl file ready in your UI up to this point. When you submit this simple Grid job to the Dirac, a job will be created and sent to a remote Worker Node. There it will execute the script jobscript.sh and write its standard output and its standard error in the simple.out and simple.err respectively.

  • Submit the simple job by typing in your UI terminal this command:

    $dirac-wms-job-submit simple.jdl -f jobid
    JobID = 314
    

The option -f allows you to specify a file (in this case jobid) to store the unique job identifier. Omitting the -f option means that the jobID is not saved in a file. When you do not save this id you will effectively loose the output of your job!

Track the job status

To check the current job status from the command line, apply the following command that queries Dirac for the status of the job.

  • After submitting the job, type:

    $dirac-wms-job-status 314
    
  • Alternatively, if you have saved your jobIds into a file you can use the -f option and the filename as argument:

    $dirac-wms-job-status -f jobid
    

Cancel job

  • If you realize that you need to cancel a submitted job, use the following command:

    $dirac-wms-job-kill 314
    

Retrieve the output

The output consists of the files included in the OutputSandbox statement. You can retrieve the job output once it is successfully completed, in other words the job status has changed from Running to Done. The files in the output sandbox can be downloaded for approximately one week after the job finishes.

Note

You can choose the output directory with the -D option. If you do not use this option then the output will be copied under the UI in the current working directory with a name based on the ID of the job.

  • To get the output, type:

    $dirac-wms-job-get-output 314
    
  • Alternatively, you can use the jobid file:

    $dirac-wms-job-get-output -f jobid
    

where you should substitute jobid with the file that you used to store the job ids. Please bear in mind the size of your home directory on the UI when downloading large output files. When dealing with large input and/or output files it is recommended to download the input data directly to the worker node, and upload the output data to a suitable storage space within the job itself. Please check out the grid_storage section for details on various clients supported on the worker nodes and best practices.

Check job output

  • To check your job output, browse into the downloaded output directory. This includes the simple.out, simple.err files specified in the OutputSandbox statement:

    $ls -l /home/homer/314
    
    -rw-rw-r-- 1 homer homer  0 Jan  5 18:06 simple.err
    -rw-rw-r-- 1 homer homer 20 Jan  5 18:06 simple.out
    
    $cat /home/homer/314/simple.out
    

Recap & Next Steps

Congratulations! You have just executed your first job to the Grid!

Let’s summarise what we’ve seen so far.

You interact with the Grid via the UI machine ui.grid.surfsara.nl. You describe each job in a JDL (Job Description Language) file where you list which program should be executed and what are the worker node requirements. From the UI, you create first a proxy of your Grid certificate and submit your job with dirac-* commands. The resource broker Dirac accepts your jobs, assigns them to the most appropriate CE (Computing Element), records the jobs statuses and retrieves the output.

See also

Try now to port your own application to the Grid. Check out the Best practices section and run the example that suits your use case. The section Advanced topics will help your understanding for several Grid modules used in the Best practices.

Done with the General, but not sure how to proceed? We can help! Contact us at helpdesk@surfsara.nl.