Data replication

In this page we will show an example for running applications on multiple LSG clusters with replicas that also require big volumes data:

Problem description

The InputSandbox and OutputSandbox attributes in the JDL file are the basic way to move files to and from the User Interface (UI) and the Worker Node (WN). However, if large (from about 100 MB and larger) or many files are involved you should not use these sandboxes to move data around. Instead you can use the Storage Elements and work with the lfc and lcg commands. These commands, and the storage system in general, are explained in the section lcg-lfn-lfc clients.

Here we give an example of how to use large input and output files which are needed by a job. We will use replicas to avoid transferring e.g. a database to multiple clusters every time.

Data Requirements

This case describes the DataRequirements attribute in your job description file; this attribute is a list of classads representing the data requirements for the job. Each classad has to contain three attributes :

InputData

The list of input files needed by the job

DataCatalog

The type of data catalog - needed by the Grid middleware. This is needed in order to resolve logical names to physical names. Fill in “DLI” here.

DataCatalogType

The address (URL) of the data catalog if this is not the VO default one.

The presence of the DataRequirements attribute causes the job to run on a Computing Element (CE) which is next to the Storage Element (SE) where the requested file is stored. Note that this attribute doesn’t perform the actual copy of the file from the SE to the WN; as we will see, this has to be done by the user.

To do this, first register a file on a SE and to the LFC. We do this by copy and register (lcg-cr):

$lcg-cr --vo lsgrid -d gb-se-ams.els.sara.nl -l lfn:/grid/lsgrid/homer/test.txt file:/home/homer/local_test.txt
guid:522350d4-a28a-48aa-939b-d85c9ab5443f

Note that the guid part is what we get as return value from the command. It identifies the file uniquely in the Grid storage. You can save this id for emergencies. The part which starts with lfn: identifies the logical file name of our uploaded file.

Note

The LFC needs to support your VO in order to work.

Second, create a JDL file that describes your job. It will contain the LFC of the file, as is shown here.

$ cat inputdata.jdl
[
    Executable = "/bin/sh";
    Arguments = "scriptInputData.sh lfn:/grid/lsgrid/homer/test.txt";

    StdOutput = "std.out";
    StdError = "std.err";

    InputSandbox = "scriptInputData.sh";
    OutputSandbox = {"std.out","std.err"};

    DataRequirements = {
            [
              InputData = {"lfn:/grid/lsgrid/homer/test.txt"};
              DataCatalogType = "DLI";
              DataCatalog = "http://lfc.grid.sara.nl:8085";
            ]
    };
    DataAccessProtocol = {"gsiftp"};

    RetryCount = 3;
]

This JDL mentions the script scriptInputData.sh (as value of Arguments) which will be submitted to the WMS, and run on a Worker Node. This script needs an inputfile, and expects an LFN as argument. We will use the file that we copied to an SE earlier. In the DataRequirements section, we mention the LFN of this file as value of InputData. Notice that the DataCatalogType and DataCatalog are also described. You can copy these values.

Note that this in itself is not enough for the script to use the file. It still needs to be copied to the worker node where the job lands. All that is achieved by this JDL description is that the job will land close to an SE which contains the needed data. The copying is done by the script itself. To actually copy the file associated with this LFN from the SE to the WN, the script uses an lcg-cp command. The script scriptInputData.sh is shown below.

The script gets the file, performs the ls command and shows the content of the file to stdout.

$ cat scriptInputData.sh
#!/bin/sh

# Set the proper environment
export LFC_HOST=lfc.grid.sara.nl
export LCG_GFAL_INFOSYS=bdii.grid.sara.nl:2170
export LCG_CATALOG_TYPE=lfc

# Download the file from the SE to the WN where this job runs
# note that the LFN is passed as input to this script
lcg-cp --vo lsgrid $1 file:`pwd`/local_file

echo "########################################"
ls -la local_file
echo "########################################"
# type the file just downloaded
cat local_file

Now the actual submission, status checking, output retrieval and inspection can take place. If you want to try this example, you have to create two files, inputdata.jdl and scriptInputData.sh, filling them with the content displayed above. Of course, you have to register your own file and consequently change the LFN requested within the DataRequirements attribute.

Moving output data from the job to the SE

What do you do when you have to move data from a running job on the Worker Node to a Storage Element? The answer is: the job has to do it by having a script copy the data. We give an example. Assume that the following script code is executed by a running job.

$ cat registeringfile-script.sh
#!/bin/sh
# Author : Emidio Giorgio
# Usage : register a file to the default SE, with a specified LFN
#  - The file to copy and register is passed as first input argument to the script ($1)
#  - The logical file name it will have is the second input argument to the script ($2)
#  - the LFN will be like this /grid/lsgrid/YOUR_DIRECTORY/$2

# Set the proper environment
export LFC_HOST=lfc.grid.sara.nl
export LCG_GFAL_INFOSYS=bdii.grid.sara.nl:2170
export LCG_CATALOG_TYPE=lfc

# Actually upload the file to the SE
# path to the file to be registered is built as {current path}/{relative path from this script to filename}
# REPLACE CHANGEME with an (already existing) LFC directory of your choice
lcg-cr --vo lsgrid -l lfn:/grid/lsgrid/CHANGEME/$2  file:$PWD/$1

This script is in charge of copying the output of your job. The simplest thing is to run it from within the main job script, as shown below:

$ cat  scriptWhichDoesSomething.sh
#!/bin/sh

# do whatever
echo "This is a very dummy test" > fileout.txt

# run the script which registers the file fileout.txt just created above
/bin/sh registeringfile-script.sh fileout.txt data_from_the_WN

# greetings
echo "All done correctly (I hope). Bye bye"

This could be a starting point for your JDL:

$ cat  JobWritingToSE.jdl
[
    Executable = "/bin/sh";
    Arguments = "scriptWhichDoesSomething.sh";

    StdOutput = "std.out";
    StdError = "std.err";

    # carry out also the script which registers the file
    InputSandbox = {"scriptWhichDoesSomething.sh","registeringfile-script.sh"};
    OutputSandbox = {"std.out","std.err"};
]

Alternatively, you can just append the content of registeringfile-script.sh to your main script.