Personal tools
You are here: Home GRIA Documentation Documentation 5.1 Reference Manuals Basic Application Services User Guide

Basic Application Services User Guide

Note: Return to reference manual view.

This guide describes how to use the Basic Application Services Package for provision of data storage and processing (using application installed on a cluster) to trusted users.

1. Overview

Overview of the Basic Application Services

The GRIA basic application services package provides the core functionality for job and data management. It consists of:

A Data Service
This allows remote users to upload and download data files to the service provider, and to transfer data between Data Services hosted by different service providers. The Data Service also supports management of access rights (for read or read-write access) granted to other users or service providers.
A Job Service
This allows remote users to start, monitor or kill computational jobs, executed by the service provider. The Job Service will fetch input from and write input to a local Data Service. The Job Service can be configured to support multiple applications, which are chosen by the service provider.

The application services can be configured to be either unmanaged (free) or managed by the GRIA service provider management package, as in the diagram below.

Highlighting the Basic Applications Package in the GRIA Architecture

Highlighting the Basic Applications Package in the GRIA Architecture

2. Installation

GRIA Basic Application Services Installation

Standard Installation Procedure

The Basic Application Services package is provided as a zip file or tar.gz for linux. Unpack the archive and you will find the following items:

  • docs (folder)
  • gria-service-provider-mgt.war
  • README.html

Install the war file according to the Service Installation Manual. Once the initial configuration has been completed, http://www.gria.org/documentation/5.3/manual/basic-application-services-user-guide/the-job-service/application-wrapper-scripts-and-description-files/application-metadata-xml~v~2the Basic Applications Package requires some extra configuration.

Additional Pre-requisites

Some additional pieces of software need to be intalled.

Perl

Perl is required in order to run the Basic Application Services correctly. For Windows, the most common Perl implementations are Cygwin Perl and ActiveState Perl. We recommend that you install ActiveState Perl—choose the latest release, click the "Next" button, and then download the MSI distribution for Windows. To fully complete the installation you must restart Windows.

Test Application: ImageMagick

ImageMagick is the default test application used in GRIA. ImageMagick binaries for Windows can be downloaded from here. Use a Q8 version, e.g. ImageMagick-6.4.2-x-Q8-windows-dll.exe.

The ImageMagick.exe package for Windows is self extracting and the installation procedure starts automatically. Follow the instructions and select the default options.

Note: older versions of ImageMagick might not support all of the basic application service's default examples, e.g. blend.

Continuing the Installation

Once the additional pre-requisites have been installed, the installation can be continued by following the instructions for each service.

3. The Data Service

Overview and Configuration of the Data Service

Overview

The GRIA data service is used to manage "data stagers". A data stager is a container for a single file (or zip file). It has a unique identifier and an access control system for determining who can read and write the data. Clients can use the service to create new stagers, upload and download data, transfer data between stagers, and control others' access to the data.

Two items of configuration must be given before the data service can be used:

The location of the root data directory
The service stores any uploaded data inside this directory. If the data service is going to be used with a job service and jobs are going to execute on a cluster then the cluster's machines need to be able to read and write to this directory.
A list of trusted management services
Normally, you can just click Add to accept the default management service. This is the SLA management service from the GRIA service provider management package. Note that if the GRIA basic application services package is deployed on a different machine to the GRIA service provider management package, some additional access control setup is required. This is described in Links with Other Services section of the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make it unmanaged (or "free"), by clicking the Make service free button.

4. The Job Service

The Job Service

4.1. Basic Configuration

The Job Service

The GRIA job service is used to manage jobs. Clients can use the service to create new jobs, upload input data, start the job, monitor progress, and download results.

Each input and output of a job is actually a data stager managed by the local data service (the one in the same .war as the job service). Therefore, you must configure the data service before the job service can be used. Users can run jobs that take input from or send output to other data services by using the normal data transfer features provided by the data service.

To configure the job service you will need to specify:

  • The location of the root job directory. The service creates one subdirectory inside this directory for each job.
  • The directory containing the platform scripts. These scripts interact with the underlying resource manager, allowing the job service to be used with clusters of machines.
  • A list of trusted management services. Normally, you can just click Add to accept the default management service. This is the SLA management service from the GRIA service provider management package. Note that if the GRIA basic application services package is deployed on a different machine to the GRIA service provider management package, some additional access control setup is required. This is described in Links with Other Services in the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make use of the service unmanaged (free), by clicking the Make service free button.
  • A list of applications which users can run using the service. See the Managing applications section for details.

4.2. Managing Applications

Managing applications

This section covers the administrative tasks of deploying and undeploying applications. After deploying an application to your job service, it becomes available for execution by remote clients. Note, however, that clients usually must satisfy additional business constraints, such as having an appropriate service level agreement or account, before they can execute deployed applications.

The GRIA Basic Application Services software is provided with a set of tutorial applications. These are made available during installation of the software. The web based Administration Interface provides the location of these files during the installation process, and guides you through the simple process of application deployment.

In addition to the tutorial applications, it's straightforward to develop new applications and deploy these in the same way.

This section assumes that you have all necessary files for application deployment and that any required executable applications have been installed according to the application documentation. If you are installing the tutorial applications, you have all the files you need. If you are deploying your own applications, first see Writing Application Scripts for details of the files you need to produce before application deployment.

Deploying applications

Having obtained or created the files and scripts needed for application deployment, the application can now be deployed to the job service. To do this, make sure that tomcat is running, then using a web browser, log into the GRIA Basic Application Services administration page. This can be found at http://<servername>:8080/gria-basic-app-services. Make sure you enter the appropriate security credentials and adjust the URL for the administration page, according to your server setup, as appropriate.

From the administration page, select the Job Service link, as shown below.

Link to the Job Service Administration Page

Link to the Job Service Administration Page

This displays the Job Service Admin page. In the Applications section, enter the location of the directory containing the files and scripts needed for deployment. Then click the Deploy new application button.

This displays an admin page that displays properties for the application. You can optionally enter arguments for the platform script, before clicking the Accept button.

This completes deploying an application to the Job Service.

Undeploying applications

Undeploying an application is straightforward. First click the Edit button along side the entry for the application you wish to undeploy.

Click the Undeploy button to undeploy the application.

4.3. Execution Platform Models

Execution Platform Models

Overview

The GRIA architecture is flexible enough to use a variety of underlying computing platforms to run jobs e.g. from single computers to clusters of workstations or even supercomputers. The following sections of this document describe GRIA constraints on different platforms.

GRIA system administrators should read this document and configure GRIA accordingly to accommodate the infrastructure of their underlying computing platform.

  • Applications should be accessible by compute nodes only. Installation of the applications can be either local per compute node or over disk space shared among all nodes.
  • Platform scripts should be accessible by the GRIA Job Service middleware.
  • Application scripts should be accessible by the GRIA Job Service middleware and compute nodes, e.g. either copy wrappers locally per node or use a shared directory.
  • Scratch area, i.e. job workspace area, should be accessible (read/write) by both the GRIA Job Service middleware and compute nodes. The scratch area cannot be copied between compute nodes, instead it has to be exported as a shared disk space.

Directory names and paths for wrappers should be similar whether the access point is a compute node or the GRIA Job Service middleware. The following figure illustrates an overview of the GRIA Job Service in relation to its execution resources.

Diagram of the Job Service and Wrapper Scripts

Figure 1. Diagram of the job service and its wrapper scripts

The Need for Platform Scripts

GRIA Services do not access resource mangers directly to submit and check jobs. Instead, GRIA Services introduce an extra layer of resource manager dependant platform scripts to submit and check jobs. For each resource manager GRIA requires a separate suite of platform scripts. This extra layer of platform dependant scripts decouples GRIA Services form resource managers and applications.

GRIA defines platform scripts APIs to handle jobs such as:

  • Start job script to submit jobs
  • Check job to check the status of a job
  • Kill job to terminate a job

The Job Service then can be configured to use platform scripts suitable for the underlying computing platform. Platform scripts then, know how to handle (start, check, kill) jobs for that particular computing platform, and can be instructed to run a particular application via its application wrapper.

Figure 1 illustrates how the platform script layer sits between the Job Service and application wrappers hiding resource manager details.

Local Execution Scenario

This is a minimum configuration scenario, all GRIA Services and jobs run locally on the same machine.

The following figure shows how GRIA can be deployed with services running locally. In this example the Job Service middleware is using platform scripts that run jobs locally. Job workspace, wrappers and applications should be accessible by the system that runs GRIA services.

Note: this is the same configuration the GRIA release demo is using to run the tutorial applications locally.

GRIA service running jobs locally

Figure 2 - GRIA service running jobs locally.

Using a Cluster of Resources Scenario

The GRIA Services can be deployed and use, as a computational platform, a cluster of workstations. For example, in Figure 3 the Job Service middleware is using PBS platform scripts to handle jobs. Cluster compute nodes need to access applications, wrappers and the job workspace, on the other hand the GRIA Services should access job workspaces, and the wrappers.

GRIA service running jobs via a cluster of resources

Figure 3 - GRIA service running jobs via a cluster of resources

4.4. Application Wrapper Scripts and Description Files

How to write the scripts and meta-data file to integrate your application.

4.4.1. Overview

Applications and The Job Service

Relation to the Job Service

A functioning GRIA Job Service installation is composed of three main parts:

  1. The service software running under Tomcat/AXIS.
  2. Scripts for running or interacting with application codes on an execution platform, which may be the service host itself, a separate execution server, or a cluster.
  3. Some installed applications capable of running on the execution platform, each corresponding to a different Job Service end-point.

The Job Service software (1) supports various bookkeeping operations for assigning job ids, workspace, etc. The core operations are those for actually starting and managing the execution of an application: starting jobs, checking their status, and killing jobs. Each of these has to access the execution platform by running the appropriate script (2), which can start or otherwise interact with the application running on the execution platform (3). For security reasons, we do not usually allow users to upload their own applications, so the service operator must install all the applications.

The Job Service and platform scripts are designed to support a uniform model of application execution, shown in Figure 1:

Application Model

Figure 1 - Application Model

The workspace for each job is set up by the Job Service when the job is initialised (this is one of the bookkeeping service operations that must precede the call to start the job). The workspace has a standard directory structure so the Job Service and platform scripts can create and find information stored in it, including a working sub-directory where the job will actually run.

When the user starts the job, the Job Service transfers input data files from Data Service URI's into the job’s workspace. It then runs a platform script locally, which in turn submits the application to the execution platform (cluster, etc) where it will run (using the specified command line), possibly after some queuing delay. The platform script saves the job handle to the workspace. The application has to read the input data left in its workspace by the Job Service, and write any outputs to the workspace so that the Job Service can find them and transfer them to output Data Service URI's when the job has finished.

When the user asks for the status of the job, the service runs a second platform script that reads status information (e.g. when the job started or finished, etc), which the application must store in the workspace. This platform script may also run an associated monitoring application (using the specified command line) to gather application-specific status information (e.g. number of iterations completed, convergence plots, etc) from the workspace.

If the user asks for the job to be killed, the service uses a third platform script that reads the job handle from the workspace, and issues a command to kill this job on the execution platform. The Job Service will detect that the job has finished, and will transfer any output produced. Note that the user (or their client-side application) should always check the status of a job to find out if it crashed or was killed, as some incomplete output may appear in this case.

Why are Application Wrapper Scripts Required?

In practice, few legacy applications behave exactly according to the model shown in Figure 1. It is rarely possible to change the application itself to fix this, so instead GRIA uses so-called wrapper scripts that do conform to the application model for starting and managing the application.

In practice, the wrapper scripts can do more than just make the underlying application work as indicated in Figure 1. They can also be used to handle and implement application specific features of the service.

One can also use (optional) wrapper scripts to look for application-specific status information in the working directory of the job. Without such scripts, the platform scripts can only provide basic job status information from the job submission system.

Finally, wrapper scripts also provide a configurable mechanism for dealing with any application-specific security risks, e.g. checking for malicious input that may exploit a feature of the application. Few legacy applications were designed as network-accessible services, and since we can’t change them to remove security loopholes, the use of wrapper scripts is essential to check for any exploits of application vulnerabilities. In the limit, one can configure the wrapper (and platform) scripts to run the application in a sandbox (e.g. chroot), with access only to a working sub-directory of the job workspace, as shown in Figure 2.

Wrapper Scripts and Security

Figure 2 - Wrapper Scripts and Security

4.4.2. Start Job

The startJob.pl Application Wrapper Script
This document refers to the startJob.pl application wrapper script not the platform script of the same name.

Application Wrapper Functionality

The startJob.pl application wrapper script is a mandatory script that deals with any application specificity, allowing the Job Service to treat all applications in the same way, and so decoupling the Job Service from the details of the application.

The main functions of the application wrapper script are:

  • handling input and output data files;
  • creating a consistent environment that is referred to correctly in inputs;
  • enforcing any security precautions to protect against loop-holes in the application.

The application wrapper is designed to run on the execution platform, having been submitted by the platform script for starting a job. Prior to submitting the wrapper script on the execution platform, the Job Service will have set up a workspace (directory) for the job, copied input data into it e.g. work/inputs, and created a working sub-directory for the job to run in, e.g. work. The following listing shows a workspace directory structure with two input files and an empty outputs directory.

ff808081-1017450e-0110-174532dd-0001-1
`--work
   |-- inputs
   |   |-- input-0
   |   `-- input-1
   `-- outputs

After changing to the workspace directory (not the working sub-directory), the wrapper will be submitted using the following command line:

app-wrapper <application arguments>

The functionality of the wrapper script should include the following:

  1. Parse wrapper arguments, including security checks for illegal input designed to inject malicious commands into the command-line used to launch the application.
  2. Move input data files into the working sub-directory, including unpacking any that are compressed archives containing multiple inputs.
  3. Create a consistent environment in the working directory, by setting up environment variables and rewriting input data to match the local environment where necessary.
  4. Touch the file .app_started in the job workspace (this should be outside the working subdirectory), so recording the time when the application started. This step is optional.
  5. Build the command line and run the underlying application, making sure that the standard output channels are directed to .stdout and .stderr in the working directory. This step is optional.
  6. When the application has finished, touch the file .app_ended in the job workspace, so recording the time when the application finished. This step is optional.
  7. Copy output files from the working directory into the specified positions in the workspace, including packing multiple outputs into compressed archive files where necessary.
  8. Exit by returning the exit code of the application.

For simple applications, security can be maintained by checking input parameters during step 1. and if necessary data files during step 3. If the application is too complicated for this to be reliable, it may also be necessary to set up a sandboxed working environment and run the code inside it during step 5.

Some of these steps are considered in more detail below.

Input and Output Data Handling

When unpacking input data, the application wrapper should attend to the following:

  1. Create any substructure needed in the job's working sub-directory of the workspace.
  2. Copy or unzip input files from the inputs sub-directory into the job's working space.
  3. Check that all input needed to run the job is present.

The Job Service knows in advance which output files have to be sent back to outputs directory. The application wrapper has to create these files by:

  1. Copying or zipping data to create the required output files in the outputs sub-directory
  2. Renaming these files to output-x naming scheme.

When the script finishes, the Job Service will detect this and handle the transfer of output files accordingly.

The number of output files is always fixed, so the wrapper can know what outputs are needed, and how to assemble them. However, the user is allowed to distribute multiple inputs across an arbitrary number of input zip archives, provided they specify all the Data Service URIs where these are stored when they start the job. The application wrapper must be capable of handling this.

Consistent Context Reconstruction

Why do we need context reconstruction? The input data for our application has been created on another system with a different directory structure, environment and possibly even operating system. We have to set up an equivalent (not necessarily identical) environment on our execution platform, and make sure any input data references to the remote user's environment are mapped onto the one we have created, or they will be invalid when the application is started.

When and where should context reconstruction be performed? One should handle it as close as possible to the running application—certainly on the execution platform where the job will actually be run—as this is where the environment is needed. This is why the Job Service doesn't attempt to create the context itself - there is no point doing it at the service host if the job will be executed on a compute node in a Condor cluster. Instead, we leave it to the application wrapper to handle everything in an application specific way on the execution platform itself.

A typical approach to context reconstruction might involve passing an array of named parameters to the Job Service, including environment settings as well as application flags. These will be passed to the wrapper through its argument list. In addition, one can provide settings in an extra input file, intended for the wrapper rather than the job itself, and used to set up the environment prior to running the application code.

The hardest job for the wrapper is to parse and rewrite application input data where necessary to ensure it is consistent with the environment established on the execution platform. If this is not needed, it is usually quite easy to 'wrap' an application to run inside the Job Service. Where it is necessary, the wrapper may become a significant body of code in its own right.

For example, consider the following line of input intended for the rendering application AIR, used with the Job Service to provide a grid-enabled video rendering service:

  Option 'searchpath' 'shader' ['&:e:\AnimalLogic\MaxMan\shaders:C:\Sample\shaders']

The problem here is that the application uses plug-ins to perform part of the rendering calculation, and the search path for these can be specified in the user input. This particular input file has been generated automatically using a graphical environment for video post-production, which has filled in the relevant path based on where the shader libraries were installed on the user's local machine.

The wrapper has to identify which shaders are needed, and substitute the path to them on the local system:

   Option 'searchpath' 'shader' ['&:/export/apps/AnimalLogic/MaxMan/shaders:/export/apps/air/Sample/shaders']

In some cases, it may be possible to infer the meaning of client-side environment references by pattern matching against a list of meaningful terms used by the application. In others (probably in this case), it is necessary for the user to send the install path quoted for specific groups of plug-ins as service arguments or environment settings, so the wrapper can find them and map them onto the equivalent installed groups of components on the execution platform.

In extreme cases, it may be necessary to establish multiple services to run the same application in different ways, allowing a different, specific environment to be set up for each. For example, it probably wouldn't make sense to have a single service to run a computational fluid dynamics (CFD) code capable of simulating coolant flows through automotive engines AND the propagation of drugs in aerosol suspension in human lungs. It would be asking too much of a wrapper developer to differentiate and correctly handle such extreme cases, and instead one should set up two services each with its own wrapper specialised to one of these scenarios.

Security Containment

Why Wrappers have to Bother with Security

The Job Service regards application wrapper scripts as trustworthy, because the service operator can inspect them and make sure they don't do anything strange or foolhardy. However, the applications may be third party, closed source executables that cannot be inspected, and were not designed as network-accessible services in the first place.

Wrapper scripts can protect the service from malicious users in three ways:

  1. checking any user input used to create the command line for running the application, to exclude command injection attacks using parameters like 'method=gauss; cd /; rm */*';
  2. checking input data known to be used in an unsafe way by the application, e.g. to construct system calls for executing plug-ins or moving files around;
  3. confining the application to a sandbox, by first preparing the sandbox and then launching the application in it.

If the application is very simple, or designed to withstand malicious users, or if you have only a small number of users you know well (and trust not to mislay their credentials) then it may be OK to include only the first of these measures.

Legacy applications are quite likely to do things in unsafe ways. Renaming files or testing if they exist are often done via system calls if the application developer wasn't filenames to be sent by a remote user who may have malicious intent. If the application isn't too complex, or if you can check with the developer on what might happen, then it should be OK if you also check the user-supplied input and check filenames and other data that may be sent to unsafe system calls.

In the worst case, one has to assume the application will be unsafe, and attempt to contain any damage caused by malicious (or possibly careless) input by restricting what the application can do and where it can do it. There are several possible ways to achieve such restrictions.

Chroot

On Linux systems, chroot can be used to restrict a sub-process to an arbitrary sub-directory, e.g. a job's working directory. The chroot mechanism was designed for use by operating system developers to allow them to create a pseudo-root within which to test their code. The chroot container doesn't prevent access to low-level devices, it will prevent most legacy applications accessing files outside the specified sub-directory. Chroot is widely used to contain web servers and other network applications to minimise the scope for damage if they are compromised.

To use chroot, it is necessary to create a complete operating system environment inside the job's working directory (which it will see as '/'). One has to copy application binaries, resolve any references to system/application libraries, create devices such as /dev/null, etc. To create a self-sufficient chroot 'jail' environment sufficient to run the application may not be easy, and of course, it would need to be repeated for each individual job. However, it can provide a good safety level as its 'jail' environment is enforced by the operating system itself.

Restricted Shells

Many shells, including bash provide a restriction mechanism, usually invoked by running the shell with the -r switch. Some common features of restricted shells are the ability to prevent a program from changing directories, to only allow the execution of commands using absolute pathnames, and to prohibit executing commands in other subdirectories, using command-line redirection operations, or changing the search path.

Minimal privilege accounts

Another approach is to create a low-privilege account for each jobs. The wrapper script would then have to assign such an account, change the working directory so it is owned by this account, and run the application in that working directory under the same account. Provided the same account is not used for anything else (including running other jobs), the application can be prevented from accessing anything outside the working directory, even if it can be induced to run some unforeseen system call by sending some malicious input.

The two drawbacks with this approach are:

  1. ideally one should create a pool of accounts and provide a way for the wrapper to assign them to jobs rather than creating new accounts, but this isn't supported at present;
  2. the wrapper would need sufficient privilege to set the account under which a sub-process is run, which may make the wrapper more dangerous if it can be compromised.

The second drawback may not be too bad, given that the wrapper at least can be designed to check all inputs and avoid doing anything unpleasant. At present, the Job Service runs with a normal unprivileged user identity, so it may be better to use other methods to contain individual jobs.

Other Methods

The above list is by no means exhaustive. For example, if the chroot 'jail' is not sufficient, one can create an entire virtual machine on which to run a potentially unsafe application. Software such as VMWare can be used to implement this approach, but users who want to go to these lengths are on their own, at least in this version of the software.

Error Handling

If an error is encountered in the application, the wrapper must report the fact. If this is not done, the Job Service will assume everything is OK, and the users client application will probably attempt to continue, which may not be appropriate if some output from the job is missing, etc.

Return values are passed back to the Job Service middleware via three files in the job workspace:

  • .app_wrapper_exit_code
  • .app_exit_code (optional)
  • .app_exit_status (optional)

The application startJob wrapper should exit with an exit status of zero if the job has completed successfully, or with a non-zero status if the job has failed. This value will be stored in .app_wrapper_exit_code by the platform scripts. Generic clients may stop executing a workflow, for example, if this result is not zero.

The two optional files may be used to provide extra application-specific information to specialised counter-part client applications. The Job service is not expected to understand these codes, but will just pass the values, if set, to the client. The application wrapper must write these files, if they are needed.

The .app_exit_code file should contain the application's exit code.

The .app_exit_status file can contain further information.

An Example Wrapper Script

This example is based on the ImageMagick application which was installed as part of the GRIA installation. To get started, we will create a simple wrapper that runs this application.

  1. Create a startJob wrapper script:
    #!/bin/sh
    exec > log 2>&1
    echo Swirl wrapper started
    echo Arguments are: $*
    
    INPUT="$2"
    OUTPUT="$4"
    
    echo Copying inputs to work directory...
    cp "$INPUT" image.jpg || exit 1
    
    echo Run the mogrify command...
    mogrify -swirl 60 image.jpg || exit 1
    
    echo Copying result to output stager...
    cp image.jpg "$OUTPUT" || exit 1
    
    echo Swirl job completed successfully
    

    This performs the following steps:

    1. Redirect all output to the log file.
    2. Print the arguments to the log.
    3. Copy the input image into the work directory.
    4. Run the mogrify command to transform the image.
    5. Copy the result to the output stager.
  2. Edit the startJob script to run your command instead of swirl (the command you tested above).
  3. Make the script executable:
    $ chmod a+x startJob
    
  4. Test the wrapper with this command:
    $ ./startJob -i image.jpg -o output.jpg
    $ cat log
    

Check that the log shows that the command ran correctly. It should only take a few seconds to process the image. It it takes longer, press Ctrl-C and examine the log.

4.4.3. Check Job

Creating the checkJob.pl Status Wrapper Script
This document refers to the checkJob.pl application wrapper script not the platform script of the same name.

Unlike the wrapper for starting an application, the application wrapper for reading status from the working directory is optional. If no such wrapper is provided, the platform script will create a simple status report by checking for files like .app-start and .app-end that indicate if and when the job started or finished, consulting the execution platform manager (e.g. batch queue system), and appending .stderr (if any) to the result.

If you want the job status report to include application-specific information such as convergence plots, iteration counters, etc., you should create a wrapper script that will be invoked by the client calling the checkJob method.

An application status wrapper is usually a lot simpler to create than the main wrapper because it does not take any user-supplied (and hence potentially malicious) arguments, and does not set up (or run) potentially untrustworthy code. All the status wrapper has to do is to examine the job's working directory, read any files it needs in order to extract the desired status information (in the limit, one could simply copy an application-level log file), and write it to the standard output.

Note: the format of the status information is open and application dependent, however, status information should not include binary data.

An Example Status Wrapper Script

This example is based on the ImageMagick application, which was installed as part of the GRIA installation and follows on from the start job example. In the same directory as startJob, create a script called checkJob:

#!/bin/sh
tail log

This is run each time the client checks the status of the job. This example simply returns the last few lines of the log file, and is executed as follows:

  1. Make the script executable:
    $ chmod a+x checkJob
  2. Test it with these commands:
    $ ./checkJob > statusfile
    $ cat statusfile
    

    You should find that the contents of log are now in statusfile.

4.4.4. Kill Job

Creating a killJob.pl Application Wrapper Script
This document refers to the killJob.pl application wrapper script not the platform script of the same name.

The application specific kill script is a script that will allow an application to be killed in a certain way, e.g. using an application specific mechanism. This script is optional, if not present in the argument list, the platform script will try to kill a job at the resource manager level. When an application specific kill script is provided the platform script will invoke it instead.

The return code of this script should be 0 upon success or any other value on error. The platform script accordingly will decide whether the killing operation was successful or failed. The functionality of this script is application dependent and difficult to be described in a generic way.

For some applications terminating a job, might be as simple as creating a single file in the job workspace. Some other applications are aware of signals that can be passed to them, etc.

Appendix II describes in detail an alternative way to kill jobs gracefully which uses a signalling mechanism for the wrapper to kill the job. The application wrapper can respond to that signal by terminating the application and preparing any useful output data.

An Example Kill Wrapper Script

We are not aware of any particular way that ImageMagick can be killed, therefore we cannot provide a complete example of a kill-wrapper script. In practice, such an example will be very similar to the default job termination mechanism used in the platform kill script, e.g. kill -9 $jobID. In the following paragraphs we provide some hints about possible ways terminating jobs.

If a particular application is aware of a termination file in the job workspace, then the kill-wrapper can be as simple as:

touch $WORK_DIR/.terminate
Where .terminate is the particular termination filename that the application is aware.

Many applications are aware of various signals e.g. SIGTERM, SIGALRM, SIGSTOP, etc. The kill-wrapper then can pass the appropiate signal to the application and the application can respond by terminating the job gracefully, for example:

# find the process ID $pid and send a termination signal
kill -SIGTERM $pid

4.4.5. XML Description

Creating an XML File to Describe an Application

Application description files are XML files containing metadata about applications deployed on GRIA. These files are essential for GRIA users to discover available applications and use them. To create an XML description file of an application you need to use the following schema to identify the application's main features including name, version number, description, and inputs/outputs (if any). For example, The following code describes the Swirl application:

<application>

    <name>http://it-innovation.soton.ac.uk/2005/gria/tutorial/swirl</name>
    <version>1.0.0</version>
    <description>Application to swirl an image</description>

    <application-inputType>
        <name>image.jpg</name>
        <type>jpg</type>
        <description>image file of any type</description>
    </application-inputType>

    <application-outputType>
        <name>image.jpg</name>
        <type>jpg</type>
        <description>swirled image of the same type as input type</description>
    </application-outputType>

</application>

The following code describes the Paint application:

<application>

    <name>http://it-innovation.soton.ac.uk/2005/gria/tutorial/paint</name>
    <version>1.0.0</version>
    <description>Application to render an image into painting</description>

    <application-inputType>
        <name>image.jpg</name>
        <type>jpg</type>
        <description>image file of any type</description>
    </application-inputType>

    <application-outputType>
        <name>image.jpg</name>
        <type>jpg</type>
        <description>painted image of the same type as input type</description>
    </application-outputType>

</application>

Note that you can add as many inputs/outputs as necessary, according to your application.

Every type of application provided by the Job service must be given a unique name. The ensure they are unique, a URI is used. Note that although these names look like web page addresses, they may not necessarily point to web pages if treated as URL. They are simply unique strings.

4.5. Platform Scripts

The platform scripts create the interface between GRIA and the scheduler.

4.5.1. The Platform Script Interface

Describing the three platform scripts

4.5.1.1. Overview

Overview of the platform script interface

The Job Service knows only how to start, check job status, and how to kill a job. In order to decouple Job Service and application wrapper bindings from resource managers, an extra layer of resource manager dependant platform scripts is introduced. This implies that Job Services can operate with different resource managers without changing the Job Service itself or the application wrappers.

The Job Service then can be configured to use platform scripts suitable for the underlying computing platform. Platform scripts then, know how to handle (start, check, kill) jobs for that particular computing platform, and can be instructed to run a particular application via its application wrapper.

Figure 1 illustrates how the platform script layer sits between the Job Service and application wrappers hiding resource manager details.

The job service interface and scripts

Figure 1. The job service interface and scripts

The GRIA Job Service requires the following platform scripts:

  • Start job: this script knows how to submit jobs for a specific resource manager
  • Check job: knows how to check the status of a job
  • Kill job: knows how to terminate a job

The following sections describe in detail the API's and the required functionality for platform scripts.

Ideally, users should adopt one of the supplied platform scripts for running jobs using PBS, Condor or local execution. If it is necessary to develop platform scripts to address an unsupported execution platform, this can be done, but first read about the GRIA platform model.

4.5.1.2. Start Job

The startJob.pl platform script
This document refers to the startJob.pl platform script not the application wrapper script of the same name.

Introduction

The startJob.pl platform script is responsible for setting up the platform dependent environment within the job workspace directory, generate a platform dependent job description files and submit that file for execution to its underlying resource manager. The resource manager will try accordingly run that job description file and invoke the application via its application wrapper.

The start job script is invoked by the Job Service middleware, usually the same system that runs GRIA Services.

Job Life Cycle

When a job is submitted via the startJob script, it will follow a specific life cycle.

The job life-cycle

Figure 2 - The job life-cycle.

Figure 2 shows the life cycle of a job within GRIA (with time increasing along the x-axis). The numbered labels refer to timestamps that are required by the Services. The table below contains details of the labels with their meanings and the name of a file (to be stored in the job's session directory) that will be touched at the appropriate time in order to record the date and time that the event occurred.

Label in Figure 2 Meaning File used to store timestamp
1 Job submission time .job_submitted
2 Application wrapper start time .app_wrapper_started
3 Application start time .app_started
4 Application end time .app_ended
5 Application wrapper end time .app_wrapper_ended

The startJob script should create the file listed for label 1, while the application wrapper should create the rest.

Script API

The start job script should comply with the following command line:

startJob
  -d <absolute path to workspace directory>
  -e <full path to application wrapper script>
  [-r [job constraints]...]
  [-- [application arguments]...]
  • Flag -d specifies the full path to job workspace directory, e.g. /mnt/data/ws-123
  • Flag -e specifies the application wrapper script
  • Flag -r specifies a list of directives/constraints for the resource manager, the script should understand these directives and translate them accordingly for the job description file, the form of the constraints should be expressed in name=value pairs.
  • Flag -- specifies a list of arguments to be passed when invoking the underlying application.

Script functionality

The functionality of the start job script should include the following:

  • Parse and identify script arguments.
  • Identify the job workspace directory structure, check that specified staging directories exist as well as the specified input data.
  • Change directory to job workspace, at this point script log files can be generated and stored in job workspace directory.
  • Create .job_submitted timestamp corresponding to (2) file in Figure 2, and store in it the startJob PID. This file can be used as a lock file to indicate the status of the job is in SUBMIT state. You need to remove the lock, i.e. empty the contents of this file on exit.
  • Analyse and compose the argument string for invoking the application wrapper, e.g. make sure that application wrapper, property files exist, etc.
  • Generate RM job description file e.g. job-$$.pbs for PBS. This is a resource manager dependent file that stores resource manager directives, instructs execution nodes to run the job, etc. Usually this file should include the following:
    • Create a resource management directives section, e.g. parse -r arguments, etc.
    • Change directory to working directory.
    • Touch the file .app_wrapper_started in job workspace which corresponds to point 2 in Figure 2.
    • Run the application wrapper using the composed argument string.
    • Store application wrapper exit code in file .app_wrapper_exit_code in job workspace, i.e. point 5 in Figure 2.
    • Store application wrapper exit code in .app_wrapper_ended in the job workspace.
  • Submit job description file to RM and store the job ID number into .jobPID file in the job workspace directory. Job status scripts will read this file to find out the status of the job with that ID.
  • Remove submit job lock file, e.g. .job_submitted.

Return values

Return values are passed back to the Job Service middleware. A return value of 0 indicates that the script has successfully submitted the job, in any other condition the script should return a non-zero value.

Job Constraints

Job constraints are passed to platform scripts either from the Job Service using the -r argument or directly by the client user via a constraints XML file which the Job Service will store in the job session directory in a file called resources.xml. The startJob platform script should parse these constraints and translate them to resource manager directives. Typical resource constraints are expected to describe constraints about WallClockTime, CPUSpeed, PhysicalMemory, DiskSpace, etc.

See the job constraints page for further information.

4.5.1.3. Check Job

The checkJob.pl platform script
This document refers to the checkJob.pl platform script not the application wrapper script of the same name.

Introduction

The purpose of the checkJob.pl platform script is to check the status of submitted GRIA jobs. This script can be invoked either through the job service administration web interface, or by GRIA users using SOAP.

The job status information is returned to job service via the standard output. Any other information, e.g. debugging, etc, should be written to standard error or log files. The return code of this script determines whether the captured job status report is valid or not.

The job status information is fed back to the end-user. The check job platform script will be invoked, and run locally, by the Job Service middleware and will need to access files in the job workspace. It will invoke the application-specific checkJob script. This script will examine application-derived files in the job workspace working directory and will try to report the current status of the application, e.g. 50% of the job remaining, computation phase 3 complete, etc.

The STDOUT of this script will generate a status report for the Job Service middleware.

Script functionality

The check job script should do the following:

  • Identify the workspace directory and change directory to it.
  • Invoke the app-specific status script.

Script API

The check job script interface will be invoked with the following command:

checkJob <application specific get status executable>
  • argument: specifies the application status wrapper script

Return values

  • 0 upon success
  • non-zero return number indicates error, STDERR and STDOUT will provide more information.

The status report is printed to STDOUT and used by the end-user, exit-codes are used by Job Service. Status debugging information is always written to STDERR.

Application-Specific Status Script Interface

This is an optional application specific script that should provide a qualitative measure of the running job. This script will run in the job working directory, e.g. work and report its status in STDOUT.

The functionality of this script is application dependent, it should try to estimate the progress of the running job, e.g. examining job input and output files, or any other applicable mechanism/technique to provide an estimation of how the job is progressing. The length of the generated report should be short for practical reasons.

4.5.1.4. Kill Job

The killJob.pl platform script
This document refers to the killJob.pl platform script not the application wrapper script of the same name.

Script API

The killJob.pl platform script is used by the job service to terminate prematurely the execution of a submitted job. The kill job script takes similar arguments to the status monitor script. The API of the kill job script is:

killJob
  -d <absolute path to workspace directory>
  -e <full path to application wrapper script>
  • Flag -d specifies the full path to job workspace directory, e.g. /mnt/data/ws-123
  • Flag -e specifies the application kill-wrapper script

When this script is invoked it identifies the submitted process ID, e.g. by reading the .jobPID file in the job session directory, and then issues a kill operation for that job when the job is in running state.

If no application wrapper kill script is provided, i.e. default mode, the resource manger will try to kill the job at the resource manager level. This can be done by issuing a resource manager specific kill command, e.g. qdel for PBS or kill -9 for local execution, etc. Subsequently, the resource manager will try to kill the application wrapper and all its child processes. As a result both the application and the application wrapper will be terminated abruptly without preparing any output data, or timestamp files, or exit codes.

When an application wrapper script is provided via the -e argument, the killJob script will try to invoke that script instead. The application wrapper kill script will be invoked within the job working directory e.g. work. In this case, the application kill wrapper is responsible to terminate the running job.

Upon successful job termination, the killJob platform script creates a .killed timestamp in the job session directory. The .killed timestamp will cause subsequent status calls to report the job status as KILLED.

The Application-Specific Kill Script

The application specific kill script will allow an application to be killed in a certain way, e.g. using an application specific mechanism, and should operate within the job working space. This script is optional, if not present in the argument list, the platform script will try to kill a job at the resource manager level. When an application specific kill script is provided the platform script will invoke it instead.

The return code of this script should be 0 upon success or any other value for error. The platform script accordingly will decide whether the killing operation was successful or failed. The functionality of this script is application dependent and difficult to be described in a generic way. For some applications it might be as simple as creating a single file in the job workspace or in some cases the script might have to pass an appropriate signal to the application, etc.

Return values

  • 0 upon success
  • non-zero return number indicates error, i.e. the state of the job should not change.

Failing to terminate successfully a job, implies that the job will continue to occupy and use resources.

4.5.2. The Supplied Platform Scripts

Description of the supplied platform scripts

4.5.2.1. Overview

Note: please refer to the Documentation Tutorial Integrating Resource Manager Systems with GRIA section for the latest platform script updates for PBS and Condor.

GRIA can use any resource management system via its platform script API. The GRIA distribution comes with pre-supplied example scripts for PBS and Condor resource managers.

The Job Service uses scripts to submit and manage jobs on an execution platform, allowing GRIA to make use of a wide range of computational resources including remote compute servers and clusters.

Service administrators can create their own platform scripts to interface with a given execution platform and configure the Job Service to use them. The Services come with scripts for the following execution platforms:

  • Portable Batch System (PBS)
  • Condor
  • Local execution

Although these scripts provide full functionality for GRIA they assume a very basic PBS or Condor RM configuration.

The functionality of these scripts should be expanded accordingly for customised resource manager configurations. This document describes which are the most likely parts of the pre-supplied example scripts for PBS and Condor systems that require customisation.

By default, it is recommended that the service installer configures the Job Service to use the local execution scripts, which means that all jobs will be run locally on the same machine that runs the services. The scripts for this do not need to be modified.

The PBS and Condor scripts do normally require some customisation, and details of how to do this (and how to test any of the supplied scripts) can be found in the following sections.

Users who need to create their own scripts to address other execution platforms should read about the platform model, and also see the instructions on the platform script interface.

4.5.2.2. PBS

How to use and configure the supplied PBS platform scripts

The following sections describe how to configure the pre-supplied platform scripts for PBS. These scripts are working for a very basic PBS configuration. However, they can be very easily modified to adapt many customised PBS configurations. The basic PBS testbed platform we used to develop and test these scripts had the following configuration:

  • All PBS and GRIA services run on the same machine, i.e. pbs_server, pbs_sched, pbs_mom
  • There is a default PBS queue, e.g. dqueue
  • System users e.g. the GRIA user (tomcat) can submit and run simple PBS jobs

PBS platform scripts can be easily customised in the following sections:

Submit Job: startJob.pl

This is a perl script which can submit GRIA jobs to PBS. Customisation of this script will require modifications to the following:

  • SECTION A: Initialise Resource Manager global vars, such as path for PBS binaries, PBS server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<PBS binary path>
    • RM_SERVER=<PBS server name>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: This section generates a job description file (JDF), which is the file submitted to PBS to run the job. This section of the script should be adequate for simple PBS configurations. You should edit this part of code if you want to change any of the default PBS directives or change the way jobs are submitted.
    The PBS JDF file has two main parts, the first one describes all the PBS directives required to run the job. The second part of the file describes how to invoke the application wrapper, etc. This section of the code should be edited only when we have to pass specific PBS directives than the exiting ones or to parse RM directives passed with the -r arguments, i.e. see section E below. The default directives used in this script are:
    #PBS -N J${SESSION_NAME}
    #PBS -o job.out
    #PBS -e job.err
    #PBS -l     cput=3600
    #PBS -q dque
    ${raString} # see SECTION E
    ...
    The second part of the file describes how to invoke the application wrapper and how to, create time-stamp files, etc. This part of the code should cover a wider range of PBS configurations.
  • SECTION D: This section contains the PBS submit command. According to your PBS system configuration you may have to edit it only for customised PBS configurations that use multiple queues, PBS servers, etc. The example code in this section submits jobs to the default queue in the PBS server defined in SECTION A, e.g.
    # compose submit command to the default queue 
    my $command_line="$RM_SUBMIT -q \@$RM_SERVER $JDF";
    
    # execute the submit command and store submission job ID
    my $sub = 0xffff & system "$command_line > $JOB_PID";
  • SECTION E: This subroutine should parse command line arguments for the RM. It should return a text string with valid PBS directives that os attached in the JDF file PBS directives section, e.g. ${raString}. The current implementation of this subroutine returns an empty string. However, if you intend to pass RM directives dynamically using the -r command line arguments you should parse them in this subroutine and return them as a PBS directive string, e.g.
    ...
    #PBS cput=2300
    #PBS -l 2
    ...

Check Job: getJobStatus.pl

This is a perl script that checks and reports to GRIA the status of a PBS job. For most PBS configurations the editing of this script should be minimal:

  • SECTION A: Initialise Resource Manager global vars, such as path for PBS binaries, PBS server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<PBS binary path>
    • RM_SERVER=<PBS server>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: The first part of this section reads the status of the PBS job. According to your PBS configuration you may have to edit the code that grabs the job status, e.g. in a PBS qstat command the status of a job is always the 6th field, etc.
    my $qString = `${RM_QUEUE} | grep $concatPID`;
    my @words = &quotewords('\s+', 0, $qString);
    my $jStatus = $words[9];
    Unless the output format of qstat is different you do not need to change this section, e.g.
    Job id           Name             User             Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    74.siegerrebe    pm               tomcate           00:30 0 R dque

Kill Job: killJob

This is a perl script for terminating PBS jobs, the following parts of the code need editing:

  • SECTION A: Initialise Resource Manager global vars, such as path for PBS binaries, PBS server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<PBS binary path>
    • RM_SERVER=<PBS server>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: The first part of this section reads the status of the PBS job. According to your PBS configuration you may have to edit the code that grabs the job status, e.g. in a PBS qstat command the status of a job is always the 6th field, etc.
    my $qString = `${RM_QUEUE} | grep $concatPID`;
    my @words = &quotewords('\s+', 0, $qString);
    my $jStatus = $words[9];
    Unless the output format of qstat is different you do not need to change this section, e.g.
    Job id           Name             User             Time Use S Queue
    ---------------- ---------------- ---------------- -------- - -----
    74.siegerrebe    pm               tomcate           00:30 0 R dque

4.5.2.3. Condor

How to use and configure the supplied Condor platform scripts

GRIA pre-supplied platform scripts for Condor systems provide identical functionality with the PBS platform scripts. These scripts are working on a very basic Condor configuration. As a basic condor testbed platform we used:

  • All condor and GRIA services run on the same system
  • Condor default values used
  • System users, i.e. GRIA user (tomcat) can submit and run simple condor jobs

Condor platform scripts can be easily customised in the following sections:

Submit Job: startJob.pl

This is a perl script to submit GRIA jobs in a Condor pool.

  • SECTION A: Initialise Resource Manager global vars, such as path for Condor binaries, master server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<Condor binary path>
    • RM_SERVER=<Condor server>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: Generate a job description file (JDF), this is the file submitted to Condor to run the job. The condor JDF file includes all the required condor directives to run the job and the job itself is described as frame. Resource manager directives passed as command line arguments should be processed in SECTION E and append at the end of JDF. The default condor directives section in this script includes:
    universe        = vanilla
    executable      = frame
    arguments       = $aRG
    shell           = /bin/bash
    error           = $JOB_ERR
    log             = job.log
    output          = $JOB_OUT
    should_transfer_files = IF_NEEDED
    when_to_transfer_output = ON_EXIT
    queue
    $raString    # see SECTION E
    You should edit this section if your condor configuration requires different directives.
    The frame1 is a simple shell script that condor has to run for every GRIA submitted job. The functionality of the frame script is to change the working directory, invoke the application wrapper and generate the time-stamp files before and after the execution of the application wrapper, e.g.
    #!/bin/bash
    cd $SESSION_DIR/$WORK_DIR
    touch ../$APP_WRAPPER_STARTED_TS
    ${EXE_WRAPPER} $aRG
    echo \$? > ../.app_wrapper_exit_code
    touch ../$APP_WRAPPER_ENDED_TS
    In most cases you should not need to change the frame code.
  • SECTION D: This section contains the condor submit command:
    # compose the submit argument 
    my $command_line="$RM_SUBMIT $JDF";
    
    # execute condor submit, store job ID
    my $sub = 0xffff & system "$command_line > $JOB_PID";
    The expected return of the condor submit command usually is similar to:
    Submitting job(s).
    Logging submit event(s).
    1 job(s) submitted to cluster 25.
    You should only change this part of the code if you use a customised condor submission command.
  • SECTION E: This subroutine should parse command line arguments for the RM. For the condor system it should return a text string with valid condor directives that will be attached in the JDF file, e.g. ${raString}, e.g.
    Requirements = Arch =="INTEL" && OpSys == "Linux" && Memory > 20
    Rank = (Memory > 32)*((Memory * 100) + (IsDedicated * 10000) + Mips)

Check Job: getJobStatus

This is a perl script that reports the status of a condor job, customisation of the code should take place in the following sections:

  • SECTION A: Initialise Resource Manager global vars, such as path for Condor binaries, master server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<Condor binary path>
    • RM_SERVER=<Condor server>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: This section reads the condor_q command output which typically should be similar to:
    -- Submitter: siegerrebe.it-innovation.soton.ac.uk : <xxx.xxx.xxx.xxx:42239> : siegerrebe.it-innovation.soton.ac.uk
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
      58.0   tomcat          7/11 14:46   0+00:00:00 R  0   0.0  frame -i ../staged
    In this example the job status is reported on the 9th field:
    my $qString = `${RM_QUEUE} $PID | grep $PID`;
    my @words = &quotewords('\s+', 0, $qString);
    my $jStatus = $words[6];
    You should only change this part of the code if you intend to use a customised format of the condor_q command.

Kill Job: killJob

This is a perl script that terminates condor jobs, the following parts of code may need editing:

  • SECTION A: Initialise Resource Manager global vars, such as path for Condor binaries, master server name, etc. In particular make sure that the following variables are set up correctly:
    • RM_PATH=<Condor binary path>
    • RM_SERVER=<Condor server>
  • SECTION B: Turn verbose debug flags on/off. This step is optional.
  • SECTION C: This section reads the condor_q in order to figure out the state of the condor job. The command output typically, should be similar to:
    -- Submitter: siegerrebe.it-innovation.soton.ac.uk : <xxx.xxx.xxx.xxx:42239> : siegerrebe.it-innovation.soton.ac.uk
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD 
      58.0   tomcat          7/11 14:46   0+00:00:00 R  0   0.0  frame -i ../staged
    In this example the job status is reported on the 6th field
    my $qString = `${RM_QUEUE} $PID | grep $PID`;
    my @words = &quotewords('\s+', 0, $qString);
    my $jStatus = $words[6];
    You should only change this part of the code if you intend to use a customised format of the condor_q command.

 

1 Submitting a simple shell script to a resource manager instead of the real application itself can sometimes cause problems e.g. advanced configurations running an application in parallel mode, etc. It is advisable in such cases to try and move the necessary functionality either to the application wrapper or up to the resource manager section, e.g. prologue and epilogue parts in PBS, etc.

4.5.2.4. Testing the Platform Scripts

How to test the platform scripts

This section describes how to test platform scripts after they have been installed and configured, using the pre-supplied test application. The details of command lines, etc, are specific to the test application only.

All platform scripts should be able to run as stand-alone applications from a command line. Before running any of the tests, make sure that:

  1. the test application is installed and configured properly, e.g. properties files, etc
  2. you can submit and run jobs successfully via the cluster resource manager (e.g. PBS) if you are using one
  3. the scratch directory you are using, e.g. /scratch, is accessible by both front-end system and compute nodes.
  • Create a temporary workspace directory:
    $ mkdir /scratch/testgrid/work/{inputs,outputs}
    $ cd /scratch/testgrid
    
  • Copy an image file to job workspace, e.g.
    $ cp some_demo.jpg /scratch/testgrid/inputs/input-0

Running a Job

The default location for scripts can be found under WRAPPERS_DIRECTORY/platform. To test the startJob script, use the following command (N.B. directory locations may vary depending on where the scripts may have been copied to):

$ /opt/gria-platform-scripts/rm_local/startJob -v -d /scratch/testgrid -e /opt/tutorial-apps/swirl/startJob

This will run the test application and store the output results in the outputs subdirectory. The working subdirectory is set to work. The response should look like:

Session directory: 1149159584
Job submitted successfully

Checking the Job Status

From the command line type:

$ /opt/gria-platform-scripts/rm_local/getJobStatus -e /opt/tutorial-apps/swirl/checkJob -d /scratch/testgrid

A typical response of this command will produce something like:

DEBUG Use session directory: /scratch/testgrid
DEBUG   Use application status script:

 +----------------------------------------------------+
 |                                                    |
 |   GRIA  Job getStatus wrapper ($Revision: 4190 $)  |
 |                                                    |
 +----------------------------------------------------+
 
 Resource Manager...: Local execution
 Check job status time....: Thu Jun  1 11:59:44 2006
 DEBUG   Using session name: testgrid
 DEBUG   Using concat session name: testgrid
 DEBUG   JOB_PID file found!
 Local job ID.........: 2286
 
 DEBUG   Detected platform: Unix
 DEBUG   qString: pagis     2286     1  0 11:59 pts/2    00:00:00 perl /scratch/testgrid/jdf.pl
 pagis     2289  2286  0 11:59 pts/2    00:00:00 /usr/bin/perl /opt/tutorial-apps/paint/startJob.pl
 
 JOB_STATUS            RUNNING
 JOB_SUBMITTED         1149159584000
 APP_WRAPPER_STARTED   1149159584000
 
 <------------------------>
 JOB_STATUS            RUNNING
 JOB_SUBMITTED         1149159584000
 APP_WRAPPER_STARTED   1149159584000
 
 <------------------------>
 Appliction specific status not available
 Check job status exit code: 0

Note: getStatus is using STDOUT to provide the job status report to Job Service while STDERR is used for debugging information. The two output streams will be mixed on the screen unless you redirect one to a separate file e.g. 2>status.err.

Killing a Job

From the command line, type:

$ /opt/gria-platform-scripts/rm_local/killJob -v -d /scratch/testgrid

The response from this command will be similar to:

Try to kill the job
Use session directory: /sratch/testgrid
 Use application specific kill script:
 killJob ver: 5.0.0
 +--------------------------------------------------+
 |                                                  |
 |   GRIA  Job killJob wrapper ($Revision: 4190 )   |
 |                                                  |
 +--------------------------------------------------+
 Thu Jun  1 12:00:00 2006
 session name: testgrid
 DEBUG   JOB_PID file found!
 DEBUG   PID: 2286, 2286, 2286
 DEBUG   Detected platform: unix
 DEBUG    qString: <>
 DEBUG   Job is not found in Q
 Job is not running, it cannot be killed because it has already finished
 Kill job exit code: 0

After a job has completed successfully the workspace directory will have a directory structure similar to:

testgrid/
|-- .app_wrapper_ended
|-- .app_wrapper_exit_code
|-- .app_wrapper_started
|-- .jobPID
|-- .job_submitted
|-- jdf.pl
|-- log
|-- resources.xml
`-- work
    |-- image.jpg
    |-- inputs
    |   `-- input-0
    `-- outputs
	`-- output-0

4.5.3. Job Constraints

The Job Service Constraints

Job Constraints

The job constraints feature is a new experimental feature. It does not currently integrate with the SLA Management Service.

Job constraints are passed to platform scripts either from the Job Service using the -r command line argument (defined statically) or directly by the client user (using the client API) via a constraints XML file which the Job Service will store in the job session directory as resources.xml. The startJob platform script should parse these constraints and translate them to resource manager directives. Typical job service constraints are expected to describe resource constraints such as WallClockTime, CPUSpeed, PhysicalMemory, DiskSpace, etc.

Constraints passed from the Job Service as command line arguments should follow the form of name=value pairs, for example -r CPUSpeed=1800 (in MHz for a CPU speed constraint), or -r WallClockTime=3600 for a runtime constraint of an hour.

Job Service providers have to specify and advertise which job constraints a user is allowed to apply for a job run. This can be done easily with an apropriate XML schema. Client users can then submit jobs along with their constraints file. An example of a simple user supplied constraints file could be:

<?xml version="1.0"?>
<Resources>
        <CPUArchitecture>amd64</CPUArchitecture>
        <CPUSpeed>180</CPUSpeed>
        <PhysicalMemory>1024</PhysicalMemory>
        <!--DiskSpace>40</DiskSpace-->
        <WallClockTime>210</WallClockTime>
        <IndividualCPUCount>1</IndividualCPUCount>
        <TotalCPUCount>1</TotalCPUCount>
        <FileSizeLimit>200</FileSizeLimit>
</Resources>

A suitable XML schema for these constraints could be:

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:annotation>
    <xsd:documentation xml:lang="en">
     GRIA resource schema
    </xsd:documentation>
  </xsd:annotation>

  <xsd:element name="Resources" type="ResourcesType"/>

  <xsd:complexType name="ResourcesType">
    <xsd:sequence>
      <xsd:element name="Comment" type="xsd:string" minOccurs="0"/>
      <xsd:element name="CPUArchitecture" type="cpuarchitecture" minOccurs="0" default="x86"/>
      <xsd:element name="CPUSpeed" type="xsd:int" minOccurs="0"/>
      <xsd:element name="PhysicalMemory" type="xsd:int" minOccurs="0"/>
      <xsd:element name="DiskSpace" type="xsd:int" minOccurs="0"/>
      <xsd:element name="WallClockTime" type="xsd:long"/>
      <xsd:element name="IndividualCPUCount" type="xsd:int" minOccurs="0" default="1"/>
      <xsd:element name="TotalCPUCount" type="xsd:int" minOccurs="0" default="1"/>
      <xsd:element name="FileSizeLimit" type="xsd:long" minOccurs="0"/>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:simpleType name="cpuarchitecture">
    <xsd:restriction base="xsd:string">
      <xsd:enumeration value="x86"/>
      <xsd:enumeration value="ia64"/>
      <xsd:enumeration value="amd64"/>
      <xsd:enumeration value="sparc"/>
      <xsd:enumeration value="other"/>
    </xsd:restriction>
  </xsd:simpleType>

</xsd:schema>

Supported Constraints in the Supplied Platform Scripts

The platform (startJob) scripts supplied with GRIA implement the following job constraints:

WallClockTime
Maximum amount of time a job can run in seconds. If the job service and the user both specify this constraint, the minimum of the two is taken.
PhysicalMemory
The minimum amount of required physical memory in MB. If the job service and the user both specify this constraint, the maximum of the two is taken.
CPUSpeed
The minimum CPU speed required in MHz. If the job service and the user both specify this constraint, the maximum of the two is taken.
DiskSpace
The minimum amount of available disk space required in MB. If the job service and the user both specify this constraint, the maximum of the two is taken.
OSName
The Job Service overwrites the user supplied constraint.

Note: GRIA pre-supplied scripts are using XML::Simple perl module to handle the user constraints XML file, which is only capable of handling simple XML documents without attributes.

The following table shows which constraints are implemented with the GRIA pre-supplied platform scripts.

Constraint Unit Local execution PBS Condor
WallClockTime sec OK OK OK
PhysicalMemory MB - OK OK
CPUSpeed MHz OK (req. perl win32::Info for XP) - -
OSName <string> - OK OK
DiskSpace MB - - OK

Depending on the platform capabilities Job Service providers should customise section E of the startJob platform script accordingly.