Personal tools
You are here: Home GRIA Documentation Documentation 5.3 Reference Manuals Basic Application Services User Guide

Basic Application Services User Guide

Note: Return to reference manual view.

This guide describes how to use the Basic Application Services Package for provision of data storage and processing (using application installed on a cluster) to trusted users.

1. Overview

Overview of the Basic Application Services

The GRIA Basic Application Services package provides the core functionality for job and data management. It consists of:

A Data Service
This allows remote users to upload and download data files to the service provider, and to transfer data between Data Services hosted by different service providers. The Data Service also supports management of access rights (for read or read-write access) granted to other users or service providers.
A Job Service
This allows remote users to start, monitor or kill computational jobs, executed by the service provider. The Job Service will fetch input from and write output to a local Data Service. The Job Service can be configured to support multiple applications, which are chosen by the service provider.

The application services can be configured to be either unmanaged (free) or managed by the GRIA Service Provider Management package, as in the diagram below.

Highlighting the Basic Applications Package in the GRIA Architecture

Highlighting the Basic Applications Package in the GRIA Architecture


2. Installation

GRIA Basic Application Services Installation

Standard Installation Procedure

The Basic Application Services package is provided as a zip file (Windows) or tar.gz (Linux). Unpack the archive and you will find the following items:

  • docs (folder)
  • gria-basic-app-services.war
  • README.html

Install the war file according to the Service Installation Manual. Once the initial configuration has been completed, the Basic Applications Package requires some extra configuration.

Additional Pre-requisites

Some additional pieces of software need to be installed, as follows.

Python

Python 2.4 or newer is required in order to run the Basic Application Services correctly. The easiest way to check which version of Python your system has installed, is to run from a terminal the following command:

python -V

Most Linux distributions come with the correct version of Python installed. For Windows, the most common Python implementations are Cygwin Python and ActiveState Python. We recommend that you install ActiveState Python, choose the latest release, click the "Next" button, and then download the MSI distribution for Windows. To fully complete the installation you must restart Windows.

N.B. Some Python implementations for Windows do not include the win32api package by default. In this case, an error complaining that "No module named win32file" will be thrown from Basic Application Services scripts. The easiest way to resolve the problem is to install the win32api package.

In Mac OS X 10.4 systems an old version (2.3) of Python is installed. GRIA requires 2.4 or later. If you already have MacPorts installed on your system, simply type:

  sudo port install python2.4

And change the link /usr/bin/python to point to the python2.4 binary. Alernatively you can download and install the binary distribution

Test Application: ImageMagick

ImageMagick is the default test application used in GRIA. ImageMagick binaries for Windows can be downloaded from here. Use a Q8 version, e.g. ImageMagick-6.4.2-x-Q8-windows-dll.exe.

The ImageMagick distribution package for Windows is self-extracting and the installation procedure starts automatically. Follow the instructions and select the default options.

Note: older versions of ImageMagick might not support all of the basic application service's default examples, e.g. blend.

For Mac OS X systems you can download and install the binary distribution, however if you already have MacPorts installed on your system, simply type:

  sudo port install ImageMagick

Continuing the Installation

Once the additional pre-requisites have been installed, the installation can be continued by following the instructions for each service.

3. The Data Service

Overview and Configuration of the Data Service

Overview

The GRIA Data Service is used to manage "data stagers". A data stager is a container for a single file (or zip file). It has a unique identifier and an access control system for determining who can read and write the data. Clients can use the service to create new stagers, upload and download data, transfer data between stagers, and control others' access to the data.

Configuration

Two items of configuration must be given before the Data Service can be used:

The location of the root data directory
The service stores any uploaded data inside this directory. If the Data Service is going to be used with a Job Service and jobs are going to execute on a cluster then the cluster's machines need to be able to read and write to this directory.
A list of trusted management services
Normally, you can just click Add to accept the default management service. This is the SLA management service from the GRIA Service Provider Management package. Note that if the GRIA Basic Application Services package is deployed on a different machine to the GRIA Service Provider Management package, some additional access control setup is required. This is described in Links with Other Services section of the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make it unmanaged (or "free"), by clicking the Make service free button.

Enabling REST data transfer (optional)

Data is normally transferred using the SOAP-with-attachments protocol. However, this has a couple of limitations:

  • Many programs do not support this protocol.
  • SOAP requires the signature to be sent before the data. However, calculating the signature requires processing all the data first. On a fast network, this can roughly double the transfer time.

To solve these problems, the GRIA data service can be configured to allow downloading, uploading and deletion using the standard HTTP methods GET, PUT and DELETE.

In this case, the access control decision is made using the HTTPS transport-layer security credentials rather than the SOAP message-layer (WS-Security) credentials. Therefore, your server must be configured to request client authentication at the transport layer.

Also, GRIA services check roots of trust (i.e. trusted certificate authorities) on a per-rule basis, not by having a static set of trusted CAs. Therefore, you should disable certificate trust validation in your container.


Apache

The easiest way to configure this is to front your GRIA services with the Apache web-server. Use this option to request client certificates but leave trust validation to GRIA:
SSLVerifyClient optional_no_ca

N.B. by enabling this option, you need to comment out the trusted certificate authority option, i.e. SSLCertificateChainFile.

Then, go to your service administration page and click on Endpoints configuration. Change the port from the Tomcat port (usually 8443) to the Apache port (usually 443).


Testing


You should now be able to upload and download data using any HTTPS-capable application, such as "curl".

The URL to use for transfers is of the form <webapp>/data-stager/<stager-ID>. For example, to upload file-to-upload to data stager ff808181-152215bf-0115-221982b5-0002:
curl --cert me.pem -T file-to-upload \
https://example.com/gria-basic-app-services/data-stager/ff808181-152215bf-0115-221982b5-0002

To get a pem file with your private key and certificate from a PKCS#12 file (e.g. one exported from your keystore using KeyToolGUI):

openssl pkcs12 -in me.p12 -out me.pem -clcerts

See this FAQ for more detail. The URL can also be found in the metadata section of the data stager's EPR, using the getRestURL method.

4. The Job Service

The Job Service

4.1. Overview

Overview of the GRIA Job Service

The GRIA Job Service is used to manage jobs. Clients can use the service to create new jobs, upload input data, start the job, monitor progress, and download results.

Each input and output of a job is actually a data stager managed by the local Data Service (the one in the same .war as the Job Service). Therefore, you must configure the Data Service before the Job Service can be used. Users can run jobs that take input from or send output to other Data Services by using the normal data transfer features provided by the Data Service.

Job Service Architecture

The GRIA Job Service architecture is flexible enough to use a variety of underlying computing platforms to run jobs e.g. from single computers to clusters of workstations or even supercomputers. In order to achieve this flexibility, the GRIA Job Service accesses resources indirectly via its RM connector scripts, which decouple the GRIA Job Service from resource managers and applications. The following sections of this document give an overall picture of the various components of the GRIA Job Service, and then describe some common deployment scenarios.

Components of the GRIA Job Service

Overview of the GRIA Job Service architecture

Figure 1. Overview of the GRIA Job Service architecture

The GRIA Job Service is separated into several distinct components: (colours relate to the above diagram)

  1. The Resource Manager Connectors - The GRIA Job Service can submit jobs to different resource managers such as TorquePBS and Condor, or even run them on the local machine using the LocalExecution plugin. It is able to do this thanks to the Resource Manager Connector layer - a plugin architecture written in Python that abstracts away resource manager-specific details and presents a single interface for submitting and monitoring jobs. Service providers can configure the Job Service to use the existing Resource Manager Plugins or they can write their own to interface with custom configurations.

    Versions 5.2, or newer, of the GRIA Job Service can have any number of RM Connector Plugins loaded at the same time. Selecting which one to use for an individual job is done in a series of steps, and is explained in The Selection Process chapter.

  2. The Application Wrapper Scripts - Each Application deployed on the GRIA Job Service needs to have a couple of small wrapper scripts installed alongside it. These scripts are responsible for providing the application with the correct files from the shared filesystem, and making sure the outputs from the application are written or copied back to the correct location. Optional wrapper scripts can also be written to cancel a job gracefully and report progress and usage information specific to that application (eg. frames rendered by a graphics package).

  3. The Shared Filesystem - When the user creates a job, the GRIA Job Service creates a directory for it on the shared filesystem. The administrator should ensure that this directory can be read from and written to by both the Job Service running on the server, and the application wrapper scripts running on the execution nodes. The structure of this scratch directory is as follows:

    • logsys - Log file for the system administrator. Contains information about the RM connector plugins and resource constraints.
    • loguser - Log file for the user. Contains the stdout and stderr from the job executable.
    • work/ - Working directory in which the job executes.
    • work/inputs/ - Directory containing the named inputs for the job.
    • work/outputs/ - Directory to which the application wrapper scripts should write the job's outputs.

When configuring the Job Service, the administrator must be careful to ensure that the different files and executables can be accessed by the correct components in the system.

  • The application executables should be accessible by compute nodes only, and they should be read only. Installation of the applications can be either local per compute node or over disk space shared among all nodes.
  • The application wrapper scripts should be accessible by both the compute nodes and the Job Service. Like the application executables they should be read only, and can either be installed locally or part of a shared filesystem.
  • The RM connector scripts should be accessible by the GRIA Job Service middleware.
  • The job's scratch directory should be accessible by both the compute nodes and the Job Service. The Job Service must have write permission on the entire directory, but applications themselves only need to access the work/ subfolder. The application wrapper scripts could set up a chroot jail inside this directory to ensure nothing else on the filesystem can be accessed. Note that the scratch area cannot be copied between compute nodes; instead it must be exported as a shared disk space.

Local Execution Deployment

This is a typical minimum configuration scenario: the Job Service and the job execution run locally on the same machine.

Local Execution Deployment

Figure 2. Local Execution Deployment

Figure 2 shows how GRIA can be configured to run applications locally on the server machine running Tomcat. Because of its simplicity and minimal configuration, this deployment is commonly used for demonstrations and testing. This is the default configuration for the GRIA Job Service assuming the administrator does not set up the TorquePBS or Condor plugins. Note that it is not advisable to leave the GRIA Job Service configured like this in a production environment.

4.2. Basic Configuration

The Job Service

To configure the Job Service you will need to specify:

  • The location of the root job directory. The service creates one subdirectory inside this directory for each job. This directory must be on the same file-system as the Data Service's data directory so that data files can be hard-linked efficiently as jobs are started and stopped. Otherwise, the data must be copied which is slow and wastes disk space.
  • A list of trusted management services. Normally, you can just click Add to accept the default management service. This is the SLA management service from the GRIA Service Provider Management package. Note that if the GRIA Basic Application Services package is deployed on a different machine to the GRIA Service Provider Management package, some additional access control setup is required. This is described in Links with Other Services in the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make use of the service unmanaged (free), by clicking the Make service free button.
  • A list of applications which users can run using the service. See the Managing applications section for details.

The other fields on the configuration page are optional and are used for connecting the Job Service to computational clusters.

4.3. Managing Applications

Managing applications

This section covers the administrative tasks of deploying and undeploying applications. After deploying an application to your job service, it becomes available for execution by remote clients. Note, however, that clients usually must satisfy additional business constraints, such as having an appropriate service level agreement or account, before they can execute deployed applications.

The GRIA Basic Application Services software is provided with a set of tutorial applications. These are made available during installation of the software. The web based Administration Interface provides the location of these files during the installation process, and guides you through the simple process of application deployment.

In addition to the tutorial applications, it's straightforward to develop new applications and deploy these in the same way.

This section assumes that you have all necessary files for application deployment and that any required executable applications have been installed according to the application documentation. If you are installing the tutorial applications, you have all the files you need. If you are deploying your own applications, first see Writing Application Scripts for details of the files you need to produce before application deployment.

Deploying applications

Having obtained or created the files and scripts needed for application deployment, the application can now be deployed to the job service. To do this, make sure that Tomcat is running then, using a web browser, log into the GRIA Basic Application Services administration page. This can be found at http://<servername>:8080/gria-basic-app-services. Make sure you enter the appropriate security credentials and adjust the URL for the administration page, according to your server setup.

From the administration page, select the Job Service link, as shown below.

Link to the Job Service Administration Page

Link to the Job Service Administration Page

This displays the Job Service Admin page. In the Applications section, enter the location of the directory containing the files and scripts needed for deployment. Then click the Deploy new application button.

This displays the Application properties for the application. You may optionally enter arguments for the application wrapper script, which will be supplied in addition to any user-specified arguments. Select the preferred resource manager for this application (LocalExecution is the default). Finally, resource manager directives may be supplied, as a JSDL (XML) fragment (see example provided).

Once all application properties have been set, click the Accept button.

This completes deploying an application to the Job Service.

Undeploying applications

Undeploying an application is straightforward. First click the Edit button along side the entry for the application you wish to undeploy.

Click the Undeploy button to undeploy the application.

4.4. Application Wrapper Scripts

How to write the scripts and meta-data file to integrate your application.

4.4.1. Overview

Applications and The Job Service

Relation to the Job Service

The Application Wrapper Scripts

Figure 1 - The Application Wrapper Scripts

The application wrapper scripts are deployed along with the applications - either on a shared file system or individually on each compute node. Their function is to provide a uniform interface to the Job Service for starting, monitoring and stopping applications.

The application model

The Job Service and RM connector scripts are designed to support a uniform model of application execution, shown in Figure 2:

Application Model

Figure 2 - Application Model

The workspace for each job is set up by the Job Service when the job is initialised (this is one of the bookkeeping service operations that must precede the call to start the job). The workspace has a standard directory structure so the Job Service and RM connector scripts can create and find information stored in it, including a "work" sub-directory where the job will actually run.

When the user starts the job, the Job Service transfers input data files from Data Service URIs into the job's workspace. It then runs the RM connector script which submits the application to the execution platform (cluster, etc) where it will run (using the specified command line), possibly after some queuing delay. The application reads input data deposited in its workspace by the Job Service, and writes outputs back to the workspace once the job has finished. The Job Service can then find the outputs and transfer these to the correct output Data Service URIs.

When the user asks for the status of the job, the service queries the Resource Manager for status information (e.g. when the job started or finished, etc). The service may also run an associated monitoring application (using the specified command line) to gather application-specific status information (e.g. number of iterations completed, convergence plots, etc) from the workspace.

If the user asks for the job to be killed, the service uses issues a command to kill the job on the execution platform. The Job Service will detect that the job has finished, and will transfer any output produced. Note that the user (or their client-side application) should always check the status of a job to find out if it crashed or was killed, as some incomplete output may appear in the latter case.

Why are Application Wrapper Scripts Required?

In practice, few legacy applications behave exactly according to the model shown in Figure 2. It is rarely possible to change the application itself to fix this, so instead GRIA uses so-called wrapper scripts that do conform to the application model for starting and managing the application.

In practice, the wrapper scripts can do more than just make the underlying application work as indicated in Figure 2. They can also be used to handle and implement application specific features of the service.

One can also use (optional) wrapper scripts to look for application-specific status information in the working directory of the job. Without such scripts, the service can only obtain basic job status information from the job submission system.

Finally, wrapper scripts also provide a configurable mechanism for dealing with any application-specific security risks, e.g. checking for malicious input that may exploit a feature of the application. Few legacy applications were designed as network-accessible services, and since we can't change them to remove security loopholes, the use of wrapper scripts is essential to check for any exploits of application vulnerabilities. In the limit, one can configure the wrapper (and RM connector) scripts to run the application in a sandbox (e.g. chroot), with access only to a working sub-directory of the job workspace.

4.4.2. startJob Wrapper Script

The startJob.pl Application Wrapper Script

Language

Like all other application wrapper scripts, startJob can be written in any scripting language supported by the host OS.

  • For Linux, the first line of the script (eg. #!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. startJob.py, startJob.sh).
  • On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.

Application Wrapper Functionality

The startJob application wrapper script is a mandatory script that deals with any application specificity, allowing the Job Service to treat all applications in the same way, and so decoupling the Job Service from the details of the application.

The main functions of the application wrapper script are:

  • handling input and output data files;
  • setting up an environment (i.e. environment variables) that is suitable to run the application;
  • enforcing any security precautions to protect against loop-holes in the application;
  • running the application itself.

The application wrapper is designed to run on the execution platform, having been submitted by the RM connector scripts for starting a job. Prior to submitting the wrapper script on the execution platform, the Job Service will have set up a workspace (directory) for the job, copied input data into it e.g. work/inputs, and created a working sub-directory for the job to run in, e.g. work. The following listing shows a workspace directory structure with two input files and an empty outputs directory.

ff808081-1017450e-0110-174532dd-0001-1
`--work
   |-- inputs
   |   |-- namedinput
   |   |-- arrayinput-0
   |   `-- arrayinput-1
   `-- outputs

After changing to the workspace directory (not the working sub-directory), the wrapper will be submitted using the following command line:

app-wrapper <application arguments>

The functionality of the wrapper script should include the following:

  1. Parse wrapper arguments, including security checks for illegal input designed to inject malicious commands into the command-line used to launch the application.
  2. Move input data files into the working sub-directory, including unpacking any that are compressed archives containing multiple inputs.
  3. Create a consistent environment in the working directory, by setting up environment variables and rewriting input data to match the local environment where necessary.
  4. Build the command line and run the underlying application.
  5. Copy output files from the working directory into the output directory, including packing multiple outputs into compressed archive files where necessary.
  6. Exit by returning the exit code of the application.

For simple applications, security can be maintained by checking input parameters during step 1. and if necessary data files during step 3. If the application is too complicated for this to be reliable, it may also be necessary to set up a sandboxed working environment and run the code inside it during step 4.

Some of these steps are considered in more detail below.

Input and Output Data Handling

Note: As of version 5.2 of the GRIA Job Service, applications can specify names for their inputs in the application metadata file. As a backwards compatibility measure, if the application metadata file still uses the old GRIA 5.1 format, inputs will be named numerically in the order they appear in the metadata (eg. input-0, input-1, etc).

When unpacking input data, the application wrapper should attend to the following:

  1. Create any substructure needed in the job's working sub-directory of the workspace.
  2. Copy or unzip input files from the inputs sub-directory into the job's working space.
  3. Check that all input needed to run the job is present.

The Job Service knows in advance which output files must be returned to the outputs directory. The application wrapper must create these files by:

  1. Copying or zipping data to create the required output files in the outputs sub-directory
  2. Renaming these files to the names specified in the metadata (or the output-x naming scheme for legacy applications)

The Job Service will detect that the application wrapper script has finished and handle the transfer of output files accordingly.

Consistent Context Reconstruction

Why do we need context reconstruction? The input data for our application has been created on another system with a different directory structure, environment and possibly even operating system. We have to set up an equivalent (not necessarily identical) environment on our execution platform, and make sure any input data references to the remote user's environment are mapped onto the one we have created, or they will be invalid when the application is started.

When and where should context reconstruction be performed? One should handle it as close as possible to the running application—certainly on the execution platform where the job will actually be run—as this is where the environment is needed. This is why the Job Service doesn't attempt to create the context itself - there is no point doing it at the service host if the job will be executed on a compute node in a Condor cluster. Instead, we leave it to the application wrapper to handle everything in an application specific way on the execution platform itself.

A typical approach to context reconstruction might involve passing an array of named parameters to the Job Service, including environment settings as well as application flags. These will be passed to the wrapper through its argument list. In addition, one can provide settings in an extra input file, intended for the wrapper rather than the job itself, and used to set up the environment prior to running the application code.

The hardest job for the wrapper is to parse and rewrite application input data where necessary to ensure it is consistent with the environment established on the execution platform. If this is not needed, it is usually quite easy to 'wrap' an application to run inside the Job Service. Where it is necessary, the wrapper may become a significant body of code in its own right.

For example, consider the following line of input intended for the rendering application AIR, used with the Job Service to provide a grid-enabled video rendering service:

  Option 'searchpath' 'shader' ['&:e:\AnimalLogic\MaxMan\shaders:C:\Sample\shaders']

The problem here is that the application uses plug-ins to perform part of the rendering calculation, and the search path for these can be specified in the user input. This particular input file has been generated automatically using a graphical environment for video post-production, which has filled in the relevant path based on where the shader libraries were installed on the user's local machine.

The wrapper has to identify which shaders are needed, and substitute the path to them on the local system:

   Option 'searchpath' 'shader' ['&:/export/apps/AnimalLogic/MaxMan/shaders:/export/apps/air/Sample/shaders']

In some cases, it may be possible to infer the meaning of client-side environment references by pattern matching against a list of meaningful terms used by the application. In others (probably in this case), it is necessary for the user to send the install path quoted for specific groups of plug-ins as service arguments or environment settings, so the wrapper can find them and map them onto the equivalent installed groups of components on the execution platform.

In extreme cases, it may be necessary to establish multiple services to run the same application in different ways, allowing a different, specific environment to be set up for each. For example, it probably wouldn't make sense to have a single service to run a computational fluid dynamics (CFD) code capable of simulating coolant flows through automotive engines AND the propagation of drugs in aerosol suspension in human lungs. It would be asking too much of a wrapper developer to differentiate and correctly handle such extreme cases, and instead one should set up two services each with its own wrapper specialised to one of these scenarios.

Security Containment

Why Wrappers have to Bother with Security

The Job Service regards application wrapper scripts as trustworthy, because the service operator can inspect them and make sure they don't do anything strange or foolhardy. However, the applications may be third party, closed source executables that cannot be inspected, and were not designed as network-accessible services in the first place.

Wrapper scripts can protect the service from malicious users in three ways:

  1. checking any user input used to create the command line for running the application, to exclude command injection attacks using parameters like 'method=gauss; cd /; rm */*';
  2. checking input data known to be used in an unsafe way by the application, e.g. to construct system calls for executing plug-ins or moving files around;
  3. confining the application to a sandbox, by first preparing the sandbox and then launching the application in it.

If the application is very simple, or designed to withstand malicious users, or if you have only a small number of users you know well (and trust not to mislay their credentials) then it may be OK to include only the first of these measures.

Legacy applications are quite likely to do things in unsafe ways. Renaming files or testing if they exist are sometimes done via system calls. This can be a potential security hole if the application developer wasn't expecting filenames to be sent by a remote user who may have malicious intent. If the application isn't too complex, or if you can check with the developer on what might happen, then it should be OK if you also check the user-supplied input, filenames and other data that may be sent to unsafe system calls.

In the worst case, one has to assume the application will be unsafe, and attempt to contain any damage caused by malicious (or possibly careless) input by restricting what the application can do and where it can do it. There are several possible ways to achieve such restrictions.

Chroot

On Linux systems, chroot can be used to restrict a sub-process to an arbitrary sub-directory, e.g. a job's working directory. The chroot mechanism was designed for use by operating system developers to allow them to create a pseudo-root within which to test their code. While the chroot container doesn't prevent access to low-level devices, it will prevent most legacy applications accessing files outside the specified sub-directory. Chroot is widely used to contain web servers and other network applications to minimise the scope for damage if they are compromised.

To use chroot, it is necessary to create a complete operating system environment inside the job's working directory (which it will see as '/'). One has to copy application binaries, resolve any references to system/application libraries, create devices such as /dev/null, etc. To create a self-sufficient chroot 'jail' environment sufficient to run the application may not be easy, and of course, it would need to be repeated for each individual job. However, it can provide a good safety level as its 'jail' environment is enforced by the operating system itself.

Restricted Shells

Many shells, including bash, provide a restriction mechanism usually invoked by running the shell with the -r switch. Some common features of restricted shells are the ability to prevent a program from changing directories, to only allow the execution of commands using absolute pathnames, and to prohibit executing commands in other subdirectories, using command-line redirection operations, or changing the search path.

Minimal privilege accounts

Another approach is to create a low-privilege account for each job. The wrapper script would then have to assign such an account, change the working directory so it is owned by this account, and run the application in that working directory under the same account. Provided the same account is not used for anything else (including running other jobs), the application can be prevented from accessing anything outside the working directory, even if it can be induced to run some unforeseen system call by sending some malicious input.

The two drawbacks with this approach are:

  1. ideally one should create a pool of accounts and provide a way for the wrapper to assign them to jobs rather than creating new accounts, but this isn't supported at present;
  2. the wrapper would need sufficient privilege to set the account under which a sub-process is run, which may make the wrapper more dangerous if it can be compromised.

The second drawback may not be too bad, given that the wrapper at least can be designed to check all inputs and avoid doing anything unpleasant. At present, the Job Service runs with a normal unprivileged user identity, so it may be better to use other methods to contain individual jobs.

Other Methods

The above list is by no means exhaustive. For example, if the chroot 'jail' is not sufficient, one can create an entire virtual machine on which to run a potentially unsafe application. Software such as VMWare can be used to implement this approach, but users who want to go to these lengths are on their own, at least in this version of the software.

Error Handling

If an error is encountered in the application, the wrapper must report the fact. If this is not done, the Job Service will assume everything is OK, and the users client application will probably attempt to continue, which may not be appropriate if some output from the job is missing, etc.

The application startJob wrapper should exit with an exit status of zero if the job has completed successfully, or with a non-zero status if the job has failed. This value will be stored in .exit_code by the RM Connector script. Generic clients may stop executing a workflow, for example, if this result is not zero.

An Example Wrapper Script

This example is based on the ImageMagick applications which were installed as part of the GRIA installation. To get started, we will create a simple wrapper that runs this application.

  1. Create a startJob wrapper script:
    #!/usr/bin/env python
    
    print("Swirl wrapper started")
    
    print("Copying input to work directory...")
    shutil.copyfile("inputs/sourceImage", "image.jpg")
    
    print("Transforming image...")
    p = subprocess.Popen(["mogrify", "-swirl", "60", "image.jpg"])
    ret = p.wait()
    
    if ret != 0:
    	print("Failed to transform image, error=%s" % ret)
    	sys.exit(ret)
    
    print("Copying result to output stager...")
    shutil.copyfile("image.jpg", "outputs/outputImage")
    
    print("Swirl job completed successfully")
    

    This will perform the following steps:

    1. Copy the input image into the work directory.
    2. Run the mogrify command to transform the image.
    3. Copy the result to the output stager.
  2. Edit the startJob.py script to run your command instead of mogrify (the command you tested above).
  3. Make the script executable:
    $ chmod a+x startJob.py

4.4.3. checkJob Wrapper Script

Creating the checkJob.pl Status Wrapper Script

Language

Like all other application wrapper scripts, checkJob can be written in any scripting language supported by the host OS.

  • For Linux, the first line of the script (e.g. #!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (e.g. checkJob.py, checkJob.sh).
  • On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.

Application Wrapper Functionality

Unlike the wrapper for starting an application, the application wrapper for reading status from the working directory is optional. If no such wrapper is provided, the platform script will create a simple status report by checking stdout and stderr of the application, and consulting the RM connector scripts.

If you want the job status report to include application-specific information such as convergence plots, iteration counters, etc, you should create a wrapper script that will be invoked by the client calling the checkJob method.

An application status wrapper is usually a lot simpler to create than the main wrapper because it does not take any user-supplied (and hence potentially malicious) arguments, and does not set up (or run) potentially untrustworthy code. All the status wrapper has to do is to examine the job's working directory, read any files it needs in order to extract the desired status information (in the limit, one could simply copy an application-level log file), and write it to the standard output.

Note: the format of the status information is open and application dependent, however status information must not include binary data since it will be returned to the user in an XML document.

An Example Status Wrapper Script

This example is based on the ImageMagick application, which was installed as part of the GRIA installation and follows on from the start job example. In the same directory as startJob.py, create a script called checkJob.sh:

#!/bin/sh
ls -l log

This is run each time the client checks the status of the job. This example simply returns the last few lines of the log file, and is executed as follows:

  1. Make the script executable:
    $ chmod a+x checkJob.sh
  2. Test it with these commands:
    $ ./checkJob.sh > statusfile
    $ cat statusfile
    

    You should find that the contents of log are now in statusfile.

4.4.4. killJob Wrapper Script

Creating a killJob.pl Application Wrapper Script

Language

Like all other application wrapper scripts, killJob can be written in any scripting language supported by the host OS.

  • For Linux, the first line of the script (eg. #!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. killJob.py, killJob.sh).
  • On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.

Application Wrapper Functionality

The application-specific kill script allows an application to be terminated in a more controlled way, e.g. using an application-specific mechanism. This script is optional and, if not available, the RM Connector script will try to kill a job at the resource manager level instead.

The return code of this script should be 0 upon success or any other value on error. The RM Connector will decide accordingly whether the killing operation was successful or if it failed. The functionality of this script is application dependent and difficult to describe in a generic way.

For some applications, terminating a job might be as simple as creating a single file (e.g. a "stop" file) in the job workspace. Some other applications are aware of signals that can be passed to them, etc.

An Example Kill Wrapper Script

We are not aware of any particular way that ImageMagick can be killed, therefore we cannot provide a complete example of a kill wrapper script. However, in the following paragraphs we provide some hints about possible ways terminating jobs.

If a particular application is aware of a termination file in the job workspace, then the kill wrapper could be as simple as:

touch .terminate

Where .terminate is the particular termination filename that the application uses.

Many applications are aware of various signals e.g. SIGTERM, SIGALRM, SIGSTOP, etc. The kill wrapper could therefore pass the appropriate signal to the application and the application could respond by terminating the job gracefully, for example:

# find the process ID $pid and send a termination signal
kill -SIGTERM $pid

Where $pid is the process ID which can be read from the .job.pid file.

4.4.5. Application Metadata XML

Creating an XML File to Describe an Application

Application description files are XML files containing metadata about an application deployed on a GRIA Job Service. These files are essential for GRIA users to discover and use available applications. To create an XML description for an application, you need to use the following schema to identify the application's main features including name, version number, description and inputs/outputs (if any).

The core elements in GRIA application metadata documents are:

  • JobServiceMinVersion, [1]
  • Application, [1]
  • Metrics, [0,1]
  • Parameters, [0,1]
  • DataStagers, [1]

The following paragraphs explain in more detail each of these elements.

JobServiceMinVersion Element

The GRIA Job Service version, e.g. 5.2, 5.3. Multiplicity: 1.

Application Element

This element contains elements that describe the application itself. Multiplicity: 1.

The Application element contains the following sub-elements:

Description Element

Application sort description, multiplicity 0,1, type string.

ApplicationName Element

Every application provided by the Job service must be given a unique ApplicationName. To ensure uniqueness, a URI is used. Note that although these names look like web page addresses, they may not necessarily point to real web pages if treated as URL; they are simply unique strings, multiplicity: 1, type string.

ApplicationVersion Element

This element describes the application version, multiplicity 0, 1, type string.

Group Element

The Group element describes optional information about an application's group, e.g. CFD, simulation, etc. Multiplicity 0, 1, type string.

Keywords Element

This is a string of keywords describing the application, multiplicity 0, 1.

Metrics Element

This is an optional element that can be used to describe application specific metrics to GRIA components. Currently not in use.

Parameters

This element is used to describe application expected parameters. It can be used by the Job Service in order to check validity of submitted job parameters, command line arguments. Multiplicity 0, *.

The Parameters element can take the following attributes:

  • name
  • qualifier
  • type, optional
  • minOccurs, optional
  • maxOccurs, optional

The last two attributes can be used to represent single or multiple parameters.

Default Element

This element can be used to provide default parameters for the application, e.g. run all jobs in debug or verbose mode, etc. Multiplicity 0, 1.

Allowed Element

This element can be used to list application allowed parameters. Multiplicity 0, *.

Description Element

This element can be used to describe parameters metadata. Multiplicity 0, 1.

DataStagers Element

DataStager elements are used to specify job inputs and outputs. The element can present default, optional and multiple I/Os that will be used for running the job. Multiplicity 1, *. Each DataStager element can contain the following attributes:

  • type, required
  • name, required
  • minOccurs, optional
  • maxOccurs, optional
  • defaultSize, optional

Description Element

This is an optional element describing a job file. Multiplicity 0, 1.

MimeType Element

MimeType is an optional element that describes the type of the file, e.g. text, image, etc. GRIA services can use it, for example to display properly the contents of a data stager. Multiplicity 0, 1.

The following XML describes the Swirl application:

<?xml version="1.0" encoding="UTF-8" ?>

<GriaApplicationDescription xmlns="http://www.it-innovation.soton.ac.uk/2007/grid/application">

<JobServiceMinVersion>5.2</JobServiceMinVersion>

<Application>
<Description>Application to swirl an image</Description>
<ApplicationName>http://it-innovation.soton.ac.uk/grid/imagemagick/swirl</ApplicationName>
<ApplicationVersion>2.0-1</ApplicationVersion>
<Group>graphics</Group>
<Keywords>imagemagick, example</Keywords>
</Application>

<DataStagers>
<DataStager type="input" name="inputImage">
<Description>Input image to be swirled</Description>
<MimeType>image</MimeType>
</DataStager>

<DataStager type="output" name="outputImage">
<Description>Swirled image</Description>
<MimeType>image</MimeType>
</DataStager>
</DataStagers>

</GriaApplicationDescription>

Advanced usage

Input arrays

An application might require arrays of inputs, whose exact sizes are specified by the user when creating the job. This is supported by GRIA using the minOccurs, maxOccurs and defaultSize attributes on DataStager elements.

For example, if your application took between 2 and 8 images as input, you might use the following XML:

<DataStager type="input" name="inputImage" minOccurs="2" maxOccurs="8" defaultSize="2">
<Description>Input image</Description>
<MimeType>image</MimeType>
</DataStager>

You can use the defaultSize attribute to support older clients that do not know how to specify the desired size of arrays.

Optional inputs

Optional inputs are described much like arrays, except the minOccurs attribute is 0 and the maxOccurs attribute is 1. For example:

<DataStager type="input" name="overlayImage" minOccurs="0" maxOccurs="1" defaultSize="0">
<Description>Optional image to superimpose on top of the result</Description>
<MimeType>image</MimeType>
</DataStager>

Command line arguments

If your metadata file describes the application's allowed command line arguments, the GRIA Client and Job Service can validate arguments as they are received by the user before they are passed to the application wrappers.

For example:

<Parameters>
<Parameter name="string" qualifier="--string" type="string" minOccurs="0" maxOccurs="1"/>
<Parameter name="bool" qualifier="--bool" type="boolean" minOccurs="0" maxOccurs="1"/>
<Parameter name="data" qualifier="" type="string" minOccurs="1" maxOccurs="1">
<Allowed>one</Allowed>
<Allowed>two</Allowed>
<Allowed>three</Allowed>
</Parameter>
</Parameters>

This would allow the following command lines:

--string "This is a string" one
--bool three

In the above example, we specify whether parameters are optional or compulsory using the minOccurs="0" maxOccurs="1" or minOccurs="1" maxOccurs="1" attribute combinations, respectively.

The "data" parameter may take different values (hence the empty qualifier attribute). It is also further restricted by the use of specific allowed elements, forming a set of options.

4.4.6. jobUsage Wrapper Script

The jobUsage script generates application-specific usage reports

Language

Like all other application wrapper scripts, jobUsage can be written in any scripting language supported by the host OS.

  • For Linux, the first line of the script (eg. #!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. jobUsage.py, jobUsage.sh).
  • On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.

Application Wrapper Functionality

The jobUsage application wrapper script is an optional script that generates usage reports, indicating how much resource a job for a specific application is using. The GRIA Job Service runs this script occasionally during the job's execution, and once when the job has completed, to gather usage reports. It combines these reports with the ones from the resource manager and then forwards them to the SLA service.

When is jobUsage run?

The jobUsage script is run approximately twice per minute during the job's execution, and then once again immediately after the job finishes. The exact frequency of the calls to jobUsage depends on service load, but is guaranteed to be at most once per 30 seconds.

Output format

The jobUsage script should print on stdout an XML fragment similar to the following:

<UsageReport metric="http://example.com/metrics/example1" type="instantaneous" time="2008-06-27T06:26:48">562</UsageReport>
<UsageReport metric="http://example.com/metrics/example2" type="cumulative" startTime="2008-06-27T06:26:44" endTime="2008-06-27T06:26:48">21</UsageReport>
<UsageReport metric="http://example.com/metrics/example3" type="instantaneous" time="2008-06-27T06:26:48" >75.32</UsageReport>

Each usage report contains the following components:

  • Metric - identifies the metric for which usage is being reported. The metric URI should be unique. For example, the resource manager plugins use the following metric URI for reporting CPU usage:
    http://www.gria.org/sla/metric/resource/cpu
  • Type - either instantaneous or cumulative. Instantaneous reports describe the usage now, and should be reset to 0 when the job has finished. Cumulative reports describe the total usage over the entire duration of the job's execution.
  • Value - the numeric value of the usage report. The type of this is double.
  • Time - time in ISO 8601 format 'yyyy-MM-ddTHH:mm:ss' for instanteneous metrics.
  • StartTime - start time in ISO 8601 format 'yyyy-MM-ddTHH:mm:ss' for cumulative metrics.
  • EndTime - end time in ISO 8601 format 'yyyy-MM-ddTHH:mm:ss' for cumulative metrics.

4.5. JSDL Job Submission

The GRIA Job Service allows submission of jobs in JSDL format

4.5.1. Overview

An explanation of what JSDL is and how it is used in GRIA

JSDL (Job Submission Description Language) is an open standard developed by the Open Grid Forum. Since version 5.2 of the GRIA Job Service, users are able to submit information about the jobs they wish to create by using a JSDL document.

Version 5.3 of the GRIA Job Service supports a limited subset of the JSDL Specification, Version 1.0. Clients can use JSDL to:

  • Name their jobs, and give the URI of the application to be run
  • Specify any command-line arguments to be passed to the application wrapper scripts
  • List the number, name and type of any input/output data stagers to be created
  • Outline the expected resource usage of the job, to ensure the client's SLA has enough resource to allow the job to run
  • Define any constraints on the resources available to the job during its execution

The graphical GRIA client will create a JSDL document for you behind the scenes when you create a job. If you prefer, the graphical GRIA client is also able to upload a hand-written JSDL document when creating a job. You can view the JSDL document for any job in the GRIA client, or by using the JobResource.getJSDL() API method.

When using the client APIs to create a job, the JobDescription Java class can be used to easily create a JSDL document. Futher information is provided in Workflow Tutorial 2 - Job Execution.

4.5.2. Supported Elements

List of the JSDL XML elements supported by the GRIA Job Service

The GRIA Job Service supports parts of the JSDL Specification, Version 1.0. This page gives a description of the elements that are supported. Any elements not listed here are ignored and not used by the GRIA Job Service.

JSDL Element NameSupportedNotes
JobIdentification
  - JobNameYesUsed to set the label of the job resource that is created
Application
  - ApplicationNameYesShould be set to the application URI (eg. http://it-innovation.soton.ac.uk/grid/imagemagick/swirl)
  - POSIXApplication
    - ArgumentYesSpecifies a single commandline argument to pass to the application wrapper scripts
    - FileSizeLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - CoreDumpLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - DataSegmentLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - LockedMemoryLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - OpenDescriptorsLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - StackSizeLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - CPUTimeLimitPartialOnly supported when using the LocalExecution plugin on POSIX
    - OutputPartialAdded transparently when using the LSF plugin on POSIX
    - ErrorPartialAdded transparently when using the LSF plugin on POSIX
    - WorkingDirectoryPartialAdded transparently when using the LSF plugin on POSIX
    - UserNamePartialAdded transparently when using the LSF plugin on POSIX
DataStagingOne DataStaging element should be provided for each input or output your job requires
  - "name" attributeYesShould match one of the input/output names in the application metadata. If the metadata describes an array, the name should have a numerical suffix, indicating which element of the array the stager represents (eg. inputarray-0, inputarray-1, etc.)
  - FileNameYesUsed as above if the "name" attribute is not specified
  - DeleteOnTerminationPartialRemoved, when using the LSF plugin
Resources
  - IndividualCPUSpeedPartialSee Support for Resource Elements
  - IndividualCPUTimePartialSee Support for Resource Elements
  - IndividualCPUCountPartialSee Support for Resource Elements
  - IndividualPhysicalMemoryPartialSee Support for Resource Elements
  - IndividualVirtualMemoryPartialSee Support for Resource Elements
  - IndividualDiskSpacePartialSee Support for Resource Elements
  - TotalCPUTimePartialSee Support for Resource Elements
  - TotalCPUCountPartialSee Support for Resource Elements
  - TotalPhysicalMemoryPartialSee Support for Resource Elements
  - TotalVirtualMemoryPartialSee Support for Resource Elements
  - TotalDiskSpacePartialSee Support for Resource Elements

4.5.3. Meaning of Resource and RangeValue types

RangeValue types in the JSDL allow the submitter of a job to specify ranges for resource usage

Overview

There are several elements in a JSDL document that contain RangeValue_Types. These elements usually allow the submitter of a job to specify a range of allowed values for a certain resource.

When used on elements in the <Resources> section of the JSDL, RangeValue_Types can be used to specify two different types of policy:

  1. The lower bound of the range is used to specify expected minimum usage, and is checked against the user's SLA at creation time. This check ensures that, if the user doesn't have sufficient resource left on his SLA, the job is not allowed to start (instead of being terminated half way through).
  2. The upper bound of the range is used to specify a maximum resource usage for the job, above which it should be terminated.

Examples

  • Lower bounded ranges. Example: "This job will use at least 20kb of disk space. Make sure I'm allowed this much before letting me start the job."
    <IndividualDiskSpace>
    <LowerBoundedRange>20000</LowerBoundedRange>
    </IndividualDiskSpace>
  • Upper bounded ranges. Example: "This job should run for at most 60 seconds. Terminate the job if it runs for longer."
    <IndividualCPUTime>
    <UpperBoundedRange>60</UpperBoundedRange>
    </IndividualCPUTime>
  • Ranges with lower and upper bounds. Example: "Make sure I can use 5MB of virtual memory before starting my job, but terminate it if it uses more than 10MB."
    <IndividualVirtualMemory>
    <Range>
    <LowerBound>5000000</LowerBound>
    <UpperBound>10000000</UpperBound>
    </Range>
    </IndividualVirtualMemory>
  • Exact values. Example: "This job will use 20MB of physical memory. Don't let it start if I'm not allowed to use that much, and terminate the job if it tries to use more."
    <IndividualPhysicalMemory>
    <Exact>20000000</Exact>
    </IndividualPhysicalMemory>

Service provider overrides

The previous examples were all written from the point of view of a user submitting a job. The service provider can also use JSDL to enforce policies on jobs that are submitted to his service. These types of service-provider policy are specified in the webadmin interface, and they are specific to each application.

  • Example: "Jobs submitted to the swirl application are not allowed to consume more than 5MB of disk space"
    The service provider would enter the following XML into the "Resource Manager Directives" field in the swirl properties page:
    <IndividualDiskSpace>
    <UpperBoundedRange>5000000</UpperBoundedRange>
    </IndividualDiskSpace>

If both the user and the service provider specify a range for the same resource type, the intersection between the two is used. For example:

Intersections between user-specified and service provider constraints

Figure 1. Intersections between user-specified and service provider constraints

4.5.4. Support for Resource elements

Not all Resource elements are supported by all Resource Manager plugins

Not all of the Resource Manager plugins bundled with the GRIA Job Service can support every type of Resource element defined in the JSDL specification. JSDL RangeValues are implemented only for UpperBoundedRandge and LowerBoundedRange. The following table gives an overview of which elements can be supported at each stage of policy enforcement.

JSDL Element Name LocalExecution TorquePBS Condor
A B C D A B C D A B C D
CandidateHosts
ExclusiveExecution
OperatingSystem
CPUArchitecture
IndividualCPUSpeed
IndividualCPUTime
TotalCPUTime
IndividualCPUCount
TotalCPUCount
IndividualNetworkBandwidth
IndividualPhysicalMemory
TotalPhysicalMemory
IndividualVirtualMemory
TotalVirtualMemory
IndividualDiskSpace
TotalDiskSpace
DiskSpace
ResourceCount
POSIX Extensions

The "POSIX Extensions" row refers to the POSIXApplication element inside the Application section of the JSDL. Technically this is not a Resource element, but its purpose is similar. Details about which POSIX constraints are supported can be found in the Supported Elements page.

Legend

A job creation time - initial check on user's SLA using the RangeResource's lower bound
B job submission time - selecting which RMs are suitable for the running of a job using the RangeResource's lower bound (see step 2 of The Selection Process for more information)
C job submission time - RM directives to govern node selection and job execution
D runtime - usage report generation
Not supported
Supported on Linux only
Supported on Linux and Windows

4.6. Resource Managers

The GRIA Job Service can interface with a number of Resource Managers. This section describes how to configure it to do this.

4.6.1. Overview

Why are resource manager plugins needed? What do they do?

The GRIA Job Service does not access resource managers directly, to submit and check jobs. Instead, the GRIA Job Service introduces an extra layer of resource manager dependent scripts to submit and check jobs. For each resource manager, GRIA requires a separate Resource Manager (RM) Connector Plugin. This extra layer of platform dependent scripts decouples the GRIA Job Service from resource managers and applications.

GRIA defines the RM Connector Plugin API to handle tasks such as:

  • Submitting jobs
  • Checking the status of a job
  • Checking the resource usage of a job
  • Terminating a job

The Job Service then can be configured to use RM Connector Plugins suitable for the underlying computing platform (or resource manager). The plugins then know how to handle (start, check, kill) jobs for that particular computing platform, and can be instructed to run a particular application via its application wrapper.

How Resource Managers fit in to the Job Service architecture

Figure 1: How Resource Managers fit in to the Job Service architecture.

4.6.2. The Selection Process and RMSelector.py

How does the Job Service know which RM plugin to use for a job?

The GRIA Job Service can be configured to use any number of Resource Managers. Deciding which one to use for a submitted job is done as a three-step selection process, outlined below.

In the screenshots below, we use five made-up Resource Manager Plugins - RM1, RM2, RM3, RM4 and RM5.

The decision process for selecting a Resource Manager

Figure 1: The decision process for selecting a Resource Manager

The Job Service starts by compiling a list of all the enabled RM Connector Plugins that are installed and enabled. You can see this list by looking at the main Job Service administration page (see Figure 2 below) - all the plugins in the list that are not greyed out are enabled, and will be used in this selection process.

A list of enabled plugins

Figure 2: A list of enabled plugins

Step 1: Filter by Application

When the service administrator deploys a new application on the job service, he can indicate which resource managers have the application installed. Any resource managers not selected will be immediately excluded from the selection process, and no jobs for that application will be able to run on them.

Selecting resource managers for an application

Figure 3: Selecting resource managers for an application

Step 2: Filter by Resources

At this stage, the Job Service asks each RM Connector Plugin whether it has enough resources to run the job being submitted. The plugins will typically look at the resource requirements section of the JSDL, and then query the actual resource manager to see if it is able to run the job. See Writing Custom Resource Manager Plugins for details of the canRunJob python function.

Some checks that the plugins might do include:

  • Checking whether the operating system and system architecture requested by the submitter is available on any of the compute nodes.
  • Checking whether there is a compute node with enough memory to run the job.

In our example, RM3 does not have enough resources to run the job and is excluded from further steps.

Note that support for this feature in the current set of RM plugins is quite limited. See column B in Support for Resource Elements for more information.

Step 3: Objective Function

The final decision as to which Resource Manager will be used for a job is made by a Python script - RMSelector.py. The default implementation of this function is to just choose the first plugin available, but administrators can override this behaviour.

The Python interface for selecting a plugin is very simple. The Job Service will look for a Python function called selectPlugin, and call it with two arguments:

  1. job - a Job object, containing information gathered from the JSDL including the job's name and its resource requirements.
  2. plugins - a list of RMConnector derived objects - representing all the plugins that made it to Step 3 in the selection process. The selectPlugin function is expected to return one of these objects.

A very simple example is given below. This always selects the plugin named "RM4".

#!/usr/bin/env python

def selectPlugin(job, plugins):
    for plugin in plugins:
        if plugin.__class__.__name__ == "RM4":
            return plugin
    return None

Once the administrator has written this script, he can instruct the Job Service to use it by entering its path in the configuration page:

Using another RMSelector

Figure 4: Using another RMSelector

4.6.3. Using Condor

The GRIA Job Service can submit jobs to Condor clusters. Here's how.

Configuring Condor

This section assumes you already have a working Condor installation. If not, you can follow the installation guide in the Condor manual.

Typical Condor setup

Figure 1: Typical Condor setup

Figure 1 shows a typical Condor/GRIA setup. Condor should be installed on the machine running the GRIA Job Service, and it should be allowed to submit jobs to the Condor Central Manager. To do this, you need to add the machine's hostname to the global Condor configuration file.

For example:

HOSTALLOW_WRITE = submit1.your.domain, submit2.your.domain, griaserver.your.domain

Configuring GRIA

Setting up the Condor plugin in the GRIA Job Service is simple. First click the Configure link next to Condor on the main admin page. This will open up the Condor configuration page:

Configuring the Condor plugin

Figure 2: Configuring the Condor plugin

Enter the paths to Condor's installation directory and its configuration file, then press Save configuration.

Customising Job Submissions

The job description template used for jobs submitted by GRIA is quite simple:

universe        = vanilla
executable = frame.py
shell = /bin/bash
log = loguser
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue

# Resource constraints
###RESOURCE_CONSTRAINTS###

(Note that ###RESOURCE_CONSTRAINTS### will be replaced automatically by the Job Service when a job is submitted)

This template is located inside the webapp (TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins/CondorTemplate.jdf), and if you need to modify it you have two choices:

  1. Change the template file inside the webapp. The disadvantage of this is that your changes will be overwritten if you redeploy or upgrade the GRIA Job Service.
  2. Copy the plugin and template files to a location outside the webapp and then modify them. You will have to change the name of your new plugin copy so that the Job Service can distinguish it from the original. To do this open up Condor.py and change
    class Condor(RMConnector):
    to something like
    class CondorCopy(RMConnector):
    You can leave the filename the same, or change it if you prefer. You should now enter the directory containing your new plugin in the Job Service configuration page.

Condor on Windows

Condor needs to switch to the user account of the submitter whenever it runs a job. This is straightforward on UNIX when Condor runs as root but, on Windows, knowledge of the user's password is required even when running at the maximum privilege level.

The GRIA Job Service runs under Tomcat, which by default runs as the NT Local System user. In this configuration, GRIA is not able to submit jobs to Condor, as the Local System user does not have a password and cannot have one set. It is recommended that you create a seperate user account for the GRIA Job Service, and have Tomcat run as that user.

4.6.4. Using Torque PBS

The GRIA Job Service can submit jobs to Torque PBS clusters. Here's how.

Configuring Torque

This section assumes you already have a working Torque installation. If not, you can follow the installation guide in the Torque Admin Manual.

Typical Torque setup

Figure 1: Typical Torque setup

Figure 1 shows a typical Torque/GRIA setup. Torque should be installed on the machine running the GRIA Job Service, and it should be allowed to submit jobs to the machine running pbs_server. To do this, you need to add the machine's hostname to the list of allowed submit hosts with the following command:

qmgr -c 'set server submit_hosts += griaserver.your.domain'

Configuring GRIA

Setting up the Torque PBS plugin in the GRIA Job Service is simple. First click the Configure link next to TorquePBS on the main admin page. This will open up the Torque configuration page:

Configuring the TORQUE PBS plugin

Figure 2: Configuring the TORQUE PBS plugin

Enter the paths to Torque's installation directory, then press Save configuration.

Customising Job Submissions

The job description template used for jobs submitted by GRIA is quite simple:

## PBS directives
#PBS -N """JOB_NAME"""
#PBS -j oe
"""PBS_DIRECTIVES"""

(Note that """JOB_NAME""" and """PBS_DIRECTIVES""" will be replaced automatically by the Job Service when a job is submitted)

This template is located inside the webapp (TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins/TorquePBSTemplate.jdf), and if you need to modify it you have two choices:

  1. Change the template file inside the webapp. The disadvantage of this is that your changes will be overwritten if you redeploy or upgrade the GRIA Job Service.
  2. Copy the plugin and template files to a location outside the webapp and then modify them. You will have to change the name of your new plugin copy so that the Job Service can distinguish it from the original. To do this open up TorquePBS.py and change
    class TorquePBS(RMConnector):
    to something like
    class TorquePBSCopy(RMConnector):
    You can leave the filename the same, or change it if you prefer. You should now enter the directory containing your new plugin in the Job Service configuration page.

4.6.5. Writing Custom Resource Manager Plugins

System administrators can write their own RM Connector Plugins to interface with other Resource Managers.

What are they?

RM Connector Plugins are classes written in Python that handle all communication with the Resource Manager. The GRIA Job Service comes with three RM Connector Plugins that can be used as examples. These can be found inside the webapp:

TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins

When writing your own plugins, you should not put them inside the webapp directory as they will be lost whenever you redeploy or upgrade the GRIA Job Service. Instead, place them in a new directory outside the webapp (eg. /opt/gria/rm-connectors) and enter this path into the Job Service configuration:

Setting additional plugin directories

Figure 1: Setting additional plugin directories

Plugins usually consist of one Python script (.py) and one or more template files. These templates are used when creating new jobs - values are substituted into them and they are written back into the new job directory. The Condor plugin consists of the following files:

  • Condor.py - the main Python script. Contains functions for interacting with the Resource Manager.
  • CondorTemplate.frame - template for job wrapper "frames".
  • CondorTemplate.jdf - template for new Job Description Files.

A Sample plugin

The code listing below is a sample RM Connector Plugin that you can use as a starting point for writing your own. It describes what you need to do to implement the five main functions in any plugin:

  • submit - submits a new job to the resource mananger.
  • jobStatus - checks whether a job is still running.
  • jobUsage - gets usage reports.
  • killJob - terminates a job.
  • canRunJob - checks if the RM can run the requested job.

For more complete documentation on the API, see RMConnector.

#!/usr/bin/env python -tt
# -*- coding: UTF-8 -*-

from RMConnector import RMConnector, ScriptNotFound
from Logger import logsys, loguser
import platformUtils

class Sample(RMConnector):
	MIN_API_VERSION=1

	def submit(self, job):
		RMConnector.submit(self, job)

		logsys.info("Executable we're running is " + job.executableName)
		logsys.info("Our arguments are " + repr(job.arguments))

		# TODO: Submit the job to the resource manager

	def jobStatus(self):
		# TODO: Check whether our job is really running.  If not, write an
		# empty file called .FAILED in the current directory.

		pass

	def jobUsage(self, appWrapperDir):
		appUsageReports = RMConnector.jobUsage(self, appWrapperDir)

		# TODO: Get any usage information from the RM, and append it to
		# appUsageReports

		return appUsageReports

	def killJob(self, appWrapperDir):
		# Try using the application wrapper scripts to kill the job gracefully
		try:
			RMConnector.killJob(self, appWrapperDir)
		except ScriptNotFound:
			pass

		# TODO: Check if the job is still running.  If it is, it means there
		# wasn't an application wrapper script, or they couldn't kill the job
		# So kill the job forcefully.

		# Write a file telling GRIA we killed the job
		platformUtils.writeToFile(".KILLED", "")

	def canRunJob(self, job):
		# TODO: Check if this RM has enough resources to run the job.
		# If we can't, return False with a reason, eg:
		# return (False, "Not enough memory")
		return (True, None)

4.7. Standard CPU time

How to adjust reported CPU time depending on the performance of the individual node on which a job is run.

What is Standard CPU time?

The GRIA Job Service receives usage reports about the current CPU utilisation of running jobs. It forwards these usage reports to the SLA service so that users can be billed according to how much CPU time they are using on compute nodes.

If the Job Service is configured to submit jobs to a heterogeneous cluster (i.e. consisting of machines with different specifications), the administrator might want users to be billed more for CPU time if their jobs are executed on faster machines. The GRIA Job Service can adjust the amount of reported CPU usage according to the measured performance of the node on which the job is running.

Benchmarking each node

The process of recording the performance of each node is not automatic - the administrator must run a benchmark on each individual machine on which a job could be run. 1 standard CPU second is defined as 1 second of full CPU utilisation on a Pentium III 1GHz processor. This gives a performance of 54.3 Mflops/s using the Linpack Java Benchmark.

From the machine you wish to benchmark, enter the URL http://www.netlib.org/benchmark/linpackjava into a browser (ensuring that Java has been enabled).

This will load an applet with a window similar to the one below:

Benchmark applet

Figure 1: Benchmark applet

Click the Press to Run Benchmark button, at the top of the applet window, to calculate your machine's performance.

Benchmark output

Figure 2: Benchmark output

To find the power of your machine, look at the Mflop/s reading (highlighted above).

This number needs to be stored in a file - either /etc/gria/benchmark or C:\gria\benchmark.txt depending on the operating system. On Linux, you can create this file using the following command (supplying your value instead of 54.721):

echo "54.721" > /etc/gria/benchmark