|
|
- Info
Basic Application Services User Guide
Note: Return to reference manual view.
This guide describes how to use the Basic Application Services Package for provision of data storage and processing (using application installed on a cluster) to trusted users.
1.
Overview
Overview of the Basic Application Services
The GRIA Basic Application Services package provides the core functionality for job and data management. It consists of:
- A Data Service
- This allows remote users to upload and download data
files to the service provider, and to transfer data between Data
Services hosted by different service providers. The Data Service also supports
management of access rights (for read or read-write access) granted to
other users or service providers.
- A Job Service
- This allows remote users to start, monitor or kill computational
jobs, executed by the service provider. The Job Service will fetch input from
and write output to a local Data Service. The Job Service can be configured to support
multiple applications, which are chosen by the service provider.
The application services can be configured to be either unmanaged (free) or
managed by the GRIA Service Provider Management package, as in the
diagram below.
2.
Installation
GRIA Basic Application Services Installation
Standard Installation Procedure
The Basic Application Services package is provided as a zip file (Windows) or tar.gz (Linux). Unpack the archive and you will find the following items:
- docs (folder)
- gria-basic-app-services.war
- README.html
Install the war file according to the Service Installation Manual. Once the initial configuration has been completed, the Basic Applications Package requires some extra configuration.
Additional Pre-requisites
Some additional pieces of software need to be installed, as follows.
Python
Python version 2.4 or higher is required in order to run the Basic Application Services correctly. For Windows, the most common Python implementations are Cygwin Python and ActiveState Python. We recommend that you install ActiveState Python—choose the latest release, click the "Next" button, and then download the MSI
distribution for Windows. To fully complete the installation you must restart Windows.
Test Application: ImageMagick
ImageMagick is the default test application used in GRIA. ImageMagick
binaries for Windows can be downloaded from here.
Use a Q8 version, e.g. ImageMagick-6.3.7-0-Q8-windows-dll.exe.
The ImageMagick distribution package for Windows is self-extracting and the installation procedure starts
automatically. Follow the instructions and select the default options.
Note: older versions of ImageMagick might not support all of the basic application service's default examples, e.g. blend.
Continuing the Installation
Once the additional pre-requisites have been installed, the installation can be continued by following the instructions for each service.
3.
The Data Service
Overview and Configuration of the Data Service
Overview
The GRIA Data Service is used to manage "data stagers". A data stager is a container for a single file (or zip file). It has a unique identifier and an access control system for determining who can read and write the data. Clients can use the service to create
new stagers, upload and download data, transfer data between stagers, and control others' access to
the data.
Two items of configuration must be given before the Data Service can be used:
- The location of the root data directory
- The service stores any uploaded data inside this directory. If the Data Service is going to be used with a Job Service and jobs are going to execute on a cluster then the cluster's machines need to be able to read and write to this directory.
- A list of trusted management services
- Normally, you can just click Add
to accept the default management service. This is the SLA management
service from the GRIA Service Provider Management package. Note that if
the GRIA Basic Application Services package is deployed on a different
machine to the GRIA Service Provider Management package, some
additional access control setup is required. This is described in Links with Other Services section of the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make it unmanaged (or "free"), by clicking the Make service free button.
4.
The Job Service
The Job Service
4.1.
Overview
Overview of the GRIA Job Service
The GRIA Job Service is used to manage jobs. Clients can use the service to create new jobs, upload input data, start the job, monitor progress, and download results.
Each input and output of a job is actually a data stager managed by the local Data Service (the one in the same .war as the Job Service). Therefore, you must configure the Data Service before the Job Service can be used. Users can run jobs that take input from or send output to other Data Services by using the normal data transfer features provided by the Data Service.
Job Service Architecture
The GRIA Job Service architecture is flexible enough to use a variety of underlying computing
platforms to run jobs e.g. from single computers to clusters of workstations
or even supercomputers.
In order to achieve this flexibility, the GRIA Job Service accesses resources indirectly via its RM connector scripts, which decouple
the GRIA Job Service from resource managers and applications.
The following sections of this document give an overall
picture of the various components of the GRIA Job Service, and then describe
some common deployment scenarios.
Components of the GRIA Job Service
The GRIA Job Service is separated into several distinct components: (colours relate to the above diagram)
-
The Resource Manager Connectors - The GRIA Job Service can submit jobs to different resource managers such as TorquePBS and Condor, or even run them on the local machine using the LocalExecution plugin. It is able to do this thanks to the Resource Manager Connector layer - a plugin architecture written in Python that abstracts away resource manager-specific details and presents a single interface for submitting and monitoring jobs. Service providers can configure the Job Service to use the existing Resource Manager Plugins or they can write their own to interface with custom configurations.
Version 5.2 of the GRIA Job Service can have any number of RM Connector Plugins loaded at the same time. Selecting which one to use for an individual job is done in a series of steps, and is explained in The Selection Process chapter.
-
The Application Wrapper Scripts - Each Application deployed on the GRIA Job Service needs to have a couple of small wrapper scripts installed alongside it. These scripts are responsible for providing the application with the correct files from the shared filesystem, and making sure the outputs from the application are written or copied back to the correct location. Optional wrapper scripts can also be written to cancel a job gracefully and report progress and usage information specific to that application (eg. frames rendered by a graphics package).
-
The Shared Filesystem - When the user creates a job, the GRIA Job Service creates a directory for it on the shared filesystem. The administrator should ensure that this directory can be read from and written to by both the Job Service running on the server, and the application wrapper scripts running on the execution nodes. The structure of this scratch directory is as follows:
- logsys - Log file for the system administrator. Contains information about the RM connector plugins and resource constraints.
- loguser - Log file for the user. Contains the stdout and stderr from the job executable.
- work/ - Working directory in which the job executes.
- work/inputs/ - Directory containing the named inputs for the job.
- work/outputs/ - Directory to which the application wrapper scripts should write the job's outputs.
When configuring the Job Service, the administrator must be careful to ensure that the different files and executables can be accessed by the correct components in the system.
- The application executables should be accessible by compute nodes only, and they should be read only. Installation of the applications can be either local per compute node or over disk space shared among all nodes.
- The application wrapper scripts should be accessible by both the compute nodes and the Job Service. Like the application executables they should be read only, and can either be installed locally or part of a shared filesystem.
- The RM connector scripts should be accessible by the GRIA Job Service middleware.
- The job's scratch directory should be accessible by both the compute nodes and the Job Service. The Job Service must have write permission on the entire directory, but applications themselves only need to access the work/ subfolder. The application wrapper scripts could set up a chroot jail inside this directory to ensure nothing else on the filesystem can be accessed. Note that the scratch area cannot be copied between compute nodes; instead it must be exported as a shared disk space.
Local Execution Deployment
This is a typical minimum configuration scenario: the Job Service and the job execution run locally on the same machine.
Figure 2 shows how GRIA can be configured to run applications locally on the server machine running Tomcat. Because of its simplicity and minimal configuration, this deployment is commonly used for demonstrations and testing. This is the default configuration for the GRIA Job Service assuming the administrator does not set up the TorquePBS or Condor plugins. Note that it is not advisable to leave the GRIA Job Service configured like this in a production environment.
4.2.
Basic Configuration
The Job Service
To configure the Job Service you will need to specify:
- The location of the root job directory. The service creates one subdirectory inside this directory for each job. This directory must be on the same file-system as the Data Service's data directory so that data files can be hard-linked efficiently as jobs are started and stopped. Otherwise, the data must be copied which is slow and wastes disk space.
- A list of trusted management services. Normally, you can just click Add
to accept the default management service. This is the SLA management
service from the GRIA Service Provider Management package. Note that if
the GRIA Basic Application Services package is deployed on a different
machine to the GRIA Service Provider Management package, some
additional access control setup is required. This is described in Links with Other Services in the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make use of the service unmanaged (free), by clicking the Make service free button.
- A list of applications which users can run using the service. See the Managing applications section for details.
The other fields on the configuration page are optional and are used for connecting the Job Service to computational clusters.
4.3.
Managing Applications
Managing applications
This section covers the administrative tasks of deploying and
undeploying applications. After deploying an application to your job
service, it becomes available for execution by remote clients. Note,
however, that clients usually must satisfy additional business
constraints, such as having an appropriate service level agreement or
account, before they can execute deployed applications.
The GRIA Basic Application Services software is provided with a set
of tutorial applications. These are made available during installation
of the software. The web based Administration Interface
provides the location of these files during the installation process, and guides you through the simple process of application
deployment.
In addition to the tutorial applications, it's straightforward to develop new applications and deploy these in the same way.
This section assumes that you have all necessary files for application
deployment and that any required executable applications
have been installed according to the application documentation. If you
are installing the tutorial applications, you have all the files you
need. If you are deploying your own applications, first see Writing Application Scripts
for details of the files you need to produce before application deployment.
Deploying applications
Having obtained or created the files and scripts needed for
application deployment, the application can now be deployed to the job
service. To do this, make sure that Tomcat is running then, using a web
browser, log into the GRIA Basic Application Services administration page. This can be found at
http://<servername>:8080/gria-basic-app-services.
Make sure you enter the appropriate security credentials and
adjust the URL for the administration page, according to your server
setup.
From the administration page, select the Job Service link, as shown below.
This displays the Job Service Admin page. In the Applications section, enter the location of the directory containing
the files and scripts needed for deployment. Then click the Deploy new application button.
This displays the Application properties for the application. You may optionally enter arguments for the
application wrapper script, which will be supplied in addition to any user-specified arguments. Select the preferred resource manager for this application (LocalExecution is the default). Finally, resource manager directives may be supplied, as a JSDL (XML) fragment (see example provided).
Once all application properties have been set, click the Accept button.
This completes deploying an application to the Job Service.
Undeploying applications
Undeploying an application is straightforward. First click the Edit button along side the entry for the application
you wish to undeploy.
Click the Undeploy button to undeploy the application.
4.4.
Application Wrapper Scripts
How to write the scripts and meta-data file to integrate your application.
4.4.1.
Overview
Applications and The Job Service
Relation to the Job Service
The application wrapper scripts are deployed along with the applications - either on a shared file system or individually on each compute node. Their function is to provide a uniform interface to the Job Service for starting, monitoring and stopping applications.
The application model
The Job Service and RM connector scripts are designed to support a uniform model of application execution, shown in Figure 2:
The workspace for each job is set up by the Job Service when the job
is initialised (this is one of the bookkeeping service operations that
must precede the call to start the job). The workspace has a standard
directory structure so the Job Service and RM connector scripts can create
and find information stored in it, including a "work" sub-directory
where the job will actually run.
When the user starts the job, the Job Service transfers input data
files from Data Service URIs into the job's workspace. It then runs the
RM connector script which submits the application to the
execution platform (cluster, etc) where it will run (using the
specified command line), possibly after some queuing delay. The application
reads input data deposited in its workspace by the Job Service,
and writes outputs back to the workspace once the job has finished.
The Job Service can then find the outputs and transfer these to the correct output Data Service URIs.
When the user asks for the status of the job, the service queries the
Resource Manager for status information (e.g. when the job
started or finished, etc). The service may also run an associated monitoring
application (using the specified command line) to gather
application-specific status information (e.g. number of iterations
completed, convergence plots, etc) from the workspace.
If the user asks for the job to be killed, the service uses issues a command to
kill the job on the execution platform. The Job Service will detect that the
job has finished, and will transfer any output produced. Note that the user
(or their client-side application) should always check the status of a job to
find out if it crashed or was killed, as some incomplete output may appear in
the latter case.
Why are Application Wrapper Scripts Required?
In practice, few legacy applications behave exactly according to the
model shown in Figure 2. It is rarely possible to change the
application itself to fix this, so instead GRIA uses so-called wrapper
scripts that do conform to the application model for starting and
managing the application.
In practice, the wrapper scripts can do more than just make the
underlying application work as indicated in Figure 2. They can also be
used to handle and implement application specific features of the
service.
One can also use (optional) wrapper scripts to look for
application-specific status information in the working directory of the
job. Without such scripts, the service can only obtain basic
job status information from the job submission system.
Finally, wrapper scripts also provide a configurable mechanism for
dealing with any application-specific security risks, e.g. checking for
malicious input that may exploit a feature of the application. Few
legacy applications were designed as network-accessible services, and
since we can't change them to remove security loopholes, the use of
wrapper scripts is essential to check for any exploits of application
vulnerabilities. In the limit, one can configure the wrapper (and
RM connector) scripts to run the application in a sandbox (e.g. chroot),
with access only to a working sub-directory of the job workspace.
4.4.2.
startJob Wrapper Script
The startJob.pl Application Wrapper Script
Language
Like all other application wrapper scripts, startJob can be written in any scripting language supported by the host OS.
- For Linux, the first line of the script (eg.
#!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. startJob.py, startJob.sh).
- On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.
Application Wrapper Functionality
The startJob application wrapper script is a mandatory script that deals with any application specificity, allowing
the Job Service to treat all applications in the same way, and so decoupling
the Job Service from the details of the application.
The main functions of the application wrapper script are:
- handling input and output data files;
- setting up an environment (i.e. environment variables) that is suitable to run the application;
- enforcing any security precautions to protect against loop-holes in the application;
- running the application itself.
The application wrapper is designed to run on the execution platform, having
been submitted by the RM connector scripts for starting a job. Prior to submitting
the wrapper script on the execution platform, the Job Service will have set up
a workspace (directory) for the job, copied input data into it e.g. work/inputs,
and created a working sub-directory for the job to run in, e.g. work.
The following listing shows a workspace directory structure with two input
files and an empty outputs directory.
ff808081-1017450e-0110-174532dd-0001-1
`--work
|-- inputs
| |-- namedinput
| |-- arrayinput-0
| `-- arrayinput-1
`-- outputs
After changing to the workspace directory (not the working
sub-directory), the wrapper will be submitted using the following
command line:
app-wrapper <application arguments>
The functionality of the wrapper script should include the following:
- Parse wrapper arguments, including security checks for
illegal input designed to inject malicious commands into the
command-line used to launch the application.
- Move input
data files into the working sub-directory, including unpacking any that
are compressed archives containing multiple inputs.
- Create
a consistent environment in the working directory, by setting up
environment variables and rewriting input data to match the local
environment where necessary.
- Build the
command line and run the underlying application.
- Copy output files from the
working directory into the output directory,
including packing multiple outputs into compressed archive files where
necessary.
- Exit by returning the exit code of the application.
For simple applications, security can be maintained by checking
input parameters during step 1. and if necessary data files during step
3. If the application is too complicated for this to be reliable, it
may also be necessary to set up a sandboxed working environment and run
the code inside it during step 4.
Some of these steps are considered in more detail below.
Input and Output Data Handling
Note: As of version 5.2 of the GRIA Job Service, applications can specify names for their inputs in the application metadata file. As a backwards compatibility measure, if the application metadata file still uses the old GRIA 5.1 format, inputs will be named numerically in the order they appear in the metadata (eg. input-0, input-1, etc).
When unpacking input data, the application wrapper should attend to the following:
- Create any substructure needed in the job's working sub-directory of the workspace.
- Copy or unzip input files from the inputs sub-directory into the job's working space.
- Check that all input needed to run the job is present.
The Job Service knows in advance which output files must be returned to the outputs directory. The application wrapper must create these files by:
- Copying or zipping data to create the required output files in the outputs sub-directory
- Renaming these files to the names specified in the metadata (or the output-x naming scheme for legacy applications)
The Job Service will detect that the application wrapper script has finished and handle the transfer of output files accordingly.
Consistent Context Reconstruction
Why do we need context reconstruction? The input data for our
application has been created on another system with a different
directory structure, environment and possibly even operating system. We
have to set up an equivalent (not necessarily identical) environment on
our execution platform, and make sure any input data references to the
remote user's environment are mapped onto the one we have created, or
they will be invalid when the application is started.
When and where should context reconstruction be performed? One
should handle it as close as possible to the running application—certainly on the execution platform where the job will actually be run—as this is where the environment is needed. This is why the Job
Service doesn't attempt to create the context itself - there is no
point doing it at the service host if the job will be executed on a
compute node in a Condor cluster. Instead, we leave it to the
application wrapper to handle everything in an application specific way
on the execution platform itself.
A typical approach to context reconstruction might involve passing
an array of named parameters to the Job Service, including environment
settings as well as application flags. These will be passed to the
wrapper through its argument list. In addition, one can provide
settings in an extra input file, intended for the wrapper rather than
the job itself, and used to set up the environment prior to running the
application code.
The hardest job for the wrapper is to parse and rewrite application
input data where necessary to ensure it is consistent with the
environment established on the execution platform. If this is not
needed, it is usually quite easy to 'wrap' an application to run inside
the Job Service. Where it is necessary, the wrapper may become a
significant body of code in its own right.
For example, consider the following line of input intended for the
rendering application AIR, used with the Job Service to provide a
grid-enabled video rendering service:
Option 'searchpath' 'shader' ['&:e:\AnimalLogic\MaxMan\shaders:C:\Sample\shaders']
The problem here is that the application uses plug-ins to perform
part of the rendering calculation, and the search path for these can be
specified in the user input. This particular input file has been
generated automatically using a graphical environment for video
post-production, which has filled in the relevant path based on where
the shader libraries were installed on the user's local machine.
The wrapper has to identify which shaders are needed, and substitute the path to them on the local system:
Option 'searchpath' 'shader' ['&:/export/apps/AnimalLogic/MaxMan/shaders:/export/apps/air/Sample/shaders']
In some cases, it may be possible to infer the meaning of
client-side environment references by pattern matching against a list
of meaningful terms used by the application. In others (probably in
this case), it is necessary for the user to send the install path
quoted for specific groups of plug-ins as service arguments or
environment settings, so the wrapper can find them and map them onto
the equivalent installed groups of components on the execution platform.
In extreme cases, it may be necessary to establish multiple services
to run the same application in different ways, allowing a different,
specific environment to be set up for each. For example, it probably
wouldn't make sense to have a single service to run a computational
fluid dynamics (CFD) code capable of simulating coolant flows through
automotive engines AND the propagation of drugs in aerosol suspension
in human lungs. It would be asking too much of a wrapper developer to
differentiate and correctly handle such extreme cases, and instead one
should set up two services each with its own wrapper specialised to one
of these scenarios.
Security Containment
Why Wrappers have to Bother with Security
The Job Service regards application wrapper scripts as trustworthy,
because the service operator can inspect them and make sure they don't
do anything strange or foolhardy. However, the applications may be
third party, closed source executables that cannot be inspected, and
were not designed as network-accessible services in the first place.
Wrapper scripts can protect the service from malicious users in three ways:
- checking any user input used to create the command line for
running the application, to exclude command injection attacks using
parameters like 'method=gauss; cd /; rm */*';
- checking
input data known to be used in an unsafe way by the application, e.g.
to construct system calls for executing plug-ins or moving files around;
- confining the application to a sandbox, by first preparing the sandbox and then launching the application in it.
If the application is very simple, or designed to withstand
malicious users, or if you have only a small number of users you know
well (and trust not to mislay their credentials) then it may be OK to
include only the first of these measures.
Legacy applications are quite likely to do things in unsafe ways.
Renaming files or testing if they exist are sometimes done via system calls.
This can be a potential security hole if the application developer wasn't
expecting filenames to be sent by a remote
user who may have malicious intent. If the application isn't too
complex, or if you can check with the developer on what might happen,
then it should be OK if you also check the user-supplied input,
filenames and other data that may be sent to unsafe system calls.
In the worst case, one has to assume the application will be unsafe,
and attempt to contain any damage caused by malicious (or possibly
careless) input by restricting what the application can do and where it
can do it. There are several possible ways to achieve such restrictions.
Chroot
On Linux systems, chroot can be used to restrict a sub-process to an
arbitrary sub-directory, e.g. a job's working directory. The chroot
mechanism was designed for use by operating system developers to allow
them to create a pseudo-root within which to test their code. While the
chroot container doesn't prevent access to low-level devices, it will
prevent most legacy applications accessing files outside the specified
sub-directory. Chroot is widely used to contain web servers and other
network applications to minimise the scope for damage if they are
compromised.
To use chroot, it is necessary to create a complete operating system
environment inside the job's working directory (which it will see as '/'). One has to copy application binaries, resolve any references to system/application libraries, create devices such as /dev/null,
etc. To create a self-sufficient chroot 'jail' environment sufficient
to run the application may not be easy, and of course, it would need to
be repeated for each individual job. However, it can provide a good
safety level as its 'jail' environment is enforced by the operating
system itself.
Restricted Shells
Many shells, including bash, provide a restriction mechanism usually
invoked by running the shell with the -r switch. Some common features
of restricted shells are the ability to prevent a program from changing
directories, to only allow the execution of commands using absolute
pathnames, and to prohibit executing commands in other subdirectories,
using command-line redirection operations, or changing the search path.
Minimal privilege accounts
Another approach is to create a low-privilege account for each job.
The wrapper script would then have to assign such an account, change
the working directory so it is owned by this account, and run the
application in that working directory under the same account. Provided
the same account is not used for anything else (including running other
jobs), the application can be prevented from accessing anything outside
the working directory, even if it can be induced to run some unforeseen
system call by sending some malicious input.
The two drawbacks with this approach are:
- ideally one should create a pool of accounts and provide a
way for the wrapper to assign them to jobs rather than creating new
accounts, but this isn't supported at present;
- the wrapper
would need sufficient privilege to set the account under which a
sub-process is run, which may make the wrapper more dangerous if it can
be compromised.
The second drawback may not be too bad, given that the wrapper at
least can be designed to check all inputs and avoid doing anything
unpleasant. At present, the Job Service runs with a normal unprivileged
user identity, so it may be better to use other methods to contain
individual jobs.
Other Methods
The above list is by no means exhaustive. For example, if the chroot 'jail'
is not sufficient, one can create an entire virtual machine on which to run
a potentially unsafe application. Software such as VMWare can be used to implement
this approach, but users who want to go to these lengths are on their own, at
least in this version of the software.
Error Handling
If an error is encountered in the application, the wrapper must report the
fact. If this is not done, the Job Service will assume everything is OK, and
the users client application will probably attempt to continue, which may not
be appropriate if some output from the job is missing, etc.
The application startJob wrapper should exit with an exit status of zero if the job
has completed successfully, or with a non-zero status if the job has failed. This value
will be stored in .exit_code by the RM Connector script. Generic
clients may stop executing a workflow, for example, if this result is not zero.
An Example Wrapper Script
This example is based on the ImageMagick applications which were
installed as part of the GRIA installation. To get started, we will
create a simple wrapper that runs this application.
- Create a startJob wrapper script:
#!/usr/bin/env python
print("Swirl wrapper started")
print("Copying input to work directory...")
shutil.copyfile("inputs/sourceImage", "image.jpg")
print("Transforming image...")
p = subprocess.Popen(["mogrify", "-swirl", "60", "image.jpg"])
ret = p.wait()
if ret != 0:
print("Failed to transform image, error=%s" % ret)
sys.exit(ret)
print("Copying result to output stager...")
shutil.copyfile("image.jpg", "outputs/outputImage")
print("Swirl job completed successfully")
This will perform the following steps:
- Copy the input image into the work directory.
- Run the mogrify command to transform the image.
- Copy the result to the output stager.
- Edit the startJob.py script to run your command instead of mogrify (the command you tested above).
- Make the script executable:
$ chmod a+x startJob.py
4.4.3.
checkJob Wrapper Script
Creating the checkJob.pl Status Wrapper Script
Language
Like all other application wrapper scripts, checkJob can be written in any scripting language supported by the host OS.
- For Linux, the first line of the script (e.g.
#!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (e.g. checkJob.py, checkJob.sh).
- On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.
Application Wrapper Functionality
Unlike the wrapper for starting an application, the application
wrapper for reading status from the working directory is optional. If
no such wrapper is provided, the platform script will create a simple
status report by checking stdout and stderr of the application,
and consulting the RM connector scripts.
If you want the job status report to include application-specific
information such as convergence plots, iteration counters, etc, you
should create a wrapper script that will be invoked by the client calling the checkJob method.
An application status wrapper is usually a lot simpler to create
than the main wrapper because it does not take any user-supplied (and
hence potentially malicious) arguments, and does not set up (or run)
potentially untrustworthy code. All the status wrapper has to do is to
examine the job's working directory, read any files it needs in order
to extract the desired status information (in the limit, one could
simply copy an application-level log file), and write it to the
standard output.
Note: the format of the status information is open and application dependent, however status information must not include binary data since it will be returned to the user in an XML document.
An Example Status Wrapper Script
This example is based on the ImageMagick application, which was
installed as part of the GRIA installation and follows on from the start job example. In the same directory as startJob.py, create a script called checkJob.sh:
#!/bin/sh
tail log
This is run each time the client checks the status of the job. This example simply returns the last few lines of the log file, and is executed as follows:
- Make the script executable:
$ chmod a+x checkJob.sh
- Test it with these commands:
$ ./checkJob.sh > statusfile
$ cat statusfile
You should find that the contents of log are now in statusfile.
4.4.4.
killJob Wrapper Script
Creating a killJob.pl Application Wrapper Script
Language
Like all other application wrapper scripts, killJob can be written in any scripting language supported by the host OS.
- For Linux, the first line of the script (eg.
#!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. killJob.py, killJob.sh).
- On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.
Application Wrapper Functionality
The application-specific kill script allows an application
to be terminated in a more controlled way, e.g. using an application-specific mechanism.
This script is optional and, if not available, the RM Connector
script will try to kill a job at the resource manager level instead.
The return code of this script should be 0 upon success or any other value
on error. The RM Connector will decide accordingly whether the killing
operation was successful or if it failed. The functionality of this script is
application dependent and difficult to describe in a generic way.
For some applications, terminating a job might be as simple as creating a
single file (e.g. a "stop" file) in the job workspace. Some other applications are aware of signals
that can be passed to them, etc.
An Example Kill Wrapper Script
We are not aware of any particular way that ImageMagick can be killed, therefore
we cannot provide a complete example of a kill wrapper script. However, in the
following paragraphs we provide some hints about possible ways terminating jobs.
If a particular application is aware of a termination file in the job
workspace, then the kill wrapper could be as simple as:
touch .terminate
Where .terminate is the particular termination filename that
the application uses.
Many applications are aware of various signals e.g. SIGTERM, SIGALRM, SIGSTOP, etc.
The kill wrapper could therefore pass the appropriate signal to the application and the
application could respond by terminating the job gracefully, for example:
# find the process ID $pid and send a termination signal
kill -SIGTERM $pid
4.4.5.
Application Metadata XML
Creating an XML File to Describe an Application
Application description files are XML files containing metadata
about an application deployed on a GRIA Job Service. These files are essential for GRIA
users to discover and use available applications. To create an XML
description for an application, you need to use the following schema
to identify the application's main features including
name, version number, description and inputs/outputs (if any). For example, the
following XML describes the Swirl application:
<?xml version="1.0" encoding="UTF-8" ?>
<GriaApplicationDescription xmlns="http://www.it-innovation.soton.ac.uk/2007/grid/application">
<JobServiceMinVersion>5.2</JobServiceMinVersion>
<Application>
<Description>Application to swirl an image</Description>
<ApplicationName>http://it-innovation.soton.ac.uk/grid/imagemagick/swirl</ApplicationName>
<ApplicationVersion>2.0-1</ApplicationVersion>
<Group>graphics</Group>
<Keywords>imagemagick, example</Keywords>
</Application>
<DataStagers>
<DataStager type="input" name="inputImage">
<Description>Input image to be swirled</Description>
<MimeType>image</MimeType>
</DataStager>
<DataStager type="output" name="outputImage">
<Description>Swirled image</Description>
<MimeType>image</MimeType>
</DataStager>
</DataStagers>
</GriaApplicationDescription>
The main application metadata is contained within an Application element.
Every application provided by the Job service must be given a unique ApplicationName. To ensure uniqueness, a URI is used. Note that although these names look like web page addresses, they may not necessarily point to real web pages if treated as URL; they are simply unique strings.
Inputs and outputs are defined as DataStager elements, with the type attribute set to "input" or "output", as appropriate.
Note that you can add as many inputs/outputs as necessary, according to your application.
Advanced usage
Input arrays
An application might require arrays of inputs, whose exact sizes are specified by the user when creating the job. This is supported by GRIA using the minOccurs, maxOccurs and defaultSize attributes on DataStager elements.
For example, if your application took between 2 and 8 images as input, you might use the following XML:
<DataStager type="input" name="inputImage" minOccurs="2" maxOccurs="8" defaultSize="2">
<Description>Input image</Description>
<MimeType>image</MimeType>
</DataStager>
You can use the defaultSize attribute to support older clients that do not know how to specify the desired size of arrays.
Optional inputs
Optional inputs are described much like arrays, except the minOccurs attribute is 0 and the maxOccurs attribute is 1. For example:
<DataStager type="input" name="overlayImage" minOccurs="0" maxOccurs="1" defaultSize="0">
<Description>Optional image to superimpose on top of the result</Description>
<MimeType>image</MimeType>
</DataStager>
Command line arguments
If your metadata file describes the application's allowed command line arguments, the GRIA Client and Job Service can validate arguments as they are received by the user before they are passed to the application wrappers.
For example:
<Parameters>
<Parameter name="string" qualifier="--string" type="string" minOccurs="0" maxOccurs="1"/>
<Parameter name="bool" qualifier="--bool" type="boolean" minOccurs="0" maxOccurs="1"/>
<Parameter name="data" qualifier="" type="string" minOccurs="1" maxOccurs="1">
<allowed>one</allowed>
<allowed>two</allowed>
<allowed>three</allowed>
</Parameter>
</Parameters>
This would allow the following command lines:
--string "This is a string" one
--bool three
In the above example, we specify whether parameters are optional or compulsory using the
minOccurs="0" maxOccurs="1" or
minOccurs="1" maxOccurs="1"
attribute combinations, respectively.
The "data" parameter may take different values (hence the empty qualifier attribute). It is also further restricted by the use of specific allowed elements, forming a set of options.
4.4.6.
jobUsage Wrapper Script
The jobUsage script generates application-specific usage reports
Language
Like all other application wrapper scripts, jobUsage can be written in any scripting language supported by the host OS.
- For Linux, the first line of the script (eg.
#!/usr/bin/python) is used to determine which interpreter to use. The filename extension can be anything (eg. jobUsage.py, jobUsage.sh).
- On Windows, the filename extension is used to determine which interpreter to use. Currently only Python (.py) and Perl (.pl) are recognised.
Application Wrapper Functionality
The jobUsage application wrapper script is an optional script that generates usage reports, indicating how much resource a job for a specific application is using.
The GRIA Job Service runs this script occasionally during the job's execution, and once when the job has completed, to gather usage reports.
It combines these reports with the ones from the resource manager and then forwards them to the SLA service.
When is jobUsage run?
The jobUsage script is run approximately twice per minute during the job's execution, and then once again immediately after the job finishes. The exact frequency of the calls to jobUsage depends on service load, but is guaranteed to be at most once per 30 seconds.
Output format
The jobUsage script should print on stdout an XML fragment similar to the following:
<UsageReport uri="http://example.com/metrics/example1" type="instantaneous">562</UsageReport>
<UsageReport uri="http://example.com/metrics/example2" type="cumulative">21</UsageReport>
<UsageReport uri="http://example.com/metrics/example3" type="instantaneous">75.32</UsageReport>
Each usage report contains the following components:
4.5.
JSDL Job Submission
The GRIA Job Service allows submission of jobs in JSDL format
4.5.1.
Overview
An explanation of what JSDL is and how it is used in GRIA
JSDL (Job Submission Description Language) is an open standard developed by the Open Grid Forum. Since version 5.2 of the GRIA Job Service, users are able to submit information about the jobs they wish to create by using a JSDL document.
Version 5.2 of the GRIA Job Service supports a limited subset of the JSDL Specification, Version 1.0. Clients can use JSDL to:
- Name their jobs, and give the URI of the application to be run
- Specify any command-line arguments to be passed to the application wrapper scripts
- List the number, name and type of any input/output data stagers to be created
- Outline the expected resource usage of the job, to ensure the client's SLA has enough resource to allow the job to run
- Define any constraints on the resources available to the job during its execution
The graphical GRIA client will create a JSDL document for you behind the scenes when you create a job. If you prefer, the graphical GRIA client is also able to upload a hand-written JSDL document when creating a job. You can view the JSDL document for any job in the GRIA client, or by using the JobResource.getJSDL() API method.
When using the client APIs to create a job, the
JobDescription Java class can be used to easily create a JSDL document.
4.5.2.
Supported Elements
List of the JSDL XML elements supported by the GRIA Job Service
The GRIA Job Service supports parts of the JSDL Specification, Version 1.0. This page gives a description of the elements that are supported. Any elements not listed here are ignored and not used by the GRIA Job Service.
| JSDL Element Name | Supported | Notes |
| JobIdentification | | |
| - JobName | Yes | Used to set the label of the job resource that is created |
| Application | | |
| - ApplicationName | Yes | Should be set to the application URI (eg. http://it-innovation.soton.ac.uk/grid/imagemagick/swirl) |
| - POSIXApplication | | |
| - Argument | Yes | Specifies a single commandline argument to pass to the application wrapper scripts |
| - FileSizeLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - CoreDumpLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - DataSegmentLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - LockedMemoryLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - OpenDescriptorsLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - StackSizeLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| - CPUTimeLimit | Partial | Only supported when using the LocalExecution plugin on POSIX |
| DataStaging | | One DataStaging element should be provided for each input or output your job requires |
| - "name" attribute | Yes | Should match one of the input/output names in the application metadata. If the metadata describes an array, the name should have a numerical suffix, indicating which element of the array the stager represents (eg. inputarray-0, inputarray-1, etc.) |
| - FileName | Yes | Used as above if the "name" attribute is not specified |
| Resources | | |
| - IndividualCPUSpeed | Partial | See Support for Resource Elements |
| - IndividualCPUTime | Partial | See Support for Resource Elements |
| - IndividualCPUCount | Partial | See Support for Resource Elements |
| - IndividualPhysicalMemory | Partial | See Support for Resource Elements |
| - IndividualVirtualMemory | Partial | See Support for Resource Elements |
| - IndividualDiskSpace | Partial | See Support for Resource Elements |
| - TotalCPUTime | Partial | See Support for Resource Elements |
| - TotalCPUCount | Partial | See Support for Resource Elements |
| - TotalPhysicalMemory | Partial | See Support for Resource Elements |
| - TotalVirtualMemory | Partial | See Support for Resource Elements |
| - TotalDiskSpace | Partial | See Support for Resource Elements |
4.5.3.
Meaning of Resource and RangeValue types
RangeValue types in the JSDL allow the submitter of a job to specify ranges for resource usage
Overview
There are several elements in a JSDL document that contain RangeValue_Types. These elements usually allow the submitter of a job to specify a range of allowed values for a certain resource.
When used on elements in the <Resources> section of the JSDL, RangeValue_Types can be used to specify two different types of policy:
- The lower bound of the range is used to specify expected minimum usage, and is checked against the user's SLA at creation time. This check ensures that, if the user doesn't have sufficient resource left on his SLA, the job is not allowed to start (instead of being terminated half way through).
- The upper bound of the range is used to specify a maximum resource usage for the job, above which it should be terminated.
Examples
- Lower bounded ranges. Example: "This job will use at least 20kb of disk space. Make sure I'm allowed this much before letting me start the job."
<IndividualDiskSpace> <LowerBoundedRange>20000</LowerBoundedRange> </IndividualDiskSpace>
- Upper bounded ranges. Example: "This job should run for at most 60 seconds. Terminate the job if it runs for longer."
<IndividualCPUTime> <UpperBoundedRange>60</UpperBoundedRange> </IndividualCPUTime>
- Ranges with lower and upper bounds. Example: "Make sure I can use 5MB of virtual memory before starting my job, but terminate it if it uses more than 10MB."
<IndividualVirtualMemory> <Range> <LowerBound>5000000</LowerBound> <UpperBound>10000000</UpperBound> </Range> </IndividualVirtualMemory>
- Exact values. Example: "This job will use 20MB of physical memory. Don't let it start if I'm not allowed to use that much, and terminate the job if it tries to use more."
<IndividualPhysicalMemory> <Exact>20000000</Exact> </IndividualPhysicalMemory>
Service provider overrides
The previous examples were all written from the point of view of a user submitting a job. The service provider can also use JSDL to enforce policies on jobs that are submitted to his service. These types of service-provider policy are specified in the webadmin interface, and they are specific to each application.
If both the user and the service provider specify a range for the same resource type, the intersection between the two is used. For example:
4.5.4.
Support for Resource elements
Not all Resource elements are supported by all Resource Manager plugins
Not all of the Resource Manager plugins bundled with the GRIA Job Service can support every type of Resource element defined in the JSDL specification. The following table gives an overview of which elements are supported at each stage of policy enforcement.
| JSDL Element Name
| LocalExecution
|
| TorquePBS
|
| Condor
|
| A
| B
| C
| D
| A
| B
| C
| D
| A
| B
| C
| D
|
| CandidateHosts
|
|
|
|
|
|
|
|
|
|
|
|
|
| ExclusiveExecution
|
|
|
|
|
|
|
|
|
|
|
|
|
| OperatingSystem
|
|
|
|
|
|
|
|
|
|
|
|
|
| CPUArchitecture
|
|
|
|
|
|
|
|
|
|
|
|
|
| CPUSpeed
|
|
|
|
|
|
|
|
|
|
|
|
|
| CPUTime
|
|
|
|
|
|
|
|
|
|
|
|
|
| CPUCount
|
|
|
|
|
|
|
|
|
|
|
|
|
| NetworkBandwidth
|
|
|
|
|
|
|
|
|
|
|
|
|
| PhysicalMemory
|
|
|
|
|
|
|
|
|
|
|
|
|
| VirtualMemory
|
|
|
|
|
|
|
|
|
|
|
|
|
| DiskSpace
|
|
|
|
|
|
|
|
|
|
|
|
|
| ResourceCount
|
|
|
|
|
|
|
|
|
|
|
|
|
| POSIX Extensions
|
|
|
|
|
|
|
|
|
|
|
|
|
The "POSIX Extensions" row refers to the POSIXApplication element inside the Application section of the JSDL. Technically this is not a Resource element, but its purpose is similar. Details about which POSIX constraints are supported can be found in the Supported Elements page.
Legend
| A
| job creation time - initial check on user's SLA using the RangeResource's lower bound
|
| B
| job submission time - selecting which RMs are suitable for the running of a job using the RangeResource's lower bound (see step 2 of The Selection Process for more information)
|
| C
| job submission time - RM directives to govern node selection and job execution
|
| D
| runtime - usage report generation
|
| Not supported
|
| Supported on Linux only
|
| Supported on Linux and Windows
|
4.5.5.
Examples
Example JSDL documents suitable for the GRIA Job Service
This example would be sent to a GRIA 5.2 basic application service to create an example "blend" job with two inputs and one output:
<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl">
<JobDescription>
<JobIdentification>
<JobName>http://it-innovation.soton.ac.uk/grid/imagemagick/blend 1</JobName>
</JobIdentification>
<Application>
<ApplicationName>http://it-innovation.soton.ac.uk/grid/imagemagick/blend</ApplicationName>
</Application>
<DataStaging name="inputImage-0">
<FileName>inputImage-0</FileName>
<CreationFlag>overwrite</CreationFlag>
<DeleteOnTermination>true</DeleteOnTermination>
</DataStaging>
<DataStaging name="inputImage-1">
<FileName>inputImage-1</FileName>
<CreationFlag>overwrite</CreationFlag>
<DeleteOnTermination>true</DeleteOnTermination>
</DataStaging>
<DataStaging name="outputImage">
<FileName>outputImage</FileName>
<CreationFlag>overwrite</CreationFlag>
<DeleteOnTermination>true</DeleteOnTermination>
</DataStaging>
</JobDescription>
</JobDefinition>
This example additionally has the "Resources" tag in it:
<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl">
<JobDescription>
<JobIdentification>
<JobName>http://it-innovation.soton.ac.uk/grid/imagemagick/blend 2</JobName>
</JobIdentification>
<Application>
<ApplicationName>http://it-innovation.soton.ac.uk/grid/imagemagick/blend</ApplicationName>
</Application>
<Resources>
<IndividualDiskSpace>
<LowerBoundedRange>20000.0000000000</LowerBoundedRange>
</IndividualDiskSpace>
</Resources>
<DataStaging name="inputImage-0">
<FileName>inputImage-0</FileName>
<CreationFlag>overwrite</CreationFlag>
<DeleteOnTermination>true</DeleteOnTermination>
</DataStaging>
etc
</JobDescription>
</JobDefinition>
4.6.
Resource Managers
The GRIA Job Service can interface with a number of Resource Managers. This section describes how to configure it to do this.
4.6.1.
Overview
Why are resource manager plugins needed? What do they do?
The GRIA Job Service does not access resource managers directly, to submit and check jobs. Instead, the GRIA Job Service introduces an extra layer of resource manager dependent scripts to submit and check jobs. For each resource manager, GRIA requires a separate Resource Manager (RM) Connector Plugin. This extra layer of platform dependent scripts decouples the GRIA Job Service from resource managers and applications.
GRIA defines the RM Connector Plugin API to handle tasks such as:
- Submitting jobs
- Checking the status of a job
- Checking the resource usage of a job
- Terminating a job
The Job Service then can be configured to use RM Connector Plugins suitable for the underlying computing platform (or resource manager). The plugins then know how to handle (start, check, kill) jobs for that particular computing platform, and can be instructed to run a particular application via its application wrapper.
4.6.2.
The Selection Process and RMSelector.py
How does the Job Service know which RM plugin to use for a job?
The GRIA Job Service can be configured to use any number of Resource Managers. Deciding which one to use for a submitted job is done as a three-step selection process, outlined below.
In the screenshots below, we use five made-up Resource Manager Plugins - RM1, RM2, RM3, RM4 and RM5.
The Job Service starts by compiling a list of all the enabled RM Connector Plugins that are installed and enabled. You can see this list by looking at the main Job Service administration page (see Figure 2 below) - all the plugins in the list that are not greyed out are enabled, and will be used in this selection process.
Step 1: Filter by Application
When the service administrator deploys a new application on the job service, he can indicate which resource managers have the application installed. Any resource managers not selected will be immediately excluded from the selection process, and no jobs for that application will be able to run on them.
Step 2: Filter by Resources
At this stage, the Job Service asks each RM Connector Plugin whether it has enough resources to run the job being submitted. The plugins will typically look at the resource requirements section of the JSDL, and then query the actual resource manager to see if it is able to run the job. See Writing Custom Resource Manager Plugins for details of the canRunJob python function.
Some checks that the plugins might do include:
- Checking whether the operating system and system architecture requested by the submitter is available on any of the compute nodes.
- Checking whether there is a compute node with enough memory to run the job.
In our example, RM3 does not have enough resources to run the job and is excluded from further steps.
Note that support for this feature in the current set of RM plugins is quite limited. See column B in Support for Resource Elements for more information.
Step 3: Objective Function
The final decision as to which Resource Manager will be used for a job is made by a Python script - RMSelector.py. The default implementation of this function is to just choose the first plugin available, but administrators can override this behaviour.
The Python interface for selecting a plugin is very simple. The Job Service will look for a Python function called selectPlugin, and call it with two arguments:
- job - a Job object, containing information gathered from the JSDL including the job's name and its resource requirements.
- plugins - a list of RMConnector derived objects - representing all the plugins that made it to Step 3 in the selection process. The selectPlugin function is expected to return one of these objects.
A very simple example is given below. This always selects the plugin named "RM4".
#!/usr/bin/env python
def selectPlugin(job, plugins):
for plugin in plugins:
if plugin.__class__.__name__ == "RM4":
return plugin
return None
Once the administrator has written this script, he can instruct the Job Service to use it by entering its path in the configuration page:
4.6.3.
Using Condor
The GRIA Job Service can submit jobs to Condor clusters. Here's how.
Configuring Condor
This section assumes you already have a working Condor installation. If not, you can follow the installation guide in the Condor manual.
Figure 1 shows a typical Condor/GRIA setup. Condor should be installed on the machine running the GRIA Job Service, and it should be allowed to submit jobs to the Condor Central Manager. To do this, you need to add the machine's hostname to the global Condor configuration file.
For example:
HOSTALLOW_WRITE = submit1.your.domain, submit2.your.domain, griaserver.your.domain
Configuring GRIA
Setting up the Condor plugin in the GRIA Job Service is simple. First click the Configure link next to Condor on the main admin page. This will open up the Condor configuration page:
Enter the paths to Condor's installation directory and its configuration file, then press Save configuration.
Customising Job Submissions
The job description template used for jobs submitted by GRIA is quite simple:
universe = vanilla executable = frame.py shell = /bin/bash log = loguser should_transfer_files = IF_NEEDED when_to_transfer_output = ON_EXIT queue
# Resource constraints ###RESOURCE_CONSTRAINTS###
(Note that ###RESOURCE_CONSTRAINTS### will be replaced automatically by the Job Service when a job is submitted)
This template is located inside the webapp (TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins/CondorTemplate.jdf), and if you need to modify it you have two choices:
- Change the template file inside the webapp. The disadvantage of this is that your changes will be overwritten if you redeploy or upgrade the GRIA Job Service.
- Copy the plugin and template files to a location outside the webapp and then modify them. You will have to change the name of your new plugin copy so that the Job Service can distinguish it from the original. To do this open up Condor.py and change
class Condor(RMConnector):
to something like
class CondorCopy(RMConnector):
You can leave the filename the same, or change it if you prefer. You should now enter the directory containing your new plugin in the Job Service configuration page.
Condor on Windows
Condor needs to switch to the user account of the submitter whenever it runs a job. This is straightforward on UNIX when Condor runs as root but, on Windows, knowledge of the user's password is required even when running at the maximum privilege level.
The GRIA Job Service runs under Tomcat, which by default runs as the NT Local System user. In this configuration, GRIA is not able to submit jobs to Condor, as the Local System user does not have a password and cannot have one set. It is recommended that you create a seperate user account for the GRIA Job Service, and have Tomcat run as that user.
4.6.4.
Using Torque PBS
The GRIA Job Service can submit jobs to Torque PBS clusters. Here's how.
Configuring Torque
This section assumes you already have a working Torque installation. If not, you can follow the installation guide in the Torque Admin Manual.
Figure 1 shows a typical Torque/GRIA setup. Torque should be installed on the machine running the GRIA Job Service, and it should be allowed to submit jobs to the machine running pbs_server. To do this, you need to add the machine's hostname to the list of allowed submit hosts with the following command:
qmgr -c 'set server submit_hosts += griaserver.your.domain'
Configuring GRIA
Setting up the Torque PBS plugin in the GRIA Job Service is simple. First click the Configure link next to TorquePBS on the main admin page. This will open up the Torque configuration page:
Enter the paths to Torque's installation directory, then press Save configuration.
Customising Job Submissions
The job description template used for jobs submitted by GRIA is quite simple:
## PBS directives
#PBS -N """JOB_NAME"""
#PBS -j oe
"""PBS_DIRECTIVES"""
(Note that """JOB_NAME""" and """PBS_DIRECTIVES""" will be replaced automatically by the Job Service when a job is submitted)
This template is located inside the webapp (TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins/TorquePBSTemplate.jdf), and if you need to modify it you have two choices:
- Change the template file inside the webapp. The disadvantage of this is that your changes will be overwritten if you redeploy or upgrade the GRIA Job Service.
- Copy the plugin and template files to a location outside the webapp and then modify them. You will have to change the name of your new plugin copy so that the Job Service can distinguish it from the original. To do this open up TorquePBS.py and change
class TorquePBS(RMConnector): to something like class TorquePBSCopy(RMConnector): You can leave the filename the same, or change it if you prefer. You should now enter the directory containing your new plugin in the Job Service configuration page.
4.6.5.
Writing Custom Resource Manager Plugins
System administrators can write their own RM Connector Plugins to interface with other Resource Managers.
What are they?
RM Connector Plugins are classes written in Python that handle all communication with the Resource Manager. The GRIA Job Service comes with three RM Connector Plugins that can be used as examples. These can be found inside the webapp:
TOMCAT_DIR/webapps/gria-basic-app-services/WEB-INF/rm-connectors/plugins
When writing your own plugins, you should not put them inside the webapp directory as they will be lost whenever you redeploy or upgrade the GRIA Job Service. Instead, place them in a new directory outside the webapp (eg. /opt/gria/rm-connectors) and enter this path into the Job Service configuration:
Plugins usually consist of one Python script (.py) and one or more template files. These templates are used when creating new jobs - values are substituted into them and they are written back into the new job directory. The Condor plugin consists of the following files:
- Condor.py - the main Python script. Contains functions for interacting with the Resource Manager.
- CondorTemplate.frame - template for job wrapper "frames".
- CondorTemplate.jdf - template for new Job Description Files.
A Sample plugin
The code listing below is a sample RM Connector Plugin that you can use as a starting point for writing your own. It describes what you need to do to implement the five main functions in any plugin:
- submit - submits a new job to the resource mananger.
- jobStatus - checks whether a job is still running.
- jobUsage - gets usage reports.
- killJob - terminates a job.
-
- canRunJob - checks if the RM can run the requested job.
For more complete documentation on the API, see RMConnector.
#!/usr/bin/env python -tt
# -*- coding: UTF-8 -*-
from RMConnector import RMConnector, ScriptNotFound
from Logger import logsys, loguser
import platformUtils
class Sample(RMConnector):
MIN_API_VERSION=1
def submit(self, job):
RMConnector.submit(self, job)
logsys.info("Executable we're running is " + job.executableName)
logsys.info("Our arguments are " + repr(job.arguments))
# TODO: Submit the job to the resource manager
def jobStatus(self):
# TODO: Check whether our job is really running. If not, write an
# empty file called .FAILED in the current directory.
pass
def jobUsage(self, appWrapperDir):
appUsageReports = RMConnector.jobUsage(self, appWrapperDir)
# TODO: Get any usage information from the RM, and append it to
# appUsageReports
return appUsageReports
def killJob(self, appWrapperDir):
# Try using the application wrapper scripts to kill the job gracefully
try:
RMConnector.killJob(self, appWrapperDir)
except ScriptNotFound:
pass
# TODO: Check if the job is still running. If it is, it means there
# wasn't an application wrapper script, or they couldn't kill the job
# So kill the job forcefully.
# Write a file telling GRIA we killed the job
platformUtils.writeToFile(".KILLED", "")
def canRunJob(self, job):
# TODO: Check if this RM has enough resources to run the job.
# If we can't, return False with a reason, eg:
# return (False, "Not enough memory")
return (True, None)
4.7.
Standard CPU time
How to adjust reported CPU time depending on the performance of the individual node on which a job is run.
What is Standard CPU time?
The GRIA Job Service receives usage reports about the current CPU utilisation of running jobs. It forwards these usage reports to the SLA service so that users can be billed according to how much CPU time they are using on compute nodes.
If the Job Service is configured to submit jobs to a heterogeneous cluster (i.e. consisting of machines with different specifications), the administrator might want users to be billed more for CPU time if their jobs are executed on faster machines. The GRIA Job Service can adjust the amount of reported CPU usage according to the measured performance of the node on which the job is running.
Benchmarking each node
The process of recording the performance of each node is not automatic - the administrator must run a benchmark on each individual machine on which a job could be run. 1 standard CPU second is defined as 1 second of full CPU utilisation on a Pentium III 1GHz processor. This gives a performance of 54.3 Mflops/s using the Linpack Java Benchmark.
From the machine you wish to benchmark, enter the URL http://www.netlib.org/benchmark/linpackjava into a browser (ensuring that Java has been enabled).
This will load an applet with a window similar to the one below:
Click the Press to Run Benchmark button, at the top of the applet window, to calculate your machine's performance.
To find the power of your machine, look at the Mflop/s reading (highlighted above).
This number needs to be stored in a file - either /etc/gria/benchmark or C:\gria\benchmark.txt depending on the operating system. On Linux, you can create this file using the following command (supplying your value instead of 54.721):
echo "54.721" > /etc/gria/benchmark
|