|
|
- Info
Basic Application Services User Guide
Note: Return to reference manual view.
This guide describes how to use the Basic Application Services Package for provision of data storage and processing (using application installed on a cluster) to trusted users.
1.
Overview
Overview of the Basic Application Services
The GRIA basic application services package provides the core functionality for job and data management. It consists of:
- A Data Service
- This allows remote users to upload and download data
files to the service provider, and to transfer data between Data
Services hosted by different service providers. The Data Service also supports
management of access rights (for read or read-write access) granted to
other users or service providers.
- A Job Service
- This allows remote users to start, monitor or kill computational
jobs, executed by the service provider. The Job Service will fetch input from
and write input to a local Data Service. The Job Service can be configured to support
multiple applications, which are chosen by the service provider.
The application services can be configured to be either unmanaged (free) or
managed by the GRIA service provider management package, as in the
diagram below.
2.
Installation
GRIA Basic Application Services Installation
Standard Installation Procedure
The Basic Application Services package is provided as a zip file or tar.gz for linux. Unpack the archive and you will find the following items:
- docs (folder)
- gria-service-provider-mgt.war
- README.html
Install the war file according to the Service Installation Manual. Once the initial configuration has been completed, http://www.gria.org/documentation/5.3/manual/basic-application-services-user-guide/the-job-service/application-wrapper-scripts-and-description-files/application-metadata-xml~v~2the Basic Applications Package requires some extra configuration.
Additional Pre-requisites
Some additional pieces of software need to be intalled.
Perl
Perl is required in order to run the Basic Application Services correctly. For Windows, the most common Perl implementations are Cygwin Perl and ActiveState Perl. We recommend that you install ActiveState Perl—choose the latest release, click the "Next" button, and then download the MSI
distribution for Windows. To fully complete the installation you must restart Windows.
Test Application: ImageMagick
ImageMagick is the default test application used in GRIA. ImageMagick
binaries for Windows can be downloaded from here.
Use a Q8 version, e.g. ImageMagick-6.4.2-x-Q8-windows-dll.exe.
The ImageMagick.exe package for Windows is self extracting and the installation procedure starts
automatically. Follow the instructions and select the default options.
Note: older versions of ImageMagick might not support all of the basic application service's default examples, e.g. blend.
Continuing the Installation
Once the additional pre-requisites have been installed, the installation can be continued by following the instructions for each service.
3.
The Data Service
Overview and Configuration of the Data Service
Overview
The GRIA data service is used to manage "data stagers". A data stager is a container for a single file (or zip file). It has a unique identifier and an access control system for determining who can read and write the data. Clients can use the service to create
new stagers, upload and download data, transfer data between stagers, and control others' access to
the data.
Two items of configuration must be given before the data service can be used:
- The location of the root data directory
- The service stores any uploaded data inside this directory. If the data service is going to be used with a job service and jobs are going to execute on a cluster then the cluster's machines need to be able to read and write to this directory.
- A list of trusted management services
- Normally, you can just click Add
to accept the default management service. This is the SLA management
service from the GRIA service provider management package. Note that if
the GRIA basic application services package is deployed on a different
machine to the GRIA service provider management package, some
additional access control setup is required. This is described in Links with Other Services section of the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make it unmanaged (or "free"), by clicking the Make service free button.
4.
The Job Service
The Job Service
4.1.
Basic Configuration
The Job Service
The GRIA job service is used to manage jobs. Clients can use the service to create
new jobs, upload input data, start the job, monitor progress, and download results.
Each input and output of a job is actually a data stager managed by the local data service
(the one in the same .war as the job service). Therefore, you must configure
the data service before the job service can be used. Users can run jobs that take input from
or send output to other data services by using the normal data transfer features provided
by the data service.
To configure the job service you will need to specify:
- The location of the root job directory. The service creates one subdirectory inside this directory for each job.
- The directory containing the platform scripts. These scripts interact with the underlying resource manager,
allowing the job service to be used with clusters of machines.
- A list of trusted management services. Normally, you can just click Add
to accept the default management service. This is the SLA management
service from the GRIA service provider management package. Note that if
the GRIA basic application services package is deployed on a different
machine to the GRIA service provider management package, some
additional access control setup is required. This is described in Links with Other Services in the Service Provider Management user guide. As an alternative to configuring the service to be managed, you can make use of the service unmanaged (free), by clicking the Make service free button.
- A list of applications which users can run using the service. See the Managing applications section for details.
4.2.
Managing Applications
Managing applications
This section covers the administrative tasks of deploying and
undeploying applications. After deploying an application to your job
service, it becomes available for execution by remote clients. Note,
however, that clients usually must satisfy additional business
constraints, such as having an appropriate service level agreement or
account, before they can execute deployed applications.
The GRIA Basic Application Services software is provided with a set
of tutorial applications. These are made available during installation
of the software. The web based Administration Interface
provides the location of these files during the installation process, and guides you through the simple process of application
deployment.
In addition to the tutorial applications, it's straightforward to develop new applications and deploy these in the same way.
This section assumes that you have all necessary files for application
deployment and that any required executable applications
have been installed according to the application documentation. If you
are installing the tutorial applications, you have all the files you
need. If you are deploying your own applications, first see Writing Application Scripts
for details of the files you need to produce before application deployment.
Deploying applications
Having obtained or created the files and scripts needed for
application deployment, the application can now be deployed to the job
service. To do this, make sure that tomcat is running, then using a web
browser, log into the GRIA Basic Application Services administration page. This can be found at
http://<servername>:8080/gria-basic-app-services.
Make sure you enter the appropriate security credentials and
adjust the URL for the administration page, according to your server
setup, as appropriate.
From the administration page, select the Job Service link, as shown below.
This displays the Job Service Admin page. In the Applications section, enter the location of the directory containing
the files and scripts needed for deployment. Then click the Deploy new application button.
This displays an admin page that displays properties for the application. You can optionally enter arguments for the
platform script, before clicking the Accept button.
This completes deploying an application to the Job Service.
Undeploying applications
Undeploying an application is straightforward. First click the Edit button along side the entry for the application
you wish to undeploy.
Click the Undeploy button to undeploy the application.
4.3.
Execution Platform Models
Execution Platform Models
Overview
The GRIA architecture is flexible enough to use a variety of underlying computing
platforms to run jobs e.g. from single computers to clusters of workstations
or even supercomputers. The following sections of this document describe GRIA
constraints on different platforms.
GRIA system administrators should read this document and configure GRIA accordingly
to accommodate the infrastructure of their underlying computing platform.
- Applications should be accessible by compute nodes only. Installation of
the applications can be either local per compute node or over disk space shared
among all nodes.
- Platform scripts should be accessible by the GRIA Job Service middleware.
- Application scripts should be accessible by the GRIA Job Service middleware
and compute nodes, e.g. either copy wrappers locally per node or use a shared
directory.
- Scratch area, i.e. job workspace area, should be accessible (read/write)
by both the GRIA Job Service middleware and compute nodes. The scratch area
cannot be copied between compute nodes, instead it has to be exported as a
shared disk space.
Directory names and paths for wrappers should be similar whether the access
point is a compute node or the GRIA Job Service middleware. The following figure
illustrates an overview of the GRIA Job Service in relation to its execution resources.
The Need for Platform Scripts
GRIA Services do not access resource mangers directly to submit and
check jobs. Instead, GRIA Services introduce an extra layer of resource
manager dependant platform scripts to submit and check jobs. For each
resource manager GRIA requires a separate suite of platform scripts.
This extra layer of platform dependant scripts decouples GRIA Services
form resource managers and applications.
GRIA defines platform scripts APIs to handle jobs such as:
- Start job script to submit jobs
- Check job to check the status of a job
- Kill job to terminate a job
The Job Service then can be configured to use platform scripts
suitable for the underlying computing platform. Platform scripts then,
know how to handle (start, check, kill) jobs for that particular
computing platform, and can be instructed to run a particular
application via its application wrapper.
Figure 1 illustrates how the platform script layer sits between the Job Service
and application wrappers hiding resource manager details.
Local Execution Scenario
This is a minimum configuration scenario, all GRIA Services and jobs run locally on the same machine.
The following figure shows how GRIA can be deployed with services running locally.
In this example the Job Service middleware is using platform scripts that run
jobs locally. Job workspace, wrappers and applications should be accessible
by the system that runs GRIA services.
Note: this is the same configuration the GRIA release demo is using to run
the tutorial applications locally.
Using a Cluster of Resources Scenario
The GRIA Services can be deployed and use, as a computational platform, a cluster
of workstations. For example, in Figure 3 the Job Service middleware is using
PBS platform scripts to handle jobs. Cluster compute nodes need to access applications,
wrappers and the job workspace, on the other hand the GRIA Services should access
job workspaces, and the wrappers.
4.4.
Application Wrapper Scripts and Description Files
How to write the scripts and meta-data file to integrate your application.
4.4.1.
Overview
Applications and The Job Service
Relation to the Job Service
A functioning GRIA Job Service installation is composed of three main parts:
- The service software running under Tomcat/AXIS.
- Scripts
for running or interacting with application codes on an execution
platform, which may be the service host itself, a separate execution
server, or a cluster.
- Some installed applications capable
of running on the execution platform, each corresponding to a different
Job Service end-point.
The Job Service software (1) supports various bookkeeping operations
for assigning job ids, workspace, etc. The core operations are those
for actually starting and managing the execution of an application:
starting jobs, checking their status, and killing jobs. Each of these
has to access the execution platform by running the appropriate script
(2), which can start or otherwise interact with the application running
on the execution platform (3). For security reasons, we do not usually allow
users to upload their own applications, so the service operator must
install all the applications.
The Job Service and platform scripts are designed to support a uniform model of application execution, shown in Figure 1:
The workspace for each job is set up by the Job Service when the job
is initialised (this is one of the bookkeeping service operations that
must precede the call to start the job). The workspace has a standard
directory structure so the Job Service and platform scripts can create
and find information stored in it, including a working sub-directory
where the job will actually run.
When the user starts the job, the Job Service transfers input data
files from Data Service URI's into the job’s workspace. It then runs a
platform script locally, which in turn submits the application to the
execution platform (cluster, etc) where it will run (using the
specified command line), possibly after some queuing delay. The
platform script saves the job handle to the workspace. The application
has to read the input data left in its workspace by the Job Service,
and write any outputs to the workspace so that the Job Service can find
them and transfer them to output Data Service URI's when the job has
finished.
When the user asks for the status of the job, the service runs a
second platform script that reads status information (e.g. when the job
started or finished, etc), which the application must store in the
workspace. This platform script may also run an associated monitoring
application (using the specified command line) to gather
application-specific status information (e.g. number of iterations
completed, convergence plots, etc) from the workspace.
If the user asks for the job to be killed, the service uses a third platform
script that reads the job handle from the workspace, and issues a command to
kill this job on the execution platform. The Job Service will detect that the
job has finished, and will transfer any output produced. Note that the user
(or their client-side application) should always check the status of a job to
find out if it crashed or was killed, as some incomplete output may appear in
this case.
Why are Application Wrapper Scripts Required?
In practice, few legacy applications behave exactly according to the
model shown in Figure 1. It is rarely possible to change the
application itself to fix this, so instead GRIA uses so-called wrapper
scripts that do conform to the application model for starting and
managing the application.
In practice, the wrapper scripts can do more than just make the
underlying application work as indicated in Figure 1. They can also be
used to handle and implement application specific features of the
service.
One can also use (optional) wrapper scripts to look for
application-specific status information in the working directory of the
job. Without such scripts, the platform scripts can only provide basic
job status information from the job submission system.
Finally, wrapper scripts also provide a configurable mechanism for
dealing with any application-specific security risks, e.g. checking for
malicious input that may exploit a feature of the application. Few
legacy applications were designed as network-accessible services, and
since we can’t change them to remove security loopholes, the use of
wrapper scripts is essential to check for any exploits of application
vulnerabilities. In the limit, one can configure the wrapper (and
platform) scripts to run the application in a sandbox (e.g. chroot),
with access only to a working sub-directory of the job workspace, as
shown in Figure 2.
4.4.2.
Start Job
The startJob.pl Application Wrapper Script
Application Wrapper Functionality
The startJob.pl application wrapper script is a mandatory script that deals with any application specificity, allowing
the Job Service to treat all applications in the same way, and so decoupling
the Job Service from the details of the application.
The main functions of the application wrapper script are:
- handling input and output data files;
- creating a consistent environment that is referred to correctly in inputs;
- enforcing any security precautions to protect against loop-holes in the application.
The application wrapper is designed to run on the execution platform, having
been submitted by the platform script for starting a job. Prior to submitting
the wrapper script on the execution platform, the Job Service will have set up
a workspace (directory) for the job, copied input data into it e.g. work/inputs,
and created a working sub-directory for the job to run in, e.g. work.
The following listing shows a workspace directory structure with two input
files and an empty outputs directory.
ff808081-1017450e-0110-174532dd-0001-1
`--work
|-- inputs
| |-- input-0
| `-- input-1
`-- outputs
After changing to the workspace directory (not the working
sub-directory), the wrapper will be submitted using the following
command line:
app-wrapper <application arguments>
The functionality of the wrapper script should include the following:
- Parse wrapper arguments, including security checks for
illegal input designed to inject malicious commands into the
command-line used to launch the application.
- Move input
data files into the working sub-directory, including unpacking any that
are compressed archives containing multiple inputs.
- Create
a consistent environment in the working directory, by setting up
environment variables and rewriting input data to match the local
environment where necessary.
- Touch the file .app_started in
the job workspace (this should be outside the working subdirectory), so
recording the time when the application started. This step is optional.
- Build the
command line and run the underlying application, making sure that the
standard output channels are directed to .stdout and .stderr in the
working directory. This step is optional.
- When the application has finished, touch
the file .app_ended in the job workspace, so recording the time when
the application finished. This step is optional.
- Copy output files from the
working directory into the specified positions in the workspace,
including packing multiple outputs into compressed archive files where
necessary.
- Exit by returning the exit code of the application.
For simple applications, security can be maintained by checking
input parameters during step 1. and if necessary data files during step
3. If the application is too complicated for this to be reliable, it
may also be necessary to set up a sandboxed working environment and run
the code inside it during step 5.
Some of these steps are considered in more detail below.
Input and Output Data Handling
When unpacking input data, the application wrapper should attend to the following:
- Create any substructure needed in the job's working sub-directory of the workspace.
- Copy or unzip input files from the inputs sub-directory into the job's working space.
- Check that all input needed to run the job is present.
The Job Service knows in advance which output files have to be sent
back to outputs directory. The application wrapper has to create these
files by:
- Copying or zipping data to create the required output files in the outputs sub-directory
- Renaming these files to output-x naming scheme.
When the script finishes, the Job Service will detect this and handle the transfer of output files accordingly.
The number of output files is always fixed, so the wrapper can know
what outputs are needed, and how to assemble them. However, the user is
allowed to distribute multiple inputs across an arbitrary number of
input zip archives, provided they specify all the Data Service URIs
where these are stored when they start the job. The application wrapper
must be capable of handling this.
Consistent Context Reconstruction
Why do we need context reconstruction? The input data for our
application has been created on another system with a different
directory structure, environment and possibly even operating system. We
have to set up an equivalent (not necessarily identical) environment on
our execution platform, and make sure any input data references to the
remote user's environment are mapped onto the one we have created, or
they will be invalid when the application is started.
When and where should context reconstruction be performed? One
should handle it as close as possible to the running application—certainly on the execution platform where the job will actually be run—as this is where the environment is needed. This is why the Job
Service doesn't attempt to create the context itself - there is no
point doing it at the service host if the job will be executed on a
compute node in a Condor cluster. Instead, we leave it to the
application wrapper to handle everything in an application specific way
on the execution platform itself.
A typical approach to context reconstruction might involve passing
an array of named parameters to the Job Service, including environment
settings as well as application flags. These will be passed to the
wrapper through its argument list. In addition, one can provide
settings in an extra input file, intended for the wrapper rather than
the job itself, and used to set up the environment prior to running the
application code.
The hardest job for the wrapper is to parse and rewrite application
input data where necessary to ensure it is consistent with the
environment established on the execution platform. If this is not
needed, it is usually quite easy to 'wrap' an application to run inside
the Job Service. Where it is necessary, the wrapper may become a
significant body of code in its own right.
For example, consider the following line of input intended for the
rendering application AIR, used with the Job Service to provide a
grid-enabled video rendering service:
Option 'searchpath' 'shader' ['&:e:\AnimalLogic\MaxMan\shaders:C:\Sample\shaders']
The problem here is that the application uses plug-ins to perform
part of the rendering calculation, and the search path for these can be
specified in the user input. This particular input file has been
generated automatically using a graphical environment for video
post-production, which has filled in the relevant path based on where
the shader libraries were installed on the user's local machine.
The wrapper has to identify which shaders are needed, and substitute the path to them on the local system:
Option 'searchpath' 'shader' ['&:/export/apps/AnimalLogic/MaxMan/shaders:/export/apps/air/Sample/shaders']
In some cases, it may be possible to infer the meaning of
client-side environment references by pattern matching against a list
of meaningful terms used by the application. In others (probably in
this case), it is necessary for the user to send the install path
quoted for specific groups of plug-ins as service arguments or
environment settings, so the wrapper can find them and map them onto
the equivalent installed groups of components on the execution platform.
In extreme cases, it may be necessary to establish multiple services
to run the same application in different ways, allowing a different,
specific environment to be set up for each. For example, it probably
wouldn't make sense to have a single service to run a computational
fluid dynamics (CFD) code capable of simulating coolant flows through
automotive engines AND the propagation of drugs in aerosol suspension
in human lungs. It would be asking too much of a wrapper developer to
differentiate and correctly handle such extreme cases, and instead one
should set up two services each with its own wrapper specialised to one
of these scenarios.
Security Containment
Why Wrappers have to Bother with Security
The Job Service regards application wrapper scripts as trustworthy,
because the service operator can inspect them and make sure they don't
do anything strange or foolhardy. However, the applications may be
third party, closed source executables that cannot be inspected, and
were not designed as network-accessible services in the first place.
Wrapper scripts can protect the service from malicious users in three ways:
- checking any user input used to create the command line for
running the application, to exclude command injection attacks using
parameters like 'method=gauss; cd /; rm */*';
- checking
input data known to be used in an unsafe way by the application, e.g.
to construct system calls for executing plug-ins or moving files around;
- confining the application to a sandbox, by first preparing the sandbox and then launching the application in it.
If the application is very simple, or designed to withstand
malicious users, or if you have only a small number of users you know
well (and trust not to mislay their credentials) then it may be OK to
include only the first of these measures.
Legacy applications are quite likely to do things in unsafe ways.
Renaming files or testing if they exist are often done via system calls
if the application developer wasn't filenames to be sent by a remote
user who may have malicious intent. If the application isn't too
complex, or if you can check with the developer on what might happen,
then it should be OK if you also check the user-supplied input and
check filenames and other data that may be sent to unsafe system calls.
In the worst case, one has to assume the application will be unsafe,
and attempt to contain any damage caused by malicious (or possibly
careless) input by restricting what the application can do and where it
can do it. There are several possible ways to achieve such restrictions.
Chroot
On Linux systems, chroot can be used to restrict a sub-process to an
arbitrary sub-directory, e.g. a job's working directory. The chroot
mechanism was designed for use by operating system developers to allow
them to create a pseudo-root within which to test their code. The
chroot container doesn't prevent access to low-level devices, it will
prevent most legacy applications accessing files outside the specified
sub-directory. Chroot is widely used to contain web servers and other
network applications to minimise the scope for damage if they are
compromised.
To use chroot, it is necessary to create a complete operating system
environment inside the job's working directory (which it will see as '/'). One has to copy application binaries, resolve any references to system/application libraries, create devices such as /dev/null,
etc. To create a self-sufficient chroot 'jail' environment sufficient
to run the application may not be easy, and of course, it would need to
be repeated for each individual job. However, it can provide a good
safety level as its 'jail' environment is enforced by the operating
system itself.
Restricted Shells
Many shells, including bash provide a restriction mechanism, usually
invoked by running the shell with the -r switch. Some common features
of restricted shells are the ability to prevent a program from changing
directories, to only allow the execution of commands using absolute
pathnames, and to prohibit executing commands in other subdirectories,
using command-line redirection operations, or changing the search path.
Minimal privilege accounts
Another approach is to create a low-privilege account for each jobs.
The wrapper script would then have to assign such an account, change
the working directory so it is owned by this account, and run the
application in that working directory under the same account. Provided
the same account is not used for anything else (including running other
jobs), the application can be prevented from accessing anything outside
the working directory, even if it can be induced to run some unforeseen
system call by sending some malicious input.
The two drawbacks with this approach are:
- ideally one should create a pool of accounts and provide a
way for the wrapper to assign them to jobs rather than creating new
accounts, but this isn't supported at present;
- the wrapper
would need sufficient privilege to set the account under which a
sub-process is run, which may make the wrapper more dangerous if it can
be compromised.
The second drawback may not be too bad, given that the wrapper at
least can be designed to check all inputs and avoid doing anything
unpleasant. At present, the Job Service runs with a normal unprivileged
user identity, so it may be better to use other methods to contain
individual jobs.
Other Methods
The above list is by no means exhaustive. For example, if the chroot 'jail'
is not sufficient, one can create an entire virtual machine on which to run
a potentially unsafe application. Software such as VMWare can be used to implement
this approach, but users who want to go to these lengths are on their own, at
least in this version of the software.
Error Handling
If an error is encountered in the application, the wrapper must report the
fact. If this is not done, the Job Service will assume everything is OK, and
the users client application will probably attempt to continue, which may not
be appropriate if some output from the job is missing, etc.
Return values are passed back to the Job Service middleware via three files in the job workspace:
- .app_wrapper_exit_code
- .app_exit_code (optional)
- .app_exit_status (optional)
The application startJob wrapper should exit with an exit status of zero if the job
has completed successfully, or with a non-zero status if the job has failed. This value
will be stored in .app_wrapper_exit_code by the platform scripts. Generic
clients may stop executing a workflow, for example, if this result is not zero.
The two optional files may be used to provide extra application-specific
information to specialised counter-part client applications. The Job service is
not expected to understand these codes, but will just pass the values, if set,
to the client. The application wrapper must write these files, if they are needed.
The .app_exit_code file should contain the application's exit code.
The .app_exit_status file can contain further information.
An Example Wrapper Script
This example is based on the ImageMagick application which was
installed as part of the GRIA installation. To get started, we will
create a simple wrapper that runs this application.
- Create a startJob wrapper script:
#!/bin/sh
exec > log 2>&1
echo Swirl wrapper started
echo Arguments are: $*
INPUT="$2"
OUTPUT="$4"
echo Copying inputs to work directory...
cp "$INPUT" image.jpg || exit 1
echo Run the mogrify command...
mogrify -swirl 60 image.jpg || exit 1
echo Copying result to output stager...
cp image.jpg "$OUTPUT" || exit 1
echo Swirl job completed successfully
This performs the following steps:
- Redirect all output to the log file.
- Print the arguments to the log.
- Copy the input image into the work directory.
- Run the mogrify command to transform the image.
- Copy the result to the output stager.
- Edit the startJob script to run your command instead of swirl (the command you tested above).
- Make the script executable:
$ chmod a+x startJob
- Test the wrapper with this command:
$ ./startJob -i image.jpg -o output.jpg
$ cat log
Check that the log shows that the command ran correctly. It should
only take a few seconds to process the image. It it takes longer, press
Ctrl-C and examine the log.
4.4.3.
Check Job
Creating the checkJob.pl Status Wrapper Script
Unlike the wrapper for starting an application, the application
wrapper for reading status from the working directory is optional. If
no such wrapper is provided, the platform script will create a simple
status report by checking for files like .app-start and .app-end that
indicate if and when the job started or finished, consulting the
execution platform manager (e.g. batch queue system), and appending
.stderr (if any) to the result.
If you want the job status report to include application-specific
information such as convergence plots, iteration counters, etc., you
should create a wrapper script that will be invoked by the client calling the checkJob method.
An application status wrapper is usually a lot simpler to create
than the main wrapper because it does not take any user-supplied (and
hence potentially malicious) arguments, and does not set up (or run)
potentially untrustworthy code. All the status wrapper has to do is to
examine the job's working directory, read any files it needs in order
to extract the desired status information (in the limit, one could
simply copy an application-level log file), and write it to the
standard output.
Note: the format of the status information is open and application dependent, however, status information should not include binary data.
An Example Status Wrapper Script
This example is based on the ImageMagick application, which was
installed as part of the GRIA installation and follows on from the start job example. In the same directory as startJob, create a script called checkJob:
#!/bin/sh
tail log
This is run each time the client checks the status of the job. This example simply returns the last few lines of the log file, and is executed as follows:
- Make the script executable:
$ chmod a+x checkJob
- Test it with these commands:
$ ./checkJob > statusfile
$ cat statusfile
You should find that the contents of log are now in statusfile.
4.4.4.
Kill Job
Creating a killJob.pl Application Wrapper Script
The application specific kill script is a script that will allow an application
to be killed in a certain way, e.g. using an application specific mechanism.
This script is optional, if not present in the argument list, the platform
script will try to kill a job at the resource manager level. When an application
specific kill script is provided the platform script will invoke it instead.
The return code of this script should be 0 upon success or any other value
on error. The platform script accordingly will decide whether the killing
operation was successful or failed. The functionality of this script is
application dependent and difficult to be described in a generic way.
For some applications terminating a job, might be as simple as creating a
single file in the job workspace. Some other applications are aware of signals
that can be passed to them, etc.
Appendix II describes in detail an alternative way to kill jobs gracefully which
uses a signalling mechanism for the wrapper to kill the job. The application
wrapper can respond to that signal by terminating the application and preparing
any useful output data.
An Example Kill Wrapper Script
We are not aware of any particular way that ImageMagick can be killed, therefore
we cannot provide a complete example of a kill-wrapper script. In practice, such
an example will be very similar to the default job termination mechanism used in
the platform kill script, e.g. kill -9 $jobID. In the following paragraphs
we provide some hints about possible ways terminating jobs.
If a particular application is aware of a termination file in the job
workspace, then the kill-wrapper can be as simple as:
touch $WORK_DIR/.terminate
Where .terminate is the particular termination filename that
the application is aware.
Many applications are aware of various signals e.g. SIGTERM, SIGALRM, SIGSTOP, etc.
The kill-wrapper then can pass the appropiate signal to the application and the
application can respond by terminating the job gracefully, for example:
# find the process ID $pid and send a termination signal
kill -SIGTERM $pid
4.4.5.
XML Description
Creating an XML File to Describe an Application
Application description files are XML files containing metadata
about applications deployed on GRIA. These files are essential for GRIA
users to discover available applications and use them. To create an XML
description file of an application you need to use the following schema
to identify the application's main features including name, version
number, description, and inputs/outputs (if any). For example, The
following code describes the Swirl application:
<application>
<name>http://it-innovation.soton.ac.uk/2005/gria/tutorial/swirl</name>
<version>1.0.0</version>
<description>Application to swirl an image</description>
<application-inputType>
<name>image.jpg</name>
<type>jpg</type>
<description>image file of any type</description>
</application-inputType>
<application-outputType>
<name>image.jpg</name>
<type>jpg</type>
<description>swirled image of the same type as input type</description>
</application-outputType>
</application>
The following code describes the Paint application:
<application>
<name>http://it-innovation.soton.ac.uk/2005/gria/tutorial/paint</name>
<version>1.0.0</version>
<description>Application to render an image into painting</description>
<application-inputType>
<name>image.jpg</name>
<type>jpg</type>
<description>image file of any type</description>
</application-inputType>
<application-outputType>
<name>image.jpg</name>
<type>jpg</type>
<description>painted image of the same type as input type</description>
</application-outputType>
</application>
Note that you can add as many inputs/outputs as necessary, according to your application.
Every type of application provided by the Job service must be given a unique name. The ensure they are unique, a URI is used. Note that although these names look like web page addresses, they may not necessarily point to web pages if treated as URL. They are simply unique strings.
4.5.
Platform Scripts
The platform scripts create the interface between GRIA and the scheduler.
4.5.1.
The Platform Script Interface
Describing the three platform scripts
4.5.1.1.
Overview
Overview of the platform script interface
The Job Service knows only how to start, check job status, and how to kill
a job. In order to decouple Job Service and application wrapper bindings from
resource managers, an extra layer of resource manager dependant platform scripts
is introduced. This implies that Job Services can operate with different resource
managers without changing the Job Service itself or the application wrappers.
The Job Service then can be configured to use platform scripts
suitable for the underlying computing platform. Platform scripts then,
know how to handle (start, check, kill) jobs for that particular
computing platform, and can be instructed to run a particular
application via its application wrapper.
Figure 1 illustrates how the platform script layer sits between the Job
Service and application wrappers hiding resource manager details.
The GRIA Job Service requires the following platform scripts:
- Start job: this script knows how to submit jobs for a specific resource manager
- Check job: knows how to check the status of a job
- Kill job: knows how to terminate a job
The following sections describe in detail the API's and the required functionality for platform scripts.
Ideally, users should adopt one of the supplied
platform scripts for running jobs using PBS, Condor or local execution.
If it is necessary to develop platform scripts to address an unsupported execution
platform, this can be done, but first read about the GRIA platform
model.
4.5.1.2.
Start Job
The startJob.pl platform script
Introduction
The startJob.pl platform script is responsible for setting up the platform
dependent environment within the job workspace directory, generate a
platform dependent job description files and submit that file for
execution to its underlying resource manager. The resource manager will
try accordingly run that job description file and invoke the
application via its application wrapper.
The start job script is invoked by the Job Service middleware, usually the same system that runs GRIA Services.
Job Life Cycle
When a job is submitted via the startJob script, it will follow a specific life cycle.
Figure 2 shows the life cycle of a job within GRIA (with time
increasing along the x-axis). The numbered labels refer to timestamps
that are required by the Services. The table below contains details of
the labels with their meanings and the name of a file (to be stored in
the job's session directory) that will be touched at the appropriate
time in order to record the date and time that the event occurred.
| Label in Figure 2 |
Meaning |
File used to store timestamp |
| 1 |
Job submission time |
.job_submitted |
| 2 |
Application wrapper start time |
.app_wrapper_started |
| 3 |
Application start time |
.app_started |
| 4 |
Application end time |
.app_ended |
| 5 |
Application wrapper end time |
.app_wrapper_ended |
The startJob script should create the file listed for label 1, while the application wrapper should create the rest.
Script API
The start job script should comply with the following command line:
startJob
-d <absolute path to workspace directory>
-e <full path to application wrapper script>
[-r [job constraints]...]
[-- [application arguments]...]
- Flag -d specifies the full path to job workspace directory, e.g. /mnt/data/ws-123
- Flag -e specifies the application wrapper script
- Flag -r specifies a list of
directives/constraints for the resource manager, the script should
understand these directives and translate them accordingly for the job
description file, the form of the constraints should be expressed in name=value pairs.
- Flag -- specifies a list of arguments to be passed when invoking the underlying application.
Script functionality
The functionality of the start job script should include the following:
- Parse and identify script arguments.
- Identify the job workspace directory structure, check that specified
staging directories exist as well as the specified input data.
- Change directory to job workspace, at this point script log files can be generated and stored in job workspace directory.
- Create .job_submitted timestamp corresponding to (2) file in Figure 2, and store in it the startJob PID.
This file can be used as a lock file to indicate the status of the job is in SUBMIT state. You need to remove the lock,
i.e. empty the contents of this file on exit.
- Analyse and compose the argument string for invoking the application
wrapper, e.g. make sure that application wrapper, property files exist,
etc.
- Generate RM job description file e.g. job-$$.pbs for PBS. This is a resource manager dependent file that
stores resource manager directives, instructs execution nodes to run the job, etc. Usually this file should
include the following:
- Create a resource management directives section, e.g. parse -r arguments, etc.
- Change directory to working directory.
- Touch the file .app_wrapper_started in job workspace which corresponds to point 2 in Figure 2.
- Run the application wrapper using the composed argument string.
- Store application wrapper exit code in file .app_wrapper_exit_code in job workspace, i.e. point 5 in Figure 2.
- Store application wrapper exit code in .app_wrapper_ended in the job workspace.
- Submit job description file to RM and store the job ID number into
.jobPID file in the job workspace directory. Job status scripts will
read this file to find out the status of the job with that ID.
- Remove submit job lock file, e.g. .job_submitted.
Return values
Return values are passed back to the Job Service middleware. A return
value of 0 indicates that the script has successfully submitted the
job, in any other condition the script should return a non-zero value.
Job Constraints
Job constraints are passed to platform scripts either from the Job Service
using the -r argument or directly by the client user via a constraints XML file which the
Job Service will store in the job session directory in a file called resources.xml.
The startJob platform script should parse these constraints and translate them
to resource manager directives.
Typical resource constraints are expected to describe constraints about WallClockTime,
CPUSpeed, PhysicalMemory, DiskSpace, etc.
See the job constraints page for further information.
4.5.1.3.
Check Job
The checkJob.pl platform script
Introduction
The purpose of the checkJob.pl platform script is to check the status of submitted GRIA jobs.
This script can be invoked either through the job service administration web interface, or by GRIA users using SOAP.
The job status information is returned to job service via the standard output.
Any other information, e.g. debugging, etc,
should be written to standard error or log files. The return code of this
script determines whether the captured job status report is valid or not.
The job status information is fed back
to the end-user. The check job platform script will be invoked, and
run locally, by the Job Service middleware and will need to access
files in the job workspace.
It will invoke the application-specific checkJob script. This script will
examine application-derived files in the job workspace working
directory and will try to report the current status of the application,
e.g. 50% of the job remaining, computation phase 3 complete, etc.
The STDOUT of this script will generate a status report
for the Job Service middleware.
Script functionality
The check job script should do the following:
- Identify the workspace directory and change directory to it.
- Invoke the app-specific status script.
Script API
The check job script interface will be invoked with the following command:
checkJob <application specific get status executable>
- argument: specifies the application status wrapper script
Return values
- 0 upon success
- non-zero return number indicates error, STDERR and STDOUT will provide more information.
The status report is printed to STDOUT and used by the end-user,
exit-codes are used by Job Service. Status debugging information is
always written to STDERR.
Application-Specific Status Script Interface
This is an optional application specific script that should provide a qualitative measure of the
running job. This script will run in the job working directory, e.g. work and report its status
in STDOUT.
The functionality of this script is application dependent, it should try to estimate the progress of
the running job, e.g. examining job input and output files, or any other applicable mechanism/technique
to provide an estimation of how the job is progressing.
The length of the generated
report should be short for practical reasons.
4.5.1.4.
Kill Job
The killJob.pl platform script
Script API
The killJob.pl platform script is used by the job service to terminate prematurely the execution
of a submitted job. The kill job script takes similar arguments to the status
monitor script. The API of the kill job script is:
killJob
-d <absolute path to workspace directory>
-e <full path to application wrapper script>
- Flag -d specifies the full path to job workspace directory, e.g. /mnt/data/ws-123
- Flag -e specifies the application kill-wrapper script
When this script is invoked it identifies the submitted process ID,
e.g. by reading the .jobPID file in the job session directory, and then
issues a kill operation for that job when the job is in running state.
If no application wrapper kill script is provided, i.e. default mode, the
resource manger will try to kill the job at the resource manager level. This
can be done by issuing a resource manager specific kill command, e.g. qdel
for PBS or kill -9 for local execution, etc. Subsequently, the resource
manager will try to kill the application wrapper and all its child processes.
As a result both the application and the application wrapper will be terminated
abruptly without preparing any output data, or timestamp files, or exit codes.
When an application wrapper script is provided via the -e argument, the killJob script will try to
invoke that script instead. The application wrapper kill script will be invoked within the job
working directory e.g. work. In this case, the application kill wrapper is responsible to
terminate the running job.
Upon successful job termination, the killJob platform script creates a .killed timestamp in the job
session directory. The .killed timestamp will cause subsequent status calls to report the job status
as KILLED.
The Application-Specific Kill Script
The application specific kill script will allow an application to be killed
in a certain way, e.g. using an application specific mechanism, and should
operate within the job working space. This script
is optional, if not present in the argument list, the platform script will try
to kill a job at the resource manager level. When an application specific kill
script is provided the platform script will invoke it instead.
The return code of this script should be 0 upon success or any other value
for error. The platform script accordingly will decide whether the killing operation
was successful or failed. The functionality of this script is application dependent
and difficult to be described in a generic way. For some applications it might
be as simple as creating a single file in the job workspace or in some cases
the script might have to pass an appropriate signal to the application, etc.
Return values
- 0 upon success
- non-zero return number indicates error, i.e. the state of the job should not
change.
Failing to terminate successfully a job, implies that the job will continue to occupy
and use resources.
4.5.2.
The Supplied Platform Scripts
Description of the supplied platform scripts
4.5.2.1.
Overview
Note: please refer to the Documentation Tutorial Integrating Resource Manager Systems with GRIA section for the latest platform script updates for PBS and Condor.
GRIA can use any resource management system via its platform script API. The GRIA distribution comes with pre-supplied example scripts for PBS and Condor resource managers.
The Job Service uses scripts to submit and manage jobs on an execution
platform, allowing GRIA to make use of a wide range of computational resources
including remote compute servers and clusters.
Service administrators can create their own platform scripts to interface with
a given execution platform and configure the Job Service to use them. The Services
come with scripts for the following execution platforms:
- Portable Batch System (PBS)
- Condor
- Local execution
Although these scripts provide full functionality for GRIA they assume a very basic PBS or Condor RM configuration.
The functionality of these scripts should be expanded accordingly for customised resource manager configurations. This document describes which are the most likely parts of the pre-supplied example scripts for PBS and Condor systems that require customisation.
By default, it is recommended that the service installer configures the Job Service to use the local
execution scripts, which means that all jobs will be run locally on the same machine
that runs the services. The scripts for this do not need to be modified.
The PBS and Condor scripts do normally require some customisation, and details
of how to do this (and how to test any of the supplied scripts) can be found in the following sections.
Users who need to create their own scripts to address other execution platforms
should read about the platform model, and also
see the instructions on the platform script interface.
4.5.2.2.
PBS
How to use and configure the supplied PBS platform scripts
The following sections describe how to configure the pre-supplied platform scripts
for PBS. These scripts are working for a very basic PBS configuration. However, they
can be very easily modified to adapt many customised PBS configurations. The basic
PBS testbed platform we used to develop and test these scripts had the following
configuration:
- All PBS and GRIA services run on the same machine, i.e. pbs_server, pbs_sched,
pbs_mom
- There is a default PBS queue, e.g. dqueue
- System users e.g. the GRIA user (tomcat) can submit and run simple PBS
jobs
PBS platform scripts can be easily customised in the following sections:
Submit Job: startJob.pl
This is a perl script which can submit GRIA jobs to PBS. Customisation of this
script will require modifications to the following:
- SECTION A: Initialise Resource Manager global vars, such as path for PBS
binaries, PBS server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<PBS binary path>
- RM_SERVER=<PBS server name>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: This section generates a job description file (JDF), which is the
file submitted to PBS to run the job. This section of the script
should be adequate for simple PBS configurations. You should edit
this part of code if you want to change any of the default PBS
directives or change the way jobs are submitted.
The PBS JDF file has two main parts, the first one describes
all the PBS directives required to run the job.
The second part of the file describes how to invoke the
application wrapper, etc.
This section of the code should be edited only when we have to
pass specific PBS directives than the exiting ones or to parse RM
directives passed with the -r arguments, i.e. see section E below.
The default directives used in this script are:
#PBS -N J${SESSION_NAME}
#PBS -o job.out
#PBS -e job.err
#PBS -l cput=3600
#PBS -q dque
${raString} # see SECTION E
...
The second part of the file describes how to invoke the
application wrapper and how to, create time-stamp files,
etc. This part of the code should cover a wider range of PBS
configurations.
- SECTION D: This section contains the PBS submit
command. According to your PBS system configuration you may have
to edit it only for customised PBS configurations that use
multiple queues, PBS servers, etc.
The example code in this section submits jobs to the default
queue in the PBS server defined in SECTION A, e.g.
# compose submit command to the default queue
my $command_line="$RM_SUBMIT -q \@$RM_SERVER $JDF";
# execute the submit command and store submission job ID
my $sub = 0xffff & system "$command_line > $JOB_PID";
- SECTION E: This subroutine should parse command line arguments for the RM.
It should return a text string with valid PBS directives that os
attached in the JDF file PBS directives section,
e.g. ${raString}.
The current implementation of this subroutine returns an empty
string. However, if you intend to pass RM directives dynamically
using the -r command line arguments you should parse them
in this subroutine and return them as a PBS directive string, e.g.
...
#PBS cput=2300
#PBS -l 2
...
Check Job: getJobStatus.pl
This is a perl script that checks and reports to GRIA the status of
a PBS job. For most PBS configurations the editing of this script
should be minimal:
- SECTION A: Initialise Resource Manager global vars, such as path for PBS
binaries, PBS server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<PBS binary path>
- RM_SERVER=<PBS server>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: The first part of this section reads the status of the PBS job. According
to your PBS configuration you may have to edit the code that grabs the job
status, e.g. in a PBS qstat command the status of a job is always
the 6th field, etc.
my $qString = `${RM_QUEUE} | grep $concatPID`;
my @words = "ewords('\s+', 0, $qString);
my $jStatus = $words[9];
Unless the output format of qstat is different you do not need to
change this section, e.g.
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
74.siegerrebe pm tomcate 00:30 0 R dque
Kill Job: killJob
This is a perl script for terminating PBS jobs, the following parts
of the code need editing:
- SECTION A: Initialise Resource Manager global vars, such as path for PBS
binaries, PBS server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<PBS binary path>
- RM_SERVER=<PBS server>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: The first part of this section reads the status of the PBS job. According
to your PBS configuration you may have to edit the code that grabs the job
status, e.g. in a PBS qstat command the status of a job is always
the 6th field, etc.
my $qString = `${RM_QUEUE} | grep $concatPID`;
my @words = "ewords('\s+', 0, $qString);
my $jStatus = $words[9];
Unless the output format of qstat is different you do not need to
change this section, e.g.
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
74.siegerrebe pm tomcate 00:30 0 R dque
4.5.2.3.
Condor
How to use and configure the supplied Condor platform scripts
GRIA pre-supplied platform scripts for Condor systems provide identical functionality with
the PBS platform scripts. These scripts are working on a very basic Condor configuration.
As a basic condor testbed platform we used:
- All condor and GRIA services run on the same system
- Condor default values used
- System users, i.e. GRIA user (tomcat) can submit and run simple condor jobs
Condor platform scripts can be easily customised in the following sections:
Submit Job: startJob.pl
This is a perl script to submit GRIA jobs in a Condor pool.
- SECTION A: Initialise Resource Manager global vars, such as path for Condor
binaries, master server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<Condor binary path>
- RM_SERVER=<Condor server>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: Generate a job description file (JDF), this is the file submitted to
Condor to run the job. The condor JDF file includes all the required condor
directives to run the job and the job itself is described as
frame.
Resource manager directives passed as command line arguments should be
processed in SECTION E and append at the end of JDF. The default
condor directives section in this script includes:
universe = vanilla
executable = frame
arguments = $aRG
shell = /bin/bash
error = $JOB_ERR
log = job.log
output = $JOB_OUT
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
$raString # see SECTION E
You should edit this section if your condor configuration requires
different directives.
The frame1 is a simple shell script that condor has
to run for every GRIA submitted job. The functionality of the
frame script is to change the working directory, invoke the
application wrapper and generate the time-stamp files before and
after the execution of the application wrapper, e.g.
#!/bin/bash
cd $SESSION_DIR/$WORK_DIR
touch ../$APP_WRAPPER_STARTED_TS
${EXE_WRAPPER} $aRG
echo \$? > ../.app_wrapper_exit_code
touch ../$APP_WRAPPER_ENDED_TS
In most cases you should not need to change the frame code.
- SECTION D: This section contains the condor submit command:
# compose the submit argument
my $command_line="$RM_SUBMIT $JDF";
# execute condor submit, store job ID
my $sub = 0xffff & system "$command_line > $JOB_PID";
The expected return of the condor submit command usually is similar to:
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 25.
You should only change this part of the code if you use a
customised condor submission command.
- SECTION E: This subroutine should parse command line arguments for the RM. For
the condor system it should return a text string with valid condor directives
that will be attached in the JDF file, e.g. ${raString}, e.g.
Requirements = Arch =="INTEL" && OpSys == "Linux" && Memory > 20
Rank = (Memory > 32)*((Memory * 100) + (IsDedicated * 10000) + Mips)
Check Job: getJobStatus
This is a perl script that reports the status of a condor job,
customisation of the code should take place in the following sections:
- SECTION A: Initialise Resource Manager global vars, such as path for Condor
binaries, master server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<Condor binary path>
- RM_SERVER=<Condor server>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: This section reads the condor_q command output which typically should
be similar to:
-- Submitter: siegerrebe.it-innovation.soton.ac.uk : <xxx.xxx.xxx.xxx:42239> : siegerrebe.it-innovation.soton.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
58.0 tomcat 7/11 14:46 0+00:00:00 R 0 0.0 frame -i ../staged
In this example the job status is reported on the 9th field:
my $qString = `${RM_QUEUE} $PID | grep $PID`;
my @words = "ewords('\s+', 0, $qString);
my $jStatus = $words[6];
You should only change this part of the code if you intend to use
a customised format of the condor_q command.
Kill Job: killJob
This is a perl script that terminates condor jobs, the following
parts of code may need editing:
- SECTION A: Initialise Resource Manager global vars, such as path for Condor
binaries, master server name, etc. In particular make sure that the following
variables are set up correctly:
- RM_PATH=<Condor binary path>
- RM_SERVER=<Condor server>
- SECTION B: Turn verbose debug flags on/off. This step is optional.
- SECTION C: This section reads the condor_q in order to
figure out the state of the condor job. The command output typically, should
be similar to:
-- Submitter: siegerrebe.it-innovation.soton.ac.uk : <xxx.xxx.xxx.xxx:42239> : siegerrebe.it-innovation.soton.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
58.0 tomcat 7/11 14:46 0+00:00:00 R 0 0.0 frame -i ../staged
In this example the job status is reported on the 6th field
my $qString = `${RM_QUEUE} $PID | grep $PID`;
my @words = "ewords('\s+', 0, $qString);
my $jStatus = $words[6];
You should only change this part of the code if you intend to use
a customised format of the condor_q command.
1 Submitting a simple shell script to a resource manager instead of the
real application itself can sometimes cause problems e.g. advanced configurations
running an application in parallel mode, etc. It is advisable in such cases to try and
move the necessary functionality either to the application wrapper or up to the resource
manager section, e.g. prologue and epilogue parts in PBS, etc.
4.5.2.4.
Testing the Platform Scripts
How to test the platform scripts
This section describes how to test platform scripts after they have been installed
and configured, using the pre-supplied test application.
The details of command lines, etc, are specific to the test application only.
All platform scripts should be able to run as stand-alone applications from
a command line. Before running any of the tests, make sure that:
- the test application is installed and configured properly, e.g. properties files, etc
- you can submit and run jobs successfully via the cluster resource manager (e.g. PBS) if you are using one
- the scratch directory you are using, e.g. /scratch, is accessible by both front-end system and compute nodes.
- Create a temporary workspace directory:
$ mkdir /scratch/testgrid/work/{inputs,outputs}
$ cd /scratch/testgrid
- Copy an image file to job workspace, e.g.
$ cp some_demo.jpg /scratch/testgrid/inputs/input-0
Running a Job
The default location for scripts can be found under WRAPPERS_DIRECTORY/platform.
To test the startJob script, use the following command (N.B. directory
locations may vary depending on where the scripts may have been copied
to):
$ /opt/gria-platform-scripts/rm_local/startJob -v -d /scratch/testgrid -e /opt/tutorial-apps/swirl/startJob
This will run the test application and store the output results
in the outputs subdirectory. The working subdirectory is set to work. The
response should look like:
Session directory: 1149159584
Job submitted successfully
Checking the Job Status
From the command line type:
$ /opt/gria-platform-scripts/rm_local/getJobStatus -e /opt/tutorial-apps/swirl/checkJob -d /scratch/testgrid
A typical response of this command will produce something like:
DEBUG Use session directory: /scratch/testgrid
DEBUG Use application status script:
+----------------------------------------------------+
| |
| GRIA Job getStatus wrapper ($Revision: 4190 $) |
| |
+----------------------------------------------------+
Resource Manager...: Local execution
Check job status time....: Thu Jun 1 11:59:44 2006
DEBUG Using session name: testgrid
DEBUG Using concat session name: testgrid
DEBUG JOB_PID file found!
Local job ID.........: 2286
DEBUG Detected platform: Unix
DEBUG qString: pagis 2286 1 0 11:59 pts/2 00:00:00 perl /scratch/testgrid/jdf.pl
pagis 2289 2286 0 11:59 pts/2 00:00:00 /usr/bin/perl /opt/tutorial-apps/paint/startJob.pl
JOB_STATUS RUNNING
JOB_SUBMITTED 1149159584000
APP_WRAPPER_STARTED 1149159584000
<------------------------>
JOB_STATUS RUNNING
JOB_SUBMITTED 1149159584000
APP_WRAPPER_STARTED 1149159584000
<------------------------>
Appliction specific status not available
Check job status exit code: 0
Note: getStatus is using STDOUT to provide the job status report to
Job Service while STDERR is used for debugging information. The two
output streams will be mixed on the screen unless you redirect one to a
separate file e.g. 2>status.err.
Killing a Job
From the command line, type:
$ /opt/gria-platform-scripts/rm_local/killJob -v -d /scratch/testgrid
The response from this command will be similar to:
Try to kill the job
Use session directory: /sratch/testgrid
Use application specific kill script:
killJob ver: 5.0.0
+--------------------------------------------------+
| |
| GRIA Job killJob wrapper ($Revision: 4190 ) |
| |
+--------------------------------------------------+
Thu Jun 1 12:00:00 2006
session name: testgrid
DEBUG JOB_PID file found!
DEBUG PID: 2286, 2286, 2286
DEBUG Detected platform: unix
DEBUG qString: <>
DEBUG Job is not found in Q
Job is not running, it cannot be killed because it has already finished
Kill job exit code: 0
After a job has completed successfully the workspace directory will have a directory structure similar to:
testgrid/
|-- .app_wrapper_ended
|-- .app_wrapper_exit_code
|-- .app_wrapper_started
|-- .jobPID
|-- .job_submitted
|-- jdf.pl
|-- log
|-- resources.xml
`-- work
|-- image.jpg
|-- inputs
| `-- input-0
`-- outputs
`-- output-0
4.5.3.
Job Constraints
The Job Service Constraints
Job Constraints
The job constraints feature is a new experimental feature. It does not currently integrate with the SLA Management Service.
Job constraints are passed to platform scripts either from the Job Service
using the -r command line argument (defined statically) or directly by the client user (using the client API) via a constraints XML file which the
Job Service will store in the job session directory as resources.xml.
The startJob platform script should parse these constraints and translate them
to resource manager directives.
Typical job service constraints are expected to describe resource constraints such as WallClockTime,
CPUSpeed, PhysicalMemory, DiskSpace, etc.
Constraints passed from the Job Service as command line arguments should follow the
form of name=value pairs, for example
-r CPUSpeed=1800 (in MHz for a CPU speed constraint), or -r WallClockTime=3600 for a
runtime constraint of an hour.
Job Service providers have to specify and advertise which job constraints a user
is allowed to apply for a job run. This can be done easily with an apropriate
XML schema.
Client users can then submit jobs along with their constraints file.
An example of a simple user supplied constraints file could be:
<?xml version="1.0"?>
<Resources>
<CPUArchitecture>amd64</CPUArchitecture>
<CPUSpeed>180</CPUSpeed>
<PhysicalMemory>1024</PhysicalMemory>
<!--DiskSpace>40</DiskSpace-->
<WallClockTime>210</WallClockTime>
<IndividualCPUCount>1</IndividualCPUCount>
<TotalCPUCount>1</TotalCPUCount>
<FileSizeLimit>200</FileSizeLimit>
</Resources>
A suitable XML schema for these constraints could be:
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:annotation>
<xsd:documentation xml:lang="en">
GRIA resource schema
</xsd:documentation>
</xsd:annotation>
<xsd:element name="Resources" type="ResourcesType"/>
<xsd:complexType name="ResourcesType">
<xsd:sequence>
<xsd:element name="Comment" type="xsd:string" minOccurs="0"/>
<xsd:element name="CPUArchitecture" type="cpuarchitecture" minOccurs="0" default="x86"/>
<xsd:element name="CPUSpeed" type="xsd:int" minOccurs="0"/>
<xsd:element name="PhysicalMemory" type="xsd:int" minOccurs="0"/>
<xsd:element name="DiskSpace" type="xsd:int" minOccurs="0"/>
<xsd:element name="WallClockTime" type="xsd:long"/>
<xsd:element name="IndividualCPUCount" type="xsd:int" minOccurs="0" default="1"/>
<xsd:element name="TotalCPUCount" type="xsd:int" minOccurs="0" default="1"/>
<xsd:element name="FileSizeLimit" type="xsd:long" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
<xsd:simpleType name="cpuarchitecture">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="x86"/>
<xsd:enumeration value="ia64"/>
<xsd:enumeration value="amd64"/>
<xsd:enumeration value="sparc"/>
<xsd:enumeration value="other"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
Supported Constraints in the Supplied Platform Scripts
The platform (startJob) scripts supplied with GRIA implement the following job constraints:
- WallClockTime
- Maximum amount of time a job can run in seconds. If the job service and the user both specify this constraint, the minimum of the two is taken.
- PhysicalMemory
- The minimum amount of required physical memory in MB. If the job service and the user both specify this constraint, the maximum of the two is taken.
- CPUSpeed
- The minimum CPU speed required in MHz. If the job service and the user both specify this constraint, the maximum of the two is taken.
- DiskSpace
- The minimum amount of available disk space required in MB. If the job service and the user both specify this constraint, the maximum of the two is taken.
- OSName
- The Job Service overwrites the user supplied constraint.
Note: GRIA pre-supplied scripts are using XML::Simple perl module to handle the user constraints XML file,
which is only capable of handling simple XML documents without attributes.
The following table shows which constraints are implemented with the GRIA pre-supplied
platform scripts.
| Constraint |
Unit |
Local execution |
PBS |
Condor |
| WallClockTime |
sec |
OK |
OK |
OK |
| PhysicalMemory |
MB |
- |
OK |
OK |
| CPUSpeed |
MHz |
OK (req. perl win32::Info for XP) |
- |
- |
| OSName |
<string> |
- |
OK |
OK |
| DiskSpace |
MB |
- |
- |
OK |
Depending on the platform capabilities Job Service providers should customise section E of
the startJob platform script accordingly.
|