GitHub - wvengen/lgipilot: LGI Pilot job manager

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
pilotdist		pilotdist
src		src
.gitignore		.gitignore
ChangeLog		ChangeLog
README		README
lgipilot		lgipilot
lgipilot.ini		lgipilot.ini
lgipilot.spec		lgipilot.spec

Repository files navigation

LGI Pilot job manager (lgipilot)
--------------------------------

To use the Leiden Grid Infrastructure (LGI) on a grid (like gLite) it can be
useful to make use of pilot jobs. Most importantly, it eliminates latency
inherent in grids. Secondly, end-users can use the grid without the need for
grid certificates (though this requires some LGI tweaking at the moment).

Requirements:
 - Python (tested with 2.4)
 - gLite user-interface
 - LGI project server version 1.29 or later

License: GPL version 2 or later


* Quickstart overview

1. Install an LGI project server
   Please see http://fwnc7003.leidenuniv.nl/LGI/ for more information. This
   typically includes installing a LAMP stack, the LGI application, setting up
   users and groups, and configuring a resource daemon.

2. Configure a resource daemon for your application.
   This is explained by the LGI documentation as well. By running this resource
   daemon on a Linux host you should be able to use your application from
   within LGI.

3. Adapt the resource daemon configuration for the grid job.
   To run the resource daemon on the grid, it needs to be self-contained and a
   little adapted. This is explained below.

4. Run the LGI pilot job manager (lgipilot)
   using the resource daemon tarball just created. See below.


* Resource daemon configuration

As mentioned in the LGI documentation, a resource daemon needs to be configured
for use with your specific application. This includes editing LGI.cfg, pointing
it to its key and certificate, and creating or editing various scripts for
things like starting and stopping the application.

To run the application on the grid, the resource daemon, its configuration and
the application must be bundled, including any dependencies. This is described
in pilotdist/README.pilot. This results in a tarball which should be placed in
pilotdist/pilotjob.tar.gz.


* Run LGI pilot job manager

With the pilot job tarball in place, the LGI pilot job manager (lgipilot) can
be started on a grid user-interface (UI, which means that it has all the tools
and configuration to submit and interact with grid jobs). Make sure that you
have a valid proxy certificate (*).

You may want to change the pilot job manager configuration in lgipilot.ini,
section 'LGI'. SchedMin/SchedMax depend on your expected load and the amount of
jobs slots (fair share) you have available, the other timing-related options
are dependent on the typical length of your jobs and grid latency. More
guidelines on how to configure this and monitoring tools may become available
in the future.
The pilot job manager includes parts of Ganga, which has many configuration
options as well in the same file. Please see comments and Ganga documentation.

(*) For production deployments one can use a robot certificate that updates the
    proxy certificate like once a day.


* Update pilot job tarball

It may happen that the pilotjob tarball is modified at some point, for example
to update or add an application. New pilot jobs will run using the updated
tarball, but old pilot jobs may keep running. And you never know which of the
running pilot jobs will be executing a new LGI job.

Currently there is only a crude way to make sure that all pilotjobs run the new
version: killing the old ones. Running LGI jobs will be rescheduled by the LGI
project server upon timeout, so no jobs will be lost. This is not ideal, but
currently the only option. One can kill all pilotjobs invoked by lgipilot with
the command-line option --pilot-cancel. The LGI pilot job manager will
automatically submit new pilotjobs to compensate (running the new version).