Skip to content

kutschkem/gluish

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gluish

Build Status pypi version

pypi downloads

Some glue around luigi.

Provides a base class, that autogenerates its output filenames based on

  • some base path,
  • a tag,
  • the task id (the classname and the significant parameters)

Additionally, this package provides a few smaller utilities, like a TSV format, a benchmarking decorator and some task templates.

A basic task that knows its place

gluish.task.BaseTask is intended to be used as a supertask.

from gluish.task import BaseTask
import datetime
import luigi
import tempfile

class DefaultTask(BaseTask):
    """ Some default abstract task for your tasks. BASE and TAG determine
    the paths, where the artefacts will be stored. """
    BASE = tempfile.gettempdir()
    TAG = 'just-a-test'

class RealTask(DefaultTask):
    """ Note that this task has a `self.path()`, that figures out the full
    path for this class' output. """
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def run(self):
        with self.output().open('w') as output:
            output.write('Hello World!')

    def output(self):
        return luigi.LocalTarget(path=self.path())

When instantiating a RealTask instance, it will automatically be assigned a structured output path, consisting of BASE, TAG, task name and a slugified version of the significant parameters.

task = RealTask()
task.output().path
# would be something like this on OS X:
# /var/folders/jy/g_b2kpwx0850/T/just-a-test/RealTask/date-1970-01-01.tsv

A TSV format

Was started on the mailing list. Continuing the example from above, lets create a task, that generates TSV files, named TabularSource.

from gluish.format import TSV

class TabularSource(DefaultTask):
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def run(self):
        with self.output().open('w') as output:
            for i in range(10):
                output.write_tsv(i, 'Hello', 'World')

    def output(self):
        return luigi.LocalTarget(path=self.path(), format=TSV)

Another class, TabularConsumer can use iter_tsv on the handle obtained by opening the file. The row will be a tuple, or - if cols is specified - a collections.namedtuple.

class TabularConsumer(DefaultTask):
    date = luigi.DateParameter(default=datetime.date(1970, 1, 1))
    def requires(self):
        return TabularSource()

    def run(self):
        with self.input().open() as handle:
            for row in handle.iter_tsv(cols=('id', 'greeting', 'greetee'))
                print('{0} {1}!'.format(row.greeting, row.greetee))

    def complete(self):
        return False

A benchmark decorator

Log some running times. Mostly useful in interactive mode.

from gluish.benchmark import timed

class SomeWork(luigi.Task):
    @timed
    def run(self):
        pass

    def complete(self):
        return False

Elasticsearch template task

Modeled after luigi.contrib.CopyToTable.

from gluish.esindex import CopyToIndex
import luigi

class ExampleIndex(CopyToIndex):
    host = 'localhost'
    port = 9200
    index = 'example'
    doc_type = 'default'
    purge_existing_index = True

    def docs(self):
        return [{'_id': 1, 'title': 'An example document.'}]

if __name__ == '__main__':
    task = ExampleIndex()
    luigi.build([task], local_scheduler=True)

Elasticsearch support has been added to luigi as luigi.contrib.esindex.

FTP mirroring task

Mirroring FTP shares. This example reuses the DefaultTask from above. Uses the sophisticated lftp program under the hood, so it needs to be available on your system.

from gluish.common import FTPMirror
from gluish.utils import random_string
import luigi

class MirrorTask(DefaultTask):
    """ Indicator makes this task run on each call. """
    indicator = luigi.Parameter(default=random_string())

    def requires(self):
        return FTPMirror(host='ftp.cs.brown.edu',
            username='anonymous', password='anonymous',
            pattern='*pdf', base='/pub/techreports/00')

    def run(self):
        with self.input().open() as handle:
            # FTPMirror output is in TSV
            for row in handle.iter_tsv(cols=('path',)):
                # do some useful things with the files here ...

    def output(self):
        return luigi.LocalTarget(path=self.path())

The output of FTPMirror is a single file, that contains the paths to all mirrored files, one per line.

A short self contained example can be found in this gist.

To copy a single file from an FTP server, there is an FTPFile template task.

Easy shell calls

Leverage command line tools with gluish.utils.shellout. shellout will take a string argument and will format it according to the keyword arguments. The {output} placeholder is special, since it will be automatically assigned a path to a temporary file, if it is not specified as a keyword argument.

The return value of shellout is the path to the {output} file.

Spaces in the given string are normalized, unless preserve_whitespace=True is passed. A literal curly brace can be inserted by {{ and }} respectively.

An exception is raised, whenever the commands exit with a non-zero return value.

Note: If you want to make sure an executable is available on you system before the task runs, you can use a gluish.common.Executable task as requirement.

from gluish.common import Executable
from gluish.utils import shellout
import luigi

class GIFScreencast(DefaultTask):
    """ Given a path to a screencast .mov, generate a GIF
        which is funnier by definition. """
    filename = luigi.Parameter(description='Path to a .mov screencast')
    delay = luigi.IntParameter(default=3)

    def requires(self):
        return [Executable(name='ffmpg'),
                Executable(name='gifsicle', message='http://www.lcdf.org/gifsicle/')]

    def run(self):
        output = shellout("""ffmpeg -i {infile} -s 600x400
                                    -pix_fmt rgb24 -r 10 -f gif - |
                             gifsicle --optimize=3 --delay={delay} > {output} """,
                             infile=self.filename, delay=self.delay)
        luigi.File(output).move(self.output().path)

    def output(self):
        return luigi.LocalTarget(path=self.path())

Dynamic date parameter

Sometimes the effective date for a task needs to be determined dynamically.

Consider for example a workflow involving an FTP server.

A data source is fetched from FTP, but it is not known, when updates are supplied. So the FTP server needs to be checked in regular intervals. Dependent tasks do not need to be updated as long as there is nothing new on the FTP server.

To map an arbitrary date to the closest date in the past, where an update occured, you can use a gluish.parameter.ClosestDateParameter, which is just an ordinary DateParameter but will invoke task.closest() behind the scene, to figure out the effective date.

from gluish.parameter import ClosestDateParameter
import datetime
import luigi

class SimpleTask(DefaultTask):
    """ Reuse DefaultTask from above """
    date = ClosestDateParameter(default=datetime.date.today())

    def closest(self):
        # invoke dynamic checks here ...
        # for simplicity, map this task to the last monday
        return self.date - datetime.timedelta(days=self.date.weekday())

    def run(self):
        with self.output().open('w') as output:
            output.write("It's just another manic Monday!")

    def output(self):
        return luigi.LocalTarget(path=self.path())

A short, self contained example can be found in this gist.

If task.closest is a relatively expensive operation (FTP mirror, rsync) and the workflow uses a lot of ClosestDateParameter type of parameters, it is convenient to memoize the result of task.closest(). A @memoize decorator makes caching the result simple:

from gluish.utils import memoize
    ...

    @memoize
    def closest(self):
        return self.date - datetime.timedelta(days=self.date.weekday())

Development

System package dependencies:

  • Ubuntu: libmysqlclient-dev
  • CentOS: mysql-devel

Setup:

$ git clone git@github.com:miku/gluish.git
$ cd gluish
$ mkvirtualenv gluish
$ pip install -r requirements.txt
$ nosetests

Coverage status

As of 4a7ec26ae6680d880ae20bebd514aba70d801687.

$ nosetests --verbose --with-coverage --cover-package gluish
Name               Stmts   Miss  Cover   Missing
------------------------------------------------
gluish                 4      0   100%
gluish.benchmark      32      3    91%   59-62
gluish.colors         14      4    71%   12, 20, 24, 32
gluish.common        103     18    83%   68, 72, 80-81, 83, 86, ...
gluish.database       50     18    64%   12, 69-76, 79-88, 91-94
gluish.esindex       105     71    32%   40-43, 62-66, 70-71, 77-86, ...
gluish.format         27      2    93%   69, 72
gluish.intervals      16      0   100%
gluish.oai            58     14    76%   83-99
gluish.parameter       5      0   100%
gluish.path           82     14    83%   25, 39-48, 101-102, 165
gluish.task           45      2    96%   95, 114
gluish.utils          80      3    96%   35, 127, 166
------------------------------------------------
TOTAL                621    149    76%
----------------------------------------------------------------------
Ran 33 tests in 0.269s

Pylint hook

$ pip install git-pylint-commit-hook
$ touch .git/hooks/pre-commit
$ chmod +x .git/hooks/pre-commit
$ echo '
#!/usr/bin/env bash
git-pylint-commit-hook
' > .git/hooks/pre-commit

About

Utils around luigi.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Makefile 0.1%