Skip to content

wtsi-hgi/cookie-monster

Repository files navigation

Build Status codecov.io

Cookie Monster

COOKIES! Om nom nom nom...

Summary

  1. Data retrievers can be setup to pull information into the system.
  2. The information is aggregated in a knowledge base, grouped by its relation to a distinct entity.
  3. When information becomes known about an entity, a production rule system is ran using rules that may have arbitrarily complex preconditions that can be used to trigger arbitrarily complex productions.
  4. Information about data objects can be easily enriched if it is determined that not enough information is known about the object to process it.

Key features

  • DSL free.
  • Python 3.5+.
  • Simple to add production rules and methods of gathering more information on-the-fly.
  • Available as a Docker image.

Less documentation, more example

If you do not want to read about how the Cookie Monster system works and just want to look at an example of it in action, please see the HGI Cookie Monster setup.

Definitions

For better or for worse, naming within some parts of the system is Sesame Street themed...

The system is called "Cookie Monster" as its behaviour is similar to that of the Cookie Monster character in Sesame Street: it shovels in all of the cookies but only a few get digested/mashed into the hand puppet, with the rest falling back out.

Components

Cookie storage

At a minimum, a Cookie Monster installation comprises of a CookieJar that can store Cookies. It is essentially a knowledge base that stores unstructured JSON data and a limited amount of associated metadata. Each Cookie in the jar holds an the identifier of the data object to which it relates. A Cookie may also contain a number of "enrichments", each of which holds information about the data object, along with details about where and when this information was attained.

A CookieJar implementation (named BiscuitTin), which uses a CouchDB database, is supplied. It can be setup with:

cookie_jar = BiscuitTin(couchdb_host, couchdb_database_name)

Cookie processing

A Cookie Monster installation can be setup with a Processor Manager, which uses Processors to examine Cookies after they have been enriched. Processors essentially implement a production rule system, where predefined rules are evaluated in order of priority. If a rule's precondition is matched, its action is triggered, which may be an arbitrary set of instructions. The action method's return value can be used to indicate whether any further rules should be processed with the cookie. In the case where no rules are matched/no rules indicate no further processing is required, the Processor will check if the Cookie can be enriched further using an Enrichment Loader and put any extra information into the knowledge base.

A simple implementation of a Processor Manager (named BasicProcessorManager) is supplied. This can be constructed as such:

processor_manager = BasicProcessorManager(number_of_processors, cookie_jar, rules_source, enrichment_loader_source)

It can then be setup to process Cookies as they are enriched in the CookieJar:

cookie_jar.add_listener(processor_manager.process_any_cookies)

Rules

Rules have a matching criteria (a precondition) to which Cookies are compared to determine if any action should be taken. If matched, the rule's action is executed, which can be an arbitrary set of commands. The action method then returns whether further processing of the Cookie is required. The order in which rules are evaluated is determined by their priority.

Changing rules on-the-fly

If RuleSource is being used by your ProcessorManager to attain the rules that are evaluated by Processor instances, it is possible to dynamically changes the rules used by the Cookie Monster for future jobs (jobs already running will continue to use the set of rules that they had when they were started).

The following example illustrates how a rule is defined and registered. If appropriate, the code can be inserted into an existing rule file. Alternatively, it can be added to a new file in the rules directory, with a name matching the format: *rule.py. Rule files can be put into subdirectories. If the Python module does not compile (e.g. it contains invalid syntax or uses a Python library that has not been installed), the module will be ignored.

from cookiemonster.models import Cookie, Rule
from hgicommon.mixable import Priority
from hgicommon.data_source import register

MY_RULE_IDENTIFIER = "my_rule"

def _matches(cookie: Cookie, context: Context) -> bool:
    return "my_study" in cookie.path
        
def _action(cookie: Cookie, context: Context) -> bool:
    # <Interesting actions>
    return whether_any_more_rules_should_be_processed

_priority = Priority.MAX_PRIORITY

_rule = Rule(_matches, _generate_action, MY_RULE_IDENTIFIER, _priority)
register(_rule)

To delete a pre-existing rule, delete the file containing it or remove the relevant call to register. To modify a rule, simply change its code and it will be updated in Cookie Monster when it is saved.

Examples

Please see the [rules used in the HGI Cookie Monster setup] (https://github.com/wtsi-hgi/hgi-cookie-monster-setup/tree/master/hgicookiemonster/rules).

Cookie Enrichments

If all the rules have been evaluated and none of them defined in their action that no further processing of the Cookie is required, cookie "enrichment loaders" can be used to load more information about a cookie.

Changing enrichment loaders on-the-fly

Similarly to rules, the enrichment loaders can be changed during execution. Files containing enrichment loaders must have a name matching the format: *loader.py.

from cookiemonster import EnrichmentLoader, Cookie, Enrichment
from hgicommon.mixable import Priority
from hgicommon.data_source import register

MY_ENRICHMENT_IDENTIFIER = "my_enrichment"

def _can_enrich(cookie: Cookie, context: Context) -> bool:
    return "my_data_source" in [enrichment.source for enrichment in cookie.enrichments]
    
def _load_enrichment(cookie: Cookie, context: Context) -> Enrichment:
    return my_data_source.load_more_information_about(cookie.path)

_priority = Priority.MAX_PRIORITY

_enrichment_loader = EnrichmentLoader(_can_enrich, _load_enrichment, MY_ENRICHMENT_IDENTIFIER, _priority)
register(_enrichment_loader)
Examples

Please see the [enrichment loaders used in the HGI Cookie Monster setup] (https://github.com/wtsi-hgi/hgi-cookie-monster-setup/tree/master/hgicookiemonster/enrichment_loaders).

Data retrievers

A Cookie Monster installation may use data retrievers, which get updates about data objects that can be used to enrich (which will create if no previous information is known) related Cookies in the CookieJar.

A retriever that periodically gets information about updates made to entities in an iRODS database is shipped with the system. In order to use it, the specific queries defined in resources/specific-queries must be installed on your iRODS server and a version of baton above 0.16.3 must be installed. It can be setup as such:

update_mapper = BatonUpdateMapper(baton_binaries_location)
database_connector = SQLAlchemyDatabaseConnector(retrieval_log_database)
retrieval_log_mapper = SQLAlchemyRetrievalLogMapper(database_connector)
retrieval_manager = PeriodicRetrievalManager(retrieval_period, update_mapper, retrieval_log_mapper)

Then linked to a CookieJar by:

executor = ThreadPoolExecutor(max_workers=NUMBER_OF_THREADS)

def put_updates_in_cookie_jar(update_collection: UpdateCollection):
    for update in update_collection:
        enrichment = Enrichment("irods_update", datetime.now(), update.metadata)
        executor.submit(timed_enrichment, update.target, enrichment)
retrieval_manager.add_listener(put_updates_in_cookie_jar)

HTTP API

A JSON-based HTTP API is provided to expose certain functionality as an outwardly facing interface, on a configurable port. Currently, the following endpoints are defined:

/queue

  • GET Get the current status details of the "to process" queue, returning a JSON object with the following members: queue_length

/queue/reprocess

  • POST Mark a file as requiring reprocessing, which will immediately return it (if necessary) to the "to process" queue. This method expects a JSON request body consisting of an object with a path member; returning the same.

/cookiejar/<identifier> (and /cookiejar?identifier=<identifier>)

  • GET Get a file and its enrichments from the metadata repository, by its identifier. (Note that the identifier must be percent encoded. If it begins with a slash, then the query string form of this endpoint must be used.)
  • DELETE Delete a file and its enrichments from the metadata repository, by its identifier. (Note that the identifier must be percent encoded. If it begins with a slash, then the query string form of this endpoint must be used.)

/debug/threads

  • GET Retrieve runtime state of all the current threads, for debugging.

Note that all requests must include application/json in their Accept header.

How to develop

Testing

Locally

To run the tests, use ./scripts/run-tests.sh from the project's root directory. This script will use pip to install all requirements for running the tests. Some tests use Docker therefore a Docker daemon must be running on the test machine, with the environment variables DOCKER_TLS_VERIFY, DOCKER_HOST and DOCKER_CERT_PATH set.