COOKIES! Om nom nom nom...
- Data retrievers can be setup to pull information into the system.
- The information is aggregated in a knowledge base, grouped by its relation to a distinct entity.
- When information becomes known about an entity, a production rule system is ran using rules that may have arbitrarily complex preconditions that can be used to trigger arbitrarily complex productions.
- Information about data objects can be easily enriched if it is determined that not enough information is known about the object to process it.
- DSL free.
- Python 3.5+.
- Simple to add production rules and methods of gathering more information on-the-fly.
- Available as a Docker image.
If you do not want to read about how the Cookie Monster system works and just want to look at an example of it in action, please see the HGI Cookie Monster setup.
For better or for worse, naming within some parts of the system is Sesame Street themed...
- The collection of all information known about a particular data object is referred to as a "Cookie".
- The subsystem that stores a collection of Cookies is referred to as a "CookieJar".
- The HTTP API is referred to as "Elmo".
The system is called "Cookie Monster" as its behaviour is similar to that of the Cookie Monster character in Sesame Street: it shovels in all of the cookies but only a few get digested/mashed into the hand puppet, with the rest falling back out.
At a minimum, a Cookie Monster installation comprises of a CookieJar that can store Cookies. It is essentially a knowledge base that stores unstructured JSON data and a limited amount of associated metadata. Each Cookie in the jar holds an the identifier of the data object to which it relates. A Cookie may also contain a number of "enrichments", each of which holds information about the data object, along with details about where and when this information was attained.
A CookieJar implementation (named BiscuitTin
), which uses a CouchDB database, is supplied. It can be setup with:
cookie_jar = BiscuitTin(couchdb_host, couchdb_database_name)
A Cookie Monster installation can be setup with a Processor Manager, which uses Processors to examine Cookies after they have been enriched. Processors essentially implement a production rule system, where predefined rules are evaluated in order of priority. If a rule's precondition is matched, its action is triggered, which may be an arbitrary set of instructions. The action method's return value can be used to indicate whether any further rules should be processed with the cookie. In the case where no rules are matched/no rules indicate no further processing is required, the Processor will check if the Cookie can be enriched further using an Enrichment Loader and put any extra information into the knowledge base.
A simple implementation of a Processor Manager (named BasicProcessorManager
) is supplied. This can be constructed as
such:
processor_manager = BasicProcessorManager(number_of_processors, cookie_jar, rules_source, enrichment_loader_source)
Then setup to process Cookies as they are enriched in the CookieJar (see [wtsi-hgi#18 bug)):
cookie_jar.add_listener(processor_manager.process_any_cookies)
Rules have a matching criteria (a precondition) to which Cookies are compared to determine if any action should be taken. If matched, the rule's action is executed, which can be an arbitrary set of commands. The action method then returns whether further processing of the Cookie is required. The order in which rules are evaluated is determined by their priority.
If RuleSource
is being used by your ProcessorManager
to attain the rules that are evaluated by Processor
instances, it is possible to dynamically changes the rules used by the Cookie Monster for future jobs (jobs already
running will continue to use the set of rules that they had when they were started).
The following example illustrates how a rule is defined and registered. If appropriate, the code can be inserted into an
existing rule file. Alternatively, it can be added to a new file in the rules directory, with a name matching the
format: *.rule.py
. Rule files can be put into subdirectories. If the Python module does not compile (e.g. it
contains invalid syntax or uses a Python library that has not been installed), the module will be ignored.
from cookiemonster.models import Cookie, Rule
from hgicommon.mixable import Priority
from hgicommon.data_source import register
def _matches(cookie: Cookie, context: Context) -> bool:
return "my_study" in cookie.path
def _action(cookie: Cookie, context: Context) -> bool:
# <Interesting actions>
return whether_any_more_rules_should_be_processed
_priority = Priority.MAX_PRIORITY
_rule = Rule(_matches, _generate_action, _priority, "optional_name")
register(_rule)
To delete a pre-existing rule, delete the file containing it or remove the relevant call to register
. To modify a
rule, simply change its code and it will be updated in Cookie Monster when it is saved.
Please see the [rules used in the HGI Cookie Monster setup] (https://github.com/wtsi-hgi/hgi-cookie-monster-setup/tree/master/hgicookiemonster/rules).
If all the rules have been evaluated and none of them defined in their action that no further processing of the Cookie is required, cookie "enrichment loaders" can be used to load more information about a cookie.
Similarly to rules, the enrichment loaders can be changed during execution. Files containing enrichment
loaders must have a name matching the format: *.loader.py
.
from cookiemonster import EnrichmentLoader, Cookie, Enrichment
from hgicommon.mixable import Priority
from hgicommon.data_source import register
def _can_enrich(cookie: Cookie, context: Context) -> bool:
return "my_data_source" in [enrichment.source for enrichment in cookie.enrichments]
def _load_enrichment(cookie: Cookie, context: Context) -> Enrichment:
return my_data_source.load_more_information_about(cookie.path)
_priority = Priority.MAX_PRIORITY
_enrichment_loader = EnrichmentLoader(_can_enrich, _load_enrichment, _priority, "optional_name")
register(_enrichment_loader)
Please see the [enrichment loaders used in the HGI Cookie Monster setup] (https://github.com/wtsi-hgi/hgi-cookie-monster-setup/tree/master/hgicookiemonster/enrichment_loaders).
A Cookie Monster installation may use data retrievers, which get updates about data objects that can be used to enrich (which will create if no previous information is known) related Cookies in the CookieJar.
A retriever that periodically gets information about updates made to entities in an iRODS database is shipped with the system. In order to use it, specific queries defined in resources/specific-queries must be installed on your iRODS server and a version of baton that supports specific queries (such as that by wtsi-hgi) must be installed. It can be setup as such:
update_mapper = BatonUpdateMapper(baton_binaries_location)
database_connector = SQLAlchemyDatabaseConnector(retrieval_log_database)
retrieval_log_mapper = SQLAlchemyRetrievalLogMapper(database_connector)
retrieval_manager = PeriodicRetrievalManager(retrieval_period, update_mapper, retrieval_log_mapper)
Then linked to a CookieJar by:
executor = ThreadPoolExecutor(max_workers=NUMBER_OF_THREADS)
def put_updates_in_cookie_jar(update_collection: UpdateCollection):
for update in update_collection:
enrichment = Enrichment("irods_update", datetime.now(), update.metadata)
executor.submit(timed_enrichment, update.target, enrichment)
retrieval_manager.add_listener(put_updates_in_cookie_jar)
A JSON-based HTTP API is provided to expose certain functionality as an outwardly facing interface, on a configurable port. Currently, the following endpoints are defined:
/queue
GET
Get the current status details of the "to process" queue, returning a JSON object with the following members:queue_length
/queue/reprocess
POST
Mark a file as requiring reprocessing, which will immediately return it (if necessary) to the "to process" queue. This method expects a JSON request body consisting of an object with apath
member; returning the same.
/cookiejar/<identifier>
(and /cookiejar?identifier=<identifier>
)
GET
Get a file and its enrichments from the metadata repository, by its identifier. (Note that the identifier must be percent encoded. If it begins with a slash, then the query string form of this endpoint must be used.)DELETE
Delete a file and its enrichments from the metadata repository, by its identifier. (Note that the identifier must be percent encoded. If it begins with a slash, then the query string form of this endpoint must be used.)
Note that all requests must include application/json
in their
Accept
header.
To run the tests, use ./scripts/run-tests.sh
from the project's root directory. This script will use pip
to
install all requirements for running the tests. Some tests use Docker therefore a Docker
daemon must be running on the test machine, with the environment variables DOCKER_TLS_VERIFY
, DOCKER_HOST
and
DOCKER_CERT_PATH
set.