Python modules to generate BEL resource documents.
See wiki for information about how to add new datasets.
- gp_baseline.py - acts as the driver for the resource-generator. This module uses configuration.py to determine which parsers to run over which datasets. After parsing and storing the data in a usable form, gp_baseline calls out to equiv.py to generate the new .belns, .belanno, and .beleq files.
- configuration.py - matches each dataset to the proper parser. This module can be used to customize which parsers to run. To run/not run a particular parser, simply uncomment/comment it.
- parsers.py - contains parsers for each dataset, and in some cases mutiple parsers over the same data. This is mainly due to the fact that in some cases withdrawn or deprecated terms are not included during resource generation, but are needed for resolving lost terms in the change log.
- parsed.py - acts as a storage module. Takes the data handed to it by the parser and stores it in a DataObject. Currently all of the data being used in this module is being kept in memory. See bug tracker about a possible solution to this memory constraint.
- datasets.py - each DataObject that holds a particular dataset is defined in this module. These objects act as an interface to the underlying dictionaries, and do various manipulations over the data to assist in generating the BEL resource files.
- equiv.py - the main function in this module will take a DataObject as a parameter, and use that object's defined functions to generate the new .beleq files.
- common.py - defines some common functions used throughout the program,
namely a download() function and a function that will open and read a
gzipped
file. - constants.py - any constants used throughout the program are defined in this module.
- rdf.py - loads each pickled dataset object generated by Phase II of gp_baseline and generates triples for each namespace 'concept', including id, preferred label, synonyms, concept type, and equivalences.
- change_log.py - a separate module from gp_baseline. This module will
download and parse the old .belns and .beleq files and compare
those results with the newly generated files that will be locally stored
from gp_baseline.py. Currently, change_log.py must be run
with the flag
-n <res_files>
.res_files
being the directory in which the newly generated resource files are located. The result of running change_log.py will be a dictionary mapping all the old terms to either their replacement terms or the stringwithdrawn
. This dictionary can be consumed by an update script to resolve lost terms in older versioned BEL documents. - changelog_config.py - the configuration file for change_log.py. Much like configuration.py, this module maps which parsers will be needed, and the corresponding datasets for those parsers.
- write_log.py - the only task for this module is to write the change-log
data out to a file using a
json
format.
- To run these Python scripts, the following software must be installed:
- Python 3.x - modules are written in Python 3.2.3
- lxml - used to parse various XML documents.
- rdflib - used by rdf.py