Skip to content

acidburn0zzz/resource-generator

 
 

Repository files navigation

Project Overview

Python modules to generate BEL resource documents.

See wiki for information about how to add new datasets.

Resource Generator

  1. gp_baseline.py - acts as the driver for the resource-generator. This module uses configuration.py to determine which parsers to run over which datasets. After parsing and storing the data in a usable form, gp_baseline calls out to equiv.py to generate the new .belns, .belanno, and .beleq files.
  2. configuration.py - matches each dataset to the proper parser. This module can be used to customize which parsers to run. To run/not run a particular parser, simply uncomment/comment it.
  3. parsers.py - contains parsers for each dataset, and in some cases mutiple parsers over the same data. This is mainly due to the fact that in some cases withdrawn or deprecated terms are not included during resource generation, but are needed for resolving lost terms in the change log.
  4. parsed.py - acts as a storage module. Takes the data handed to it by the parser and stores it in a DataObject. Currently all of the data being used in this module is being kept in memory. See bug tracker about a possible solution to this memory constraint.
  5. datasets.py - each DataObject that holds a particular dataset is defined in this module. These objects act as an interface to the underlying dictionaries, and do various manipulations over the data to assist in generating the BEL resource files.
  6. equiv.py - the main function in this module will take a DataObject as a parameter, and use that object's defined functions to generate the new .beleq files.
  7. common.py - defines some common functions used throughout the program, namely a download() function and a function that will open and read a gzipped file.
  8. constants.py - any constants used throughout the program are defined in this module.
  9. rdf.py - loads each pickled dataset object generated by Phase II of gp_baseline and generates triples for each namespace 'concept', including id, preferred label, synonyms, concept type, and equivalences.

Change-Log

  1. change_log.py - a separate module from gp_baseline. This module will download pickled data objects generated by gp_baseline.py. change_log.py outputs a json dictionary mapping old terms to either their replacement terms or the string withdrawn. This dictionary can be consumed by an update script to resolve lost terms in older versioned BEL documents.

Dependencies

  1. To run these Python scripts, the following software must be installed:
  • Python 3.x - modules are written in Python 3.2.3
  • lxml - used to parse various XML documents.
  • rdflib - used by rdf.py

About

Python modules to generate BEL resource documents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%