Project Overview

Python modules to generate BEL resource documents.

See wiki for information about dataset objects and how to add new datasets.

Resource Generator

To run: './gp_baseline.py -n [dir]' '[dir]' is the working directory to which data will be downloaded and new files generated.

gp_baseline.py runs in several phases:

data download
data parse (and save as pickled objects)
build '.belns' (namespace) files
removed
build '.beleq' (equivalence) files

The pipeline can be started and stopped at any phase using the '-b' and '-e' options. This enables re-rerunning the pipeline on stored data downloads and pickled data objects.

gp_baseline.py - acts as the driver for the resource-generator.
configuration.py - Configures the datasets to be included in the resource-generation pipeline, including initialization of the dataset objects, specification of a download url, and association with a parser
parsers.py - contains parsers for each dataset.
parsed.py - acts as a storage module. Takes the data handed to it by the parser and stores it in a DataObject. Currently all of the data being used in this module is being kept in memory. See bug tracker about a possible solution to this memory constraint.
datasets.py - each DataObject class
is defined in this module. See wiki for information about DataObject classes, methods, and attributes.
equiv.py - this module will take a DataObject as a parameter, and use that object's defined functions to generate the new .beleq files.
common.py - defines some common functions used throughout the program, namely a download() function and a function that will open and read a gzipped file.
constants.py - any constants used throughout the program are defined in this module.
rdf.py - loads each pickled dataset object generated by Phase II of gp_baseline and generates triples for each namespace 'concept', including id, preferred label, synonyms, concept type, and equivalences.
belanno.py - generates 'belanno' files outside of the main gp_baseline pipeline (gp_baseline does download and create pickled data objects for the annotation data sets).

Change-Log

change_log.py - a separate module from gp_baseline. This module uses two sets ('old' and 'new') of pickled data objects generated by gp_baseline.py. change_log.py outputs a json dictionary mapping old terms to either their replacement terms or the string withdrawn. This dictionary can be consumed by an update script to resolve lost terms in older versioned BEL documents.

Resource files

These scripts are used to generate additional resource files - see openbel-framework-resources

orthology.py - creates the gene-orthology.bel file; requires the pickled data objects from the gp_baseline run.
gene_scaffolding.py - creates the gene_scaffolding_document_9606_10090_10116.bel; requires HGNC, MGI, and RGD '.belns' files generated from the gp_baseline run.
go_complexes_to_bel.py - creates a '.bel' file with statements mapping Gene Ontology (GO) complexes to their human, mouse, and rat complex components based on data from GO. Uses the 'testing' version of the GOCC complexes '.belns' file and the current gene association files from GO. Output not currently used for openbel-framework-resources.

Dependencies

To run these Python scripts, the following software must be installed:

Python 3.x - modules are written in Python 3.2.3
lxml - used to parse various XML documents.
rdflib - used by rdf.py

Name		Name	Last commit message	Last commit date
Latest commit History 368 Commits
datasets		datasets
templates		templates
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annoheaders.py		annoheaders.py
annotate.py		annotate.py
bel_functions.py		bel_functions.py
belanno.py		belanno.py
change_log.py		change_log.py
common.py		common.py
configuration.py		configuration.py
constants.py		constants.py
datasets.py		datasets.py
equiv.py		equiv.py
gene_scaffolding.py		gene_scaffolding.py
go_complexes_to_BEL.py		go_complexes_to_BEL.py
gp_baseline.py		gp_baseline.py
ns_check.py		ns_check.py
orthology.py		orthology.py
parsed.py		parsed.py
parsers.py		parsers.py
rdf.py		rdf.py
sparql_test.py		sparql_test.py
species.py		species.py
write_log.py		write_log.py

License

nbargnesi/resource-generator

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Resource Generator

Change-Log

Resource files

Dependencies

About

Resources

License

Stars

Watchers

Forks

Languages