The Genomics DeepDive (GDD) Project

We have a wiki for this project at http://dev.stanford.edu/confluence/display/collaborators/DeepDive+Genomics

Getting Started: Basics

1. Installing DeepDive via Docker

Note that a DeepDive installation is not needed to run the analyses, inspection or labeling tasks (just access to the PSQL database that the results are in)

Pending confirmation...:

Install Docker (see Docker installation guide)
Create a new directory with at least 20GB of free space, and copy the file named Dockerfile in this directory into it, then cd into it

Run the following:

 docker build -t deepdive . 
 docker run -d --privileged --name db -h gphost readr/greenplum
 docker run -t -d --link db:db --name deepdive deepdive bash
 docker exec -ti deepdive bash
 cd ~/deepdive
 make test

Note: currently we are having issues with running docker build; need to stop and restart multiple times before we finish... Does eventually finish though
DeepDive & Greenplum will now run in the background; to open up a shell again, run:
```
 docker exec -ti deepdive bash
```

Note: this may take a while to start, as it waits for GreenPlum first...

2. Other Docker stuff

docker ps -a to see running containers (should be one for deepdive, one for greenplum)
To copy file into docker...

3. Getting GDD running:

Make sure environment variables are set correctly- see env.sh
If necessary, create database and then create the tables:
```
 ./code/create_schema.sh	
```
Make sure that user functions (ex: array_accum) are loaded into SQL under the correct user ($DBUSER). Run the SLQ in code/add_user_functions.sql
Make sure that GreenPlum's parallel file distribution server, gpfdist, is running with the correct settings (e.g. run ps aux | grep gpfdist; make sure that an intance is running with the correct $GPPATH and $GPPORT). If not, then start a new one running on a free port:
```
 gpfdist -d ${GPPATH} -p ${GPPORT} -m 268435456 &
```

Load data; if from tsv file you can use:

 ./code/copy_table_from_file.sh [DB_NAME] [TABLE_NAME] [TSV_FILE_PATH]

Select the appropriate pipeline in the application.conf file to be using
Run!

[See DeepDive main documentation for more detail]

4. Setting environment variables:

In env.sh, change the variables marked as 'todo' as appropriate for intended usage. Additional notes:

Mac OSX: Will need to either install coreutils (e.g. with Homebrew: brew install coreutils) or hardcode absolute path; see env.sh
Most scripts in GDD will call env.sh automatically, but if not, run bash source env.sh in the terminal session being used. (Note that env.sh must be set as executable, chmod +x env.sh)

Running an Iteration of DeepDive

TO-DO...

Basic Contents of this Repository

At a high level, this repo consists of the "UDF" (User Defined Functions) and "extractors" more generally, which output candidates, features and distantly-supervised examples for the core DeepDive system to utilize; plus some additional analysis & labeling code for use on the resulting output.

For the core DeepDive code see this repo.

In more detail, this repo contains:

The application.conf file: This is the primary configuration file for the core DeepDive system. In it we define:
- Extractors: For candidates, features and distantly-supervised examples
- Pipelines: Ordered sets of extractors & other operations to execute
- Schema: Defining the random variables (RVs) that we are observing / trying to predict
- Inference rules: Factors in the factor graph, which define causal relations between RVs in our schema
The extractor code (code/ext_*): see this documentation also. These are UDF scripts used by the extractors (note that formally, the extractors are defined in application.conf, and may not require any UDF scripts) that generate the requisite inputs for DeepDive:
- ext_gene_find_acronyms: Extracts acronyms, used for distantly supervising the gene mention classification
- ext_{gene | pheno | genepheno}_candidates: Generates candidate mentions
- ext_{gene | pheno | genepheno}_features: Outputs the features for candidate mentions
Analysis, inspection & labeling scripts: See the following sections

Database Schema

Not an exhaustive list, but important tables:

sentences: The pre-processed input text, with one row for each sentence
sentences_input: Same as sentences, except arrays stored as strings (for convenience in certain processing steps)
{gene_mentions | pheno_mentions | genepheno_relations}: The extracted candidates (is_correct=NULL) and distantly- or directly-supervised examples (is_correct=true|false)
{gene | pheno | genepheno}_features: Separate tables for features for each of the candidate mentions/relations
{gene_mentions | pheno_mentions | genepheno_relations}_is_correct_inference[_bucketed]: Views of the results of the DeepDive run [bucketed by 0.1 increments in expectation value]
acronyms: Extracted gene acronyms, used for distant supervision
labeled_gp: Labeled GP relations (from ...?) for direct supervision of GP relations
TODO: expand this list...

NOTE: "pheno" is replaced by "hpoterm" in older datasets...

Postgres tips:

\d+: See all tables
\d+ TABLE_NAME: See the schema of a specific table TABLE_NAME
\q: Exit
psql -U $DBUSER -h $DBHOST -p $DBPORT $DBNAME: Open postgres client

Running Simple Analyses of Output

After an iteration of DeepDive has been run, some simple analyses (just SQL queries + optional post-processing) can be computed for error analysis and assesment, in addition to the automatically produced calibration plots. To perform an analysis run:

./analysis/run-analysis.sh NAME MODE

where NAME is "g", "p" or "gp" (as relevant). To see a list of available MODE arguments (analyses), just run the above with no arguments.

1. Current analyses available:

mentions-by-source: Number of mentions of NAME grouped by journal source, with counts broken down by labeled_true / labeled_false / bucket_n, where e.g. bucket_3 is the count of unlabeled mentions with infered expectation between 0.3 and 0.4
mentions-by-entity: Number of mentions of NAME grouped by entity, with same columns as above. A post-processing step checks for any entities that are in the relevant dictionary but not in the table, and includes these with zero counts. [Note: the post-processing step doesn't work for phenotypes because the entity names are not yet resolved enough to match with our dictionary...]
relations-by-entity: Number of relations involving an entity E, grouped by E, with same columns as above.
postgres-stats: Compiled by postgres automatically for query planning (only reason we included). Generates files labeled by column id and analysis type, e.g. output_2_most_common_values.csv would be the most common values for column 2 of the NAME_mentions table. See postgres documentation

[NOTE: gp relations not currently run on raiders4]

2. Creating new analyses:

The basic structure of an analysis script is:

an input-sql.sh script that takes e.g. $1 \in {gene_mentions, pheno_mentions, genepheno_relations} as an argument and outputs the SQL to run
[optionally] a process.py post-processing script which takes in the input filename, output path root and table name argument as above as inputs. Exs: Unpacking & splitting data; filling in zero-values from dictionary; etc.

Other analyses we might want (TODO):

Compute ratio of mentions to relations by entity in one single script?
Something around average mentions/doc?
Something involving common features (think about this more algorithmically / generally- longer term)

Performing Labeling & Data Inspection Tasks

Having analyzed the output of DeepDive using the calibration plots, analysis scripts described above, and other means, you may want to inspect and/or label specific examples to improve the system. We use a GUI tool called Mindtagger to expedite the labeling tasks necessary for evaluating the data products of DeepDive.

1. Using Mindtagger to inspect data

To inspect certain documents without wanting to label them- e.g. to get some quick intuition about a specific slice of the data while building feature extractors- run:

./inspection/inspect-data.sh SQL_FILE

where SQL_FILE is your .sql query for getting the documents you want to inspect. This will fetch the data and start the Mindtagger GUI. See example queries in the inspection/scripts directory, or just run the above command with no arguments to list available ones.

2. Using Mindtagger to label data

Provided are a set of Mindtagger labeling templates for e.g. precision, recall labeling tasks. First, create a new task by running

./labeling/create-new-task.sh TASK

(run without args to view available tasks). This will create a new output directory in labeling/ where the output tags from the labeling will be stored. Once all tasks are created, start the Mindtagger GUI by running:

./labeling/start-gui.sh

and then open a browser to localhost to view all the created tasks & label data!

3. Using Mindtagger tags to supervise DeepDive

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 544 Commits
analysis		analysis
code		code
dicts		dicts
inspection		inspection
labeling		labeling
.gitignore		.gitignore
DESCRIPTION.md		DESCRIPTION.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
application.conf		application.conf
env.sh		env.sh
pull.sh		pull.sh
push.sh		push.sh
run.sh		run.sh

License

amwenger/dd-genomics

Folders and files

Latest commit

History

Repository files navigation