We have a wiki for this project at http://dev.stanford.edu/confluence/display/collaborators/DeepDive+Genomics
Note that a DeepDive installation is not needed to run the analyses, inspection or labeling tasks (just access to the PSQL database that the results are in)
Pending confirmation...:
-
Install Docker (see Docker installation guide)
-
Create a new directory with at least 20GB of free space, and copy the file named
Dockerfile
in this directory into it, thencd
into it -
Run the following:
docker build -t deepdive . docker run -d --privileged --name db -h gphost readr/greenplum docker run -t -d --link db:db --name deepdive deepdive bash docker exec -ti deepdive bash cd ~/deepdive make test
-
Note: currently we are having issues with running
docker build
; need to stop and restart multiple times before we finish... Does eventually finish though -
DeepDive & Greenplum will now run in the background; to open up a shell again, run:
docker exec -ti deepdive bash
Note: this may take a while to start, as it waits for GreenPlum first...
docker ps -a
to see running containers (should be one for deepdive, one for greenplum)- To copy file into docker...
-
Make sure environment variables are set correctly- see
env.sh
cp env.sh env_local.sh
then edit env_local.sh to suit yourself; it's ignored by git.Add
export DD_GENOMICS_HOME=...
to the~/.bashrc
where the DB runs. It's required by plpython scripts -- a hacky way to communicate the local repo path. For it to be picked up by PG / GP, you need to restart the DB server after the change. (Similarly, also make sureDEEPDIVE_HOME
is set.) -
If necessary, create database and then create the tables:
For Greenplum: `./util/create_schema.sh` For Postgres: `./util/create_schema.sh pg`
-
Make sure that user functions (ex: array_accum) are loaded into SQL under the correct user ($DBUSER). Run the SLQ in
util/add_user_functions.sql
-
Make sure that GreenPlum's parallel file distribution server,
gpfdist
, is running with the correct settings (e.g. runps aux | grep gpfdist
; make sure that an intance is running with the correct $GPPATH and $GPPORT). If not, then start a new one running on a free port:gpfdist -d ${GPPATH} -p ${GPPORT} -m 268435456 &
-
Load data; if from tsv file you can use:
./util/copy_table_from_file.sh [DB_NAME] [TABLE_NAME] [TSV_FILE_PATH]
-
Fetch and process ontology files:
cd onto; ./make_dicts.sh
-
Select the appropriate pipeline in the app.conf file to be using
-
Run!
[See DeepDive main documentation for more detail]
In env.sh, change the variables marked as 'todo' as appropriate for intended usage. Additional notes:
- Mac OSX: Will need to either install coreutils (e.g. with Homebrew:
brew install coreutils
) or hardcode absolute path; see env.sh - Most scripts in GDD will call env.sh automatically, but if not, run
bash source env.sh
in the terminal session being used. (Note that env.sh must be set as executable,chmod +x env.sh
)
TO-DO...
At a high level, this repo consists of the "UDF" (User Defined Functions) and "extractors" more generally, which output candidates, features and distantly-supervised examples for the core DeepDive system to utilize; plus some additional analysis & labeling code for use on the resulting output.
For the core DeepDive code see this repo.
In more detail, this repo contains:
-
The
application.conf
file: This is the primary configuration file for the core DeepDive system. In it we define:- Extractors: For candidates, features and distantly-supervised examples
- Pipelines: Ordered sets of extractors & other operations to execute
- Schema: Defining the random variables (RVs) that we are observing / trying to predict
- Inference rules: Factors in the factor graph, which define causal relations between RVs in our schema
-
The extractor code (in
code/
): see this documentation also. These are UDF scripts used by the extractors (note that formally, the extractors are defined in application.conf, and may not require any UDF scripts) that generate the requisite inputs for DeepDive:- {gene | pheno}_mentions, gen_pheno_pairs: Generates candidate mentions
- mention_features, pair_features: Outputs the features for candidate mentions
-
Analysis, inspection & labeling scripts: See the following sections
Not an exhaustive list, but important tables:
- sentences: The pre-processed input text, with one row for each sentence
- sentences_input: Same as sentences, except arrays stored as strings (for convenience in certain processing steps)
- {gene_mentions | pheno_mentions | genepheno_relations}: The extracted candidates (
is_correct=NULL
) and distantly- or directly-supervised examples (is_correct=true|false
) - {gene | pheno | genepheno}_features: Separate tables for features for each of the candidate mentions/relations
- {gene_mentions | pheno_mentions | genepheno_relations}_is_correct_inference[_bucketed]: Views of the results of the DeepDive run [bucketed by 0.1 increments in expectation value]
- acronyms: Extracted gene acronyms, used for distant supervision
- labeled_gp: Labeled GP relations (from ...?) for direct supervision of GP relations
- TODO: expand this list...
NOTE: "pheno" is replaced by "hpoterm" in older datasets...
Postgres tips:
\d+
: See all tables\d+ TABLE_NAME
: See the schema of a specific table TABLE_NAME\q
: Exitpsql -U $DBUSER -h $DBHOST -p $DBPORT $DBNAME
: Open postgres client
After an iteration of DeepDive has been run, some simple analyses (just SQL queries + optional post-processing) can be computed for error analysis and assesment, in addition to the automatically produced calibration plots. To perform an analysis run:
./analysis/run-analysis.sh NAME MODE
where NAME
is "g", "p" or "gp" (as relevant). To see a list of available MODE
arguments (analyses), just run the above with no arguments.
- mentions-by-source: Number of mentions of NAME grouped by journal source, with counts broken down by labeled_true / labeled_false / bucket_n, where e.g. bucket_3 is the count of unlabeled mentions with infered expectation between 0.3 and 0.4
- mentions-by-entity: Number of mentions of NAME grouped by entity, with same columns as above. A post-processing step checks for any entities that are in the relevant dictionary but not in the table, and includes these with zero counts. [Note: the post-processing step doesn't work for phenotypes because the entity names are not yet resolved enough to match with our dictionary...]
- relations-by-entity: Number of relations involving an entity E, grouped by E, with same columns as above.
- postgres-stats: Compiled by postgres automatically for query planning (only reason we included). Generates files labeled by column id and analysis type, e.g. output_2_most_common_values.csv would be the most common values for column 2 of the NAME_mentions table. See postgres documentation
[NOTE: gp relations not currently run on raiders4]
The basic structure of an analysis script is:
- an
input-sql.sh
script that takes e.g. $1 \in {gene_mentions, pheno_mentions, genepheno_relations} as an argument and outputs the SQL to run - [optionally] a
process.py
post-processing script which takes in the input filename, output path root and table name argument as above as inputs. Exs: Unpacking & splitting data; filling in zero-values from dictionary; etc.
Other analyses we might want (TODO):
- Compute ratio of mentions to relations by entity in one single script?
- Something around average mentions/doc?
- Something involving common features (think about this more algorithmically / generally- longer term)
Having analyzed the output of DeepDive using the calibration plots, analysis scripts described above, and other means, you may want to inspect and/or label specific examples to improve the system. We use a GUI tool called Mindtagger to expedite the labeling tasks necessary for evaluating the data products of DeepDive.
To inspect certain documents without wanting to label them- e.g. to get some quick intuition about a specific slice of the data while building feature extractors- run:
./inspection/inspect-data.sh SQL_FILE
where SQL_FILE
is your .sql query for getting the documents you want to inspect. This will fetch the data and start the Mindtagger GUI. See example queries in the inspection/scripts
directory, or just run the above command with no arguments to list available ones.
Provided are a set of Mindtagger labeling templates for e.g. precision, recall labeling tasks. First, create a new task by running
./labeling/create-new-task.sh TASK
(run without args to view available tasks). This will create a new output directory in labeling/
where the output tags from the labeling will be stored. Once all tasks are created, start the Mindtagger GUI by running:
./labeling/start-gui.sh
and then open a browser to localhost to view all the created tasks & label data!
TODO