pcawg_tools

Docker builds, wrapper scripts and Galaxy tool config files related to the ICGC/TCGA PCAWG work ( http://pancancer.info/ )

Main code Base https://github.com/ucscCancer/pcawg_tools

Runner Project https://github.com/kellrott/nebula

Workflow Engine Docker build https://github.com/bgruening/docker-galaxy-stable

Reference Data https://www.synapse.org/#!Synapse:syn3241088

WhiteList/Assignment Tracking https://www.synapse.org/#!Synapse:syn4556289

All the tools are written as Galaxy tool wrappers (in the tool directory), with Dockerfile builds (except for the broad tool). The workflows are the .ga files in the workflow directory (in Galaxy format).

These can be used manually in a properly configured Galaxy environment. The trick is to automate deployment and instancing of the workflows. Nebula ( https://github.com/kellrott/nebula ) manages spinning up docker based galaxy instances, which in turn runs the workflow and manages the running of the tool containers. Once the work is done, the result data is copied out and the instance terminated.

There is also a workflow generator script included in each project that will check out tasks from the central whitelist (using the synapse table API), download reference files (from https://www.synapse.org/#!Synapse:syn3241088) and build the workflow request files. I also use a 'doc store' to store large files and associated json meta-data documents, each with a uuid. Right now, its just a directory with predictable file names and associated json files.

The generator script will also scan the doc-store after the workflows are done, to build upload submissions (I haven't connected the data submission to the workflow, because I needed to review results).

Installation

-1 - System Dependencies (Ubuntu 14.04)

sudo apt-get install python-virtualenv python-dev

0 - Python dependencies If there are any questions about the python environment, create a 'virtualenv' one and install packages

virtualenv venv . venv/bin/activate pip install synapseclient pip install pandas

1 - Download PCAWG tools git clone https://github.com/ucscCancer/pcawg_tools.git This directory will be referred to at $PCAWG_DIR You probably want to put $PCAWG_DIR in your .bashrc file 2- Checkout deployment version: cd $PCAWG_DIR && git checkout 1.0.0 3 - Get Nebula for deployment (master branch should work for deployment)j:

git clone https://github.com/kellrott/nebula.git

This directory will be referred to as $NEBULA

If you are running any programs that use this code, you'll need to set PYTHONPATH: export PYTHONPATH=$NEBULA

3.5 - Set Environmental Variables

Change for your system:

echo "export PCAWG_DIR=/home/ubuntu/gitroot/pcawg_tools">> ~/.bashrc echo "export NEBULA=/home/ubuntu/gitroot/nebula">> ~/.bashrc echo "export PYTHONPATH=$NEBULA">> ~/.bashrc source ~/.bashrc

4 - Build tool containers 4.1 - Obtain GATK manually from https://www.broadinstitute.org/gatk/ then: cp GenomeAnalysisTK.jar $PCAWG_DIR/tools/gatk_bqsr/

4.2 - Run builder. This calls 'docker build' with the correct tag and dumps a copy of the image onto the file system. In $PCAWG_DIR run the command python $NEBULA/nebula/warpdrive.py build -o images/ tools/

4.3 - Obtain Broad Docker image The Broad docker image needs to be built separately. Gordon will provide the download and decrypt instructions (its an encrypted file on the jamboree site). Jamboree, the URL is https://tcga-data-secure.nci.nih.gov/tcgafiles/pancan/forucsc/ And files to download: /pancan/forucsc/refdata9 /pancan/forucsc/docker9 You will need to decrypt: openssl aes-128-cbc -d -salt -in docker9 -out docker9.tar.gz openssl aes-128-cbc -d -salt -in refdata9 -out refdata9.tar.gz And the Broad will send you decryption passphrase. You then build the docker image: tar zxf refdata9.tar.gz; tar zxf docker9.tar.gz; cd docker && docker build -t broad_variant_pipeline . Save the image to image directory docker save broad_variant_pipeline > $PCAWG_DIR/images/docker_broad_variant_pipeline.tar Need to check on the name and version used here... 4.4 Final list of containers needed Broad GATK co-cleaning MuSE ???

5 - Build workflow engine You can obtain a copy of the workflow engine from the Docker registry (docker pull bgruening/galaxy-stable), but its recommended you build it yourself, so that you can change the deployed user and group IDs to match your own account, which removes file access issues. git clone --recursive https://github.com/bgruening/docker-galaxy-stable.git Edit galaxy/Dockerfile to match your User and Group IDs (GALAXY_UID and GALAXY_GID on line 37 and 38). (You can find out your account number with 'id -u' and 'id -g')
Then run docker build -t bgruening/galaxy-stable galaxy There’s also the DockerHub version of the above but it doesn’t allow you to change the running user (since it’s pre-built). https://hub.docker.com/r/bgruening/galaxy-stable/

docker pull bgruening/galaxy-stable:dev

Cache Docker Galaxy image: docker save bgruening/galaxy-stable > $PCAWG_DIR/images/galaxy.tar

6 - Set up configuration Files GNOS keys and the BROAD reference data directory go under the $PCAWG_DIR/tool_data/files directory They should be $PCAWG_DIR/tool_data/files/cghub.key $PCAWG_DIR/tool_data/files/icgc.key $PCAWG_DIR/tool_data/files/refdata

7 - Generate Jobs Register for jobs (note, you'll have to have your Synapse login with a ~/.synapseConfig file, see https://www.synapse.org/#!Synapse:syn2375225/wiki/ , you'll also have to request write access to the tracking table) ./scripts/pcawg_wf_gen.py register --count 15

You can confirm this worked by querying the Synapse table (https://www.synapse.org/#!Synapse:syn4556289 ). In this case I’m querying for my username “briandoconnor” and it finds 15 entries.

Notes about possible failure: I tried the above on Ubuntu 14.04 and had to do the following to get it to work: change the following in $PCAWG_DIR/scripts/synqueue.py

import requests.packages.urllib3 requests.packages.urllib3.disable_warnings() to import urllib3

sudo apt-get install python-dev sudo pip install pandas

Now run the job creation script

Run the job creation script: ./scripts/pcawg_wf_gen.py gen --ref-download --create-service

Note: to use a separate directory for the workspace, add the parameter --work-dir </path/to/local/dir>, otherwise, the workflow will execute in a directory under /var/lib/docker, and may fill up the '/' partition Submit job: qsub sge_qsub_runworkflow.sh pcawg.service pcawg.tasks/workflow_id

Result files Uploads

Debugging and Development To manually instance galaxy (with PCAWG Tools loaded) for interactive analysis and testing In the $PCAWG_DIR run the command python $NEBULA/nebula/warpdrive.py up -t tools -l data -a -p 8080 --sudo -c run_config.yaml -td tool_data -f -d

Runtime Notes We're running on 32 core machines, scheduling a single donor per machine. For the PCAWG work, its been an average of 9 hours for GATK Indel realignment, 23 hours for BQSR, and then Broad pipeline averages 32 hours and the MuSE averages 11 hours (but they run in parallel). So a wall time average of 64 hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

jobs

jobs

scripts

scripts

tool_data

tool_data

tools

tools

workflows

workflows

.gitignore

.gitignore

README.md

README.md

run_config.yaml

run_config.yaml

sge_qsub_runworkflow.sh

sge_qsub_runworkflow.sh

Repository files navigation

pcawg_tools

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
data		data
images		images
jobs		jobs
scripts		scripts
tool_data		tool_data
tools		tools
workflows		workflows
.gitignore		.gitignore
README.md		README.md
run_config.yaml		run_config.yaml
sge_qsub_runworkflow.sh		sge_qsub_runworkflow.sh

wangyumei-gd/pcawg_tools

Folders and files

Latest commit

History

Repository files navigation

pcawg_tools

About

Resources

Stars

Watchers

Forks

Languages