Deep Learning Recipes for DNA reads and short variants.

Setting up your environment

We recommend using anaconda to handle your python environments. For CPU only libraries:

conda env create -n gatk -f ./envs/gatkcondaenv_cpu.yml

To use GPU, you will need a NVIDIA GPU, CUDA and CuDNN installed tensorflow has nice instructions:

conda env create -n gatk -f ./envs/gatkcondaenv_gpu.yml

Training models from example tensors

In the data directory we provide a small dataset of reference and read tensors from the NA12878 sample. The reference tensors are input for a 1D CNN. They are a 1-hot encoding of 128 base pairs of reference sequence centered at a variant. The read tensors are input for a 2D CNN. They encode reference and read sequence as well as read meta data. They use the tensorflow default channel ordering: reads x sequence x channels. You can toggle between tensorflow and theano channel ordering with the --channels_last and --channels_first arguments. Uncompress them with tar:

cd data
tar -xzvf example_reference_tensors_chr1.tar.gz 
tar -xzvf example_read_tensors_chr1_channels_last.tar.gz
cd ..

Train a model that predicts variant quality from read tensors and variant annotations:

python recipes.py train_ref_read_anno \
  --data_dir ./data/example_read_tensors_chr1_channels_last/ \
  --tensor_map read_tensor \
  --annotation_set best_practices \
  --id ref_read_anno_model

Train a model that predicts variant quality from read tensors:

python recipes.py train_ref_read \
  --data_dir ./data/example_read_tensors_chr1_channels_last/ \
  --tensor_map read_tensor \
  --id ref_read_model

Train a model that predicts variant quality from reference sequence and annotations:

python recipes.py train_reference_annotation \
  --data_dir ./data/example_reference_tensors_chr1/ \
  --tensor_map reference \
  --annotation_set best_practices \
  --id ref_anno_model

Train a model that predicts variant quality from reference sequence only:

python recipes.py train_reference \
  --data_dir ./data/example_reference_tensors_chr1/ \
  --tensor_map reference \
  --id ref_model

Write tensors with your own data

Create read tensors with a truth vcf, confident region, unfiltered variant calls, and aligned reads:

python recipes.py write_tensors \
  --reference_fasta reference.fasta \
  --train_vcf validated_calls.vcf.gz \
  --negative_vcf my_unfiltered_calls.vcf.gz \
  --bed_file validated_calls_confident_region.bed \
  --data_dir ./data/my_read_tensors/ \ 
  --bam_file my_aligned_reads.bam \
  --tensor_map read_tensor \
  --channels_last \
  --read_limit 128 \
  --window_size 128

Create reference tensors with a truth vcf, confident region, and unfiltered variant calls:

python recipes.py write_dna_tensors \
  --reference_fasta reference.fasta
  --train_vcf validated_calls.vcf.gz \
  --negative_vcf my_unfiltered_calls.vcf.gz \
  --bed_file validated_calls_confident_region.bed \
  --data_dir ./data/my_reference_tensors/ \ 
  --tensor_map reference \
  --window_size 128

You can downsample specific classes with the --downsample_class_label arguments. For example, to only write 10% of the positive SNPs add --downsample_snps 0.1 to your command line or to keep half of the negative indel examples use: --downsample_not_indels 0.5

You can also parallelize over the genome via the --chrom, --start_pos, and --end_pos arguments.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
TensorViewer/TensorViewerModule		TensorViewer/TensorViewerModule
api_tutorials		api_tutorials
data		data
envs		envs
notebooks		notebooks
scripts		scripts
.gitattributes		.gitattributes
arguments.py		arguments.py
caller.py		caller.py
defines.py		defines.py
hyperparameter_optimizer.py		hyperparameter_optimizer.py
inference.py		inference.py
models.py		models.py
plots.py		plots.py
readme.Rmd		readme.Rmd
readme.md		readme.md
recipes.py		recipes.py
training_data.py		training_data.py
unit_tests.py		unit_tests.py

wisekh6/dsde-deep-learning

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Recipes for DNA reads and short variants.

Setting up your environment

Training models from example tensors

Write tensors with your own data

About

Resources

Stars

Watchers

Forks

Languages