cohort-matcher

A workflow for comparing multiple cohorts of BAM files to determine if they contain reads sequenced from the same samples or patients by counting genotype matches at common SNPs. Cohort-matcher is an efficient, cloud-enabled variation of BAM-matcher.

Algorithm

The basic workflow consists of:

Genotype all the samples to be compared. (genotypeSamples.py)
Compare the genotypes of each sample against the genotypes of all the other samples. (compareSamples.py which in turn uses compareGenotypes.py to compare a sample to reamining cohort of samples)
Merge the results of the sample comparisons (mergeResults.py)
Generate plots based on results and known patient-to-sample assocation.

In order to efficiently, some steps are parallelized to reduce runtime. Specifically:

Genotype each sample independently of each other
Compare a sample's genotype against all other samples (to create a sample's meltedResults file)

How to run

Pre-req: Make input bamsheet

Construct a single 3 column tab-delimited text file consisting of sampleName, S3 path to the sample bamfile, and reference sample is mapped to (hg19 or GRCh37ERCC) for all the samples. For example:

P-1234.bamsheet.txt:

sample	s3 path to bamfile	reference
sample1	s3://bmsrd-ngs-results/P-12345678-1234/RNA-Seq/bam/sample1.GRCh37ERCC-ensembl75.bam	GRCh37ERCC
sample2	s3://bmsrd-ngs-results/P-12345678-4567/WES/bam/sample2.hg19.bam	hg19

Call genotypeSamples.py

genotypeSamples.py -b P-1234.bamsheet.txt -o s3://bmsrd-ngs-results/P-1234/cohort-matcher

Call compareSamples.py

compareSamples.py -b P-1234.bamsheet.txt -CD s3://bmsrd-ngs-results/P-1234/cohort-matcher

Call mergeResults.py

mergeResults.py -b P-1234.bamsheet.txt -CD s3://bmsrd-ngs-results/P-1234/cohort-matcher

Call findSwaps.R

Rscript analysisScripts/findSwaps.R

or via Docker

docker run -ti --rm -v $PWD:/work -w /work -v /home/ec2-user/NGS/cohort-matcher:/cohort-matcher 483421617021.dkr.ecr.us-east-1.amazonaws.com/cohort-matcher-r Rscript /cohort-matcher/analysisScripts/findSwaps.R

Output

mergeResults.py created meltedResults.txt, which contains the sample-to-sample comparisons.

Genome Reference

The focus of cohort-matcher v2 is on human (hg19 / GRCh37, and hg38 / GRCh38). Samples must be mapped against either:

hg19 or GRCh37

OR

hg38 or GRCh38

Other combinations of references will not work. In version 2, the chromosome map has been eliminated, and the VCF to TSV process removes the 'chr' chromosome prefix, if one exists, allowing all VCFs to be compared against each other.

Reference/Target Paths for GRCh37ERCC:

s3://bmsrd-ngs-repo/cohort-matcher/GRCh37ERCC.tar.bz2
s3://bmsrd-ngs-repo/cohort-matcher/GRCh37ERCC.cohort-matcher.bed

Reference/Target Paths for hg19:

s3://bmsrd-ngs-repo/cohort-matcher/hg19.tar.bz2
s3://bmsrd-ngs-repo/cohort-matcher/hg19.cohort-matcher.bed

Variant Callers

(Require at least one)

GATK (requires Java)
VarScan2 (requires Java and Samtools)
Freebayes

Note: Cohort-matcher only supports Freebayes at this time.

Installation

git clone https://github.com/golharam/cohort-matcher
pip install -r cohort-matcher/requirements.txt

The repository includes 3 VCF files which can be used for comparing human data (hg19/GRCh37).

These VCF files also contain variants extracted from 1000 Genomes project which are all exonic and have high likelihood of switching between REF and ALT alleles (global allele frequency between 0.45 and 0.55). The only difference between them is the number of variants contained within.

The repository also includes several BAM files which can be used for testing (under test_data directory), as well as the expected results for various settings.

Cohort-matcher adds unit tests to test the python code.

LICENSE

The code is released under the Creative Commons by Attribution licence (http://creativecommons.org/licenses/by/4.0/). You are free to use and modify it for any purpose (including commercial), so long as you include appropriate attribution.

Citation

cohort-matcher - in prep

Contact

Ryan Golhar (ryan.golhar@bms.com)

Name		Name	Last commit message	Last commit date
Latest commit History 471 Commits
VCFs		VCFs
analysisScripts		analysisScripts
docker-r		docker-r
docker		docker
test_data		test_data
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
VCFtoIntervals.py		VCFtoIntervals.py
common.py		common.py
compareGenotypes.py		compareGenotypes.py
compareSamples.py		compareSamples.py
constructAlleleFrequencyTable		constructAlleleFrequencyTable
constructGenotypeFrequencyTable		constructGenotypeFrequencyTable
genotypeSamples.py		genotypeSamples.py
jenkins.sh		jenkins.sh
matchSamples.py		matchSamples.py
mergeResults.py		mergeResults.py
requirements.txt		requirements.txt
test_cohort_matcher.py		test_cohort_matcher.py

License

golharam/cohort-matcher

Folders and files

Latest commit

History

Repository files navigation

cohort-matcher

Algorithm

How to run

Output

Genome Reference

Variant Callers

Installation

LICENSE

Citation

Contact

About

Resources

License

Stars

Watchers

Forks

Languages