A workflow for comparing multiple cohorts of BAM files to determine if they contain reads sequenced from the same samples or patients by counting genotype matches at common SNPs. Cohort-matcher is an efficient, cloud-enabled variation of BAM-matcher.
The basic workflow consists of:
- Genotype all the samples to be compared. (genotypeSamples.py)
- Compare the genotypes of each sample against the genotypes of all the other samples. (compareSamples.py which in turn uses compareGenotypes.py to compare a sample to reamining cohort of samples)
- Merge the results of the sample comparisons (mergeResults.py)
- Generate plots based on results and known patient-to-sample assocation.
In order to efficiently, some steps are parallelized to reduce runtime. Specifically:
- Genotype each sample independently of each other
- Compare a sample's genotype against all other samples (to create a sample's meltedResults file)
Pre-req: Make input bamsheet
Construct a single 3 column tab-delimited text file consisting of sampleName, S3 path to the sample bamfile, and reference sample is mapped to (hg19 or GRCh37ERCC) for all the samples. For example:
P-1234.bamsheet.txt:
sample | s3 path to bamfile | reference |
---|---|---|
sample1 | s3://bmsrd-ngs-results/P-12345678-1234/RNA-Seq/bam/sample1.GRCh37ERCC-ensembl75.bam | GRCh37ERCC |
sample2 | s3://bmsrd-ngs-results/P-12345678-4567/WES/bam/sample2.hg19.bam | hg19 |
- Call genotypeSamples.py
genotypeSamples.py -b P-1234.bamsheet.txt -o s3://bmsrd-ngs-results/P-1234/cohort-matcher
- Call compareSamples.py
compareSamples.py -b P-1234.bamsheet.txt -CD s3://bmsrd-ngs-results/P-1234/cohort-matcher
- Call mergeResults.py
mergeResults.py -b P-1234.bamsheet.txt -CD s3://bmsrd-ngs-results/P-1234/cohort-matcher
- Call findSwaps.R
Rscript analysisScripts/findSwaps.R
or via Docker
docker run -ti --rm -v $PWD:/work -w /work -v /home/ec2-user/NGS/cohort-matcher:/cohort-matcher 483421617021.dkr.ecr.us-east-1.amazonaws.com/cohort-matcher-r Rscript /cohort-matcher/analysisScripts/findSwaps.R
mergeResults.py created meltedResults.txt, which contains the sample-to-sample comparisons.
The focus of cohort-matcher v2 is on human (hg19 / GRCh37, and hg38 / GRCh38). Samples must be mapped against either:
- hg19 or GRCh37
OR
- hg38 or GRCh38
Other combinations of references will not work. In version 2, the chromosome map has been eliminated, and the VCF to TSV process removes the 'chr' chromosome prefix, if one exists, allowing all VCFs to be compared against each other.
Reference/Target Paths for GRCh37ERCC:
- s3://bmsrd-ngs-repo/cohort-matcher/GRCh37ERCC.tar.bz2
- s3://bmsrd-ngs-repo/cohort-matcher/GRCh37ERCC.cohort-matcher.bed
Reference/Target Paths for hg19:
- s3://bmsrd-ngs-repo/cohort-matcher/hg19.tar.bz2
- s3://bmsrd-ngs-repo/cohort-matcher/hg19.cohort-matcher.bed
(Require at least one)
- GATK (requires Java)
- VarScan2 (requires Java and Samtools)
- Freebayes
Note: Cohort-matcher only supports Freebayes at this time.
git clone https://github.com/golharam/cohort-matcher
pip install -r cohort-matcher/requirements.txt
The repository includes 3 VCF files which can be used for comparing human data (hg19/GRCh37).
These VCF files also contain variants extracted from 1000 Genomes project which are all exonic and have high likelihood of switching between REF and ALT alleles (global allele frequency between 0.45 and 0.55). The only difference between them is the number of variants contained within.
The repository also includes several BAM files which can be used for testing (under test_data directory), as well as the expected results for various settings.
Cohort-matcher adds unit tests to test the python code.
The code is released under the Creative Commons by Attribution licence (http://creativecommons.org/licenses/by/4.0/). You are free to use and modify it for any purpose (including commercial), so long as you include appropriate attribution.
cohort-matcher - in prep
Ryan Golhar (ryan.golhar@bms.com)