A collection of Python tools for working with genomic variants and next-generation sequencing reads. Not particularly fast for large datasets. The emphasis is on extracting what you need from BAMs and VCFs into a CSV file for further analysis.
- varlens-variants
Combine, annotate, and filter variants from VCF or CSV files. Available annotations include genes, variant effects, surrounding sequence context, counts of supporting reads from specified BAM files, and MHC I binding affinity prediction of mutant peptides.
- varlens-reads
Display, filter, and copy reads from a SAM/BAM file. Partial replacement for
samtools view
.- varlens-allele-support
Count reads supporting each allele at specified sites in BAM files.
To install from PyPI:
pip install varlens
Or from a git checkout:
pip install .
To run the tests:
nosetests .
To build the documentation (just this README plus the commandline tool help):
pip install -e .
pip install Sphinx
cd docs
make clean setup rst html
The docs will be written to the _build/html
directory.
Given variants from one or more VCF or CSV files, apply filters, add additional columns, and output to CSV.
Currently we can only output to CSV, not VCF.
A number of useful annotations can be added for each variant by specifying options of the form '--include-XXX', e.g. '--include-gene'. See detailed help (run with -h).
Print basic info for the variants found in two VCF files. Note that variants found in both files are listed in one row, and the 'sources' column lists the files each variant was found in:
$ varlens-variants test/data/CELSR1/vcfs/vcf_1.vcf test/data/CELSR1/vcfs/vcf_2.vcf
genome,contig,interbase_start,interbase_end,ref,alt,sources
GRCh37,22,21829554,21829555,T,G,1.vcf
GRCh37,22,46931059,46931060,A,C,1.vcf
GRCh37,22,46931061,46931062,G,A,1.vcf 2.vcf
GRCh37,22,50636217,50636218,A,C,1.vcf
GRCh37,22,50875932,50875933,A,C,1.vcf
GRCh37,22,45309892,45309893,T,G,2.vcf
Same as the above but include additional columns giving varcode variant effect annotations and the genes the variants overlap, and write to a file:
$ varlens-variants test/data/CELSR1/vcfs/vcf_1.vcf test/data/CELSR1/vcfs/vcf_2.vcf \
--include-effect \
--include-gene \
--out /tmp/result.csv
Wrote: /tmp/result.csv
$ cat /tmp/result.csv
genome,contig,interbase_start,interbase_end,ref,alt,sources,effect,gene
GRCh37,22,21829554,21829555,T,G,1.vcf,non-coding-transcript,PI4KAP2
GRCh37,22,46931059,46931060,A,C,1.vcf,p.S670A,CELSR1
GRCh37,22,46931061,46931062,G,A,1.vcf 2.vcf,p.S669F,CELSR1
GRCh37,22,50636217,50636218,A,C,1.vcf,intronic,TRABD
GRCh37,22,50875932,50875933,A,C,1.vcf,splice-acceptor,PPP6R2
GRCh37,22,45309892,45309893,T,G,2.vcf,p.T214P,PHF21B
Print counts for number of reads supporting reference/variant/other alleles from the specified BAM, counting only reads with mapping quality >= 10:
$ varlens-variants test/data/CELSR1/vcfs/vcf_1.vcf \
--include-read-evidence \
--reads test/data/CELSR1/bams/bam_1.bam \
--min-mapping-quality 10
genome,contig,interbase_start,interbase_end,ref,alt,sources,num_alt,num_ref,total_depth
GRCh37,22,21829554,21829555,T,G,vcf_1.vcf,0,0,0
GRCh37,22,46931059,46931060,A,C,vcf_1.vcf,0,216,320
GRCh37,22,46931061,46931062,G,A,vcf_1.vcf,0,321,321
GRCh37,22,50636217,50636218,A,C,vcf_1.vcf,0,0,0
GRCh37,22,50875932,50875933,A,C,vcf_1.vcf,0,0,0
Filter reads from one or more BAMs and output a CSV or a new BAM.
Loci and VCF files may be specified, in which case reads are filtered to overlap the specified loci or variants.
Print basic fields for the reads in a BAM:
$ varlens-reads test/data/CELSR1/bams/bam_0.bam
query_name,reference_start,reference_end,cigarstring
HISEQ:142:C5822ANXX:3:2116:16538:101199,46929962,46930062,100M
HISEQ:142:C5822ANXX:3:1106:18985:32932,46929964,46930064,100M
HISEQ:142:C5822ANXX:3:2201:21091:67220,46929966,46930066,100M
HISEQ:142:C5822ANXX:4:1304:5363:12786,46929966,46930066,100M
HISEQ:142:C5822ANXX:4:1104:9008:85114,46929969,46930069,100M
HISEQ:142:C5822ANXX:3:2304:9921:94828,46929970,46930070,100M
HISEQ:142:C5822ANXX:3:2211:6266:74633,46929973,46930073,100M
HISEQ:142:C5822ANXX:3:1305:8982:42729,46929974,46930074,100M
HISEQ:142:C5822ANXX:4:2316:5630:7371,46929978,46930078,100M
...
Same as above but filter only to reads aligned on the (-) strand, write to a file instead of stdout, and also include the mapping quality and sequenced bases in the output:
$ varlens-reads test/data/CELSR1/bams/bam_0.bam \
--is-reverse \
--field mapping_quality query_alignment_sequence \
--out /tmp/result.csv
Wrote: /tmp/result.csv
$ head /tmp/result.csv
query_name,reference_start,reference_end,cigarstring,mapping_quality,query_alignment_sequence
HISEQ:142:C5822ANXX:3:2116:16538:101199,46929962,46930062,100M,60,CATGATCTGGGCATTAGGGCCTTCATCAGGGTCGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCG
HISEQ:142:C5822ANXX:3:1106:18985:32932,46929964,46930064,100M,60,TGATCTGGGCATTAGGGCCTTCATCAGGGTCGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTC
HISEQ:142:C5822ANXX:4:1104:9008:85114,46929969,46930069,100M,60,TGGGCATTAGGGCCTTCATCAGGGTCGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCT
HISEQ:142:C5822ANXX:4:1202:18451:91174,46929979,46930079,100M,60,GGCCTTCATCAGGGTCGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCTCAAACATGGG
HISEQ:142:C5822ANXX:3:1211:18522:54773,46929987,46930087,100M,60,TCAGGGTCGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCTCAAACATGGGGGCATTGT
HISEQ:142:C5822ANXX:3:2114:19455:45093,46929987,46930087,100M,60,TCAGGGTCGTTAGCACGAATCTTTGCCACCGCCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCTCAAACATGGGGGCATTGT
HISEQ:142:C5822ANXX:4:2115:9153:21593,46929994,46930094,100M,60,CGTTAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCTCAAACATGGGGGCATTGTCATTAAT
HISEQ:142:C5822ANXX:4:1212:15644:87227,46929995,46930095,100M,60,GTTAGCACGTATGTTTGCCACCACCGACCCCACTGAGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTGCTTCTCAAACATGGGGGCAGTGTCATTAATG
HISEQ:142:C5822ANXX:3:1103:4717:26369,46929997,46930097,100M,60,TAGCACGAATCTTTGCCACCACCGACCCCACTGGGTTGTTCTCCTCAACAAACAGCTCCAGTTCGTCCTTCTCAAACATGGGGGCATTGTCATTAATGTC
Write a bam file consisting of reads with mapping quality >=30 and overlapping a certain locus:
$ varlens-reads test/data/CELSR1/bams/bam_0.bam \
--min-mapping-quality 30 \
--locus 22:46932040-46932050 \
--out /tmp/result.bam
Write a bam file consisting of reads overlapping variants from a VCF:
$ varlens-reads test/data/CELSR1/bams/bam_0.bam \
--variants test/data/CELSR1/vcfs/vcf_1.vcf \
--out /tmp/result.bam
Print just the header for a BAM in csv format:
$ varlens-reads test/data/CELSR1/bams/bam_0.bam --header
Given one or more BAMs and some genomic sites to consider, write a csv file giving counts of reads supporting each allele at each site for each BAM.
The genomic sites to consider may be specified by locus (--locus option), or via one or more VCF files.
The positions outputted by this command are in interbase coordinates, i.e. starting at 0, inclusive on first index, exclusive on second (as opposed to the one-based inclusive coordinates used in VCF files).
varlens-allele-support \
--reads test/data/CELSR1/bams/bam_1.bam \
--locus 22:46931061 22:46931063
source,contig,interbase_start,interbase_end,allele,count
bam_1.bam,22,46931060,46931061,,1
bam_1.bam,22,46931060,46931061,G,329
bam_1.bam,22,46931062,46931063,A,327
bam_1.bam,22,46931062,46931063,AC,1
bam_1.bam,22,46931062,46931063,AG,2
varlens
uses 0-based half-open coordinates internally. Many tools (including samtools and VCF files) use inclusive 1-based coordinates. We try to keep the confusion to a minimum by using the term "interbase" whenever we're using 0-based half open coordinates and "inclusive" when we're using 1-based inclusive coordinates.
One particularly sticky place this comes up is when specifying loci on the commandline using e.g. --locus chr22:43243-43244
. To maintain consistency with the most common other tools, when you specify a locus like chr22:10-20
, we interpret that as a 1-based inclusive coordinate. To specify 0-based half-open coordinates, use this syntax: chr22/11-20
(i.e. a slash instead of a colon).
See this blog post for more details on coordinate systems.