Python package that helps intersect BED files with sequence conservation tracks.
This package was inspired by the need to extract conservation scores for a given set of BED intervals. It is based on a solution described by Dave Tang in this blog post.
pyconserve
takes as input the following:
- A BED file containing the intervals of interest
- Phastcons or PhyloP bedGraph files (one per chromosome)
It then uses [pybedtools] to perform genomic intersections between the input BED file and bedGraph files. As each chromosome is queried independently, this can be run using parallelization to speed up runtime.
Note: this software is in beta and may contain bugs.
-
Install bedtools
-
Clone the pyconserve GitHub repo
git clone https://www.github.com/kcha/pyconserve.git cd pyconserve
-
Install the package
python setup.py install
The following steps describe how to download conservation tracks and convert
them to bedGraph format using the UCSC tools wigToBigWig
and
bigWigToBedGraph
. These steps are adapted from this blog post.
-
Download Phastcons or PhyloP wiggle files (one chromosome per file) from UCSC
-
Convert wiggle to bigWig using
wigToBigWig
.wigToBigWig chr1.phastCons100way.wigFix.gz hg19.chrom_sizes.txt \ chr1.phastCons100way.bigWig
This step requires a file containing the chromosome sizes of your species.
-
Convert bigWig to bedGraph using
bigWigToBedGraph
bigWigToBedGraph chr1.phastCons100way.bigWig chr1.phastCons100way.bedGraph
-
(optional) To save space, compress the bedGraph file and remove the wiggle and bigWig files
rm -v chr1.phastCons100way.wigFix.gz chr1.phastCons100way.bigWig gzip -vf chr1.phastCons100way.bedGraph
Given a BED file and a set of bedGraph conservation files, pyconserve
can be
run as follows:
pyconserve a.bed chr*.phastCons100way.bedGraph.gz > a_conservation.bed
This will perform bedtools intersect
and save the output to
a_conservation.bed
.
To aggregate these results by computing the mean conservation score for each
interval in the BED file, use the command summarize_conserve
:
summarize_conserve a_conservation.bed > summarized.bed
Alternatively, the above two commands can be chained together as follows:
pyconserve a.bed chr*.phastCons100way.bedGraph.gz | \
summarize_conserve - > summarized.bed
This software is inspired by previous work from others:
- Extracting sequence conservation: https://davetang.org/muse/2012/08/07/sequence-conservation-in-vertebrates/
- Multiprocessing using pybedtools: https://daler.github.io/pybedtools/3-brief-examples.html#example-3-count-reads-in-introns-and-exons-in-parallel