PgPy is a python library designed for population genomic analysis. PgPy is written using python and interacting with vcf files with pysam library, allowing to quickly iterate through whole genomic data. The current release is developped under python 3.6 and no support is provided for 2.x version.
Using pip
pip install git+https://github.com/jsgounot/PgPy.git
Or download / clone the github
git clone https://github.com/jsgounot/PgPy.git
cd PgPy
python setup.py install --user
The main purpose of this library is to work with a merged vcf file based on multiple samples sequencing data. The final vcf file must be tabulated using tabix. Merging multiple vcfs into on single vcf can be done using vcftools's vcfmerge
function. Since pysam works well with compressed file, you should use bgzip from tabix as well at the end. If you want to work with snpEff results, do not forget to annotate your merged vcf files during the process.
PgPy has been designed to be minimalist and flexible. You can look at the introduction guide to have a first view of the possibilities. PgPy provides also several recipies which might help you to see how it works. Simply, PgPy allows you to :
- Iterate easily through variants along the genome or only a part of it (based on tabix support provided by pysam)
- Produce quickly alignment with inferred SNPs and / or indels
- Working within a python environment and interfacing easily with the BioPython library
- Modify "on the fly" SNPs, such as modifying heterozygous SNPs into IUPAC code
- Use multiprocessing to make process faster by parallelizing operations for each chromosome or regions