scripts for analysis of selected Whole Genome Amplification data generated for Wbancrofti
takes output of MSMC2 and reformats it to ms input
parses a fasta file by a list to include or exclude sequences. Works with both single and N character lines breaks.
applies filters to SNPS from a freebayes or GATK files with 1 or more samples files should be prepped as: vcffilter -f 'QUAL > 30' -s FOO.vcf | vt decompose_blocksub - 2> /dev/null | vcffixup - | vcfstreamsort | vt normalize -r ref.fasta -q - 2> /dev/null | vcfuniqalleles > FOO.norm.vcf && bcftools filter -g5 -G10 -i' %TYPE="snp"' FOO.norm.vcf > FOO.norm.snps.vcf then run fix_mnps.py
for each line in vcf with (no indels) where the 5th column is >2 and contains a "," character, this script prints this as 2 lines. These can then be properly filtered by filter_snps.py
make a bedfile mask for use with bedtoolsmaskfasta from a fasta where lower-case is considered a masked site, fasta should be single line format not broken every 80 or whatever characters. This was specifically written to take a SNPable mask file and prep it for use with MSMC/MSMC2
change character breaks in a fasta file
this file removes lines in the VCF that correspond with the locations of the mask provided in bed file.
if there are 2 lines with the same coordinate it always prints the 1st line. This assumes that fix_mnps has selected the most common allele freq as the first line
creates 4 masks: gap mask, mappability mask, low-complexit/repeat mask, coverage mask. Paths are dependent software: bedtools, GapDistrFromFasta.pl (found in Wb_Genome_L3 repository), RepeatMasker, and hengli's programs in seqbility-20091110 and bwa
calculates assemblathon stats from assembly
calculates percent gc for each contig in a sliding window
fits mixture distribution to genome coverage profiles
two perl scripts for parsing mpileup files from samtools
calculates RAF from a VCF. outputs format for ggplot2. This is really a terrible unfelxible implementation I will rewrite one for the sWGA repository
This runs samtools mpileup after sort and rmdup ad calculates the Reference allele frequency for diagnostic test of ploidy.
Calculates the distance between snps in a vcf and determines what % of the genome that length bins comprise. Currently requires a lengths.txt file of contig lengths to calculate edge cases. Writes an output file easily read by ggplot2
Calculate the number of snps per sliding window from a vcf file, works on individual
This code counts the number of fixed sites between PNG and Jak in a VCF by windows. It also counts the number of SNPs in the same window. The output is contig
this is a redo on HKA test using maffilter SNP file from bmal and Wb alignment I also added windows without SNPs that still have a real number of divergences Previously I skipped contigs if they did not have SNPs, this was wrong, dead wrong! I love lamp. mafTools/bin/mafExtractor
Another attempt at calculating HKA table from the outgroup of Bmal. This iteration uses the maf.vcf created by maffilter and a mugsy alignment. It also utilizes the fasta alignment converted from the maf alignment. The maf.vcf can be loaded from the pickle produced by Ancestral allele script
slices region from maf for vcf ancestral allele
parses a maf.vcf to a dict from takes a vcf from maffilter between 2 species. Uses dict to add ancestral state to VCF file as AA:%s
Takes a vcf where the last sample entry is a copy of another sample (fake outgroup) the script then makes this sample the outgroup by replacing the genotype with that denoted by the AA (ancestral allele) column in the VCF. The AA can be added with the script mafvcf2ancestral.py
parses Rtable w/ significant Tajd and FayWuH w/ a list of intersecting contigs see popgenome.R for details
parses SweeD outfile
parses fasta for input into SweeD (http://pop-gen.eu/wordpress/software/sweed)
creates sweepfinder file from vcf w/ AA field
Calculates the distance between snps in a vcf. Only calculates the difference between alt homs 1/1 and assumes only 1 sample per vcf. This was used to compare Wb from Jakarta with Wb from PNG. 1) call snps 2) bcftools isec 3) take private JAK only 4) run this script
replaces the bases in the reference fasta with the outgroup base in the maf.vcf
used for plotting PSMC data. requires a headless txt file of RS and TR values which can be created by grepping for these lines after only the n=20 iteration is pulled (psmctrunc in utils) from all the bootstrapped files. Really just a way for me to remember how to use ggplot2 to plot psmc
takes MaCs (markovian approx coal sims, Chen) output (after formatted to ms) and destills into the format output by msHOT-lite (Heng Li in Foreign) this can then be read by ms2pmcfa in psmc/utils
lists all gaps in a single fasta file. I used this to create a the negative mask in msmc
R script that reads in VCF and computes all stats for popgenome module in R. Includes a section on jointDH test and MKT
popgenome kept throwing an error about the format of missing data in the input vcf. This script just fixes those positions to a more compliant format
take a single sample vcf and a reference fasta file and creates a diploid consensus w/ UPAC characters.
take a single sample vcf and a reference fasta file and creates a haploid consensus.
This script constructs a table of polymorphism ad divergence from an alignment data in fasta or nbrf format
inverts a matrix
vcf2genepop.pl (https://github.com/z0on/2bRAD_denovo) ~Mikhail V Matz
Converts VCF to multiallelic GENEPOP, preserves chromosome and position info
LDnull.py (requires covld (https://github.com/alanrogers/covld) )
a very crappy script that interfaces with covld to calculate background levels of LD from a 012 vcftools output
counts derived SNPs from a vcf with 'AA'