Find all gaps for a Roche 454 assembly or mapping project Eventually, will accept BAM/SAM file too
Please check under setup.py inside of install_requires for details on what python packages are required. https://github.com/VDBWRAIR/NGSCoverage/blob/master/setup.py#L33
aligncoverage utilizes the 454AlignmentInfo.tsv file to build a CoverageRegion for each contig that is located in the file. The way it determines a CoverageRegion's type is base on the depth of each base and is not easily configurable at this time.
If the project that is being analyzed is a Assembly then the AlignDepth is used otherwise the TotalDepth column is used. You can refer to the Roche documentation on all of the available fields.
Since the 454AlignmentInfo.tsv file contains multiple contigs that coorespond to the same mapped reference(in a mapping project), aligncoverage merges all of the regions produced from all SeqAlignments for every reference.
It then outputs those regions in human readable or csv format
#> aligncoverage -h
usage: aligncoverage [-h] -d DIR [-c]
optional arguments:
-h, --help show this help message and exit
-d DIR, --dir DIR 454 Project directory path to find gaps in
-c, --csv Output in csv format
Usage is fairly straight forward and simple You have to provide it a Roche 454 project directory using the -d or --dir option If you want csv output then also specify --csv or -c --csv is useful for when you use gapstoscatter
Note: All output is written to the terminal(Standard Out) so you will likely want to redirect it using the Unix redirect operators(>)
gapstoscatter will read in any csv file generated by aligncoverage and produce a graphic that displays the gaps and low coverage visually If you give it a primer file it will map the primers to visually compare with.
These csv gap files have the form of
<samplename or reference>|<reference length>,<GapType>,<start>,<end>...
#> gapstoscatter -h
usage: gapstoscatter [-h] --csv CSVFILE [-o OUTPUTFILE] [-t TITLE]
optional arguments:
-h, --help show this help message and exit
--csv CSVFILE CSV Gaps file to parse
-o OUTPUTFILE Filepath to put output image[Default: ./gaps.png]
-t TITLE, --title TITLE
Title for the scatterplot
Generate Graphic without Primer
#> aligncoverage -d /some/454/project/dir --csv > gaps.csv
#> gapstoscatter --csv gaps.csv
Generate Graphic with Primer
#> aligncoverage -d /some/454/project/dir --csv > gaps.csv
#> gapstoscatter --csv gaps.csv -p /path/to/primer
Notes:
- You can also change the title of the graphic by utilizing the -t option. If you don't specify the -t it will use the name of the file from the --csv option
- If you do not specify the -o option it will use gaps.png as the output
Some quick notes about the structure expected in the primer file To know where the beginning and end of a primer are and what direction each primer identifier needs to be specifically formatted.
Basically you need to ensure that the sequence id field looks similar to this: R followed by a number -- or -- F followed by a number
Where R represents reverse and F represents forward primer The number represents the start position of the primer if it is forward or the end if it is reverse
The code that parses this is located here: https://github.com/VDBWRAIR/pyWrairLib/blob/master/wrairlib/primer.py#L358
This script is useful if you want to run a bunch of existing newbler project folders through the aligncoverage script. Allows you to generate graphics easily for many projects so you can see how well your coverage has been.
The index file is simply a text file containing newbler project directories listed one per line This script has multiprocessor support to speed it up.
usage: gapsformids.py [-h] [-i INDEX] [--cpus CPUS] [-o OUTPUTDIR]
optional arguments:
-h, --help show this help message and exit
-i INDEX, --index INDEX
Index to use. Defaults to standard
input
--cpus CPUS Number of cpus to use[Default: 1]
-o OUTPUTDIR, --output OUTPUTDIR
Output directory[Default: Current directory]
As this script generates comma separated value files you can open the file with Excel or OpenOffice.org and then select comma as the delimiter
Hint: You can open the file from the command line in Linux by using openoffice.org -calc gaps.csv
This script allows you to map a primer fasta file againsta a reference fasta file. It basically draws a transparent horizontal line for every reference sequence found inside the reference fasta file. The length of the line is the ENTIRE sequence length. It is annotated at the end with the exact length. Then it gathers all of the primer regions from the primer fasta file and draws lines on top of each reference for the primers that are for each reference with arrows indicating reverse or forward.
usage: refcoverage [-h] --reference REFFILE --primer PRIMERFILE
[--pattern REFPATTERN] [--title TITLE] [-o OUTPUTFILE]
[--debug {DEBUG,INFO,WARNING}]
optional arguments:
-h, --help show this help message and exit
--reference REFFILE Reference file
--primer PRIMERFILE Primer file
--pattern REFPATTERN Regex to match genes in reference id lines. Has to
have a named pattern called gene somewhere in it.
[Default: (?P<name>(?P<accession>.*?)_(?P<gene>.*?)_(?
P<strain>.*))]
--title TITLE Title that will appear at top of image
-o OUTPUTFILE Output file name(should end in .png)
--debug {DEBUG,INFO,WARNING}
Output level of logger
The --pattern option is a bit clunky but is necessary to help ensure that you can customize how the script identifies which gene segement each reference identifier is. This pattern must contain a named group such as (?P....). This named group will be used to identify which primers match with which references. So you better make sure that your primer identifier lines contain the exact same gene name as your reference identifier lines.