Skip to content

Genome assembly quality control and misassembly repair

License

Notifications You must be signed in to change notification settings

slimsuite/depthcharge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DepthCharge: Genome assembly quality control and misassembly repair

DepthCharge v0.2.0

For a better rendering and navigation of this document, please download and open ./docs/depthcharge.docs.html, or visit https://slimsuite.github.io/depthcharge/. Documentation can also be generated by running DepthCharge with the dochtml=T option. (R and pandoc must be installed - see below.)

Introduction

DepthCharge is an assembly quality control and misassembly repair program. It uses mapped long read depth of coverage to charge through a genome assembly and identify coverage "cliffs" that may indicate a misassembly. If appropriate, it will then blast the assembly into fragment at those misassemblies.

DepthCharge uses a genome assembly and PAF file of mapped reads as input. If no file is provided, minimap2 will be used to generate one.

For each sequence, DepthCharge starts at the beginning of the sequence and scans through the PAF file for coverage to drop below the mindepth=INT threshold (default = 1 read). These positions are marked as "bad" and compressed into regions of adjacent bad positions. Regions at the start or end of a sequnece are labelled "end". Regions overlapping gaps are labelled "gap". Otherwise, regions are labelled "bad". All regions are output to *.depthcharge.tdt along with the length of each sequence (region type "all").

Future versions will either fragment the assembly at "bad" regions (and "gap" regions if breakgaps=T. If breakmode=gap then DepthCharge will replace bad regions with a gap (NNNN...) of length gapsize=INT. If breakmode=report then no additional processing of the assembly will be performed. Otherwise, the processed assembly will be saved as *.depthcharge.fasta.


Running DepthCharge

DepthCharge is written in Python 2.x and can be run directly from the commandline:

python $CODEPATH/depthcharge.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone DepthCharge git repo, $CODEPATH will be the path the to code/ directory. Please see details in the DepthCharge git repo for running on example data.

Dependencies

DepthCharge uses grep and awk. To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

If a PAF file is not provided, minimap2 must be installed and either added to the environment $PATH or given with the minimap2=PROG setting.

For full documentation of the DepthCharge workflow, run with dochtml=T and read the *.docs.html file generated.

Commandline options

### ~ Main DepthCharge run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Input sequence assembly [None]
basefile=FILE   : Root of output file names [$SEQIN basefile]
paf=FILE        : PAF file of long reads mapped onto assembly [$BASEFILE.paf]
breakmode=X     : How to treat misassemblies (report/gap/fragment) [fragment]
breakgaps=T/F   : Whether to break at gaps where coverage drops if breakmode=fragment [False]
gapsize=INT     : Size of gaps to insert when breakmode=gap [100]
mindepth=INT    : Minimum depth to class as OK [1]
### ~ PAF file generation options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
reads=FILELIST  : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST   : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
minimap2=PROG   : Full path to run minimap2 [minimap2]
mapopt=CDICT    : Dictionary of minimap2 options [N:100,p:0.0001,x:asm5]
### ~ Additional options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
dochtml=T/F     : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
logfork=T/F     : Whether to log forking in main log [False]
tmpdir=PATH     : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

© 2021 Richard Edwards | richard.edwards@unsw.edu.au

About

Genome assembly quality control and misassembly repair

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published