Diploidocus v1.5.1
For a better rendering and navigation of this document, please download and open ./docs/diploidocus.docs.html
, or visit https://slimsuite.github.io/diploidocus/.
Documentation can also be generated by running Diploidocus with the dochtml=T
option. (R and pandoc must be installed - see below.)
In addition, Diploidocus will use mapped long reads and BUSCO single copy read depths for genome size prediction
(gensize
), and coverage (regcheck
) or copy number estimation (regcnv
) for user-defined regions. Diploidocus
also has functions for removing redundancy (sortnr
), generating a non-redundant pseudo-diploid assembly with primary
and secondary scaffolds from 10x pseudohap output (diphap
), and creating an in silico diploid set of long reads from two
haploid parents (for testing phasing etc.) (insilico
).
Please note that Diploidocus is still in development and documentation is currently a bit sparse.
The different run modes are set using runmode=X
:
diploidocus
default run mode will rungensize
,telomeres
,vecscreen
,deptrim
andpurgehap
analysisdipcycle
runs iterative cycles ofdiploidocus
mode until convergence (no more filtering) is reachedgensize
uses BUSCO results, a BAM file and read file(s) to predict the genome size of the organismpurgehap
filters scaffolds based on post-processing of purge_haplotigstelomeres
performs a regex telomere search based on method of https://github.com/JanaSperschneider/FindTelomeresvecscreen
searches for contaminants and flags/masks/removes identified scaffoldsdeptrim
trims sequence termini of at leastmintrim=INT
bp with less thandeptrim=INT
read depthregcheck
checks reads spanning given regions and also calculates mean depth and estimated copy number (if regcnv=T)regcnv
calculates mean depth and estimated copy number for regcheck regions from BAM file (no read spanning analysis)gapspan
is a specialisedregcheck
analysis that checks for reads spanning genome assembly gapsgapass
extends thegapspan
mode to extract the reads spanning a gap and re-assemble them usingflye
gapfill
extends thegapass
mode to map re-assembled gaps back onto the assembly and replace gaps spanned by the new contigs *sortnr
performs an all-by-all mapping with minimap2 and then removes redundancydiphap
splits a pseudodiploid assembly into primary and alternative scaffoldsdiphapnr
runssortnr
followed bydiphap
insilico
generates balanced diploid combined reads from two sequenced haploid parentssummarise
just runs the seqin summarise code and then stops.
See https://slimsuite.github.io/diploidocus/ for details of each mode. General SLiMSuite run documentation can be found at https://github.com/slimsuite/SLiMSuite.
Diploidocus is available as part of SLiMSuite, or via a standalone GitHub repo at https://github.com/slimsuite/diploidocus.
The main Diploidocus tidy mode has been published as part of the Waratah genome paper:
Chen SH, Rossetto M, van der Merwe M, Lu-Irving P, Yap JS, Sauquet H, Bourke G, Amos TG, Bragg JG & Edwards RJ (2022). Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. Molecular Ecology Resources doi: 10.1111/1755-0998.13574
Note that the genome size prediction and copy number estimation modes are now available through DepthSizer and DepthKopy, which should cite the same article. Please contact the author if you have trouble getting the full text version, or read the bioRxiv preprint version:
Chromosome-level de novo genome assembly of Telopea speciosissima (New South Wales waratah) using long-reads, linked-reads and Hi-C. bioRxiv 2021.06.02.444084; doi: 10.1101/2021.06.02.444084.
If using the simplified Nala version of the tidy algorithm, please cite: Edwards RJ et al. (2021), BMC Genomics [PMID: 33726677].
If using the 10x genomics non-redundancy pipeline, cite the Starling genome paper:
Stuart KC*, Edwards RJ*, Cheng Y, Warren WC, Burt DW, Sherwin WB, Hofmeister NR, Werner SJ, Ball GF, Bateson M, Brandley MC, Buchanan KL, Cassey P, Clayton DF, De Meyer T, Meddle SL & Rollins LA (2022): Transcript- and annotation-guided genome assembly of the European starling. Molecular Ecology 22(8):3141-3160. doi: 10.1111/1755-0998.13679) [*Joint first authors]
Diploidocus is written in Python 2.x and can be run directly from the commandline:
python $CODEPATH/diploidocus.py [OPTIONS]
If running as part of SLiMSuite, $CODEPATH
will be the SLiMSuite tools/
directory. If running from the standalone Diploidocus git repo, $CODEPATH
will be the path the to code/
directory. Please see details in the Diploidocus git repo
for running on example data.
For full functionality, Diploidocus needs a number of additional programs installed on the system:
- Python 2.x and Python 3.x
- KAT
- BedTools
- R
- purge_haplotigs
- bbmap
- BLAST+
- Minimap2 (added to environment
$PATH
or given with theminimap2=PROG
setting) - Samtools
To generate documentation with dochtml
, R will need to be installed and a pandoc environment variable must be set, e.g.
export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc
For Diploidocus documentation, run with dochtml=T
and read the *.docs.html
file generated.
A list of commandline options can be generated at run-time using the -h
or help
flags. Please see the general
SLiMSuite documentation for details of how to
use commandline options, including setting default values with INI files.
### ~ Main Diploidocus run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE : Input sequence assembly [None]
runmode=X : Diploidocus run mode (diploidocus/dipcycle/gensize/purgehap/telomeres/vecscreen/regcheck/regcnv/deptrim/sortnr/diphap/diphapnr/insilico) [diploidocus]
basefile=FILE : Root of output file names [diploidocus or $SEQIN basefile]
summarise=T/F : Whether to generate and output summary statistics sequence data before and after processing [True]
genomesize=INT : Haploid genome size (bp) [0]
scdepth=NUM : Single copy ("diploid") read depth. If zero, will use SC BUSCO mode [0]
bam=FILE : BAM file of long reads mapped onto assembly [$BASEFILE.bam]
bamcsi=T/F : Use CSI indexing for BAM files, not BAI (needed for v long scaffolds) [False]
reads=FILELIST : List of fasta/fastq files containing reads. Wildcard allowed. Can be gzipped. []
readtype=LIST : List of ont/pb/hifi file types matching reads for minimap2 mapping [ont]
dochtml=T/F : Generate HTML Diploidocus documentation (*.docs.html) instead of main run [False]
tmpdir=PATH : Path for temporary output files during forking (not all modes) [./tmpdir/]
### ~ Genome size prediction options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
busco=TSVFILE : BUSCO full table [full_table_$BASEFILE.busco.tsv]
readbp=INT : Total combined read length for depth calculations (over-rides reads=FILELIST) []
quickdepth=T/F : Whether to use samtools depth in place of mpileup (quicker but underestimates?) [False]
depdensity=T/F : Whether to use the BUSCO depth density profile in place of modal depth [True]
mapadjust=T/F : Whether to adjust predicted genome size based on read length:mapping ratio [False]
### ~ DiploidocusHocusPocus and Purge haplotigs options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
kmerreads=FILELIST : File of high quality reads for KAT kmer analysis []
10xtrim=T/F : Whether to trim 16bp 10x barcodes from Read 1 of Kmer Reads data for KAT analysis [False]
minmedian=INT : Minimum median depth coverage to avoid low coverage filter [3]
minlen=INT : Minimum scaffold length to avoid low quality filter [500]
purgehap=X : Purge_haplotigs method (purgehap/diploidocus) [purgehap]
phlow=INT : Low depth cutoff for purge_haplotigs (-l X). Will use SCDepth/4 if zero. [0]
phmid=INT : Middle depth for purge_haplotigs (-m X). Will derive from SCDepth if zero. [0]
phhigh=INT : High depth cutoff for purge_haplotigs (-h X). Will use SCDepth x 2 if zero. [0]
zeroadjust=T/F : Add zero coverage bases to purge_haplotigs LowPerc and adjust total [True]
includegaps=T/F : Whether to include gaps in the zero coverage bases for adjustment (see docs) [False]
mingap=INT : Minimum length of a stretch of N bases to count as a gap for exclusion [10]
purgemode=X : Rules used for purgehap analysis (simple/complex/nala) [complex]
diploidify=T/F : Whether to generate alternative diploid output with duplicated diploid contigs and no hpurge [False]
pretrim=T/F : Run vectrim/vecmask and deptrim trimming prior to diploidocus run [False]
maxcycle=INT : Restrict run to maximum of INT cycles (0=No limit) [0]
### ~ Depth Trim options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
deptrim=INT : Trim termini with <X depth [1]
mintrim=INT : Min length of terminal depth trimming [1000]
### ~ Telomere options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
telofwd=X : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
telonull=T/F : Whether to output sequences without telomeres to telomere table [False]
### ~ VecScreen options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
screendb=FILE : File of vectors/contaminants to screen out using blastn and VecScreen rules []
screenmode=X : Action to take following vecscreen searching (report/purge) [report]
minvechit=INT : Minimum length for a screendb match [50]
minidhit=INT : Minimum length for an identical screendb match (based on NCBI filtering) [27]
efdr=NUM : Expected FDR threshold for VecScreen queries (0 is no filter) [1.0]
vecpurge=PERC : Remove any scaffolds with >= PERC % vector coverage [50.0]
vectrim=INT : Trim any vector hits (after any vecpurge) within INT bp of the nearest end of a scaffold [1000]
vecmask=INT : Mask any vector hits of INT bp or greater (after vecpurge and vectrim) [900]
maskmode=X : Whether to perform full (all bases) or partial (every other base) masking of hits [partial]
keepnames=T/F : Whether to keep names unchanged for edited sequences or append 'X' [False]
veccheck=T/F : Check coverage of filtered contaminant hits using reads=FILELIST data [False]
### ~ Region checking options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
regcheck=FILE : File of SeqName, Start, End positions for read coverage checking [None]
checkfields=LIST: Fields in checkpos file to give Locus, Start and End for checking (unless GFF) [Hit,SbjStart,SbjEnd]
checkflanks=LIST: List of lengths flanking check regions that must also be spanned by reads [0,100,1000,5000]
spanid=X : Generate sets of read IDs that span veccheck/regcheck regions, grouped by values of field X []
regcnv=T/F : Whether to calculate mean depth and predicted CNV of regcheck regions based on SCdepth [True]
gfftype=LIST : Optional feature types to use if performing regcheck on GFF file (e.g. gene) ['gene']
mingapspan=INT : Minimum number of reads spanning a gap in order to re-assemble [2]
minlocid=PERC : Minimum percentage identity for aligned chunk to be kept (local %identity) [0]
minloclen=INT : Minimum length for aligned chunk to be kept (local hit length in bp) [500]
### ~ SortNR filtering/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
checkcov=PERC : Percentage coverage for double-checking partial exact matches [95]
seqout=FILE : Output sequence assembly [$BASEFILE.nr.fasta]
### ~ In silico diploid input/output options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
rqfilter=X : Minimum RQ for output subreads [0]
lenfilter=X : Min read length for filtered subreads [500]
parent1=FOFN : File of file names for subreads fasta files on Parent 1. []
parent2=FOFN : File of file names for subreads fasta files on Parent 2. []
See also SMRTSCAPE `summarise=T` options if `*.unique.tdt`/`*.smrt.tdt` have not been pre-generated with SMRTSCAPE.
### ~ Advanced/Developmental options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
memperthread=INT: Number of Gb per thread to allocate to samtools etc. [6]
useqsub=T/F : Whether to use qsub to queue up system calls [False]
qsubvmem=INT : Memory setting (Gb) when queuing with qsub [126]
qsubwall=INT : Walltime setting (hours) when queuing with qsub [12]
modules=LIST : List of modules that needs to be loaded for running with qsub []
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Details for the main Diploidocus run modes are given below.
NOTE: Diploidocus is under development and documentation might be a bit sparse. Please contact the author or post an issue on GitHub if you have any questions.
Diploidocus can be automated to cycle through repeated rounds of the main purge_haplotigs/filtering method until
no further scaffolds are removed. Each cycle will run Diploidocus with a numerical suffix, e.g. $BASEFILE.1.*
,
using the *.diploidocus.fasta
output from the previous cycle as input. Cycling will continue until no further
scaffolds are filtered into either *.quarantine.fasta
or *.junk.fasta
.
Output for each cycle will be initially generated in the run directory but then moved to a dipcycle_$BASEFILE
directory upon completion.
Final outputs from the final cycle will then be compiled under the original $BASEFILE
prefix:
$BASEFILE.diploidocus.tdt
= Final ratings for the input scaffolds. This is like the single cycle output with an additionalCycle
field containing the last cycle this scaffold was processed in.$BASEFILE.ratings.tdt
= Reduced final ratings output for the input scaffolds (SeqName
,SeqLen
,ScreenPerc
,Class
,Rating
,Set
,Cycle
).$BASEFILE.diploidocus.fasta
= the scaffolds kept from the final Diploidocus cycle$BASEFILE.repeats.fasta
= the repeat scaffolds excluded from core$BASEFILE.core.fasta
= the same set of scaffolds, minus repeats$BASEFILE.quarantine.fasta
= concatenated purged scaffolds from all Diploidocus cycles.$BASEFILE.junk.fasta
= concatenated low coverage and low quality scaffolds, removed as junk, from all cycles.
NOTE: Contents for these four *.fasta
files are summarised in the main log. Individual purge cycles have their own
log files in the dipcycle_$BASEFILE
directory.
See runmode=diploidocus
documentation for more details.
If Diploidocus is killed before completion, files will not have been moved to the dipcycle_$BASEFILE
directory.
Any cycles that have generated a *.diploidocus.fasta
file will be skipped until an incomplete cycle is found.
In the rare case that Diploidocus was killed whilst generating fasta output, and *.diploidocus.fasta
is thus
incomplete, further cycles will be messed up. In this case, deleted the relevant *.diploidocus.fasta
file
before re-running.
If Diploidocus is run with runmode=dipcycle
and the dipcycle_$BASEFILE
directory is found, it will abort.
Rename or delete the directory to re-run Diploidocus with the same $BASEFILE
setting, or change basefile=X
.
Diploidocus builds on the PurgeHaplotigs classifications to use the depth bins, KAT assembly kmer frequencies and BUSCO results to reclassify scaffolds and partition them into:
*.diploidocus.fasta
= the scaffolds kept for the next round of PurgeHap*.core.fasta
= the same set of scaffolds, minus repeats*.repeats.fasta
= the repeat scaffolds excluded from core*.junk.fasta
= low coverage scaffolds, removed as junk*.quarantine.fasta
= putative haplotigs, removed from the assembly but retained for reference.
Unless summarise=F
, summary statistics for sequence data before and after processing will be included in the run log.
In addition, two key output tables are generated:
*.diploidocus.tdt
= full compiled data and classification. (See field descriptions, below.)*.rating.tdt
= reduced data and classification for easier extraction of sequence names withgrep
etc.
All output files are named with the basefile=X
prefix. This is the prefix of the assembly given with seqin=FILE
if no basefile=X
setting is provided.
Diploidocus needs the following programs installed for full functionality:
bbmap
blast+
kat
minimap2
purge_haplotigs
samtools
The main inputs for Diploidocus rating and filtering are:
seqin=FILE
: Input sequence assembly to tidy [Required].screendb=FILE
: File of vectors/contaminants to screen out using blastn and VecScreen rules [Optional].reads=FILELIST
: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. For a single run (not cycling), a BAM file can be supplied instead withbam=FILE
. (This will be preferentially used if found, and defaults to$BASEFILE.bam
.) Read types (pb/ont) for each file are set withreadtype=LIST
, which will be cycled if shorter (default=ont
). Optionally, the pre-calculated total read length can be provided withreadbp=INT
and/or the pre-calculated (haploid) genome size can be provided withgenomesize=INT
.busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
] used for calculating single copy ("diploid") read depth. This can be over-ridden by settingscdepth=NUM
.kmerreads=FILELIST
: File of high quality (i.e. short or error-corrected) reads for KAT kmer analysis [Optional]
If a BAM file is not provided/found, Diploidocus will use minimap2 to generate a BAM file of reads=FILELIST
data mapped onto the seqin=FILE
assembly. Each read file is mapped separately (--secondary=no -L -ax map-ont
or --secondary=no -L -ax map-pb
) and converted into a sorted BAM files, before merging the BAM files with samtools
and indexing the combined file.
Diploidocus will re-use files where they already exist, providing the downstream files are newer than the upstream files. (If files have been copied and lost their datestamp information, switching ignoredate=T
will re-use files regardless.) Setting force=T
should force regeneration of files even if they exist.
Diploidocus uses output from purge_haplotigs as on of the core components for its classification system. Running purge_haplotigs is automated, using the single copy (diploid) read depth to set the purge_haplotigs thresholds. If scdepth=NUM
is not set, this will be calculated using BUSCO single copy genes (see Genome size prediction [runmode=gensize] for details). The purge_haplotigs
mid cutoff (-m X
) will be set at the halfway point between the diploid depth (SCDepth) and the haploid depth (SCDepth/2). The low depth (-l X
) and high depth (-h X
) cutoffs will then be set to SCDepth/4 and SCDepth*2. This can be over-ridden with:
phlow=INT
: Low depth cutoff for purge_haplotigs (-l X
).phmid=INT
: Middle depth for purge_haplotigs (-m X
).phhigh=INT
: High depth cutoff for purge_haplotigs (-h X
).
Diploidocus adjusts the purge_haplotigs
binning proportions, which do not use the zero coverage bases, adding
the zero coverage bases to the low coverage bin. This adjustment can be with zeroadjust=F
. By default, gaps are
excluded from this adjustment, defined as a run of 10+ N
bases. Setting includegaps=T
will include these
regions in the adjustment, whilst changing the mingap=INT
setting will redefine gaps.
Next, the kat sect
function of KAT is used to calculate kmer frequencies for (a) raw data using the kmerreads=FILELIST
file of high quality (i.e. short or error-corrected) reads, and (b) the assembly itself. This generates four outputs:
*.kat-stats.tsv
= summarykmerreads
kmer coverage per sequence*.kat-counts.cvg
=kmerreads
kmer coverage per base for each input sequence*.selfkat-stats.tsv
= summary assembly kmer coverage per sequence*.selfkat-counts.cvg
= assembly kmer coverage per base for each input sequence
If the kmerreads=FILELIST
files are 10x chromium reads, the first 16bp of read 1 (the 10x barcode) can be trimmed by setting 10xtrim=T
. (This sets --5ptrim 16
in kat for the first read file.)
Read depth statistics for the BAM file are calculated per input sequence using the pileup.sh
program of bbmap
. This generates two files:
*.depth.tdt
= read depth stats per sequence*.depths.stats
= overall read depth summary, generated bypileup.sh
stdout and stderr
The runmode=vecscreen
contaminant screening using vector/contaminants from screendb=FILE
is run on the assembly - see VecScreen mode documentation for details.
The runmode=telomeres
telomere identifcation screen is run on the assembly - see Telomeres mode documentation for details.
The full BUSCO results table (busco=FILE
) is used to generate counts of Complete
, Duplicate
and Fragmented
BUSCO genes for each input sequence. These are converted into a single BUSCO rating of the dominant (i.e. most abundant) BUSCO class during classification.
Once all the different elements have been run, Diploidocus compiles results per input sequence, extracting a subset of fields of interest and renaming fields where appropriate. The following results files are compiled, joined by input sequence name (SeqName
):
*.depth.tdt
: bbmap pileup.sh read mapping statisticsSeqName
= Scaffold sequence nameSeqLen
= Scaffold sequence length (bp)Median_fold
= Median long read sequence depth for sequenceAvg_fold
= Mean long read sequence depth for sequenceCovered_percent
= Percentage of sequence covered by long readsCovered_bases
= Number of bp covered by long readsPlus_reads
= Number of reads mapped to positive strandMinus_reads
= Number of reads mapped to negative strandRead_GC
= GC content of reads mapping to sequence
*.purge.coverage_stats.csv
: purge_haplotigs coverage statsLowPerc
= Proportion of bases with read depth below low coverage cutoffHapPerc
= Proportion of bases with read depth between low coverage and mid coverage cutoffsDipPerc
= Proportion of bases with read depth between mid coverage and high coverage cutoffsHighPerc
= Proportion of bases with read depth above high coverage
*.purge.reassignments.tsv
: purge_haplotigs best sequence hits and reassignmentsTopHit
= scaffold with biggest sequence coverage matchSecHit
= scaffold with second biggest sequence coverage matchTopHitCov
= percentage sequence coverage byTopHit
MaxHitCov
= combined percentage sequence coverage by all sequence hitsPurgeHap
= purge_haplotigs reassignment ratingTopNum
= number of sequences for which this sequence is theTopHit
SecNum
= number of sequences for which this sequence is theSecHit
*.selfkat-stats.tsv
: assembly kmer coverage statsSelfMedK
= Median assembly kmer frequency for sequenceSelfAvgK
= Mean assembly kmer frequency for sequence
*.kat-stats.tsv
: short read kmer coverage statsMedK
= Median short read kmer frequency for sequenceAvgK
= Mean short read kmer frequency for sequenceSeqGC
= GC content for sequenceKPerc
= Percentage of non-gap kmers in sequence that are found in short read data- NOTE: In the absence of short read data, these values will be set to -1
*.screencov.tdt
: Contaminant coverage from vecscreen modeScreenPerc
= Total combined sequence coverage by contaminants
*.telomeres.tdt
: Telomere prediction resultsTel5
= Whether a 5' telomere is predictedTel3
= Whether a 5' telomere is predictedTel5Len
= Length in window chunks of 5' telomereTel3Len
= Length in window chunks of 3' telomereTelPerc
= Percentage of sequence predicted to telomeres. (Crude calculation.)
full_table_*.busco.tsv
Complete
= Number of BUSCO Complete genes in sequenceDuplicated
= Number of BUSCO Duplicated genes in sequenceFragmented
= Number of BUSCO Fragmented genes in sequence
$SEQIN
: input assemblyN_bases
= total count of all unresolved (N
) basesGap_bases
= total count of all gap (N
) bases as determined bymingap=INT
SeqDesc
= sequence description
Diploidocus classification adds the following fields to the compiled table:
Class
= Six-part (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO
) Diploidocus classifcation (see below)TopClass
= Class ofTopHit
(if any)SecClass
= Class ofSecHit
(if any)
Classification is based on a combination of commandline parameters and hard-coded cutoffs:
minmedian=INT
: Minimum median depth coverage to avoid low coverage filter [3]pureperc=80
: Depth class percentage to classify asPURE
rephitcov=250
: MinMaxHitCov
value to classifty asREPEAT
Each sequence is given classification in six different criteria (PURITY|DEPTH|HOM|TOP|MEDK|BUSCO
) with two possible suffixes:
PURITY
: Purity of dominant read depth classPURE
= At least 80% of sequence in that depth binGOOD
= At least 50% of sequence in that depth binWEAK
= Under 50% of sequence in that depth bin
DEPTH
: Dominant read depth classLOWX
= Median read depth (Median_fold
) fails to meetminmedian=INT
criterion (default=3)LOW
=LowPerc
read depth bin has highest percentage coverage (ties assigned to other class)HAP
=HapPerc
read depth bin has highest percentage coverage (non-DIP
ties addigned toHAP
)DIP
=DipPerc
read depth bin has highest percentage coverage (ties assigned toDIP
)EXS
=HighPerc
read depth bin has highest percentage coverage
HOM
: Homology/repeat status based on purge_haplotigs hitsUNIQ
= NoTopHit
PART
= Partial (<50%) coverage ofTopHit
HAPL
= 50%+TopHit
coverage but noSecHit
HOMO
=TopHit
andSecHit
but combinedMaxHitCov
< 250%REPT
=TopHit
andSecHit
and 250%+ combinedMaxHitCov
TOP
: Top/Sec Hit status for sequenceTOP
=TopHit
for at least one other sequenceSEC
= Not aTopHit
butSecHit
at least one other sequenceNON
= Neither aTopHit
norSecHit
for any other sequence
MEDK
: Assembly redundancy based on KAT assembly kmersPRI
= Over 50% unique kmers (SelfMedK
= 1)ALT
= Median kmer frequency of two (SelfMedK
= 2)REP
= Median kmer frequency exceeds two (SelfMedK
> 2)
BUSCO
: Dominant BUSCO class (can help decide about risk)COMP
= 1+ Complete BUSCO genes and more Complete than DuplicatedDUPL
= 1+ Duplicated BUSCO genes and more Duplicated than CompleteFRAG
= 1+ Fragmented BUSCO genes and no Complete or DuplicatedNONE
= No Complete, Duplicated or Fragmented BUSCO genes
+TEL
: If any telomeres are detected,+TEL
is added+VEC
: If any contamination is detected,+VEC
is added
Diploidocus rating based on purgemode=X
adds the following fields to the compiled table:
Rating
= Main Diploidocus rating of sequence (see below)TopRating
= Rating ofTopHit
(if any)SecRating
= Rating ofSecHit
(if any)Set
= Diploidocus output set (keep
/repeat
/quarantine
/junk
)
There are three basic purgemode
implementations:
purgemode=complex
[Default] uses all of the Diploidocus classification data to partition sequences into classes and then partition those classes into the main output sets.purgemode=simple
uses a simplified set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is conceptually easier than thecomplex
mode but unlikely to produce as good results.purgemode=crude
uses a crude set of Diploidocus classification rules data to partition sequences into classes before partitioning those classes into the main output sets. This is more complicated thansimple
but still conceptually easier than thecomplex
mode but unlikely to produce as good results.purgemode=nala
uses a simplified and conservative set of filtering rules used for the Nala German Shepherd Dog assembly.
Details for these modes can be found below.
Ratings are based on a combination of commandline parameters and hard-coded cutoffs:
minlen=INT
: Minimum scaffold length to avoid low quality filter [500]mindipcov=20
: Minimim DipPerc coverage to assign a "KEEP" rating (purgemode=crude
)artefactcov=80
: Min High/Low coverage to assign as artefacts (purgemode=crude
,purgemode=nala
)hpurgecov=80
: Min TopHitCov to assignHPURGE
rating.covwarning=95
: Read coverage threshold for warnings
Diploidocus performs a hierarchical rating of scaffolds, based on their classifications and compiled data:
- Initial low quality filter:
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)LOWQUAL
= Scaffolds below the sequence length set byminlen=INT
LOWCOV
= Low read depth and 50%+ sequence without short read kmers (PURE|LOW|...
orGOOD|LOW|...
, andMedK=0
)LOWCOV
= Low read depth and high frequency assembly kmers (...|LOW|...|REP|...
)
- High quality scaffolds to keep:
QUALITY
= Highest quality scaffolds: pure duploid, complete BUSCOs, no Duplicated BUSCOs (PURE|DIP|UNIQ|...|PRI|COMP
&Duplicated
= 0)FINAL
= As Quality butPRI|FRAG
andPRI|NONE
also allowed (PURE|DIP|UNIQ|...|PRI|...
&Duplicated
= 0)CORE
= Predominantly Diploid with <50% covered byTopHit
andSelfMedK
=1 ([PURE/GOOD]|DIP|[UNIQ/PART]|...|PRI|...
)COREHAP
= Predominantly haploid but <50% covered byTopHit
and 1+ Complete BUSCOs (...|HAP|[UNIQ/PART]|...
&Complete
> 0)PRIMARY
= Putative primary scaffold but with possible alternative scaffolds still in assembly and/or low quality regions (...|DIP|[UNIQ/PART/HAPL/HOMO]|...|[PRI/ALT]|...
)
- Repetitive scaffolds to keep:
PRIRPT
= Putative primary scaffold but >50% repeated (...|DIP|[UNIQ/PART]|...|REP|...
or[PURE/GOOD]|DIP|...|REP|...
)COLLAPSED
= High coverage scaffolds representing putative collapsed repeats. (Check for other contamination.) (...|EXS|[UNIQ/PART]|...
or[GOOD/PURE]|EXS|...|TOP|...
, orWEAK|EXS|...
and less of the sequence in the Diploid coverage bin than below Diploid coverage.REPEAT
= Predominantly Diploid scaffolds that have major signs of redundancy, probably due to presence of alternative contigs (...|DIP|REPT|...
orWEAK|DIP|[HAPL/HOMO]|...|REP|...
)
- Haploid coverage scaffolds to keep (NOTE: any
HAPLOTIG
rating will be converted toHAPRPT
if...|REP|...
)HAPLOID
= Predominantly haploid coverage but enough unique sequence to keep for now? (Option to quarantine?) Might be very heterozygous Alternative haplotigs (...|HAP|[UNIQ/PART]|PRI|...
)HAPLOTIG
= Predominantly haploid coverage but enough unique sequence to keep for now - possible Alternative haplotig (...|HAP|[UNIQ/PART]|ALT|...
or...|HAP|...|TOP|[PRI/ALT]|...
)HAPRPT
= Low quality scaffold that is probably most repeats, but not bad enough to dump outright (...|HAP|[UNIQ/PART]|...|REP|...
or...|HAP|...|TOP|...|REP|...
)HAPLOTIG
= Low/haploid coverage scaffold with insufficient coverage of a Diploid scaffold to purge (...|[LOW/HAP]|[HAPL/HOMO/REPT]|...
and|DIP|
inTopHit
rating andTopHitCov
<hpurgecov
(80%))HAPLOTIG
= Low/haploid coverage scaffold with insufficient coverage of another scaffold to purge ([PURE/GOOD]|HAP|HOMO|...' and
TopHitCov<
hpurgecov` (80%))HAPRPT
= Other haploid coverage scaffold with insufficient coverage of another scaffold to purge (...|HAP|...
andTopHitCov
<hpurgecov
(80%))HAPRPT
= Low coverage and collapsed repeats with insufficient coverage of another scaffold to purge (...|EXS|...
and less of the sequence in the Diploid coverage bin than below Diploid coverage andTopHitCov
<hpurgecov
(80%))REPEAT
= High coverage scaffold with insufficient coverage of another scaffold to purge ([GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON]|...
andTopHitCov
<hpurgecov
(80%))
- Scaffolds to quarantine for removal
RPURGE
= Messy scaffolds that are largely repeats and are sufficiently redundant/low quality to purge (WEAK|EXS
and less of the sequence in the Diploid coverage bin than below Diploid coverage, or[GOOD/PURE]|EXS|[HAPL/HOMO/REPT]|[SEC/NON]
, andTopHitCov
>=hpurgecov
(80%))
- Additional low coverage scaffolds
LOWCOV
= Low depth of coverage and strong matches to other scaffolds, and not TopHit (...|LOW|[HAPL/HOMO/REPT]|[SEC/NON]
)
- Further scaffolds to quarantine for removal
HPURGE
= Clear candidate haplotig to purge (...|LOW|[HAPL/HOMO/REPT]|...
and|DIP|
inTopHit
rating, or...|HAP|...
, andTopHitCov
>=hpurgecov
(80%))
- Final low quality filter
LOWQUAL
= Unconvincing scaffolds that do not fall into a clear class but are not bad enough to dump outright as junk (...|LOW|...
)
Any scaffolds that escape the above rules will be UNRATED
. (There should not be any!)
The simplified rating system is based much more on the dominant read depth class. A subset core scaffolds are identified and the rest are classified on the basis of dominant read depth and REP
or REPT
status.
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)LOWQUAL
= Scaffolds below the sequence length set byminlen=INT
LOWQUAL
= Low quality artefacts with less than 50% of the scaffold covered by short-read kmers (MedK=0
)- Keep Scaffolds with Diploid+ depth (
[DIP/EXS]
), or limited homology to other scaffolds ([UNIQ/PART]
) or Top hits for other scaffolds (TOP
) or 1+Complete
BUSCO genes. These are broken down into:PRIMARY
= Diploid depth or noTopHit
or dominant BUSCO rating isComplete
, and not repetitivePRIRPT
= AsPRIMARY
butREPT
orREP
classificationsCOLLAPSED
= Dominant high depth (EXS
)REPEAT
= Remaining Scaffolds to keep that haveREPT
orREP
classificationsHAPLOID
= Remaining Scaffolds to keep withoutREPT
orREP
classifications
HPURGE
= Dominant haploid read depth withoutREPT
orREP
classificationsRPURGE
= Dominant haploid read depth, plusREPT
orREP
classificationsLOWCOV
= Any remaining scaffolds (dominantLOW
read depth)
The crude
purge mode is a less nuanced version of the main Diploidocus rating.
LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)LOWQUAL
= Scaffolds below the sequence length set byminlen=INT
HPURGE
= Low+Haploid coverage meetartefactcov
cutoff (>=80%) andTopHitCov
meetshpurgecov
cutoff (>=80%)HAPLOTIG
= Haploid coverage exceeds 50%,TopHitCov
meetshpurgecov
cutoff (>=80%), andSelfMedK=2
- Scaffolds with a Diploid depth coverage otherwise meeting the
mindipcov
cutoff (>=20%) are kept:PRIMARY
= Median assembly kmer frequency of 1REPEAT
=MaxHitCov
meets therephitcov
cutoff (>=250%)KEEP
= Remaining diploid scaffolds.
LOWCOV
= Any other scaffolds with 80%+ low coverage bases are filtered as Low Coverage.COLLAPSED
= Any other scaffolds with 80%+ high coverage bases are flagged as COLLAPSED.HAPLOTIG
= Scaffolds that are 50%+ Haploid Depth if mostly present 2+ times (median assembly kmer frequency = 2)HAPLOID
= Scaffolds that are 50%+ Haploid Depth if mostly present once (median assembly kmer frequency = 1)ARTEFACT
= Scaffolds <50% Haploid Depth but 80%+ Low or Haploif depthHAPLOTIG
= Scaffolds with 80%+ hitting other scaffolds but does not meet therephitcov
cutoff (<250%)HAPRPT
= Scaffolds with 80%+ hitting other scaffolds and meets therephitcov
cutoff (>=250%)ARTEFACT
= Remaining scaffolds with 80%+ bases in the low/haploid coverage bins
The purgemode=nala
rating scheme was used for the Nala German Shepherd Dog genome assembly, and features a simplified set of ratings:
CONTAMINATION
= 50%+ identified contamination (ScreenPerc
)LOWCOV
= Poor median read coverage (Median_fold < minmedian=INT
)LOWQUAL
= Scaffolds below the sequence length set byminlen=INT
HPURGE
= Any scaffold with 80%+ bases in the low/haploid coverage bins (haplotigs or assembly artefacts).PRIMARY
= Scaffolds with 20%+ diploid coverage are marked as retention as probable diploids.COLLAPSED
= Scaffolds with <20% diploid coverage and 50%+ high coverage are marked as probable collapsed repeats.JUNK
/HAPLOTIG
/KEEP
= Remaining Scaffolds are given the PurgeHaplotigs rating (over 80% low/high coverage will be filtered as a probable artefact)
The purgemode
ratings are then converted into subsets of scaffolds for output:
keep
= Scaffolds that form part of the*.core.fasta
and*.diploidocus.fasta
outputs:- ['QUALITY','FINAL','CORE','COREHAP','PRIMARY','PRIRPT','HAPLOID','HAPLOTIG','KEEP']
repeat
= Scaffolds excluded from the*.core.fasta
but part of the*.diploidocus.fasta
output:- ['COLLAPSED', 'REPEAT', 'HAPRPT']
quarantine
= Scaffolds removed as probably Haplotigs and/or excess copies of repeats but saved to*.quarantine.fasta
in case additional checks are required:- ['HPURGE', 'RPURGE']
junk
= Scaffolds removed due to low coverage, poor quality, or contamination. These are partitioned into*.junk.fasta
:- ['LOWCOV', 'LOWQUAL', 'CONTAMINATION','JUNK']
The final ratings for each TopHit
and SecHit
are also added to the main Diploidocus table as TopHitRating
and SecHitRating
fields. This might flag up additional scaffolds (e.g. with HAPLOTIG
ratings) that the user decides to remove despite failing to meet the automated criteria.
NOTE: A cutdown output of Scaffold classifications and ratings is saved as *.ratings.tdt
. This can be used to easily extract a subset of sequence names that can be filtered from the final output using SeqSuite
, e.g.
python $SLIMSUITE/tools/seqsuite.py seqin=$PREFIX badseq=$(grep HAPLOTIG | awk '{print $1;}' ORS=",") seqout=$PREFIX.filtered.fasta
Once rating is complete, the following warnings are generated in the log file:
- All
CONTAMINATION
scaffolds - Scaffolds not classed as
junk
but:- <50% covered by short read kmers
- any vector contamination flagged
- long read coverage < 95%
- Any
COREHAP/HAPLOID/HAPLOTIG
scaffolds withDuplicated
BUSCO genes (more evidence for redundancy) - Any purged (
junk
orquarantine
) scaffolds that are the PurgeHaplotigsTopHit
for 1+ other scaffolds
Temporary files generated during forking will be generated in a subdirectory set by tmpdir=PATH
(Default: ./tmpdir/
).
Details of full outputs will be added. Please see other documentation sections for specific outputs, or contact the author if anything is unclear.
Diploidocus features a special diploidify=T
mode, which aims to generate fully diploid output. This is achieved by adding HPURGE
scaffolds to the keep
set rather than the quarantine
set. In addition, diploid scaffolds are duplicated. For this purpose, diploid scaffolds are defined as those in the keep
or repeat
sets that meet are:
- Classified as
...|DUP|...
- or Classified as
WEAK|...|NON|...
, i.e. are neither aTopHit
norSecHit
for any other sequence
Diploid scaffolds are saved with an X2
suffix on their sequence name.
NOTE: This mode is now primarily documented and updated through DepthSizer.
The main inputs for Diploidocus genome size prediction are:
seqin=FILE
: Input sequence assembly to tidy [Required].reads=FILELIST
: List of fasta/fastq files containing long reads. Wildcard allowed. Can be gzipped. For a single run (not cycling), a BAM file can be supplied instead withbam=FILE
. (This will be preferentially used if found, and defaults to$BASEFILE.bam
.) Read types (pb/ont/hifi) for each file are set withreadtype=LIST
, which will be cycled if shorter (default=ont
). Optionally, the pre-calculated total read length can be provided withreadbp=INT
and/or the pre-calculated (haploid) genome size can be provided withgenomesize=INT
.busco=TSVFILE
: BUSCO full table [full_table_$BASEFILE.busco.tsv
] used for calculating single copy ("diploid") read depth. This can be over-ridden by settingscdepth=NUM
.quickdepth=T/F
: Whether to use samtools depth in place of mpileup (quicker but underestimates?) [Default:False
]
This works on the principle that Complete
BUSCO genes should represent predominantly single copy (diploid read
depth) regions along with some ooor quality and/or repeat regions. Assembly artefacts and collapsed repeats etc.
are predicted to deviate from diploid read depth in an inconsistent manner. Therefore, even if less than half the
region is actually diploid coverage, the modal read depth is expected to represent the actual single copy
read depth.
Diploidocus uses samtools mpileup
(or samtools depth
if quickdepth=T
) to calculate the per-base read depth
and extracts the modal read depth for each BUSCO gene along with the overall modal read depth for all gene
regions. Genome size is then estimated based on a crude calculation using the total combined sequencing length.
This will be caculated from reads=FILELIST
unless provided with readbp=INT
.
BUSCO single-copy genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be predominantly single copy.
NOTE: There is currently no adjustment for contamination, read mapping or imbalanced insertion:deletion ratios etc. As a consequence, the current genome size prediction appears to be an over-estimate.
Diploidocus can automate the running of purge_haplotigs is automated, using the single copy (diploid) read depth
to set the purge_haplotigs thresholds. If scdepth=NUM
is not set, this will be calculated using BUSCO single
copy genes (see Genome size prediction [runmode=gensize] for details).
The purge_haplotigs
mid cutoff (-m X
) will be set at the halfway point between the diploid depth (SCDepth)
and the haploid depth (SCDepth/2). The low depth (-l X
) and high depth (-h X
) cutoffs will then be set to
SCDepth/4 and SCDepth*2. This can be over-ridden with:
phlow=INT
: Low depth cutoff for purge_haplotigs (-l X
).phmid=INT
: Middle depth for purge_haplotigs (-m X
).phhigh=INT
: High depth cutoff for purge_haplotigs (-h X
).
Output from this run is:
*.purge.coverage_stats.csv
: purge_haplotigs coverage stats*.purge.reassignments.tsv
: purge_haplotigs best sequence hits and reassignments
Additional purge_haplotigs files and output will be saved in a subdirectory:
purge_*/
This is to stop multiple purge_haplotigs runs from interfering with each other.
Diploidocus has implemented a regex-based search for Telomeres, based on the code at
https://github.com/JanaSperschneider/FindTelomeres. This looks for a canonical telomere motif of TTAGGG/CCCTAA,
allowing for some variation. For each sequence, Diploidocus trims off any trailing Ns and then searches for
telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are
reported for the 5' end (tel5
and trim5
) and 3' end (tel3
and trim3
), along with the total percentage
telomeric sequence (TelPerc
).
By default, Diploidocus searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3}
at the 5' end, and
a reverse sequence at the 3' end of T{1,3}A{1,2}G{2,4}
. These can be set with telofwd=X
and telorev=X
.
Telomeres are marked if at least 50% (teloperc=PERC
) of the terminal 50 bp (telosize=INT
) matches the
appropriate regex. If either end contains a telomere, the total percentage of the sequence matching either
regex is calculated as TelPerc
. Note that this number neither restricts matches to the termini, not includes
sequences within predicted telomeres that do not match the regex.
By default, only sequences with telomeres are output to the *.telomeres.tdt
output, but
switching telonull=T
will output all sequences.
This mode screens scaffolds for possible contaminants, given by screendb=FILE
.
First, a blastn
search of the screendb=FILE
sequences is performed, using the NCBI VecScreen search and
match strategy (see below). These parameters can be modified using blastopt=X
and/or optionfile=FILE
,
which will be appended to the run command. Alternatively, the $BASEFILE.vecscreen.blast
file produced can be
pre-made with different parameters (if force=F
).
Results are then parsed into a local hits table and rated according to the strength of a match. This is performed
iteratively, re-assigning internal matches as "proximal" if they are withing 25 bases of another match (of any
strength) and re-rating the strength of the match, until no ratings changes occur. Two additional fields are
added to the local hits table during this process: MatchPos
and MatchStr
. Once all assignments have been
made, segments of the assembly between two matches and/or sequence ends are added to the table as Suspect
.
MatchPos
will have a value of:
Terminal
= within 25 bp of either end of the sequence.Proximal
= within 25 bp of a vecsreen match (Weak
,Moderate
orStrong
).Internal
= over 25 bp from a sequence end or vecsreen match.Suspect
= Segments added asSuspect
.
MatchStr
will have a value of:
Strong
=Terminal
/Proximal
match with Score >= 24, orInternal
match with Score >= 30.Moderate
=Terminal
/Proximal
match with Score 19 to 23, orInternal
match with Score 25 to 29.Weak
=Terminal
/Proximal
match with Score 16 to 18, orInternal
match with Score 23 to 24.Suspect
= Any segment of fewer than 50 bases between two vector matches or between a match and an end.
These parameters seem to be designed for small (Illumina?) assemblies, and give excessive false positives for larger genomes. Diploidocus therefore features two additional contaminant identification filters:
eFDR
= The "Expected False Discovery Rateis calculated for each contaminant sequence. This is simply the
Expectvalue for that hit, divided by the total number of hits with that
Expectvalue (or lower). If
efdr=Xis set, any hits with an
eFDRvalue exceeding the threshold will be removed. By default, this is set to
1.0, i.e. any contaminants with fewer assembly hits than their
Expect` value will be dropped.Minimum hit length
= Short hits are inevitable in large assemblies and unlikely to be real for long read assemblies (e.g. without cloning etc.). An additionalminvechit=X
setting will remove any really short hits. This is set by default tominvechit=50
, meaning that a hit of at least 50 bp is required.
Finally, the percentage coverage per scaffold is calculated from the filtered hits. This is performed first for each contaminant individually, before being collapsed into total contamination coverage per query.
Results are output into two main delimited results files:
*.vecscreen.tdt
= Contaminant local hit table with VecScreen ratings and eFDR calculation. (Unfiltered)*.screencov.tdt
= Query coverage per contaminantQuery
= Contaminant name. For total coverage, this will beTOTAL
.Hit
= Scaffold name.HitLen
= Length of scaffold (bp).Coverage
= Covered length of scaffold (bp).CovPC
= Percentage coverage of scaffold (0-100).
The VecScreen BLAST+ parameters are pre-set using blastn options: -task blastn -reward 1 -penalty -5 -gapopen 3 -gapextend 3 -dust yes -soft_masking true -evalue 700 -searchsp 1750000000000
VecScreen Match Categories Vector contamination usually occurs at the beginning or end of a sequence; therefore, different criteria are applied for terminal and internal matches. VecScreen considers a match to be terminal if it starts within 25 bases of the beginning of the query sequence or stops within 25 bases of the end of the sequence. Matches that start or stop within 25 bases of another match are also treated like terminal matches. Matches are categorized according to the expected frequency of an alignment with the same score occurring between random sequences.
Strong Match to Vector (Expect 1 random match in 1,000,000 queries of length 350 kb.) Terminal match with Score >= 24. Internal match with Score >= 30.
Moderate Match to Vector (Expect 1 random match in 1,000 queries of length 350 kb.) Terminal match with Score 19 to 23. Internal match with Score 25 to 29.
Weak Match to Vector (Expect 1 random match in 40 queries of length 350 kb.) Terminal match with Score 16 to 18. Internal match with Score 23 to 24.
Segment of Suspect Origin Any segment of fewer than 50 bases between two vector matches or between a match and an end.
If veccheck=T
then long reads given by reads=FILELIST readtype=LIST
will be mapped onto the assembly using
minimap2 and complete coverage of each hit reported. This will report the number of reads that span the entrie
contaminant hit, plus flanking distances each side set by checkflanks=LIST
(default: 0,100,1000,5000). If the
flanking distance (bp) or the end of the sequence is reached before the end of the read in both directions, that
read will be added to the coverage count. Because this can inflate apparent flanking coverage for hits near the
end of a sequence, the distance from each end of the sequence is also returned as MaxFlank5
and MaxFlank3
.
Depth trimming (deptrim
) mode trims sequence termini of at least mintrim=INT
bp with less than deptrim=INT
read depth. First, samtools mpileup
or depth
(depmethod=X
) is used to find the first and last positions that
exceed deptrim=INT
. If no positions meet this criterio, the entire sequence will removed. Otherwise, either
terminus that exceeds mintrim=INT
base pairs of insufficent read depth are trimmed.
Region checking, whether for read spanning analysis (runmode=regcheck
) or copy number analysis
(runmode=regcnv
or runmode=regcheck regcnv=T
), analyses regions extracted from a delimited file given by:
regcheck=FILE
. This can be a GFF file, in which case gfftype=LIST
will restrict analysis to specific feature
types, or regions can be defined by checkfields=LIST
, which defines the locus, start and end positions. These
fields default to SeqName
, Start
and End
fields. If these fields cannot be found, the first three fields
of the regcheck=FILE
file will be used.
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly
(seqin=FILE
) using minimap2 to generate a PAF file. This is then parsed and reads spanning each feature based
on their positions and the target start and end positions in the PAF file. In addition to absolute spanning of
regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.
If regcnv=T
then the region copy number analysis (below) will also be performed.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of
parallel calculations is set by forks=INT
.
Copy number analysis uses the same single copy depth profile analysis as the DepthSizer
(Diploidocus runmode=gensize
) genome size prediction. In short, the modal read depth of BUSCO single copy
Complete
genes is calculated using samtools
mpileup
(or samtools depth
if quickdepth=T
) and used to defined "single copy read depth". BUSCO single-copy
genes are parsed from a BUSCO full results table, given by busco=TSVFILE
(default
full_table_$BASEFILE.busco.tsv
). This can be replaced with any table with the fields:
['BuscoID','Status','Contig','Start','End','Score','Length']. Entries are reduced to those with
Status
= Complete
and the Contig
, Start
and End
fields are used to define the regions that should be
predominantly single copy.
Single-copy read depth can also be set using scdepth=NUM
to re-used or over-ride previously calculated values.
Long read data, given with the reads=FILELIST
and readtype=LIST
options, are then mapped onto the assembly
(seqin=FILE
) using minimap2. This can be skipped by providing a BAM file with bam=FILE
. For each regcheck
feature, the same samtools depth calculation is performed as for the BUSCO data. The mean depth across the region
is then divided by the single copy depth to estimate total copy number across the region. Note that unlike the
single copy depth estimation itself, this might be biased by repeat sequences etc. that have a different depth
profile to the main region. One way to control for this might be to restrict analysis to a subset of reads that
meet a certain minimum length cutoff, e.g. 10kb.
Query-based CNV analysis. If the regcheck=FILE
file has additional Qry
, QryLen
, QryStart
and QryEnd
fields, the copy number analysis will have an additional query-focus. In this case, each region mapping onto a
specific query is summed up, adjusted for the proportion of the query covered by that region. For example, 3.5X
mean depth of a 100% length copy and 3.0X coverage of a 50% length copy would sum to (3.5x1.0 + 3x0.5 = 5 total
copies). If these fields are not present, each region will be analysed independently.
NOTE: Depth calculations are performed in parallel in the directory set with tmpdir=PATH
. The number of
parallel calculations is set by forks=INT
.
NOTE: These modes are now primarily documented and updated through GapSpanner.
This mode first identifies all the gaps in an assembly (seqin=FILE
) (using SeqList gapstats
or $SEQBASE.gaps.tdt
if pre-
existing) and then runs the read spanning analysis (runmode=regcheck
) with regcnv=F
. Long read data, given
with the reads=FILELIST
and readtype=LIST
options, are mapped onto the assembly using minimap2 to generate a PAF file.
This is then parsed and reads spanning each gap are identified based on their positions and the target start and end positions in the PAF file.
In addition to absolute spanning of regions, reads spanning a region +/- distances set by checkflanks=LIST
will also be calculated. If the end of a
sequence is reached before the end of the read, this will also count as flanking. Such regions can be identified
using the MaxFlank5
and MaxFlank3
fields, which identify the maximum distance 5' and 3' that a read can span
due to sequence length constraints.
Spanning spanid
output is also generated for each gap and saved in $BASEFILE_spanid
. Each gap will be named:
seqname.start-end
.
In addition to the gapspan
analysis, reads identified as spanning each gap are extracted and assembled using flye
in a $BASEFILE__gapassemble/
output directory. Only gaps with at least mingapspan=INT
(default 2) reads are
re-assembled.
In addition to the gapspan
and gapass
outputs, re-assembled gap regions are compiled into a single file and then
mapped back on the original assembly using Minimap2, with tabulated hit output into $BASEFILE__gapfill/
. Local hits
are reduced to unique coverage of the assembly sequences. Gaps are filled if one of the two conditions are met:
- A single local alignment spans an entire gap.
- A pair of concordant local alignments from the same re-assembly contig (same orientation) flank an entire gap.
In the case of a single spanning local alignment, the entire assembly region is replaced by the corresponding re-assembly contig region. For a pair of hits, the region between the two hits is replaced.
The sorted non-redundant assembly cleanup mode (runmode=sortnr
) screens out any sequences that are 100% gap,
then removes any sequences that are 100% redundant with other sequences in the input. This includes full and
partial matches, i.e. if sequence X is wholly contained within sequence Y then X will be removed.
First, sequences are loaded from the file given with seqin=FILE
and any rje_seqlist
filters and sequence sorting are applied to the input. Sequences that are 100% Ns are removed and any gaps
exceeding 10 nt are reduced to 10 N
s (NNNNNNNNNN
) to prevent minimap2 from splitting sequences on long gaps.
These gap-reduced sequences are output to $BASEFILE.tmp.fasta
and used for an all-by-all minimap2 search.
By default, minimap2 is run with the options to generate a $BASEFILE.tmp.paf
file:
--cs -p 0.0001 -t 4 -x asm20 -N 250
To modify minimap2 search settings, please see the rje_paf documentation.
NOTE: These run options can probably be made more stringent to speed up minimap2 without loss of function. Future releases may alter defaults accordingly.
Minimap2 output is parsed to identify scaffold-scaffold matches. Self-hits are ignored.
The minimum (gap-reduced) sequence length is used as a rapid parsing filter: any minimap2 matches that are less
than 95% of the query sequence (Length
+nn
fields) or less that 100% identity (Identity
+nn
)/(Length
+nn
)
are filtered during parsing.
NOTE: Future releases may feature an option to reduce the global percentage identity cut-off. Please contact the author if you wish to see this implemented.
Minimap2 hits are then processed reverse-sorted by reference sequence size (e.g. scaffold length). Any hits
where either sequence has already been filtered are skipped. Otherwise, if the match (as determined by the
length of :
regions in the CS string) matches the query length, the Query sequence will be flagged for
remove as "identical to" or "contained within" the Hit. (Mutually partial overlapping exact matches are NOT
filtered.) Filtered IDs and their matches are output to $BASEFILE.redundant.txt
.
Once all sequences have been filtered, the remaining sequences are output to: $BASEFILE.nr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to:
$BASEFILE.summarise.tdt
.
Temporary gap-reduced and minimap2 PAF files are deleted unless running in debug
or
dev
modes.
This protocol is based on 10x assemblies made for multiple organisms with supernova v2.0.0 and supernova v2.1.1. In each case, some redundancy in output was discovered (a) within pseudohap output, and (b) in terms of fully homozygous (identical) scaffolds shared by both haplotigs. It was also not entirely clear on what basis a particular haplotype was assigned to pseudohap1 or pseudohap2.
The general workflow therefore sought to remove redundancy, generate a set of primary scaffolds based on scaffold
length, and generate a non-redundant set of alternative scaffolds where heterozygosity exists. If diphapnr
mode
is used, the full workflow is implemented by first running the sortnr
workflow described above. In the reduced
diphap
mode, redundancy is not removed first.
Sequences are loaded and matching haplotigs identified based on their names. Sequence names MUST end HAP(\d+)
,
where (\d+)
indicates an integer that matches up haplotigs (as produced by supernova pseudohap2 output, for
example). This is not a pipeline to identify haplotig pairs, it is purely for splitting identified
haplotigs into primary and alternative assemblies.
Processing itself is quite simple. Haplotig pairs are identified based on matching HAP(\d+)
numbers. Where a
single haplotig is found, it is assigned as diploid
, under the assumption that the two haplotigs were identical
and one was removed. (It is possible that only one parent had this scaffold, e.g. sex chromosomes, so some post-
processing of descriptions may be required.) If two haplotigs with the same number are identified, the longest
is assigned to haploidA
and the shorter haploidB
.
The Primary Assemmbly is then compiled from all haploidA
and diploid
sequences. These are given pri
prefixes and output to $BASEFILE.pri.fasta
. The Alternative comprised of all haploidB
sequences is output
to $BASEFILE.alt.fasta
. If redundancy has been removed, this will likely be a subset of the full assembly. The
combined set of all primary and alternative sequences is output to $BASEFILE.dipnr.fasta
.
NOTE: By default, sequences are output in the same order as in the input. To output in reverse size order,
add the sortseq=invsize
command to the Diploidocus run command.
Finally, the input and output files are summarised (unless summarise=F
) and statistics output to
$BASEFILE.summarise.tdt
:
$BASEFILE.dipnr.fasta
= Combined pseudodiploid withhaploidA
,haploidB
anddiploid
annotation.$BASEFILE.pri.fasta
= Primary assembly withhaploidA
anddiploid
sequences.$BASEFILE.alt.fasta
= Alternative assembly withhaploidB
sequences.
This module generates balanced "in silico diploid" PacBio subread data from two sequenced haploid parents. Each
parent must first be run through SMRTSCAPE to generate subread summary data. (This will be performed if missing. Each
parent needs a *.fofn
file of subread file names, *.unique.tdt
unique subreads table and *.smrt.tdt
SMRT cell
identifier table.)
A new set of subreads is then generated from the combined set of parent subreads. This is done by first ranking the
unique subreads from each parent by length. First, the longest subread from each parent are compared and the shortest
selected to be the first subread of the diploid. (The shortest is taken to minimise length differences between the
two parents.) Next, the longest subread from the next parent that is no longer than the previous subread is added.
This cycles, picking a read from the the parent with fewest cumulative bases each cycle. The longest subread that is
no longer than the previous subread is selected. This continues until one parent runs out of subreads. Additional
subreads will be added from the other parent if they reduce the difference in cumulative output for each parent, or
until lenfilter=X
is reached.
Final output will be a *.LXXXRQXX.fasta
file in which each parent has a similar total sequence content and for
which the subread length distributions should also be similar. This is to overcome biases in resulting diploid
assemblies, where one parent has higher quality data than the other.
NOTE: If performing downstream filtering by Read Quality (RQ), this might reintroduce a bias if one parent has much
higher RQ values than the other. The rqfilter=X
setting can therefore be used to restrict output to reads with a
minimum RQ value. By default this is 0.84. If you do not get enough sequence output, this setting may need to be
relaxed. Similarly, only sequences above lenfilter=X
in length will be output. These are the figures given in the
LXXXRQXX
part of the output file, e.g. defaults of RQ>=0.84 and Len>=500 generates *.L500RQ84.fas
.
© 2019 Richard Edwards | richard.edwards@unsw.edu.au