RepUnitTyping

RepUnitTyping version 1.6

RepUnitTyping.py was originally created to predict copy numbers of repeat units in VNTR loci of Mycobacterium tuberculosis genome from PCR-free Illumina short reads.

In the non-prediction mode, however, it can also indicate the presence/insertion or absence/deletion of any genomic regions (not only VNTR loci) defined by specific nucleotide sequences (~50 bp), when prepared and optimized for your own settings in the multi-fasta format.
This script is using a backbone of SpoTyping-v2.1, a well-known in-silico spoligotyping tool (Xia E et al. Genome Med. 2016), but entire modules for VNTR repeat-unit prediction and its extension were newly built.

Citation

[Genotyping of Mycobacterium tuberculosis spreading in Hanoi, Vietnam using conventional and whole genome sequencing methods] (https://doi.org/10.1016/j.meegid.2019.104107)

ref/rep_unit.fasta contains a provisional set of repeat unit sequences and their flanking sequences observed in the 33 VNTR loci of M. tuberculosis genome including the conventional 24-MIRU-VNTR loci, although you may prepare another multi-fasta file.

Locus labels (=keys):
[M2, 0424, ETR-C, M4, M40, M10, M16, 1955, 1982, M20, 2074, 2163a, 2163b, ETR-A, 2347, 2372, 2401, ETR-B, M23, M24, M26, M27, 3155, 3171, M31, 3232, 3336, 3690, 3820, 4052, 4120, 4156, M39]

the 24 MIRU-VNTR loci = [MIRU02(M2), Mtub04(0424), ETR-C, MIRU04(M4), MIRU40(M40), MIRU10(M10), MIRU16(M16), Mtub21(1955), MIRU20(M20), QUB11b(2163b), ETR-A, Mtub29(2347), Mtub30(2401), ETR-B, MIRU23(M23), MIRU24(M24), MIRU26(M26), MIRU27(M27), Mtub34(3171), MIRU31(M31), Mtub39(3690), QUB26(4052), QUB4156(4156), MIRU39(M39)]

Prerequisites:

Python2.7 or 3.5+
BLAST+ [ncbi-blast(-2.8.1+)]
(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/)
samtools-1.9+
(http://www.htslib.org/download/)

python libraries used inside:
    import sys
    import os
    import re
    from optparse import OptionParser
    import subprocess
    import gzip
    from collections import OrderedDict

Installation:

cd
git clone https://github.com/NKrit/RepUnitTyping.git

Input:

Fastq(.gz) file or paired-end files
Fasta file of a complete genomic sequence or assembled contigs

Output:

Output file specified: predicted number of repeat units in VNTR loci
(In the non-prediction mode, the presence or absence of a given sequence is shown)
Output log file: number of hits ('CompleteHits'= No mismatch, 'ZeroOneErrorHits'= No or one mismatch) in BLAST for each repeat unit or flanking sequence
Output log2 file: summary of search results in the fasta mode

Usage:

python RepUnitTyping.py --help

    Usage: python RepUnitTyping.py [options] FASTQ_1/FASTA FASTQ_2(optional)

    Options:
  	--version             show program's version number and exit
	-h, --help            show this help message and exit
  	-p, --pred            set this if you try prediction of the number of repeat
        	              units based on hits on flanking sequences [Default is
                              off]
  	--seq                 set this if input is a fasta file that contains only a
                              complete genomic sequence or assembled contigs
                              [Default is off]
  	-s SWIFT, --swift=SWIFT
                              swift mode, either "on" or "off" [Default: off]
  	-O OUTDIR, --outdir=OUTDIR
                              output directory [Default: running directory]
  	-o OUTPUT, --output=OUTPUT
          	              basename of output files generated [Default:
          	   	      RepUnitTyping]
  	-f, --filter          stringent filtering of reads (used only for low
                              quality reads) [Default is off]
  	--sorted              set this only when the reads are sorted to a reference
      	                      genome [Default is off]
  	-d, --detail          enable detail mode, keeping intermediate files for
      	                      checking [Default is off]
  	-q Q_FASTA, --query=Q_FASTA
                              query file for repeat units [Default is rep_unit.fasta
                              in "ref" subdirectory]
  	-c CUTOFF, --cutoff=CUTOFF
        	              threshold for the presence of each sequence [Default:
                	      0.15] times the average read depth calculated from Mtb
                              genome size
        -g GENOME_SIZE, --genome=GENOME_SIZE
                              target genome size [Default: 4500000]

Examples:

cd ./RepUnitTyping
python RepUnitTyping.py -s off ../AL123456.3H37Rv_HS25-l150-f200_R1.fastq.gz ../AL123456.3H37Rv_HS25-l150-f200_R2.fastq.gz -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping -p # prediction/non-swift mode for paired fastq files
python RepUnitTyping.py -s on ../AL123456.3H37Rv_HS25-l150-f200_R1.fastq.gz ../AL123456.3H37Rv_HS25-l150-f200_R2.fastq.gz -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping # non-prediction/swift mode for paired fastq files 
python RepUnitTyping.py --seq ../AL123456.3H37Rv.fasta -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping -p # fasta mode mainly for complete genome sequence data
# rep_unit.fasta should be located in the ref subdirectory.
# Output files created from the last command are present in the RepUnit_out directory as an example. You can delete them when new analyses are made.

or you may use a shell script, rep-unit-typing.sh, to run RepUnitTyping.py interactively and repeatedly.

cd ./RepUnitTyping
sh rep-unit-typing.sh

For good prediction, PCR-free deep sequencing (depth of coverage > 200) is indispensable.
When inconsistencies with experimental typing results are suspected, incomplete matches due to unidentified repeat unit variants or flanking sequences should be considered, and an optimal rep_unit.fasta file should be reconstructed, extracting unlisted variants from de-novo assembled sequences.
v1.4 (2019-05-06): initially opened.
v1.5 (2020-07-07): gzip interface mode was modified, and each run is ~40% quicker than before. --sorted option and cleaning-up process were also modified.
v1.6 (2022-06-07): an option to change genome size was added to the python script, and variants found in a genome larger than that of M. tuberculosis can also be handled.

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
RepUnit_out		RepUnit_out
ref		ref
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RepUnitTyping.py		RepUnitTyping.py
rep-unit-typing.sh		rep-unit-typing.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepUnit_out

RepUnit_out

ref

ref

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

RepUnitTyping.py

RepUnitTyping.py

rep-unit-typing.sh

rep-unit-typing.sh

Repository files navigation

RepUnitTyping

RepUnitTyping version 1.6

Citation

Prerequisites:

Installation:

Input:

Output:

Usage:

Examples:

About

Releases

Packages

Languages

License

NKrit/RepUnitTyping

Folders and files

Latest commit

History

Repository files navigation

RepUnitTyping

RepUnitTyping version 1.6

Citation

Prerequisites:

Installation:

Input:

Output:

Usage:

Examples:

About

Resources

License

Stars

Watchers

Forks

Languages