Skip to content

NKrit/RepUnitTyping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepUnitTyping

RepUnitTyping version 1.6

RepUnitTyping.py was originally created to predict copy numbers of repeat units in VNTR loci of Mycobacterium tuberculosis genome from PCR-free Illumina short reads.

In the non-prediction mode, however, it can also indicate the presence/insertion or absence/deletion of any genomic regions (not only VNTR loci) defined by specific nucleotide sequences (~50 bp), when prepared and optimized for your own settings in the multi-fasta format.
This script is using a backbone of SpoTyping-v2.1, a well-known in-silico spoligotyping tool (Xia E et al. Genome Med. 2016), but entire modules for VNTR repeat-unit prediction and its extension were newly built.

Citation

[Genotyping of Mycobacterium tuberculosis spreading in Hanoi, Vietnam using conventional and whole genome sequencing methods] (https://doi.org/10.1016/j.meegid.2019.104107)

ref/rep_unit.fasta contains a provisional set of repeat unit sequences and their flanking sequences observed in the 33 VNTR loci of M. tuberculosis genome including the conventional 24-MIRU-VNTR loci, although you may prepare another multi-fasta file.

Locus labels (=keys):
[M2, 0424, ETR-C, M4, M40, M10, M16, 1955, 1982, M20, 2074, 2163a, 2163b, ETR-A, 2347, 2372, 2401, ETR-B, M23, M24, M26, M27, 3155, 3171, M31, 3232, 3336, 3690, 3820, 4052, 4120, 4156, M39]

  • the 24 MIRU-VNTR loci = [MIRU02(M2), Mtub04(0424), ETR-C, MIRU04(M4), MIRU40(M40), MIRU10(M10), MIRU16(M16), Mtub21(1955), MIRU20(M20), QUB11b(2163b), ETR-A, Mtub29(2347), Mtub30(2401), ETR-B, MIRU23(M23), MIRU24(M24), MIRU26(M26), MIRU27(M27), Mtub34(3171), MIRU31(M31), Mtub39(3690), QUB26(4052), QUB4156(4156), MIRU39(M39)]

Prerequisites:

Python2.7 or 3.5+
BLAST+ [ncbi-blast(-2.8.1+)]
(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/)
samtools-1.9+
(http://www.htslib.org/download/)

python libraries used inside:
    import sys
    import os
    import re
    from optparse import OptionParser
    import subprocess
    import gzip
    from collections import OrderedDict

Installation:

cd
git clone https://github.com/NKrit/RepUnitTyping.git

Input:

Fastq(.gz) file or paired-end files
Fasta file of a complete genomic sequence or assembled contigs

Output:

Output file specified: predicted number of repeat units in VNTR loci
(In the non-prediction mode, the presence or absence of a given sequence is shown)
Output log file: number of hits ('CompleteHits'= No mismatch, 'ZeroOneErrorHits'= No or one mismatch) in BLAST for each repeat unit or flanking sequence
Output log2 file: summary of search results in the fasta mode

Usage:

python RepUnitTyping.py --help

    Usage: python RepUnitTyping.py [options] FASTQ_1/FASTA FASTQ_2(optional)

    Options:
  	--version             show program's version number and exit
	-h, --help            show this help message and exit
  	-p, --pred            set this if you try prediction of the number of repeat
        	              units based on hits on flanking sequences [Default is
                              off]
  	--seq                 set this if input is a fasta file that contains only a
                              complete genomic sequence or assembled contigs
                              [Default is off]
  	-s SWIFT, --swift=SWIFT
                              swift mode, either "on" or "off" [Default: off]
  	-O OUTDIR, --outdir=OUTDIR
                              output directory [Default: running directory]
  	-o OUTPUT, --output=OUTPUT
          	              basename of output files generated [Default:
          	   	      RepUnitTyping]
  	-f, --filter          stringent filtering of reads (used only for low
                              quality reads) [Default is off]
  	--sorted              set this only when the reads are sorted to a reference
      	                      genome [Default is off]
  	-d, --detail          enable detail mode, keeping intermediate files for
      	                      checking [Default is off]
  	-q Q_FASTA, --query=Q_FASTA
                              query file for repeat units [Default is rep_unit.fasta
                              in "ref" subdirectory]
  	-c CUTOFF, --cutoff=CUTOFF
        	              threshold for the presence of each sequence [Default:
                	      0.15] times the average read depth calculated from Mtb
                              genome size
        -g GENOME_SIZE, --genome=GENOME_SIZE
                              target genome size [Default: 4500000]

Examples:

cd ./RepUnitTyping
python RepUnitTyping.py -s off ../AL123456.3H37Rv_HS25-l150-f200_R1.fastq.gz ../AL123456.3H37Rv_HS25-l150-f200_R2.fastq.gz -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping -p # prediction/non-swift mode for paired fastq files
python RepUnitTyping.py -s on ../AL123456.3H37Rv_HS25-l150-f200_R1.fastq.gz ../AL123456.3H37Rv_HS25-l150-f200_R2.fastq.gz -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping # non-prediction/swift mode for paired fastq files 
python RepUnitTyping.py --seq ../AL123456.3H37Rv.fasta -q rep_unit.fasta -O RepUnit_out -o 2019RepUnitTyping -p # fasta mode mainly for complete genome sequence data
# rep_unit.fasta should be located in the ref subdirectory.
# Output files created from the last command are present in the RepUnit_out directory as an example. You can delete them when new analyses are made.

or you may use a shell script, rep-unit-typing.sh, to run RepUnitTyping.py interactively and repeatedly.

cd ./RepUnitTyping
sh rep-unit-typing.sh
  • For good prediction, PCR-free deep sequencing (depth of coverage > 200) is indispensable.
  • When inconsistencies with experimental typing results are suspected, incomplete matches due to unidentified repeat unit variants or flanking sequences should be considered, and an optimal rep_unit.fasta file should be reconstructed, extracting unlisted variants from de-novo assembled sequences.
  • v1.4 (2019-05-06): initially opened.
  • v1.5 (2020-07-07): gzip interface mode was modified, and each run is ~40% quicker than before. --sorted option and cleaning-up process were also modified.
  • v1.6 (2022-06-07): an option to change genome size was added to the python script, and variants found in a genome larger than that of M. tuberculosis can also be handled.

Releases

No releases published

Packages

No packages published