This package is developed for automated whole-genome de-novo TE annotation and benchmarking the annotation performance of TE libraries.
For the initial search of TE candidates, LTRharvest, LTR_FINDER_parallel, and LTR_retriever are incorporated in this package to identify LTR retrotransposons; TIR-Learner and MITE-Hunter are incorporated in this package to identify TIR transposons (a subclass of DNA transposons); HelitronScanner is incorporated in this package to identify Helitron transposons (a subclass of DNA transposons); and finally RepeatModeler is used to identify any TEs missed by these structure-based programs.
The EDTA package was designed to filter out false discoveries in raw TE candidates and generate a high-quality non-redundant TE library for whole-genome TE annotation.
For benchmarking of a testing TE library, I have provided the curated TE annotation (v6.9.5) for the rice genome (TIGR7/MSU7 version). You may use the lib-test.pl
script to compare the annotation performance of your method/library to the methods we have tested (usage shown below).
conda create -n EDTA
conda activate EDTA
conda install -c conda-forge perl perl-text-soundex
conda install -c cyclus java-jdk
conda install -c bioconda cd-hit
conda install -c bioconda/label/cf201901 repeatmasker
conda install -c bioconda repeatmodeler
git clone https://github.com/oushujun/EDTA
./EDTA/EDTA.pl
Form head to toe (you got a genome and you want to get a high-quality TE library):
perl EDTA.pl -genome your_genome.fasta -threads 36
Just the body (you got raw TE candidates from various programs and you want to filter them using EDTA):
perl EDTA_process.pl [options]
-genome [File] The genome FASTA
-ltr [File] The raw LTR library FASTA
-tir [File] The raw TIR library FASTA
-mite [File] The raw MITE library FASTA
-helitron [File] The raw Helitron library FASTA
-repeatmasker [path] The directory containing RepeatMasker (default: read from ENV)
-blast [path] The directory containing Blastn (default: read from ENV)
-threads [int] Number of theads to run this script
-help|-h Display this help info
If you got a TE library and want to compare it's annotation performance to other methods, you can:
1.annotate the rice genome with your test library:
RepeatMasker -pa 36 -q -no_is -norna -nolow -div 40 -lib custom.TE.lib.fasta -cutoff 225 rice_genome.fasta
2.Test the annotation performance of a particular TE category.
perl lib-test.pl -genome genome.fasta -std genome.stdlib.RM.out -tst genome.testlib.RM.out -cat [options]
-genome [file] FASTA format genome sequence
-std [file] RepeatMasker .out file of the standard library
-tst [file] RepeatMasker .out file of the test library
-cat [string] Testing TE category. Use one of LTR|nonLTR|LINE|SINE|TIR|MITE|Helitron|Total|Classified
-N [0|1] Include Ns in total length of the genome. Defaule: 0 (not include Ns).
-unknown [0|1] Include unknown annotations to the testing category. This should be used when
the test library has no classification and you assume they all belong to the
target category specified by -cat. Default: 0 (not include unknowns)
eg.
perl lib-test.pl -genome rice_genome.fasta -std ./EDTA/database/Rice_MSU7.fasta.std6.9.5.out -tst rice_genome.fasta.test.out -cat LTR
You may download the rice genome here.