Skip to content

mendezg/DATOL

Repository files navigation

DATOL


A. What's This(could use a name)?

This is a pipeline, inspired by the CEGMA pipeline, for generating phylogenetic trees from transcriptomic data or annotated genomic data. A non-redundant set of genes that are highly conserved across eukaryotic taxa from the CEGMA and BUSCO projects are provided as a potential set of genes to ue for the pipeline, but any set of genes can be used. Scripts are provided to prepare your own set of genes for the pipeline.

This pipeline is also useful for finding genes in transcriptomic or annotated genomic datasets in a robust manner using phylogenetic trees to inform the user whether potential hits are homologous to the queried genes.

This pipeline uses blastp to generate a list of candidate hits that are scored according to a HMM. Candidate hits are fed into a series iterative phlyogenetic based tests to find the best hit. Part of this iterative process may flag certain genes as unacceptable for phylogenetic use, due to an inability to differentiate orthologs. In the end, a subset of the initial genes that all provide relaible hits are used to generate phylogenetic trees.

This can be installed on *NIX platforms and requires the installation of: HMMER 3.1, MAFFT 7, RAxML 8, SAMTools, Usearch 8.1, TrimAL 1.2, Python 2.7. And the Python libraries: Bio, numpy, matplotlib, ete2.

Necessary data inputs are fasta files of open reading frames for the CDS and corresponding proteins. If you wish to use your own genes then you will need fasta files for each gene listing sequences from a diverse set of taxa.

You can direct questions/comments to mendezg@umd.edu


B. Installation

Install: HMMER 3.1, MAFFT 7, RAxML 8, SAMTools, usearch v8.1, TrimAL 1.2, Python 2.7 Install Python libraries: Bio, numpy, matplotlib, ete2

Add scripts to bash PATH


C. How to use this

This pipeline is a multi-stage process that will require user interction at several stages.

If you are starting from transcriptomic assemblies you will need to find open reading frames. We suggest using transdecoder or another similar tool. This will output the .cds and .pep files needed for this pipeline.

If you are starting from genomic data you will need the cds and protein reads generated after annotation. We provide a script to generate cds sequences using a gff file and fasta encoded genomic assembly.

The first pipeline script to use is loop.sh

loop.sh

You need to generate several input datafiles before running this script:

  1. ORFs available for both DNA and peptides for each species in fasta format. A separate fasta formatted file for each species.
  2. query sequences in fasta format. A separate fasta file for each gene
  3. HMMs for each query gene
  4. A cutoff file with a bitscore cutoff specified for each gene

Scripts called by this script:

  • big_ublastp.sh - This performs the ublast searches, searching for each gene of interest against the blast databases of your species of interest
  • get_seq.sh - This uses the text file from PepFromBlast.py to write sequence files
  • hmm_from_pep.sh - This performs and hmmsearch on each sequence from the blast search to filter the results based on the cutoff scores
  • parse_hmm_search.py - This generates a text file form the hmmsearch output needed to lookup the sequences
  • write_cds_pep.sh - This uses the text file generated by parse_hmm_search.py to write a dna and peptide file for each sequence
  • gene_species_table.sh - This generates a table (in csv format) indicating the presence or absence of genes in the species searched.

Named variables. Every run needs the following defined:

  • -q | --query_dir - The directory containing query directory, hmm directory, and scores_cutoff.txt file.
  • -d | --dna_dir - The directory containing the open reading frames (DNA sequences).
  • -p | --protein_dir - The directory containing the translated open reading frames (protein sequences).
  • -o | --output_dir - The directory to put the output
  • -t | --threads - How many threads to use.

Example:

loop.sh -d ~/mydata/dna -p ~/mydata/prot -q ~/pipeline/queries -o ~/myoutput/loop1 -t 32

When loop.sh is completed you need to examine the gene_species_table.csv file stored in your output directory. This table is a spreadsheet showing species in columns and genes in rows. If a gene was found for a given species a "1" is listed. If the gene was not found a "0" is listed. Using this information choose a set of species and genes with no or few gaps in the data to use in the next steps. Save the species names to a text file and the gene names to a separate text file. The text files should have one species/gene on each line. When you have prepared those lists you can launch the next script: pretree_loop.sh using the command suggested in the final loop.sh output.

pretree_loop.sh

Named variables. Every run needs the following defined:

  1. -i | --input_dir - The directory containing the Working directory from loop.sh.
  2. -s | --species_list - A text file listing species; one species per line.
  3. -g | --gene_list - A text file listing genes; one gene per line.
  4. -t | --threads - How many threads to use.

Example:

pretree_loop.sh -t 24 -i ~/myoutput/loop1 -s ~/myoutput/loop1/Working_Dir_Mon_Dec_7_161512_EST_2015/lists/species_list.txt -g ~/myoutput/loop1/Working_Dir_Mon_Dec_7_161512_EST_2015/lists/gene_list.txt

When pretree_loop.sh is done you will have alignment files available for tree finding. There will be a file for DNA and another for AA. Perform a RAxML tree search using each of these files. Boot strapping is not necessary but nice for reference. Examples (make sure your output is named exactly as shown below):

raxmlHPC-PTHREADS-AVX -f a -p 258755 -x 258755 -# 100 -m PROTGAMMAAUTO -s supermatrix_pep.phylip -T 32 -n loop_1_pep.tre
raxmlHPC-PTHREADS-AVX -f a -p 258755 -x 258755 -# 100 -m GTRGAMMA -s supermatrix_cds.phylip -T 16 -n loop_1_cds.tre

When your tree finding is complete place the resulting files in the same directory as the input alignment files. The next script to launch is search_optimization.sh.

search_optimization

Before launching search_optimization.sh you need to prepare a text file listing outgroups. This is necessary to root the trees being examined. It is best to have multiple possible rooting groups in case genes were not found for your first rooting choice. The line should start with the word Outgroup1 followed by a name for the group followed by a list of all the species in the outgroup with everything separated by spaces. You specify Outgroup2 and Outgroup3 on separate lines. Example outgroup file:

Outgroup1 Vitrella Vitrella_brassicaformis Vitrella_brassicaformis_CCMP3155
Outgroup2 Oxyrrhis Oxyrrhis_marina_LB1974 Oxyrrhis_marina Oxyrrhis_marina_unknown
Outgroup3 Amphidinium Amphidinium_massartii Amphidinium_sp_cladeA Amphidinium_carterae_MMET Amphidinium_carterae

Example execution of search_optimization.sh script:

search_optimization.sh -i ~/myoutput/loop1 -og ~/myoutput/outgroups.txt -t 32

When search_optimization is completed it is time to run loop.sh again, but this time use the rebuilt queries generated by search_optimization.sh instead of the queries initially provided. These new queries, hmms, and cutoff scores are generated by your own input data. The script will provide the required command. This is also a good time to review the html report generated that will give detailed information and figures for each gene showing which sequences were identified as paralogs.

Second round loop.sh

When the search_optimization script completes it will provide a message including the command to start the second loop. In this loop the new search terms, HMMs, and screened libraries generated by search_optimization.sh will be used to create a final set of gene sequences. The provided command will follow this format:

loop.sh -d ~/myoutput/CDS_degenerate_seqs -p ~/myoutput/Loop_2/screened_fasta -q ~/myoutput/Loop_2/rebuilt_queries -o ~/myoutput -t 32

Screen new loop sequences using final_check.sh

When loop.sh is completed this second time it will provide a dialog with instructions on how to execute the next step. As with the first loop you should first review the provided gene-species coverage table and create a list of species and genes. I recommend placing your species.txt and genes.txt file in loop_1_out/lists and your outgroups file in your main directory.

Example execution of final_check.sh:

final_check.sh -i ~/myoutput/ -s ~/myoutput/Loop_2/lists/species.txt -g ~/myoutput/Loop_2/lists/genes.txt -og ~/myoutput/outgroups.txt -t 32

Create Sequence Alignments using pretree_loop.sh

The final output from final_check will provide instructions on how to execute this script. You should review the revised list of genes then run pretree_loop.sh

Example:

pretree_loop.sh -t 24 -i ~/myoutput/loop2 -s ~/myoutput/loop2/Working_Dir_Mon_Dec_7_161512_EST_2015/lists/species_list.txt -g ~/myoutput/loop2/Working_Dir_Mon_Dec_7_161512_EST_2015/lists/gene_list.txt

When pretree_loop.sh is done you will have alignment files available for tree finding. There will be a file for DNA and another for AA. This is the final product of this pipeline, and you can proceed with whatever phylogenetic tests you feel are relevant to your work. You can perform a RAxML tree search using each of these files.

Examples (make sure your output is named exactly as shown below):

raxmlHPC-PTHREADS-AVX -f a -p 258755 -x 258755 -# 100 -m PROTGAMMAAUTO -s supermatrix_pep.phylip -T 32 -n loop_2_pep.tre
raxmlHPC-PTHREADS-AVX -f a -p 258755 -x 258755 -# 100 -m GTRGAMMA -s supermatrix_cds.phylip -T 16 -n loop_2_cds.tre

About

Phylogenetic Marker Discovery Pipeline Utilizing Deep Sequencing Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published