Skip to content

santirdnd/PTU_paper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pTU_paper

This repository contains all scripts used to produce the results shown on the paper. Some of these scripts are just shell wrappers around other commands. They have been included as they document the exact params used for the pipeline. To install copy them to a folder and execute from there. Make sure that all dependencies are in your PATH. The list of external dependencies required are indicated below. The entire enveomics package is not required to reproduce the results shown in the paper. The only requirement from enveomics is ani.rb, a Ruby script for ANI calculation. This pipeline has been tested on Ubuntu 16.04 LTS edition.

Dependencies:

  • External programs:

    • AcCNET v1.2
    • ani.rb from enveomics repository
    • Blast+ v2.6.0
    • HMMER v3.1b2
    • PlasmidFinder v1.3
    • Gephi v0.9.2
    • graph-tool v2.29
  • Other software and libraries:

    • Matlab 2019a
    • Pyhton 3.6
    • BioPython v1.69
    • R 3.4
    • Perl v5.26.2
    • BioPerl 1.6.924
    • Easyfig v2.2.2

Pipeline description

To reproduce the manuscript's results a number of scripts should be executed on a precise order as some steps require the output from previous commands. Not all scripts are documented as some are just simply wrappers around other commands. A broad description of the different steps follows:

Plasmid sequences metadata compilation
  • download_ncbi_taxonomy.sh: Generate a database for taxonomy annotations
  • extract_RefSeq84_database.sh: Extract RefSeq plasmid sequences database (modify accordingly to your database version). One of the outputs of this command is plasmid.lst, a file listing all accession numbers of the plasmid dataset
  • generate_protein_seqs.sh: Extract aminoacid sequences from genome GBK files. As argument use the file plasmid.lst generated by the previous step
  • extract_plasmid_info.sh: Generate the plasmid metadata database plasmid.tsv using GenBank annotations and taxonomy
Relaxase and replicon annotations
  • assign_mob_classes.sh: Find plasmids relaxasome. Relaxase HMM profile database shared from MOBscan. This wrapper's argument is the same plasmid.lst file previously used
  • assign_pfinder_classes.sh: Type plasmid replicons with PlasmidFinder software and database
Compile network topologies
  • list_subgroups.sh: Generate different subsets of plasmids (Enterobacterales, Escherichia, etc). Some accession numbers are blacklisted as were found to not be real plasmids
  • append_pGroup_annotation.sh: This is a bit underhanded but we update here the plasmid metadata database with the the PTUs manually defined based on the output of next commands
  • accnet_RefSeq84.sh: Execute AcCNET to generate the plasmidome/ORFeome bipartite network. Use Gephi with the output of this script to produce the network layout
  • ani_RefSeq84.sh: This is the main step of our analysis as it produces the files later used with Gephi to layout the PTU network. This script combines two distinct functions:
    • calculate_ani_distances_p.py: Produce the list of ANI pairwise comparisons. This script is, as it is at the moment, very inefficient to execute on a personal computer and will take several weeks for completion
    • genome_similarity_nerwork.py: Take the ANI comparisons and generate the file of edge's similarity and distance measures
PID: topological algorithm for automatic PTU identification

This algorithm has been implemented with the Matlab files setglobal.m, divide.m, escribe_componentes.m, keephojas.m and dibuja.m

To execute simply enter the following commands on Matlab Command Window:

>> setglobal;
>> divide(G, '0');
>> keephojas;
PTU validation with stochastical blockmodeling (SBM)
  • graph-tool_SBM_script.py: Script used to generate different SBM models
  • graph-tool_NSBM_script.py: Script used to generate different Hierarchical SBM models
  • simulation.py: Script used for sHSBM performance simulation
  • ptu_classifier.py: Script used for sHSBM PTU classification
  • ptu_comparison.py: Script used for sHSBM and PID PTU classifications
Host range visual representation
  • bipartite_kept_190823.sh: Generation of the bipartite network used for host range visualization. Use Gephi to convert on a monopartite network of hosts present per PTU
Checking the networks
  • Connections_v2.py: Calculate plasmid/HpC statistics of bipartite network stratified by the taxonomic levels of nodes

  • Connection_plots.R: R script for visualization of Connections_v2.py output

  • pANI_prepare_data.sh: Compile a BLAST database of plasmid fragments sized for aligment fraction (AF) calcutation

  • pANI_Enterobacterales.sh: Generate pairwise list of aligment fraction (AF) results. Again, this step will take too long to execute on a personal computer

  • summarize_pGroups_info.sh: Generate a basic description of PTU composition

  • calculate_cluster_density.py: Calculate inter and intra-cluster density of PTU clusters

  • check_database_redundancy.py: Verify the percentage of plasmid duplication on PTU clusters

Expected output and execution time

The expected output from this pipeline are the files defining the networks shown on the paper and the list of plasmids classified into different PTUs. The Gephi network files corresponding to the Plasmidome/ORFeome bipartite network (Figure 1), the full RefSeq84 plasmidome PTU network (Figures 3 and 6) and the Enterobacterales plasmidome subset (Figure 4) can be downloaded from the Supplementary Material attached to the paper. These files are, respectively, Supplementary File SF4, SF1, and SF2. Supplementary Table ST5 lists those plasmids automatically classified into different PTUs after applying PID and sHSBM algorithm to the adjacency matrix of the RefSeq84 plasmidome network.

The execution time needed to complete the full pipeline on a personal computer will be around a few weeks because of the cuadratic number of pairwise ANI similarity comparisons. Moreover, the Plamidome/ORFeome bipartite AcCNET network is big enough to endanger the normal execution of Gephi on usual personal computers. ANI networks, being monopartite, are not yet limited by plasmid number.

License

All this software is released under the GPL license.

About

Scripts and data used for the PTU definition paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published