pTU_paper

This repository contains all scripts used to produce the results shown on the paper. Some of these scripts are just shell wrappers around other commands. They have been included as they document the exact params used for the pipeline. To install copy them to a folder and execute from there. Make sure that all dependencies are in your PATH. The list of external dependencies required are indicated below. The entire enveomics package is not required to reproduce the results shown in the paper. The only requirement from enveomics is ani.rb, a Ruby script for ANI calculation. This pipeline has been tested on Ubuntu 16.04 LTS edition.

Dependencies:

External programs:
- AcCNET v1.2
- ani.rb from enveomics repository
- Blast+ v2.6.0
- HMMER v3.1b2
- PlasmidFinder v1.3
- Gephi v0.9.2
- graph-tool v2.29
Other software and libraries:
- Matlab 2019a
- Pyhton 3.6
- BioPython v1.69
- R 3.4
- Perl v5.26.2
- BioPerl 1.6.924
- Easyfig v2.2.2

Pipeline description

To reproduce the manuscript's results a number of scripts should be executed on a precise order as some steps require the output from previous commands. Not all scripts are documented as some are just simply wrappers around other commands. A broad description of the different steps follows:

Plasmid sequences metadata compilation

download_ncbi_taxonomy.sh: Generate a database for taxonomy annotations
extract_RefSeq84_database.sh: Extract RefSeq plasmid sequences database (modify accordingly to your database version). One of the outputs of this command is plasmid.lst, a file listing all accession numbers of the plasmid dataset
generate_protein_seqs.sh: Extract aminoacid sequences from genome GBK files. As argument use the file plasmid.lst generated by the previous step
extract_plasmid_info.sh: Generate the plasmid metadata database plasmid.tsv using GenBank annotations and taxonomy

Relaxase and replicon annotations

assign_mob_classes.sh: Find plasmids relaxasome. Relaxase HMM profile database shared from MOBscan. This wrapper's argument is the same plasmid.lst file previously used
assign_pfinder_classes.sh: Type plasmid replicons with PlasmidFinder software and database

Compile network topologies

list_subgroups.sh: Generate different subsets of plasmids (Enterobacterales, Escherichia, etc). Some accession numbers are blacklisted as were found to not be real plasmids
append_pGroup_annotation.sh: This is a bit underhanded but we update here the plasmid metadata database with the the PTUs manually defined based on the output of next commands
accnet_RefSeq84.sh: Execute AcCNET to generate the plasmidome/ORFeome bipartite network. Use Gephi with the output of this script to produce the network layout
ani_RefSeq84.sh: This is the main step of our analysis as it produces the files later used with Gephi to layout the PTU network. This script combines two distinct functions:
- calculate_ani_distances_p.py: Produce the list of ANI pairwise comparisons. This script is, as it is at the moment, very inefficient to execute on a personal computer and will take several weeks for completion
- genome_similarity_nerwork.py: Take the ANI comparisons and generate the file of edge's similarity and distance measures

PID: topological algorithm for automatic PTU identification

This algorithm has been implemented with the Matlab files setglobal.m, divide.m, escribe_componentes.m, keephojas.m and dibuja.m

To execute simply enter the following commands on Matlab Command Window:

>> setglobal;
>> divide(G, '0');
>> keephojas;

PTU validation with stochastical blockmodeling (SBM)

graph-tool_SBM_script.py: Script used to generate different SBM models
graph-tool_NSBM_script.py: Script used to generate different Hierarchical SBM models
simulation.py: Script used for sHSBM performance simulation
ptu_classifier.py: Script used for sHSBM PTU classification
ptu_comparison.py: Script used for sHSBM and PID PTU classifications

Host range visual representation

bipartite_kept_190823.sh: Generation of the bipartite network used for host range visualization. Use Gephi to convert on a monopartite network of hosts present per PTU

Checking the networks

Connections_v2.py: Calculate plasmid/HpC statistics of bipartite network stratified by the taxonomic levels of nodes
Connection_plots.R: R script for visualization of Connections_v2.py output
pANI_prepare_data.sh: Compile a BLAST database of plasmid fragments sized for aligment fraction (AF) calcutation
pANI_Enterobacterales.sh: Generate pairwise list of aligment fraction (AF) results. Again, this step will take too long to execute on a personal computer
summarize_pGroups_info.sh: Generate a basic description of PTU composition
calculate_cluster_density.py: Calculate inter and intra-cluster density of PTU clusters
check_database_redundancy.py: Verify the percentage of plasmid duplication on PTU clusters

Expected output and execution time

The expected output from this pipeline are the files defining the networks shown on the paper and the list of plasmids classified into different PTUs. The Gephi network files corresponding to the Plasmidome/ORFeome bipartite network (Figure 1), the full RefSeq84 plasmidome PTU network (Figures 3 and 6) and the Enterobacterales plasmidome subset (Figure 4) can be downloaded from the Supplementary Material attached to the paper. These files are, respectively, Supplementary File SF4, SF1, and SF2. Supplementary Table ST5 lists those plasmids automatically classified into different PTUs after applying PID and sHSBM algorithm to the adjacency matrix of the RefSeq84 plasmidome network.

The execution time needed to complete the full pipeline on a personal computer will be around a few weeks because of the cuadratic number of pairwise ANI similarity comparisons. Moreover, the Plamidome/ORFeome bipartite AcCNET network is big enough to endanger the normal execution of Gephi on usual personal computers. ANI networks, being monopartite, are not yet limited by plasmid number.

License

All this software is released under the GPL license.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Conection_plots.R		Conection_plots.R
Conections_v2.py		Conections_v2.py
README.md		README.md
RefSeq84_Genus.patch		RefSeq84_Genus.patch
accnet_RefSeq84.sh		accnet_RefSeq84.sh
ani_RefSeq84.sh		ani_RefSeq84.sh
append_columns_to_file.py		append_columns_to_file.py
append_pGroup_annotation.sh		append_pGroup_annotation.sh
assign_mob_classes.sh		assign_mob_classes.sh
assign_mob_to_genome.py		assign_mob_to_genome.py
assign_pfinder_classes.sh		assign_pfinder_classes.sh
assign_pfinder_to_genome.py		assign_pfinder_to_genome.py
assign_values_from_rules.py		assign_values_from_rules.py
bipartite_kept_190823.sh		bipartite_kept_190823.sh
calculate_ani_distances_p.py		calculate_ani_distances_p.py
calculate_clusters_density.py		calculate_clusters_density.py
check_dataset_redundancy.py		check_dataset_redundancy.py
dibuja.m		dibuja.m
divide.m		divide.m
download_ncbi_taxonomy.sh		download_ncbi_taxonomy.sh
escribe_componentes.m		escribe_componentes.m
extract_RefSeq84_database.sh		extract_RefSeq84_database.sh
extract_plasmid_info.py		extract_plasmid_info.py
extract_plasmid_info.sh		extract_plasmid_info.sh
fill_accnet_representatives_fcn.py		fill_accnet_representatives_fcn.py
fillcoef.m		fillcoef.m
fix_ncbi_gb.py		fix_ncbi_gb.py
generate_AWF_list.py		generate_AWF_list.py
generate_adjacency_matrices.py		generate_adjacency_matrices.py
generate_auxiliary_genome_files.py		generate_auxiliary_genome_files.py
generate_ncbi_lineage.py		generate_ncbi_lineage.py
generate_protein_seqs.sh		generate_protein_seqs.sh
genome_similarity_network.py		genome_similarity_network.py
graph-tool_NSBM_script.py		graph-tool_NSBM_script.py
graph-tool_SBM_script.py		graph-tool_SBM_script.py
graph_alignments.m		graph_alignments.m
hmmscan_domtblout_summarize.py		hmmscan_domtblout_summarize.py
host_pTU_network.py		host_pTU_network.py
hubs.m		hubs.m
isolatedneig.m		isolatedneig.m
keephojas.m		keephojas.m
leegrafo.m		leegrafo.m
list_subgroups.sh		list_subgroups.sh
matrizcoef.m		matrizcoef.m
node_subgraph_assign.m		node_subgraph_assign.m
pANI_BLAST.py		pANI_BLAST.py
pANI_Enterobacterales.sh		pANI_Enterobacterales.sh
pANI_make_blastDBs.py		pANI_make_blastDBs.py
pANI_percentage.py		pANI_percentage.py
pANI_prepare_data.sh		pANI_prepare_data.sh
pANI_split_fasta.py		pANI_split_fasta.py
plasmid_assignment.m		plasmid_assignment.m
pretify_info_panel.sh		pretify_info_panel.sh
ptu_classifier.py		ptu_classifier.py
ptu_comparison.py		ptu_comparison.py
run_all.sh		run_all.sh
setglobal.m		setglobal.m
simulation.py		simulation.py
split_seqs.py		split_seqs.py
triangulo.m		triangulo.m
tsne_adjacency.m		tsne_adjacency.m

santirdnd/PTU_paper

Folders and files

Latest commit

History

Repository files navigation