Skip to content

CCB-SB/plsdb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline for data collection

News

Our manuscript discussing the new features of PLSDB was accepted to the annual 2022 Nucleic Acid Research database Issue! The manuscript can be found here.

Summary

pipeline graph

  • retrieve_plasmid: Retrieve plasmid data from NCBI database
  • filter_metadata: To remove incomplete or nonbacterial records these are filtered by metadata attributes:
    • Duplicated entry (NUCCORE_DuplicatedEntry)
    • Record description (regular expression from Orlek et al.)
    • Assembly Lastest (True)
    • Assembly Completeness (Complete)
      • If no assembly: Completeness status of the nuccore record has to be complete
      • Has assembly: assembly status of the latest version has to be Complete genome
    • By taxonomy: superkingdom taxon should be Bacteria
  • filter_sequences: To remove identical records
    • Group plasmids with identical sequences
    • Among these groups select one record
      • Prefer the one from RefSeq
      • Prefer the one with location information
      • Prefer the one with an assembly
      • Prefer the most recent assembly release date
      • Prefer the one with a biosample
      • Prefer the newewst nuccore creation date
      • Prefer the one with the highest coverage
      • If all equals, choose the first one
  • filter_rmlst: To remove putative chromosomal sequences
    • Obtain rMLST database
    • Create a local NCBI chromosomal sequences database
    • The plasmid sequences are aligned against the rMLST allele sequences and local NCBI-db
    • Records having more than 5 unique rMLST loci are searched in NCBI chromosomal sequences using BLASTn (remote access)
    • Records with hits are removed
  • filter_artifacts: Remove possible artifacts sequences
  • process_abricate: Annotate antimicrobial resistance or virulence genes.
    • BLASTn search in DBs provided by ABRicate
      • Blaster from CGE core module is used for search and pre-processing
      • Filtering:
        • Identity and coverage cutoffs
        • Overlapping matches are removed
      • All hits are ollected into one file
  • process_pmslt: Annotate using pMLST
    • For each found replicon use the associated pMLST scheme (if available)
    • Use mlst to perform the pMLST analysis
    • Process the results
      • Set IncF ST according to the FAB formula (Villa et al.)
    • Create BLAST database file from plasmid FASTA
    • Create sketches from plasmid FASTA using Mash
  • dstream_sim_records: List of similar plasmids
    • Use Mash do compute pairwise distances (use a distance cutoff)
    • Create a list of unique pairs
  • dstream_umap: Embedding
    • Compute pairwise distances between plasmids using Mash
    • Compute embedding using UMAP
  • process_infotable: Create info table
    • Record information
    • Embedding coordinates
    • PlasmidFinder hits
    • pMLST hits
  • dstream_compare: Compare created table to an older version
    • Which plasmid records were removed
    • Which plasmid records were added
    • Which plasmid records changed

Preparations

PubMLST data

This data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust.

rMLST data

Note: requires a PubMLST account Note: requires graphical interface *Note: PubMLST account needs to request access to Ribosomal MLST locus/sequence definitions database from rMLST admin. Access is normally granted within a day. * Error: Message: 'chromedriver' executable needs to be in PATH : Make sure that chromedriver is installed. As chromium already comes with a chromedriver installation, you can try: sudo pacman -Syyu followed by sudo pacman -S chromium

To remove putative chromosomal sequences rMLST analysis is performed which requires rMLST sequences from PubMLST. There is an API for the PubMLST services, however using it seems to require much more effort than downloading the data through a web browser. Thus, there is a rule (retrieve_rmlst_data) that downloads the sequences automatically (given the login data). This rule needs a graphical interface, please, run this rule locally in your computer.

Here, a login and password are required. Please, create and account and specify your credentials in config.yml.

Note: Cookie agreement might cause problems. Requires minor changes if "Got it!" is changed to different link text.

pMLST

There is a mapping from PlasmidFinder IDs to pMLST profile names in pipeline.json (pmlst/map). It may require an update depending on which pMLST schemes are available from PubMLST and which IDs are currently in the PlasmidFinder database.

  • Note: Information on pMLST schemes is shown in: https://pubmlst.org/plasmid/

  • Path to installed pMLST schemes: ~/miniconda3/envs/plsdb/db/pubmlst/

    • Each scheme is one directory (also listed in the log file when created)
  • Path to installed PlasmidFinder DB: ~/miniconda3/envs/plsdb/db/plasmidfinder/sequences

ABRicate

Please, if the most recent version of ABRicate contains the most recent database links abricate-get_db.

IMPORTANT: Currently, ABRicate (version 1.0.1) does not update some databases correctly:

  • ARG-ANNOT: The URL changed The file patch_abricate-get_db (abricate: getdb_bin param in config.yml) should resolve these issues. It is a copy of the abricate-get_db, but changing the URL of ARG-ANNOT. If you also find some deprecated links, please substitute them and set the param abricate: replace_getdb: True in config.yml.

API keys

NCBI data

To retrieve data from NCBI, please obtain an API and specify it in the config.yml.

Location queries

To map location names to coordinates the Nominatim API and Google API are used. Google API is only used for comparative purposes, as their policy doesn't allow the storage of google's content (more here).Google requires API key, please which requires you register (more info).

BIOSAMPLE_Host

Already known hosts are in hosts_version.csv. Run the rule process_create_host_mapping and manually check the new versions of host mapping. Find more details at the end of the log file of the rule (logs/process_create_host_mapping.log) or in the rule process_manually_inspect_hosts.

BIOSAMPLE_Location

Already known locations are in locations_version.csv and corrections to find some specific locations are display in location_correction_version.csv

Run the rule process_parse_locations and manually check the new versions of location and location_corrections. Find more details in the rule process_manually_inspect_locations.

Conda & Snakemake

If needed, install (mini-)conda

cd ~
# get miniconda (for linux)
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# install
bash Miniconda3-latest-Linux-x86_64.sh
# set path to binaries in your ~/.bashrc
export PATH=$HOME/miniconda3/bin:$PATH

If needed, install snakemake

# If required, install mamba package manager in you base env
conda install  -c conda-forge mamba
# Install snakemake
mamba create -c conda-forge -c bioconda -n snakemake snakemake

Current versions:
- conda==23.7.3 - snakemake==7.32.4

Comparing new and old versions

The last rule in the pipeline requires a "master" table from an older version. The path has to be set in config.yml (attribute previous_table) and the file must exist.

Running the pipeline

  • Do not ignore log files. They contain timestamps and useful information to detect possible errors. Add more information if necessary.
  • Take notes: Write down what was changed: version updates, big fixes etc.
  • Do not run the pipeline on the first day of a month (usually many requests fail or return nothing)
  • Do not execute the complete pipeline: use snakemake --use-conda -c {CORES} --force target_rule to run groups of rules
  • Before running something, list the commands to be executed: snakemake -np. See Section
  • If a step where some data is fetched from NCBI fails try to re-run the step
    • Sometimes NCBI return an empty result or the request fails
    • Re-running the command usually solves the problem
  • The longest steps are:
    • Automatic rules:
      • retrieve_nuccoredb_seqs: Download of putative chromosomes from NCBI server to create local db (2023-10-06: ~7h)
      • process_rmlst_blastn: BLASTn search against rmlst database (2023-10-05: ~16h)
      • filtering_rmlst: filter plasmid data using information form nuccoredb and rmlst_blastn (2023-10-06: ~2h)
    • Manual curation rules:
      • process_manually_inspect_hosts: Depends on the number of unknown hosts, but save at least 2 days (16h)
      • process_manually_inspect_locations: Depends on the number of unknown locations, but save at least 1 day (8h)
  • Other steps require usually only a few minutes and should run under one hour
  • Try to run all steps requiring updated data on the same day
    • I.e. getting new data for rMLST, abricate, pMLST, NCBI data (retrieve rules)

Groups of execution (sequentially)

  • RETRIVAL
    • Plasmid NCBI retrival: retrieve_plasmid_metadata retrieve_plasmid_taxid filter_metadata retrieve_fasta
    • NCBI chromosomal sequences retrival: retrieve_nuccoredb_ids retrieve_nuccoredb_seqs process_make_nuccoredb_blastdb
    • ABRicate retrival: retrieve_abricate_getdb retrieve_abricatedb
    • pmlst retrival: retrieve_pmlst_data process_make_pmlst_blastdb
    • rmlst retrival (graphical interface): retrieve_rmlst_data process_make_rmlst_blastdb
    • human disease ontology: retrieve_disease_ont
  • FILTERING:
    • filter_sequences process_rmlst_blastn filter_rmlst filter_artifacts
  • PROCESSING:
    • process_calculate_GC process_abricate process_join_abricate process_pmlst
    • process_mash_sketch process_mash_dist process_umap process_mash_dist_sim process_dstream_sim_records
  • MANUAL_CURATION:
    • process_create_host_mapping process_manually_inspect_hosts process_infer_host
    • process_disease_ont (intermediate step, not manual curation)
    • process_parse_locations process_manually_inspect_locations
  • DSTREAM:
    • process_infotable dstream_krona_xml dstream_krona_html dstream_summary dstream_compare

References

  • Mash: "Mash: fast genome and metagenome distance estimation using MinHash", B. D. Ondov, T. J. Treangen, P. Melste d, A. B. Mallonee, N. H. Bergman, S. Koren and A. M. Phillippy, Genome Biology, 2016, [paper link](https://genomebiology .biomedcentral.com/articles/10.1186/s13059-016-0997-x), repository link
  • UMAP: "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction", L. McInnes and J. Healy, N. Saul and L. Großberger, Journal of Open Source Software v, 2018, paper link, repository link
  • BLAST: "Basic local alignment search tool." , S.F. Altschul, W. Gish, W. Miller, E. W. Myers and D. J. Lipman, J. Mol. Biol. 215:403-410, BLAST paper link, BLAST+ paper link, tool link
  • ABRicate: Tool implemented by Thorsten Seemann repository link
  • ARG-ANNOT: "ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes", S. K. Gupta, B. R. Padmanabhan, S. M. Diene, R. Lopez-Rojas, M. Kempf, L. Landraud, and J. M. Rolain, Antimicrob. Agents Chemother., 2014, paper link
  • CARD: "CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database.", B. Jia, A. R. Raphenya, B. Alcock, N. Waglechner, P. Guo, K. K. Tsang, B. A. Lago, B. M. Dave, S. Pereira, A. N. Sharma, S. Doshi, M. Courtot, R. Lo, L. E. Williams, J. G. Frye, T. Elsayegh, D. Sardar, E. L. Westman, A. C. Pawlowski, T. A. Johnson, F. S. Brinkman, G. D. Wright, and A. G. McArthur, Nucleic Acids Res., 2017, paper link
  • ResFinder: "Identification of acquired antimicrobial resistance genes", E. Zankari, H. Hasman, S. Cosentino, M. Vestergaard, S. Rasmussen, O. Lund, F. M. Aarestrup, and M. V. Larsen, J. Antimicrob. Chemother., 2012, paper link
  • VFDB: "VFDB: a reference database for bacterial virulence factors", L. Chen, J. Yang, J. Yu, Z. Yao, L. Sun, Y. Shen, and Q. Jin, Nucleic Acids Res., 2005, paper link
  • PlasmidFinder: "In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing.", A. Carattoli, E. Zankari, A. Garcia-Fernandez, M. Voldby Larsen, O. Lund, L. Villa, F. Møller Aarestrup, and H. Hasman, Antimicrob. Agents Chemother., 2014, paper link, repository link
  • pMLST in PubMLST: web-site
  • mlst: Tool implemented by Thorsten Seemann, repository link
  • OpenCageData: An API to convert coordinates to and from places, web-site
  • rMLST: rMLST at PubMLST
  • Jolley et al., Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain, K. A. Jolley, C. M. Bliss, J .S. Bennett, H. B. Bratcher, C. Brehony, F. M. Colles, H. Wimalarathna, O. B. Harrison, S. K. Sheppard, A. J. Cody, M .C. Maiden, Microbiology, 2012, paper link
  • Orlek et al.: Ordering the mob: Insights into replicon and MOB typing schemes from analysis of a curated dataset of publicly available plasmids, A. Orlek, H. Phan, A. E. Sheppard, M. Doumith, M. Ellington, T. Peto, D. Crook, A. S. Walker, N. Woodford, M. F. Anjum, N. Stoesser, Plasmid, 2017, paper link
  • Yutin et al.: Distribution of ribosomal protein genes across bacterial genome partitions, N. Yutin, P. Puigbò, E. V. Koonin, Y. I. Wolf, PLoS One, 2012, paper link
  • Villa et al.: Replicon sequence typing of IncF plasmids carrying virulence and resistance determinants, L. Villa, A. García-Fernández, D. Fortini, A. Carattoli, Journal of Antimicrobial Chemotherapy, 2010, paper link
  • CGE core module: repository link
  • fuzzywuzzy: repository link
  • MOB-suite: “MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.” Robertson, James, and John H E Nash. Microbial genomics vol. 4,8 (2018) paper link, repository link
  • taxize: "taxize: taxonomic search and retrieval in R." Chamberlain SA and Szöcs E. F1000Res. 2013;2:191. paper link, repository link

Notes

This data processing pipeline makes use of the PubMLST website developed by Keith Jolley (Jolley & Maiden 2010, BMC Bioinformatics, 11:595) and sited at the University of Oxford. The development of that website was funded by the Wellcome Trust.