Skip to content

Benchmarking common tasks on proteins in various languages and packages

License

Notifications You must be signed in to change notification settings

diegozea/pdb-benchmarks

 
 

Repository files navigation

PDB benchmarks

Open source software packages to parse Protein Data Bank (PDB) files and manipulate protein structures exist in many languages, often as part of Bio* projects.

This repository aims to collate benchmarks for common tasks across various languages and packages. The collection of scripts may also be useful to get an idea how each package works.

Please feel free to contribute scripts from other packages, or submit improvements to the scripts already present - I'm looking for the fastest implementation for each software that makes use of the provided API.

Disclosure: I contributed the BioStructures.jl package to BioJulia.

Tests

  • Parsing 3 PDB files, taken from the benchmarking in [1]:
    • 1CRN - hydrophobic protein (327 atoms)
    • 3JYV - 80S rRNA (57,327 atoms)
    • 1HTQ - multicopy glutamine synthetase (10 models of 97,872 atoms)
  • Counting the number of alanine residues in adenylate kinase (1AKE)
  • Calculating the distance between residues 50 and 60 of chain A in adenylate kinase (1AKE)
  • Calculating the Ramachandran phi/psi angles in adenylate kinase (1AKE)

[1] Gajda MJ, hPDB - Haskell library for processing atomic biomolecular structures in protein data bank format, BMC Research Notes 2013, 6:483 | link

The PDB files can be downloaded to directory pdbs by running source tools/download_pdbs.sh from this directory. If you have all the software installed, and compiled where applicable, you can use the script tools/run_benchmarks.sh from this directory to run the benchmarks. The mean over a number of runs is taken for each benchmark to obtain the values below.

Benchmarks were carried out on an Intel Xeon CPU E5-1620 v3 3.50GHz x 8 processor with 32 GB 2400 MHz DDR4 RAM. The operating system was CentOS v7.4.1708. Time is the elapsed time.

Software

  • BioJulia v0.2.0 branch running on Julia v0.6.0 (times measured after JIT compilation)
  • MIToS v2.1.1 running on Julia v0.6.0 (times measured after JIT compilation)
  • Biopython v1.71 running on Python v3.6.2
  • ProDy v1.10.7 running on Python v3.6.2
  • MDAnalysis v0.18.0 running on Python v3.6.2
  • Bio3D v2.3-4 running on R v3.5.0
  • Rpdb v2.3 running on R v3.5.0
  • BioPerl v1.007002 running on Perl v5.16.3
  • BioRuby v1.5.1 running on Ruby v2.0.0
  • Victor v1.0 compiled with g++ v7.3.1
  • ESBTL v1.0-beta01 compiled with g++ v7.3.1

Comparison

Note that direct comparison between these times should be treated with caution, as each package does something slightly different. For example, things that increase parsing time include:

  • Parsing the PDB header
  • Accounting for disorder at both the atom and residue (point mutation) level
  • Forming a heirarchical model of the protein that makes access to specific residues, atoms etc. easier and faster after parsing
  • Checking that the PDB format is adhered to at various levels of strictness

Each package supports these to varying degrees.

BioJulia MIToS Biopython ProDy MDAnalysis Bio3D Rpdb BioPerl BioRuby Victor ESBTL
Parse 1CRN / ms 1.3 1.9 8.0 4.2 7.4 13 12 72 49 35 4.5
Parse 3JYV / s 0.42 0.39 1.1 0.47 0.54 0.85 1.3 3.8 1.4 9.9 0.65
Parse 1HTQ / s 5.5 17 24 2.0 2.2 4.5 22 81 23 25 -
Count / ms 0.74 0.078 0.52 14 0.15 0.23 0.28 1.2 0.23 - -
Distance / ms 0.068 0.010 0.54 72 1.1 26 1.6 1.3 1.9 - -
Ramachandran / ms 6.9 - 150 330 2000 - - - - - -
Language Julia Julia Python Python Python R R Perl Ruby C++ C++
Parses header
Hierarchichal parsing
Supports disorder
Writes PDBs
Superimposition
PCA
License MIT MIT Biopython MIT GPLv2 GPLv2 GPL GPL/Artistic Ruby GPLv3 GPLv3

Benchmarks as a plot:

benchmarks

Parsing the whole PDB

It is instructive to run parsers over the whole PDB to see where errors arise. This approach has led to me submitting corrections for small mistakes (e.g. duplicate atoms, residue number errors) in a few PDB structures. As of July 2018, the PDB entries that error with the Biopython (permissive mode) and BioJulia parsers are:

  • 4UDF - mmCIF file errors in Biopython and BioJulia due to duplicate C and O atoms in Lys91 of chains B, F etc.
  • 1EJG - mmCIF file errors in Biopython due to blank and non-blank alt loc IDs at residue Pro22/Ser22.
  • 5O61 - mmCIF file errors in Biopython due to an incorrect residue number at line 165,223.

Running Biopython in non-permissive mode picks up more potential problems such as broken chains and mixed blank/non-blank alt loc IDs. For further discussion on errors in PDB files see the Biopython documentation. The scripts to reproduce the whole PDB checking can be found in checkwholepdb. There is also a script to check recent PDB changes that can be run as a CRON job.

Opinions

  • For most purposes, particularly work on small numbers of files, the speed of the programs will not hold you back. In this case use the language/package you are most familiar with.
  • If you are analysing ensembles of proteins use packages with that functionality, such as ProDy or Bio3D, rather than writing the code yourself.
  • For fast parsing, consider using a binary format such as MMTF or binaryCIF.

Contributing

If you want to contribute benchmarks for a package, please make a pull request with the script(s) in a directory like the other packages. I will run the benchmarks again and change the README, thanks.

Plans

  • Test BioJava, hPDB, possibly others.
  • Add benchmarks for parsing mmCIF, the standard PDB archive format.
  • Add benchmarks for parsing binary formats, e.g. MMTF.

Resources

  • Benchmarks for mmCIF parsing can be found here.
  • The PDB file format documentation can be found here.
  • A list of PDB parsing packages, particularly in C/C++, can be found here.
  • The Biopython documentation has a useful discussion on disorder at the atom and residue level.
  • Sets of utility scripts exist including pdbtools, pdb-tools and PDBFixer.

About

Benchmarking common tasks on proteins in various languages and packages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 27.9%
  • Shell 24.9%
  • Julia 18.3%
  • R 8.9%
  • Perl 8.1%
  • C++ 5.1%
  • Other 6.8%