Skip to content

seonwoo-min/DeeperHSP

Repository files navigation

Protein transfer learning improves identification of heat shock protein families (PLOS ONE 2021)

Official Pytorch implementation of DeeperHSP | Paper

Abstract

Heat shock proteins (HSPs) play a pivotal role as molecular chaperones against unfavorable conditions. Although HSPs are of great importance, their computational identification remains a significant challenge. Previous studies have two major limitations. First, they relied heavily on amino acid composition features, which inevitably limited their prediction performance. Second, their prediction performance was overestimated because of the independent two-stage evaluations and train-test data redundancy. To overcome these limitations, we introduce two novel deep learning algorithms: (1) time-efficient DeepHSP and (2) high-performance DeeperHSP. We propose a convolutional neural network (CNN)-based DeepHSP that classifies both non-HSPs and six HSP families simultaneously. It outperforms state-of-the-art algorithms, despite taking 14-15 times less time for both training and inference. We further improve the performance of DeepHSP by taking advantage of protein transfer learning. While DeepHSP is trained on raw protein sequences, DeeperHSP is trained on top of pre-trained protein representations. Therefore, DeeperHSP remarkably outperforms state-of-the-art algorithms increasing F1 scores in both cross-validation and independent test experiments by 20% and 10%, respectively. We envision that the proposed algorithms can provide a proteome-wide prediction of HSPs and help in various downstream analyses for pathology and clinical research.

How to Run

Example:

CUDA_VISIBLE_DEVICES=0 python embed_data.py --data-path data/ --model-config config/model/DeeperHSP.json
CUDA_VISIBLE_DEVICES=0 python train_model.py --data-config config/data/HSP_train.json --model-config config/model/DeeperHSP.json --run-config config/run/run.json --output-path results/DeeperHSP_final/
CUDA_VISIBLE_DEVICES=0 python evaluate_model.py --data-config config/data/HSP_test.json --model-config config/model/DeeperHSP.json --run-config config/run/run.json --checkpoint pretrained_models/DeeperHSP_final.pt --output-path results/DeeperHSP_final/

HSP Datasets

  • FASTA : FASTA files for generating OneHot/ESM embeddings (CV and Test datasets)
  • OneHot : DeepHSP OneHot embedding files (Test dataset)
  • ESM : DeeperHSP ESM embedding files (Test dataset)

Due to the large file sizes, we only provide OneHot & ESM embedding files for the Test dataset.
OneHot & ESM embeddings files for the CV dataset can be obtained from FASTA files using embed_data.py script.

Requirements

  • Python 3.8
  • PyTorch 1.5.1
  • Bio Embeddings 0.1.5
  • Numpy 1.20.1
  • Scipy 1.6.0
  • Scikit-Learn 0.24.1
  • Thop 0.0.31


About

Official Pytorch implementation of DeeperHSP (Protein transfer learning improves identification of heat shock protein families), PLOS ONE 2021

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages