3D Convolutional Neural Networks for Classifying Protein Structures into Folds
The data can be downloaded from the berkely website. Please download multiple pdb archives into the PRO3DCNN directory and use tar -xf 'name of tar file' to unzip the data. The unzipped files are around 50GB.
https://scop.berkeley.edu/downloads/parse/dir.cla.scop.1.55.txt https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-1.55.tgz
https://scop.berkeley.edu/downloads/parse/dir.cla.scope.2.07-stable.txt https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-1.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-2.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-3.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-4.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-5.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-6.tgz https://scop.berkeley.edu/downloads/pdbstyle/pdbstyle-2.07-7.tgz
The following workflow is used to process the data.
Protein Chain -> Dist M -> Cropped Dist M
Protein Chain -> Dist M -> Persistent Barcodes -> Persistence Images
import MAT
import TDA
# This will save the protein chains in batches as chains/0 chains/1 chains/2 ...
# Note 'chains/' directory must exist.
MAT.getChains(loadPath='pdbstyle-1.55/', savePath='chains/',SCOPEdir='dir.cla.scop.1.55.txt')
MAT.getChains(loadPath='pdbstyle-2.07/', savePath='chains/',SCOPEdir='dir.cla.scope.2.07-stable.txt')
# rangeTo is the number that the files under 'chains/' that we should process
# Note 'mats/' directory must exist.
MAT.getDistMs(loadPath='chains/',savePath='mats/',sparse=False,rangeTo=N)
#########################
# Cropped Dist M
#########################
#upToBatchNum is the number of files under 'mats/' that we should process
# Note 'croppedMats/' directory must exist along with the batch directories.
# mkdir croppedMats
# cd croppedMats
# mkdir {1..999}
MAT.splitMat(loadPath='mats/', savePath='croppedMats/',windowSize=100,upToBatchNum=N)
#########################
# Persistence Homology
#########################
#toRange is the number of files under 'mats/' that we should process
# Note 'barcodes/' directory must exist
TDA.genHoms(loadPath='mats/', savePath='barcodes/', toRange=N)
#rng is the number of files under 'mats/' that we should process
# Note 'barcodeImgs/' directory must exist along with the batch directories.
# mkdir barcodeImgs
# cd barcodeImgs
# mkdir {1..999}
TDA.getBcodeImgsSeparated(loadPath='barcodes/', savePath='barcodeImgs/',rng=N)
Training and evaluation is done by running the following files. Parameters regarding which file to load is under section ###### Parameters #######
trainDistM.py
trainBcodeImg.py
trainBoth.py
If using a GPU, use the following instead:
trainDistMgpu.py
trainBcodeImggpu.py
trainBothgpu.py