A simplified drug discovery pipeline -- generating SMILE molecular with AlphaSMILES, predicting protein structure with AlphaFold, and checking the druggability with fPocket/Amber.
As we can tell from the descrption, there would be 3 parts of the project(molecule, protein, and mol-x-protein). Now I stucked at reconstructing pdb protein tertiary structure from contact map.
- Research on Tinker to reconstruct protein tertiary structure.
- Add functions to reconstruct protein CASP-RR files in
fmol.py
. - Maybe, create visualized configurator to config
rnn
andmcts
used in AlphaSMILES
In short, AlphaSMILES relies on TensorFlow but alphafold relies on PyTorch, so the better way to run this project is to set up separate virtual environments according to their original documentation and run them separately. However, to save disk space and make things more automatic, here are 2 ways to set up an overall environment.
I do recommend you to use Anaconda to manage your packages and environments. A reason is AlphaSMILES uses rdkit, but they do not provide a way to install via pip
.
Use the following lines to create a new environment named fmol
with some packages already installed and then activate it.
conda create -n fmol python=3 anaconda
conda activate fmol
Then we need to install TensorFlow and PyTorch. Anaconda may stuck at some point solving environments. Don't worry, Ctrl + C
and try more times the issue would be solved usually.
- If you only want to use CPU for this project
conda install tensorflow
conda install pytorch torchvision cpuonly -c pytorch
- If you also want to use GPU to shorten the runtime
conda install tensorflow-gpu
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch
Change the version of cudatoolkit
correspondingly. More details check PyTorch website.
Then install other libraries with the following lines.
conda install -c rdkit rdkit
conda install -c conda-forge keras
conda install -c omnia cclib
pip install pptree
conda install -c anaconda joblib
conda install -c conda-forge tensorboard
Don't forget to install third party libraries if you want the project works as expected. After that, we are done with environments.
It's also possible to clone an existing environment from a specification file I provided:
conda create -n fmol --file spec-file.txt
Then activate it with
conda activate fmol
If the default framework used by keras is Theanos, use the following line to switch to TensorFlow print Using TensorFlow backend.
/ Using Theanos backend.
when you launch the program:
export KERAS_BACKEND='tensorflow'
It's configured in PyCharm configure file .idea/workspace.xml
, but need to set up manually if u don't run this project via PyCharm.
- AlphaSMILES uses 3D calculation(DFT) library Gaussian 09 by default. If you want this functionality works well, here are some guides how to set up Gaussian 09 on Ubuntu.
- I use RECONSTRUCT to reconstruct protein tertiary structure in
.pdb
format from contact map. This software does not works as expected so far, it's still a beta version and the organization is working on it. It's expected to provide an easy way to reconstruct protein tertiary structure. For chemistry professionals, see Recovery of protein structure from contact maps. They use Tinker to reconstruct the protein tertiary structure.
Personally I develop and run this project on an Ubuntu 20.04 instance with CUDA 10.2 + cudnn 7. I didn't test it on mac OS or Windows since my macbook does not have a graphic card and running the bash script on windows via WSL is obviously inefficient. Feel free to open an issue page if you test this project on other platforms but encounter compatibility issues.
- Download AlphaFold weight data from here.
- Install Gaussian 09 and make sure
g09
works well in your terminal - Extract the sample input data in
AlphaSMILES/data_in
provided in.tar.xz
and.tar.gz
format. - Make a new subfolder
alphafold_pytorch/model
and extract the weight folders intomodel
. - Modify the variable in
fmol.py
according to your PC. - Run
./fmol.py
Please check doc for usage tutorial. Cyril-Grl has made an brilliant documentation for it. I provide some additional input data, sample configurations for rnn
and mcts
, and a sample output using the sample configurations. There is also a local version of the documentation if Cyril's website shuts down, it's in AlphaSMILES/doc/_build/html/index.html
If you have Gaussian 09 set up and g09
works well in your terminal and just want a quick start:
- Extract the sample input data in
AlphaSMILES/data_in
provided in.tar.xz
and.tar.gz
format. - Change the options in
AlphaSMILES/main.py
- Simply run
AlphaSMILES/main.py
- To run the project, you need to firstly download pre-trained weights from Deepmind repos.
- Create a folder named
model
underalpha_fold_pytorch
- Extract the weights downloaded in step 1 and move
873731
,916425
, and941521
3 folders into themodel
folder. - The samples inputs is provided, so simply run
./alphafold_pytorch/alphafold.sh
to run the project.
- Technically we can use original deepmind AlphaFold rather than alphafold_pytorch. But I got too many error warnings when I run their code and they didn't provide a good way to visualize the output. So I choose alphafold_pytorch at last.
- For more details, check alphafold_pytorch readme
- If you encounter issue that says out of GPU memory, uncomment line 16 of
alphafold_pytorch/alphafold.sh
. That allows you to run 3 trainings at a time, not all 8 trainings by default.
I provide a method to convert CAPS13-RR file to contact map file that RECONSTRUCT accepts. It create a contact map file in .cm
file format within the same folder as the input .rr
file.
- input(string) - path to the input file
- None
- Use the
install_fpocket.sh
shell script underscripts
folder to install fpocket on your machine. - For more information check their repo
- I have not include any part of amber in this project. But it's a powerful and useful library in chemistry.
- The output file of alphafold comes in
.rr
casp13-rr format. It stores the probability of two atoms on the protein chain could contact within 8 angstroms. But fpocket only accept input file in.pdb
format, which basically stores the 3-D coordinate information of each atom. Reconstructing reliable PDB file from the CASP13-RR file is still an unsolved problem in academic circles. RECONSTRUCT is a third-party software using TINKER package aiming to reconstruct PDB file from.cm
contact map file format, but does not work well. I wrote a tool to convert CASP13-RR format into contact map format(seeutils.rr_to_cm
). - Deepmind didn't open-source the procedure of protein tertiary structure prediction, especially the part of training model from CASP PDB dataset. However, it's essential to the accuracy of prediction of arbitrary protein structure.
To make the project easier to deploy on the cloud, I copied and merged some repos into this project according to their licence.