eToxPred is a tool to reliably estimate the toxicity and synthetic accessibility of small organic compounds.
This README file is written by Limeng PU.
If you find this tool is useful to you, please cite this paper:
Limeng Pu, Misagh Naderi, Tairan Liu, Hsiao-Chun Wu, Supratik Mukhopadhyay, and Michal Brylinski. "eToxPred: A Machine Learning-Based Approach to Estimate the Toxicity of Drug Candidates."
- Python 2.7+ or Python 3.5+
- Theano
- numpy 1.8.2 or higher
- scipy 0.13.3 or higher
- scikit-learn 0.18.1 (higher version can produce error due to the model is trained using this version)
- Openbabel 2.3.1
- (Optional) CUDA 8.0 or higher
The software package contains 2 parts:
- SAscore prediction (in the folder SAscore)
- Toxicity prediction (in the folder toxicity)
To use the trained models for predictinos:
- Download and extract the package. Make sure
etoxpred.py
and the other two folders (SAscore and toxicity) are in the same folder. Otherwise you have to chagne the path in theetoxpred.py
(line 13 and 14). - Run the eToxPred by
python etoxpred.py -i tcm600_nr.smi -o output
- the first input argument
-i
specifies the input .smi file which stores the SMILES data. - the second input argument
-o
specifies the output file to store the predicted SAscores and Tox-scores. Note that no file extension is needed since the program will produce two filesoutput_sa.txt
andoutput_tox.txt
to store the ID and predicted values respectively.
- The corresponding trianed models are in SAscore and toxicity folders respectively. The
trained_model_gpu.pkl
can be used when CUDA is installed and properly configured.
To use the package to train your own models:
- Prepare the training dataset. The dataset contains two parts: the fingerprints and the label. The label can be the binary class labels for toxicity prediction or the SAscores. The dataset has to be stored in a .smi file in the format: [SMILES string\tID\tLabel].
- Train the DBN for SAscore prediction. Run the
sa_dbn.py
in the SAscore folder bypython sa_dbn.py -i your_training_set.smi
- The input arguement is the path to your training datset. The data has to be in the format:
- The data will be randomly split into training, testing, and validation sets (60%/20%/20%).
- The parameters of the DBN can be changed in
sa_dbn.py
at line 471.finetune_lr is
the learning rate used in finetune stage. Default is 0.2.pretrainig_epochs
is the epochs employed in the pretraining stage. Default is 20.k
is the number of Gibbs steps in CD/PCD. Default is 1.training_epochs
is the maxical number of iterations ot run the optimizer. Default is 1000batch_size
is the the size of a minibatch. Default is 50.
- The best trained model will be saved as
best_sa_model.pkl
, which can be used for prediction later. Note that the model trained with GPU can only be used with GPU prediction.
- Train the ET for toxicity prediction. Select the best parameters automatically. Run
xtrees_param_tune.py
in the toxicity folder bypython xtrees_param_tube.py -i your_training_set.txt
.
- The input arguement is the path to your training datset.
- The input data should contain both toxic and non-toxic instances. Otherwise, the code will produce error since the model predicts everything to be toxic or non-toxic.
- The parameters to be tuned are:
min_samples_leaf
: The minimum number of samples required to be at a leaf node.max_features
: The number of features to consider when looking for the best split.min_samples_split
: The minimum number of samples required to split an internal node.
- The tuning range can be set in the
setgrid()
function inxtrees_param_tune.py
. - The best set of parameters will be printed and the model will be saved as
best_tox_model.pkl
. Note that this step might take a long time. Progress will be printed in between.
An example test dataset that can be used for prediction (in the .smi format) is provided in tcm600_nr.smi
. The ready to used dataset for ET and DBN training can be found at https://osf.io/m4ah5/
. The data is in text format. The general format is SMILES string\tID\tSAscore/Toxicity. The results of our experiments in terms of SAscores and Tox-scores are also provied in sa_results.txt
and tox_results.txt
. Both ID and SAscore/Tox-score is included in the aforementioned files.