Skip to content

sdvillal/manysources

Repository files navigation

Introduction

We are revisiting through theory, discussion and (not so) anecdotical evidence several issues that arise when appliying statistical learning1 over chemical datasets with scaffold overrepresentation, a problem commonly called analog bias in the cheminformatics literature.

image

Because of the way chemical collections are usually constructed (by exploring substitutions around scaffolds of interest), the way molecules are often represented when fed to statistical models (1D and 2D descriptors dominate the academic literature) and the way these models work (learning repeated discriminative or correlative patterns), two problems pervade models built using the average chemical dataset:

* Overoptimistic evaluation inflated performance estimates irrelevant for generalisation

* Scaffold overfitting models that give too much importance to overrepresented features and miss what often is more interesting, activity cliffs

Model <-> Example <-> Predicition interactions (with a chemical twist)

The tools in this repository are useful on their own too. Some highlights:

  • We use unfolded, unhashed fingerprints.
    • Pros: good for model interpretability and avoiding hash clashes effects. They usually provide elevated performance2
    • Cons: no more regularisation by hashing, one needs a model that scales well with very high dimensionality
  • We provide a general framework to understand individual molecule predictions in the context of a concrete dataset.
    • Linking to feature importance and assessing too the influence of other molecules and selecting influential molecules.
    • Providing quantitative and qualitative insight in the warts of evaluation and the whys of predcitions
    • Can be extended to provide different hypothesis for individual molecules prediction on model deployment time.

Usage

We include several of the datasets we use in our study on this repository (see data).

  • Use manysources/datasets.py for feature extraction.

* Use manysources/experiments.py to generate new results. These build and evaluate models for many different data partitions (note that we run these for a few hours in something like 30 parallel jobs).

  • Use manysources/hub-py to easily link everything, from molecules to features to model to prediction and back.
  • There are some example analysis in manysources/analyses.

Installation

We recommend using the anaconda python scientific distribution to install manysources and its dependencies. Dependencies are in setup.py. So assuming we are using a conda environment, these commands install the required software:

conda install numpy scipy pandas h5py matplotlib seaborn joblib scikit-learn cytoolz networkx numba
conda install -c rdkit rdkit
pip install whatami tsne argh

To install manysources itself there are a few options:

...or...

  • pip install git+git://github.com/sdvillal/manysources.git

Proper releases are coming (at least there will be one when a publication happens).

Work in progress, but come back soon

This is the code we (Floriane, Santi) are using in our experiments. A documented, stable and more featureful release is happening in 2015 Q4. In the meantime, feel free to peek around the code, it is not too bad!


  1. But our research is also relevant to other QSAR methods, from statistical pharmacophore mining to docking evaluation.

  2. One can also use the analog bias to do well in competitions ;-)

About

Assessing the impact of analog-bias in cheminformatics predictive modelling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages