Introduction

We are revisiting through theory, discussion and (not so) anecdotical evidence several issues that arise when appliying statistical learning¹ over chemical datasets with scaffold overrepresentation, a problem commonly called analog bias in the cheminformatics literature.

Because of the way chemical collections are usually constructed (by exploring substitutions around scaffolds of interest), the way molecules are often represented when fed to statistical models (1D and 2D descriptors dominate the academic literature) and the way these models work (learning repeated discriminative or correlative patterns), two problems pervade models built using the average chemical dataset:

* Overoptimistic evaluation inflated performance estimates irrelevant for generalisation

* Scaffold overfitting models that give too much importance to overrepresented features and miss what often is more interesting, activity cliffs

Model <-> Example <-> Predicition interactions (with a chemical twist)

The tools in this repository are useful on their own too. Some highlights:

We use unfolded, unhashed fingerprints.
- Pros: good for model interpretability and avoiding hash clashes effects. They usually provide elevated performance ²
- Cons: no more regularisation by hashing, one needs a model that scales well with very high dimensionality
We provide a general framework to understand individual molecule predictions in the context of a concrete dataset.
- Linking to feature importance and assessing too the influence of other molecules and selecting influential molecules.
- Providing quantitative and qualitative insight in the warts of evaluation and the whys of predcitions
- Can be extended to provide different hypothesis for individual molecules prediction on model deployment time.

Usage

We include several of the datasets we use in our study on this repository (see data).

Use manysources/datasets.py for feature extraction.

* Use manysources/experiments.py to generate new results. These build and evaluate models for many different data partitions (note that we run these for a few hours in something like 30 parallel jobs).

Use manysources/hub-py to easily link everything, from molecules to features to model to prediction and back.
There are some example analysis in manysources/analyses.

Installation

We recommend using the anaconda python scientific distribution to install manysources and its dependencies. Dependencies are in setup.py. So assuming we are using a conda environment, these commands install the required software:

conda install numpy scipy pandas h5py matplotlib seaborn joblib scikit-learn cytoolz networkx numba
conda install -c rdkit rdkit
pip install whatami tsne argh

To install manysources itself there are a few options:

download a zip file or clone the manysources repository and then tweak $PYTHONPATH or use pip install -e

...or...

pip install git+git://github.com/sdvillal/manysources.git

Proper releases are coming (at least there will be one when a publication happens).

Work in progress, but come back soon

This is the code we (Floriane, Santi) are using in our experiments. A documented, stable and more featureful release is happening in 2015 Q4. In the meantime, feel free to peek around the code, it is not too bad!

But our research is also relevant to other QSAR methods, from statistical pharmacophore mining to docking evaluation.↩
One can also use the analog bias to do well in competitions ;-)↩

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
chemdeco		chemdeco
data-schema		data-schema
data		data
doc/posters		doc/posters
manysources		manysources
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chemdeco

chemdeco

data-schema

data-schema

data

data

doc/posters

doc/posters

manysources

manysources

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

MANIFEST.in

MANIFEST.in

README.rst

README.rst

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Introduction

Model <-> Example <-> Predicition interactions (with a chemical twist)

Usage

Installation

Work in progress, but come back soon

About

Releases

Packages

Languages

License

sdvillal/manysources

Folders and files

Latest commit

History

Repository files navigation

Introduction

Model <-> Example <-> Predicition interactions (with a chemical twist)

Usage

Installation

Work in progress, but come back soon

About

Resources

License

Stars

Watchers

Forks

Languages