Skip to content

API for feature engineering in Cheminformatics Tasks

Notifications You must be signed in to change notification settings

Bundaberg-Joey/ChemScI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChemSci

Build Status

Goals

  • Provide a consistent interface for the featurisation of molecules or other chemical systems for informatics studies
  • Allow for these featurisations to be easily saved to a wide variety of formats
  • Allow for novel featurisations to easily be incorporated / implemented from literature within minimal boiler plate required from contributing authors

ChemSci is designed to facilitate an entirely modular approach to chemical featurisation, where the featurisation of a chemical system does not assume it originates from a SMILES string etc. This prevents commonly used tools from being subtly hard-coded into the ChemSci and ensures a flexible API is maintained.

Through providing a consistent API, it is hoped that newly developed featurisations for cheminformatics can be readily incorporated and hence provide easy access to the informatics community. In development of new molecule featurisations, users are encourgaed to submit their own featurisations to ChemSci so that a consistent api for all featurisations is maintained.

FeatureFactory

The core object of ChemSci is the FeatureFactory which can be used as a sklearn transformer or as a stand alone object in informatics studies. Within the FeatureFactory are the converter and featuriser callables which are specified at initialisation, either through default keywords or custom callables. The converter callable takes a standard representation of a chemical system and converts it into a workable obeject from which featurisations can be derived. The featuriser callable takes the workable object and generates the featurisation from it.

from chemsci.factory import FeatureFactory

ff = FeatureFactory(converter='smiles', featuriser='maccs')  
# create featuriser which reads smiles strings and calculates maccs fingerprints

X = ['smiles_1', 'smiles_2', 'smiles_n']
out = ff.fit_transform(X)

Default converters and featurisers

Default converters:

  • smiles
  • smarts
  • inchi
  • mol (path to file)
  • pdb (path to file)
  • pubchem (CID number)
  • none (when no conversion is required)

Default featurisers:

  • maccs (Molecular Access fingerprint)
  • avalon (Avalon fingerprint)
  • daylight (daylight fingerprint)
  • ecfp_4_1024 (1024 bit ECFP of radius 4)
  • ecfp_6_1024 (1024 bit ECFP of radius 6)
  • ecfp_4_2048 (2048 bit ECFP of radius 4)
  • ecfp_6_2048 (2048 bit ECFP of radius 6)
  • fcfp_4_1024 (1024 bit ECFP of radius 4)
  • fcfp_6_1024 (1024 bit ECFP of radius 6)
  • fcfp_4_2048 (2048 bit ECFP of radius 4)
  • fcfp_6_2048 (2048 bit ECFP of radius 6)

Custom Converters and Featurisers:

Any callable can be passed in place of a converter and featuriser, with several customs available in custom_featurisers and converters. For example, both the "MAP4" molecular fingerprints (DOI 10.1186/1758-2946-5-26) and pubchempy fingerprints are available to be used.

from chemsci.factory import FeatureFactory
from chemsci.custom_featurisers import Map4Fingerprint

ff = FeatureFactory(converter='smiles', featuriser=Map4Fingerprint())

The contracts of the callables are as below where any type can be used as input but the output of featuriser must be a numpy array.

def convert(representation):
    """
    Parameters
    ----------
    representation: any
        any type input.
    
    Returns
    -------
    out : any
    """
    out = custom_conversion(representation)
    return out


def featurise(mol):
    """
    Parameters
    ----------
    representation: any
        any type input.
    
    Returns
    -------
    out : np.ndarray
    """
    out = custom_featuisation(representation)
    return out

If more custom and detailed logic is required (as in MAP4) then a class specifying the __call__ method can be passed instead so long as it is instantiated. The below gives an example of an arbitrary class where we wish to weight the generated value with a custom value we specified at initialisation.

from chemsci.factory import FeatureFactory

class CustomFeaturiser:
    def __init__(self, custom_param=1):
        self.custom_param = int(custom_param)
    
    def __call__(self, *args, **kwargs):
        out = custom_func(*args, **kwargs)
        return out + self.custom_param
        
ff = FeatureFactory(converter='smiles', featuriser=CustomFeaturiser(3))

Integration with sklearn

The FeatureFactory object inherits from sklearn.base.TransformerMixin and so can readily be included in any pipeline / workflow with ease.

Convenience Outputs

As with pandas dataframes, the results of a FeatureFactory transformation can be returned as given datatypes / saved to given file structures with ease due to the provided convenience functions:

  • to_list()
  • to_dict(convert_string=False)
  • to_arrary(as_type=None)
  • to_series(index=None, name=None, as_type=None)
  • to_csv(self, path, index=None, name=None, as_type=None, sep=',', save_index=True, columns=None, header=True)
  • to_json(self, path=None)
  • to_yaml(self, path)

Installation

git clone https://github.com/Bundaberg-Joey/ChemScI.git
conda create --name <env_name> --file=environment.yml
pip install .

About

API for feature engineering in Cheminformatics Tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published