Machine_Learning_Immunogenicity

This is a repo for the Machine Learning Immunogenicity Team in the August 2016 NCBI Hackathon

This project looks into the application of Machine Learning (ML) techniques in the prediction of Immunogenicity (Categorical; Positive or Negative) based on a peptide and its associated amino acid properties. This study uses peptide data from the Immune Epitope Database (IEDB). The R package "Peptides" has been used to compute the amino acid properties and mashup with peptide data to enable the use of ML algorithms for immunogenicity analysis, particularly, the algorithms that are more efficient with numeric and categorical data instead of string sequence.

Tensorflow is an open source software library ML that provides linear regression and classification algorithms (open sourced by Google in Nov 2015) for multi-dimensional arrays (aka “Tensors”). K-fold cross-validation as well as hold-out of test data was used to train and test the generated models.

Initial application of Logistic Regression (LR) and Neural Networks (NN) looks promising with approximately 82% and 90% predictive accuracy respectively. Note: Further cross-validation and rigorous analysis needs to be performed to validate these performance metrics. Various other ML algorithms such as variants of Neural Networks such as Convoluted NN, RESNET, MUST-CNN as well as Random Forest, Bayesian Networks should be considered as part of future work.

The following are provided:

R scripts for data wrangling of IEDB data and mashup with Amino Acid properties
Python Notebook for application of Logistic Regression and Neural Networks using Tensorflow
Python script for computing binding affinities using several published approaches based on several datasets

As part of initial results, the convergence of predictive accuracy for Neural Network is presented below.

Amino acid properties computed using R peptides package and added to the new dataset

Tiny.Molecular.Percent
Small.Molecular.Percent
Aliphatic.Molecular.Percent
Aromatic.Molecular.Percent
Polar.Molecular.Percent
Charged.Molecular.Percent
Basic.Molecular.Percent
Acidic.Molecular.Percent
Peptide.Aliphatic.Index
Peptide.Boman
Peptide.Charge
Peptide.hmoment
Peptide.hydrophobicity
Peptide.instaindex
Peptide.Kidera.helix.bend.pref
Peptide.Kidera.side.chain.size
Peptide.Kidera.extended.str.pref
Peptide.Kidera.hydrophobicity
Peptide.Kidera.double.bend.pref
Peptide.Kidera.partial.spec.vol
Peptide.Kidera.flat.ext.pref
Peptide.Kidera.occurrence.alpha.reg
Peptide.Kidera.pK.C
Peptide.Kidera.surrounding.hydrop
Peptide.mw
Peptide.pI

Test if epitope predictors predict immunogenicity on published data

This script allows to run published epitope binding predictors on immunogenicity data from IEDB T-cell and MHC assays (http://www.iedb.org/), as well as data from IMMA2. This script can be used to test whether binding predictions also predict immugenicity (Note: they don't). We use the data interface implemented in the pepdata package from the Hammer Lab (https://github.com/hammerlab/pepdata) and the implementations of various predictors in the Fred 2 framework for computational immunogenomics by Schubert et al. (https://github.com/FRED-2/Fred2 and http://bioinformatics.oxfordjournals.org/content/32/13/2044).

Usage

usage: run-data-on-predictor.py [-h] --predictor PREDICTOR --dataset DATASET
                                [-n N] [--allele ALLELE]

Call epitope predictors on data.

optional arguments:
  -h, --help            show this help message and exit
  -n N                  Number of rows to take from dataset
  --allele ALLELE       Allelle

required arguments:
  --predictor PREDICTOR
                        Epitope predictors [see all with --predictor=list]
  --dataset DATASET     Immunogenic dataset [see all with --dataset=list]

List available predictors and datasets

run-data-on-predictor.py  --predictor list --dataset list

Set one of the predictors with --predictor: ['smmpmbec', 'syfpeithi', 'netctlpan', 'smm', 'tepitopepan', 'netmhcii', 'arb', 'pickpocket', 'epidemix', 'unitope', 'netmhciipan', 'comblibsidney', 'netmhcpan', 'calisimm', 'hammer', 'svmhc', 'bimas']

Details from https://bioinformatics.oxfordjournals.org/content/suppl/2016/02/26/btw113.DC1/S1.pdf

Method	Type	Publication
SYFPEITHI	T-cell epitope	(Rammensee, et al., 1999)
BIMAS	MHC-I binding	(Parker, et al., 1994)
SVMHC	MHC-I binding	(Dönnes and Elofsson, 2002)
ARB	MHC-I binding	(Bui, et al., 2005)
SMM	MHC-I binding	(Peters and Sette, 2005)
SMMPMBEC	MHC-I binding	(Kim, et al., 2009)
Epidemix	MHC-I binding	(Feldhahn, et al., 2009)
Comblib	MHC-I binding	(Sidney, et al., 2008)
PickPocket*	MHC-I binding	(Zhang, et al., 2009)
NetMHC*	MHC-I binding	(Lundegaard, et al., 2008)
NetMHCpan*	MHC-I binding	(Hoof, et al., 2009)
HAMMER	MHC-II binding	(Sturniolo, et al., 1999)
TEPITOPEpan	MHC-II binding	(Zhang, et al., 2012)
NetMHCII*	MHC-II binding	(Nielsen, et al., 2007)
NetMHCIIpan*	MHC-II binding	(Karosiene, et al., 2013)
UniTope	T-cell epitope	(Toussaint, et al., 2011)
NetCTLpan*	T-cell epitope	(Stranzl, et al., 2010)

available datasets: iedb.tcell, ideb.mhc, imma2

Example output

python -W ignore src/run-data-on-predictor.py --predictor all --dataset iedb.mhc -n 10

smmpmbec  immunogenic peptides
                                           A*01:01       A*02:01       B*15:01
Seq                         Method                                            
(A, T, F, S, V, P, M, E, K) smmpmbec  17077.331150  15646.960997  71277.096461
(E, V, M, P, V, S, M, A, K) smmpmbec  51691.602975  56419.599637  20603.926881
(K, L, E, D, L, E, R, D, L) smmpmbec  76281.636225   2096.186888  72435.256110
(K, T, F, P, P, T, E, P, K) smmpmbec  50748.124892  51455.286421  13550.333994
(L, I, T, G, R, L, Q, S, L) smmpmbec  47360.905813   1504.630981   3147.385937
            A*01:01       A*02:01       B*15:01
count      5.000000      5.000000      5.000000
mean   48631.920211  25424.532985  36202.799877
std    21069.256898  26693.953616  33136.522923
min    17077.331150   1504.630981   3147.385937
25%    47360.905813   2096.186888  13550.333994
50%    50748.124892  15646.960997  20603.926881
75%    51691.602975  51455.286421  71277.096461
max    76281.636225  56419.599637  72435.256110

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
R_Scripts		R_Scripts
data		data
pics		pics
relevant_papers		relevant_papers
slides		slides
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
explore_with_Allel_tcell.ipynb		explore_with_Allel_tcell.ipynb
explore_without_Allel_bcell.ipynb		explore_without_Allel_bcell.ipynb
explore_without_Allel_tcell.ipynb		explore_without_Allel_tcell.ipynb
ideas.md		ideas.md

License

NCBI-Hackathons/Machine_Learning_Immunogenicity

Folders and files

Latest commit

History

Repository files navigation

Machine_Learning_Immunogenicity

Amino acid properties computed using R peptides package and added to the new dataset

Test if epitope predictors predict immunogenicity on published data

Usage

List available predictors and datasets

Example output

About

Resources

License

Stars

Watchers

Forks

Languages