Course project for CU Boulder CSCI 5622: Machine Learning with Dr. Jordan Boyd-Graber
Team members: Nicolas Metts, Matthew Pennington, Rani Schwindt and Carter Tillquist
An "essential" gene is one which, when absent/deleted, confers a lethal phenotype. We propose that gene essentiality can be predicted using a weighted combination of features. Here we use data sets with genes from Saccharomyces cerevisiae, a well characterized yeast species.
In 2006, Seringhaus et al used 14 biological features to train a classifier for predicting essential genes in S. cerevisiae and a related organism S. miktae. On 4,648 genes in S. cerevisiae, the classifier resulted in a precision TP/TP+FP = 0.69 and recall TP/TP+FN = 0.091. The classifier used was an average of 7 different classifiers, including logistic regression, Naive Bayes, and AdaBoost.
The data from this paper was provided by the lab here, however, it only includes a complete feature set for 3,500 genes.
Label = SGD_ess (1 = essential, 0 = nonessential)
Feature | Description | Raw data format |
---|---|---|
Mitochondria | Does the protein localize to the mitochondria (predicted) | Binary |
Cytoplasm | Does the protein localize to the cytoplasm (predicted) | Binary |
ER | Does the protein localize to the ER (predicted) | Binary |
Nucleus | Does the protein localize to the nucleus (predicted) | Binary |
Vacuole | Does the protein localize to the vacuole (predicted) | Binary |
Other | Does the protein localize somewhere else (predicted) | Binary |
CAI | Codon adaptation index | 0 to 1 |
Nc | Effective number of codons | Integer |
GC | GC content | 0 to 1 |
L_aa | Number of amino acids in protein (predicted) | Integer |
Gravy | Hydrophobicity (positive) or hydrophilicity (negative) | -inf to inf |
DovEXPR | Unknown | 0 to inf |
BLAST_hits_in_yeast | Number of related genes in yeast (BLAST similarity) | Integer |
INTXN_partners | Number of protein interaction partners | Integer |
Chromosome | Which chromosome (yeast have 16) is the gene on | Integer |
Chr_position | Where does the gene start relative to the whole chromosome | 0 to 1 |
Intron | Unknown | Binary |
CLOSE_STOP_RATIO | % of codons one third-base away from stop codon | 0 to 1 |
RARE_AA_RATIO | % of rare amino acids in translated ORF | 0 to 1 |
TM_HELIX | Number of transmembrane helices (predicted) | Integer |
In_how_many_of_5_proks_BLAST | Number of related genes in 5 prokaryotes (BLAST similarity) | Integer |
In_how_many_of_6_close_yeast_BLAST | Number of related genes in 6 yeast species (BLAST similarity) | Integer |
Compiling features for all 5,799 genes in S. cerevisiae. Label = Essential (1 = essential, 0 = nonessential). Note that the localization features in this data set are not predicted, but were experimentally determined.
Feature | Description | Raw data format |
---|---|---|
Transcript length | Length of the transcribe gene including UTRs | Integer |
Strand | Whether gene is on the positive DNA strand (1) or negative (-1) | 1 or -1 |
GC | GC content | 0 to 1 |
Enzyme | Does the protein have enzymatic activity | 0 to 1 |
SEG.low.complexity | Predicted to have low-complexity regions | 0 to 1 |
Transmembrane.domain | Does the protein have a transmembrane domain | 0 to 1 |
Signal.peptide | Does the protein have a signal peptide | 0 to 1 |
Coiled.coil | Does the protein have a coiled coil | 0 to 1 |
Nucleus | Does the protein localize to the nucleus | Binary |
Mitochondria | Does the protein localize to the mitochondria | Binary |
ER | Does the protein localize to the ER | Binary |
Cytoplasm | Does the protein localize to the cytoplasm | Binary |
Ribosome | Does the protein localize to the ribosome | Binary |
TO DO: add Gene Ontology terms about protein and gene function