Code for the precisionFDA sample mislabeling identification challenge.

File Structure


This directory houses the all data used by this code for the challenge.


Contains stand allow scripts used for some functional axiliary purpose to the project. Such as cleaning the raw input data and creating tidy data.


Contains all R scripts and R markdown documents used in the analysis

Is a python module intened to be used by other scripts and never to be run on it own.

Contained in file are functions used for - training scikit learn classifiers - making prediction with each algorithm

To contribute a new model to this module do the following:

  1. Add an import stament for just module being used.
from sklearn.neighbors import KNeighborsClassifier
  1. Create a function called train_name-of-model that takes two parameters, training data and labels for that data. This function should create a new classifier, make a stratified shuffled split of the data, get cross validation scores of the module and print the scores.
def train_knn(data,labels):
    knn = KNeighborsClassifier()
    cv = StratifiedShuffleSplit( 
            n_splits = NUMBER_OF_SPLITS, 
            test_size = TEST_SIZE, 
            random_state = RAND_STATE )
    scores = cross_val_score(knn, data, labels, cv = cv, scoring = SCORING_METHOD)

This is the main script for executing the analysis and algorithm. It is from this file that learner_functions should be used.

Feature Selection

There are three types of feature selection implemented:

import feature_selection
from load_data import LoadData
from sklearn.svm import SVC

data = LoadData()

var_threshold = feature_selection.variance(data.proteomic, threshold=0.125)
k_best = feature_selection.univariate(data.proteomic, data.clinical, method=SelectKBest)
feature_elim = feature_selection.elimination(data.proteomic, data.clinical, SVC(), eliminator=RFE, n_features_to_select=15)

Siamese Network + Genetic Algorithm

The siamese network (along with the other learners) generates probabilities of things being mismatches. These probabilities can be generated by running:


(Note: Keras and TensorFlow must be installed for this to work.)

The genetic algorithm then can use these probabilities to generate the best rematching of clinical, RNA-Seq, and proteomic data:


All relevant hyperparameters for the genetic algorithm can be found as constants in the beginning of (after the imports).