Code for the precisionFDA sample mislabeling identification challenge.
This directory houses the all data used by this code for the challenge.
Contains stand allow scripts used for some functional axiliary purpose to the project. Such as cleaning the raw input data and creating tidy data.
Contains all R scripts and R markdown documents used in the analysis
Is a python module intened to be used by other scripts and never to be run on it own.
Contained in file are functions used for - training scikit learn classifiers - making prediction with each algorithm
To contribute a new model to this module do the following:
- Add an import stament for just module being used.
from sklearn.neighbors import KNeighborsClassifier
- Create a function called train_name-of-model that takes two parameters, training data and labels for that data. This function should create a new classifier, make a stratified shuffled split of the data, get cross validation scores of the module and print the scores.
def train_knn(data,labels):
knn = KNeighborsClassifier()
cv = StratifiedShuffleSplit(
n_splits = NUMBER_OF_SPLITS,
test_size = TEST_SIZE,
random_state = RAND_STATE )
scores = cross_val_score(knn, data, labels, cv = cv, scoring = SCORING_METHOD)
print(scores)
This is the main script for executing the analysis and algorithm. It is from this file that learner_functions should be used.
There are three types of feature selection implemented:
- Variance Threshold - Remove features with low variance.
- Univariate Feature Selection - Select the best features based on some metric. Default is
SelectKBest
, which gets the k features that classify with the highest score (default is accuracy). - Recursive Feature Elimination - Recursively eliminate features that look less important after classification.
import feature_selection
from load_data import LoadData
from sklearn.svm import SVC
data = LoadData()
var_threshold = feature_selection.variance(data.proteomic, threshold=0.125)
k_best = feature_selection.univariate(data.proteomic, data.clinical, method=SelectKBest)
feature_elim = feature_selection.elimination(data.proteomic, data.clinical, SVC(), eliminator=RFE, n_features_to_select=15)
The siamese network (along with the other learners) generates probabilities of things being mismatches. These probabilities can be generated by running:
python probabilities.py
(Note: Keras and TensorFlow must be installed for this to work.)
The genetic algorithm then can use these probabilities to generate the best rematching of clinical, RNA-Seq, and proteomic data:
python genetic.py
All relevant hyperparameters for the genetic algorithm can be found as constants in the beginning of genetic.py
(after the imports).