FDA-Sampling

Code for the precisionFDA sample mislabeling identification challenge.

File Structure

`data/`

This directory houses the all data used by this code for the challenge.

`script/`

Contains stand allow scripts used for some functional axiliary purpose to the project. Such as cleaning the raw input data and creating tidy data.

`r/`

Contains all R scripts and R markdown documents used in the analysis

`learner_functions.py`

Is a python module intened to be used by other scripts and never to be run on it own.

Contained in file are functions used for - training scikit learn classifiers - making prediction with each algorithm

To contribute a new model to this module do the following:

Add an import stament for just module being used.

from sklearn.neighbors import KNeighborsClassifier

Create a function called train_name-of-model that takes two parameters, training data and labels for that data. This function should create a new classifier, make a stratified shuffled split of the data, get cross validation scores of the module and print the scores.

def train_knn(data,labels):
    knn = KNeighborsClassifier()
    cv = StratifiedShuffleSplit( 
            n_splits = NUMBER_OF_SPLITS, 
            test_size = TEST_SIZE, 
            random_state = RAND_STATE )
    
    scores = cross_val_score(knn, data, labels, cv = cv, scoring = SCORING_METHOD)
    print(scores)

`main.py`

This is the main script for executing the analysis and algorithm. It is from this file that learner_functions should be used.

Feature Selection

There are three types of feature selection implemented:

Variance Threshold - Remove features with low variance.
Univariate Feature Selection - Select the best features based on some metric. Default is SelectKBest, which gets the k features that classify with the highest score (default is accuracy).
Recursive Feature Elimination - Recursively eliminate features that look less important after classification.

import feature_selection
from load_data import LoadData
from sklearn.svm import SVC

data = LoadData()

var_threshold = feature_selection.variance(data.proteomic, threshold=0.125)
k_best = feature_selection.univariate(data.proteomic, data.clinical, method=SelectKBest)
feature_elim = feature_selection.elimination(data.proteomic, data.clinical, SVC(), eliminator=RFE, n_features_to_select=15)

Siamese Network + Genetic Algorithm

The siamese network (along with the other learners) generates probabilities of things being mismatches. These probabilities can be generated by running:

python probabilities.py

(Note: Keras and TensorFlow must be installed for this to work.)

The genetic algorithm then can use these probabilities to generate the best rematching of clinical, RNA-Seq, and proteomic data:

python genetic.py

All relevant hyperparameters for the genetic algorithm can be found as constants in the beginning of genetic.py (after the imports).

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.idea		.idea
data		data
parameter_optimization		parameter_optimization
r		r
scripts		scripts
.gitignore		.gitignore
README.md		README.md
feature_selection.py		feature_selection.py
find_mismatch.py		find_mismatch.py
genetic.py		genetic.py
hard_vote.py		hard_vote.py
hard_vote_fs_strict.py		hard_vote_fs_strict.py
learner_functions.py		learner_functions.py
load_data.py		load_data.py
main.py		main.py
prob_voting.py		prob_voting.py
probabilities.py		probabilities.py
siamese_net.py		siamese_net.py
soft_vote.py		soft_vote.py
subchallenge2.py		subchallenge2.py
subchallenge_1.csv		subchallenge_1.csv
svm_optimization.py		svm_optimization.py

byubrg/FDA-Sampling

Folders and files

Latest commit

History

Repository files navigation

FDA-Sampling

File Structure

data/

script/

r/

learner_functions.py

main.py

Feature Selection

Siamese Network + Genetic Algorithm

About

Resources

Stars

Watchers

Forks

Languages

`data/`

`script/`

`r/`

`learner_functions.py`

`main.py`