Skip to content

noougatine/Owkin-Internship-Application

Repository files navigation

Owkin application - Thomas AUZARD

Description

Repository for the Owkin data challenge as part of my application for the "Machine Learning Scientist - Medical imaging - Internship" offer. It contains the (short) code, the data used (.csv) and some explanations.

Requirements

pip install -r requirements.txt

My approach for this problem

What's in and behind the code

After reading the description of the challenge and understanding it, I decided to use the lifeline package for the Cox model. I quickly tried to obtain a first prediction using all radiomics + all clinical data, which gave me a score just a little below the benchmark. After playing around with the features and the lifeline package which can be very visual, I did a feature selection based on Pearson correlation coefficient. From that I extracted "useful" features, and did a second prediction, which gave me a score a little above the benchmark (0.7198). I finally used lifeline to have a quick overview of the model, and realised that a lot of features were "useless" for the regression (0 coefficients in the \beta vector). I removed them and did a last prediction for the test dataset, which gave me a score of 0.728 (please note that the submission from the account orsonlelyonnais and thomas_auzard are both mine).

I first coded some functions : it was much easier to play with the data then. I searched a little on the internet about Cox model and survival prediction, and realised that in the amount time I wanted to spend on this project, I would probably not be able to use something different than the "standard" Cox model.

I found some deposits on GitHub and other packages with "optimized" Cox model (Gradient-boosted, different base-learner), but I decided to stick with standard one and rather play with the parameters.

What was tested and is not in the code

Regarding features selection, I tried Spearman selection, but it did not gave me better reasults than Pearson selection. I also tried to standardize the datas (using sklearn function), but again, the results where not better (quite oddly).

I also coded a cross validation method, which I did not have enough time to actually use (see cross_val.py).

I wanted to give neural net (Keras) a try even though it I thought it would probably be overkill and unefficient. At first, I used 3d convolutionnal layer with raw scans as inputs, which turned out to be way too computationnally expensive for my computer (I wanted to use the scans or the masks, because radiomics were described as biased and suboptimal). It was still too slow with the binary masks, so I decided to try with the radiomics and clinical data as inputs. I tried a "regression" neural net (MSE on the survival time as loss, single output), which gave terrible results.

I then tried to use a "discrete time" model as classification to use a standard neural network. My idea was to set a max survival time, and divide it in a given number of intervals. The goal was to predict in which interval the lifetime would be. Unfortunately, I spent some time on it without having any results so I stopped trying neural nets.

What I wanted to test but did not

  • Extracting features directly from the scans instead of using the "standard" radiomics features. I thought that after making a very deep work on the radiomics features (features selection based on different criterion, dimensionality reduction (PCA, ICA; etc.)), we could extract a set of features. Then, using a neural net, with the scans or the binary masks as inputs (eventually some imaging pipelines), we could train the network to predict those new features, which would then be used in the Cox model, or any other model.

  • The differents optimized Cox fitter I read or found about online, and more effective features selections. I used Pearson correlation coefficients but could have probably used other criterion.

  • Using other survival time prediction model : from the few papers and articles I read about survival time prediction, Cox model appeared to be the best "estimator" when it comes to censored data (which is why I decided to stick with it), yet I could have tested other model (there actually were some in the lifeline package).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages