Context: Capstone project of Udacity's Machine Learning Engineer Nanodegree.
About:
- The Python script takes a dataset from a pan-cancer analysis of paediatric cancers as inputs.
- It builds a series of classifiers to predict cancer histotypes and trains them on the dataset comprising activities of mutational signatures, including a decision tree, a naive Bayes classifier, support vector machines, an ensemble method (Adaboost), and a multilayer perceptron.
- It quantifies the intra-histotype variations in the dataset by hierarchical clustering.
- It extracts latent features from the dataset by principal component analysis.
Files:
- The Python script named 'Capstone.py' should be implemented in Python 3.5.
- The dataset named 'nature25795-s4' must be in the same directory as the script when the latter is run.
- The script can be run without changes. The only problem is that the three dendrograms produced by the final block of the script will be squeezed into one plot. One needs to plot the dendrograms separately again after running the script.
Modules:
- NumPy is needed for array and matrix support.
- Pandas is needed for data manipulation and analysis.
- matplotlib is needed for visualisation.
- scikit-learn is needed for most of the classification models, their optimisation, their metrics, and principal component analysis.
- Keras and TensorFlow are needed for deep learning.
- Scipy is needed for hierarchical clustering.