STATS202: Data Mining and Analysis
The Positive and Negative Syndrome Scale (PANSS) is a medical scale used for measuring symptom severity of patients with schizophrenia. Raw and cleaned datasets are located at ./data/
.
Please note that datasets ./data/Study_[A, B, C, D].csv
are labelled training sets, while ./data/Study_E.csv
is unlabelled testing sets.
The final project consisted of four tasks: testing for treatment effect, clustering patients, classifying assessment validity, and forecasting further PANSS scores.
A more detailed description of the four tasks for this project (mentioned above) can be found in the instruction document at ./final_project.pdf
A 17-page report for this project was uploaded as well: ./report/writeup.pdf
Most of this project was done using python while the hypothesis testing part used R;
K-Mean, PCA, and agglomerative clustering in clustering task were from sklearn
;
Boosting, random forest, support vector regression, and support vector classification were from sklearn
;
Deep neural networks were from tensorflow
.
./classification/
Assessment validity classification (task 4).
./cluster/
Patient classification (task 2).
./descriptive/
Data visualization and treatment effect testing (task 1).
./forecasting/
Forecasting PANSS scores (task 3).
./report/
The writeup report for this final project.
./util/
Data manipulation utilities.
To verify the validity of k-mean clustering, the clustering results were visualized no a principal component space instead of the feature space.
Run code below using bash to generate an interactive 3D plot.
cd ./clustering
python3.7 ./KMean_PCA.py --clusters=3 --components=3
Sample static plots (axes were adjusted for better illustration):
References for what PANSS scores are and how they are measured can be found at ./references/