Middle project in Prof. Louzoun's Unsupervised Learning course.
Three data sets were analyzed using five unsupervised learning methods. The first data set is of online shoppers purchasing intention. The second one represents a decade (1999-2008) of clinical care at 130 US hospitals of patients with diabetes. The third data set contains information on click-stream from an online store offering clothing for pregnant women. For each data set, the goal was to cluster the data, visualize the clustering results, compute how well each clustering method fits the external classification, determine which clustering algorithm is better and explain the reason for the difference between them. Out of the five algorithms tested, Hierarchical Complete with four clusters was the best algorithm for the data of online shoppers' intention and e-shop clothing. However, for the clinical data, K Means with three clusters provided the best results.
The data are too large to upload. They can be found here:
- Online Shoppers Purchasing Intention Dataset Data Set
- Diabetes 130-US hospitals for years 1999-2008 Data Set
- clickstream data for online shopping Data Set
In order to run the code, the data sets shall be downloaded and placed in a directory named 'dataset'.
The main modules used on this project are:
- Sklearn
- Matplotlib
- Skfuzzy
- Numpy
- Pandas
- Scipy
- Yellowbrick