Skip to content

zhangyilun/waterloo-stat441-project

Repository files navigation

University of Waterloo STAT 441 Final Project

This report summarized supervised and semi-supervised machine learning algorithms being applied to predict the predominant forest cover type in a Kaggle competition. The analysis flow includes exploratory data analysis, dimensionality reduction, supervised model fitting including logistic regression, tree based ensemble learning methods, gradient boosting, adaptive boosting, naive bayes, support vector machine and neural network, feature creation and selection, semi-supervised learning algorithm using graph based label spreading and propagation. Classification error rate was used to measure model accuracy. The best performing model is the extremely randomized tree model with grid searched parameters fitting on data with 116 features selected from features including base, 2-way and 3-way interactions by gini variable importance. This model resulted in rank 362 among all 1694 teams participated in this Kaggle competition.

Releases

No releases published

Packages

No packages published