Predicting the results of 2017 March Madness

What am I doing?
The data
The models
What have I missed

This project is hosted by Kaggle, which provides the initial data. The goal is to predict the outcome of the all the 2278 (=68*67/2, there are 68 teams in the NCAA tourament) matches. Then given the initial position of the braket, one can proceed to fill-out the braket and predict the champion. Let's cut to the chase, I have UCLA as my 2017 champion, and below is the detailed braket. As of 4/4/2017, I was quite wrong... I have only 32 games right out of 67. I might as well as just use the model as the anti-predictor...

Even my prediction doesn't work out quite well, let me elaborate my procedures anyway.

The data

The data are provided by Kaggle, which are already nicely parsed. The game statistics are quite straigtforward if one knows the basics of the basketball game, and I am mostly interested in those dataset with detailed statistic for each game. My training set will be mostly based on this detailed data set (minus the win/lose result). To augment the statistic, I have also added two more types of feastures: (1) the game-averaged statisitc for each team, upto the gameday during the season, and (2) averaged Massey Oridinal ranking, upto the gameday during the season. All the coding are done with python and pandas. I paid the special attention here to make sure the team statistic is consitent with the gameday, i.e., the statistic is only collected upto the game day. For (2), if there is no ranking for the specific agency yet, I will take the average from other available agencies to fill the void. I have also include home / away game as a feature.

The models

Preparation for the training data set is actually the time-consuming part, while the regression is more like a one-button opeartion. For this project, I've tried (1) random forest, (2) neural network, (3) logistic regression, and (4) Gradient Boost, all from sklearn package. Since the Kaggle competition requires a winning probability, the direct neural network here seems not quite appropriate. I've also hold out 20% of the training data for cross validation purpose. Below is a snapshot for the training result. I have played with the hyperparameters a little bit, trying to get better results.

===== run #1 of 1 ======

Random Forest:
Importance:      B_POM:0.014
Importance:      A_PIG:0.012
Importance:      B_DC2:0.011
Importance:      A_DC2:0.011
Importance:      A_SFX:0.011
Importance:      A_POM:0.011
Importance:      A_TRP:0.010
Importance:      A_LOG:0.010
Importance:      B_SAG:0.010
Importance:      A_EBP:0.010
Training Accurancy : 0.772124427296841
x-validation Accurancy: 0.7145612343297975
Time spent for RF: 70.883s
The logloss is: 0.5520256161356428

Neural Network:
Training Accurancy : 0.7390884977091874
x-validation Accurancy: 0.7107039537126326
Time spent for NN:  1.888s
The logloss is: 0.5540510895470059

Logistic regression:
Training Accurancy : 0.7390884977091874
x-validation Accurancy: 0.7020250723240116
Time spent for GLM:  1.577s
The logloss is: 0.5709980792779981

Gradient Boost RT:
Training Accurancy : 1.0       
x-validation Accurancy: 0.7675988428158148
Time spent for GBRT: 32.882s
The logloss is: 1.624365021920362

From the random forest output, it seems that the Massey Ordinal rankings are the most important predictors.

What have I missed

Since I've done poorly, both in terms for predicting the outcome, or the ranking in the competition, I am asking myself where it went wrong. From the discussion posted on Kaggle, it seems Massey Ordinal rankings are really the key, and I should probably just use them as features. Also, some people also use Elo rating with good results, and some people are taking distance between the game location and school location into consideration. Well, I guess I've learnt quite a lot... It have been really fun!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
input		input
sample		sample
.gitignore		.gitignore
March Madness.ipynb		March Madness.ipynb
bracket_builder.py		bracket_builder.py
bracket_builder_2017.py		bracket_builder_2017.py
empty_bracket.jpg		empty_bracket.jpg
function_libs.py		function_libs.py
march_madness.py		march_madness.py
march_madness_v0.py		march_madness_v0.py
march_madness_v0_2015.py		march_madness_v0_2015.py
march_madness_v0_2016.py		march_madness_v0_2016.py
march_madness_v1_2013.py		march_madness_v1_2013.py
march_madness_v1_2015.py		march_madness_v1_2015.py
march_madness_v1_2016.py		march_madness_v1_2016.py
march_madness_v1_2017.py		march_madness_v1_2017.py
march_madness_v1_Tourney.py		march_madness_v1_Tourney.py
march_madness_v1_single_year.py		march_madness_v1_single_year.py
march_madness_v2.py		march_madness_v2.py
march_madness_v2_2017.py		march_madness_v2_2017.py
march_madness_v2_All.py		march_madness_v2_All.py
march_madness_v2_Tourney.py		march_madness_v2_Tourney.py
march_madness_v3_2017.py		march_madness_v3_2017.py
march_madness_v4_2003_to_2017.py		march_madness_v4_2003_to_2017.py
march_madness_v4_2017.py		march_madness_v4_2017.py
matchup_locs.py		matchup_locs.py
not_README.md		not_README.md
predicted_bracket.jpg		predicted_bracket.jpg
predicted_bracket_2.jpg		predicted_bracket_2.jpg
read_oranks.py		read_oranks.py
readme.md		readme.md
remove_dead_oranks.py		remove_dead_oranks.py

changyaochen/March-Madness

Folders and files

Latest commit

History

Repository files navigation

Predicting the results of 2017 March Madness

The data

The models

What have I missed

About

Resources

Stars

Watchers

Forks

Languages