Things to try

This is the repository Mike Vella and Ben Derrett used for coding the solution that acheived position #13 in the Kaggle competition See Click Fix Predict.

#WARNING

Most of the code here has highly-experimental prototype status, It was written as a one-off for the competition and to help BD and MV learn about sklearn and Pandas. I would not advise anyone to use this codebase for educational purposes.

Things to try

In order of subjective importance (i.e. expected score improvement / implementation time.)

BD, MV: Can we incorporate text with GBM? Perhaps we do this by reducing the number of features to the GBM.
MV: Tune the max_depth feature of the GBM, as suggested on Sklearn GBM page.
Do tfidf with max_features parameter set. e.g. = 100. Greatly speed up computation and reduce overfitting.
MV: Try subtting a blend of regressors.
BD: See whether special considerations for Chicago improve the score.
MV: Look for duplicates in test data. How common are they? The city often marks duplicates as closed. Doing this removes the option to vote. Can we use this?
BD,MV: Why do the mean corrections in time not improve the score?
BD,MV: Why do we have large variation in correction factors between v,v and c?
BD: Speed up computating by not computing features which are never used.
MV: Separate hyperparameter scans for each of {c,v,v}.
MV: Separate training for remote_api_created, perhaps even separate training sets
MV: Data preformatting: this has to be critical, what can we do?
MV: With GBM, going from 40 to 30 estimators caused a very slight improvement, would 30 to 20 cause a more significant improvement? Where on the estimators/score curve is the right place? (BD: Have you tried this, MV?)
BD: Think in terms of excess votes (>1).
Move code from notebooks when it is useful to keep.
Plot proportion of data in each small niche for the traning and test data sets.
Can we combine the mean information we have for the remoteapi with that for the first and second halves of the test data?
Use locality data: http://sedac.ciesin.columbia.edu/data/set/usgrid-summary-file3-2000-msa

Useful Background

Read http://eaves.ca/2013/09/11/announcing-the-311-data-challenge-soon-to-be-launched-on-kaggle/ and in particular the link to the 311 standard api: http://open311.org/ . This should really help our intuition of what's going on.
Visualization of tag types in different cities: https://www.kaggle.com/c/the-seeclickfix-311-challenge/visualization/1299

Observations

11 Nov submission (Score 0.30339, pos #8)

With gradient boosters increasing the number of estimators can be counter-productive
CV can be misleading

This is the repository for our entry for the See Click Predict Fix Kaggle contest.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
data		data
.gitignore		.gitignore
311 plots.ipynb		311 plots.ipynb
README.md		README.md
city_extraction.py		city_extraction.py
correct_prediction.py		correct_prediction.py
cost_function.py		cost_function.py
example.py		example.py
features.py		features.py
hyperparameter_scan.py		hyperparameter_scan.py
learn.py		learn.py
make_predictions.py		make_predictions.py
mean_computation.nb		mean_computation.nb
mean_computation.py		mean_computation.py
predictions.csv		predictions.csv
pycuda.ipynb		pycuda.ipynb
test.py		test.py
test_prediction.py		test_prediction.py
utils.py		utils.py

vellamike/311

Folders and files

Latest commit

History

Repository files navigation

Things to try

Useful Background

Observations

11 Nov submission (Score 0.30339, pos #8)

About

Resources

Stars

Watchers

Forks

Languages