CDIPS Data Science Workshop 2014

##Project The code in this repository is an attemp to solve the AVITO.ru Kaggle challenge. The goal of the project is to develop a model that predicts whether an online ad (in Russian) contains mention of illicit materials and should be flagged for review. The metric of success is Average Precision at k which tracks the precision of the algorithm (# correct prediction / total predictions) integrated over recall from 0% (no correct predictions returned by the model) to 100% (model correctly predicts all true positives in the test data). The k for the public Kaggle leaderboard is set at 32500 entries.

##Prerequisite external modules

NumPy
SciPy
scikit-learn
NLTK
- Need to use nltk.download() to obtain corpora/stopwords

##Expected folder structure

root
- data
  - avito_train.tsv (training dataset)
  - avito_test.tsv (testing dataset)
- results
  - avito_starter_solution.csv (sample submission generated by sample.py)
- sample.py (Avito-provided sample submission generator)
- APatK.py (Methods for generating the AP@k metric)
- all other source files

The code will be stored in this repository, but please obtain the data from Kaggle and unpack into the appropriate location. Combined, the testing and training data are ~4 GB.

The code in sample.py has been modified to expect the above structure. If you commit modified paths, please let the others know, as doing so will likely break their workflow.

##Notebooks

Preparing a smaller training and validation sets
- The smaller sets

##Results The sample submission provided by Avito yields AP@k 0.05367

The sample.py generate set yields AP@k 0.88598

The current benchmark AP@k from sample.py is 0.89061

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
FeatureEngineering		FeatureEngineering
SampleScript		SampleScript
data		data
.gitignore		.gitignore
APatK.py		APatK.py
README.md		README.md
avito_modules.py		avito_modules.py
bayes_benchmark.py		bayes_benchmark.py
bayes_notext.py		bayes_notext.py
bayes_wordbag.py		bayes_wordbag.py
kaggle.py		kaggle.py
lasso_benchmark.py		lasso_benchmark.py
lasso_notext.py		lasso_notext.py
log_benchmark.py		log_benchmark.py
log_notext.py		log_notext.py
log_wordbag.py		log_wordbag.py
make_word_features.ipynb		make_word_features.ipynb
make_word_matrix.ipynb		make_word_matrix.ipynb
ridge_benchmark.py		ridge_benchmark.py
ridge_notext.py		ridge_notext.py
ridge_wordbag.py		ridge_wordbag.py
sample.py		sample.py
subcat_clean_words.ipynb		subcat_clean_words.ipynb
subcat_dirty_words.ipynb		subcat_dirty_words.ipynb
svm_benchmark.py		svm_benchmark.py
svm_notext.py		svm_notext.py
svm_wordbag.py		svm_wordbag.py

eyedvabny/CDIPS-WS-2014

Folders and files

Latest commit

History

Repository files navigation

CDIPS Data Science Workshop 2014

About

Resources

Stars

Watchers

Forks

Languages