##Project
The code in this repository is an attemp to solve the AVITO.ru Kaggle challenge. The goal of the project is to develop a model that predicts whether an online ad (in Russian) contains mention of illicit materials and should be flagged for review. The metric of success is Average Precision at k
which tracks the precision of the algorithm (# correct prediction / total predictions) integrated over recall from 0% (no correct predictions returned by the model) to 100% (model correctly predicts all true positives in the test data). The k
for the public Kaggle leaderboard is set at 32500 entries.
##Prerequisite external modules
- NumPy
- SciPy
- scikit-learn
- NLTK
- Need to use nltk.download() to obtain corpora/stopwords
##Expected folder structure
- root
- data
avito_train.tsv
(training dataset)avito_test.tsv
(testing dataset)
- results
avito_starter_solution.csv
(sample submission generated bysample.py
)
sample.py
(Avito-provided sample submission generator)APatK.py
(Methods for generating the AP@k metric)- all other source files
- data
The code will be stored in this repository, but please obtain the data from Kaggle and unpack into the appropriate location. Combined, the testing and training data are ~4 GB.
The code in sample.py has been modified to expect the above structure. If you commit modified paths, please let the others know, as doing so will likely break their workflow.
##Notebooks
##Results The sample submission provided by Avito yields AP@k 0.05367
The sample.py
generate set yields AP@k 0.88598
The current benchmark AP@k from sample.py
is 0.89061