Skip to content

eyedvabny/CDIPS-WS-2014

Repository files navigation

CDIPS Data Science Workshop 2014

##Project The code in this repository is an attemp to solve the AVITO.ru Kaggle challenge. The goal of the project is to develop a model that predicts whether an online ad (in Russian) contains mention of illicit materials and should be flagged for review. The metric of success is Average Precision at k which tracks the precision of the algorithm (# correct prediction / total predictions) integrated over recall from 0% (no correct predictions returned by the model) to 100% (model correctly predicts all true positives in the test data). The k for the public Kaggle leaderboard is set at 32500 entries.

##Prerequisite external modules

##Expected folder structure

  • root
    • data
      • avito_train.tsv (training dataset)
      • avito_test.tsv (testing dataset)
    • results
      • avito_starter_solution.csv (sample submission generated by sample.py)
    • sample.py (Avito-provided sample submission generator)
    • APatK.py (Methods for generating the AP@k metric)
    • all other source files

The code will be stored in this repository, but please obtain the data from Kaggle and unpack into the appropriate location. Combined, the testing and training data are ~4 GB.

The code in sample.py has been modified to expect the above structure. If you commit modified paths, please let the others know, as doing so will likely break their workflow.

##Notebooks

##Results The sample submission provided by Avito yields AP@k 0.05367

The sample.py generate set yields AP@k 0.88598

The current benchmark AP@k from sample.py is 0.89061

About

CDIPS Data Science Workshop 2014

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages