Solution of Adthena query categorization task

Solution implements two models:

KNN with cosine similarity distance
Logistic Regression from VW with hashing trick to reduce memory requirements

Scores for test sample are in ./data

Usage instructions to train and score models:

Create virtual environment to install dependencies

virtualenv -p python3 query_cat
source query_cat/bin/activate

Clone repository, change to folder and install dependencies

git clone https://github.com/mindis/QueryCategorization/
cd QueryCategorization
pip3 install -r requirements.txt

Within folder run following command to download data:

python3 run.py load

Train and evaluate KNN and LR models

python3 run.py train -i ./data/trainSet.csv -m knn
python3 run.py train -i ./data/trainSet.csv -m lr

Score KNN and LR models

python3 run.py score -i ./data/candidateTestSet.csv -m knn
python3 run.py score -i ./data/candidateTestSet.csv -m lr

Answers

Due to large number of categories, short queries and noisy labels I've tried simple top20 cosine KNN and logistic regression models.
From out of box preprocessing tools only lemmatization improved accuracy significantly.
F1 and accuracy
top20 KNN is fast and only requires to optimize number of neighbours. LR with Vowpall Wabbit allows to balance memory, speed and accuracy. sklearn LR run out of memory.
Weaknesses:

KNN similarity measure is not optimized for predictive accuracy. Similarity based on trained embeddings potentially could work better(e.g. Prototypical and Matching networks)

VW LR options were not optimized, can be improved with trainable embeddings, ngrams, interactions.
Better data cleaning, preprocessing and hand crafted features could also help:

Translate queries to english

Create better quality labels.

Put more emphasis on nouns and named entities (keywords),
Just testing

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
debug.log		debug.log
requirements.txt		requirements.txt
run.py		run.py
test.txt		test.txt
test1.txt		test1.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

models

models

notebooks

notebooks

src

src

.gitignore

.gitignore

README.md

README.md

debug.log

debug.log

requirements.txt

requirements.txt

run.py

run.py

test.txt

test.txt

test1.txt

test1.txt

Repository files navigation

Solution of Adthena query categorization task

About

Releases

Packages

Languages

mindis/QueryCategorization

Folders and files

Latest commit

History

Repository files navigation

Solution of Adthena query categorization task

About

Resources

Stars

Watchers

Forks

Languages