GitHub

Feature Engineer

We used regex to exclude punctuations in the strings and filtered out the stopwords based on the `nltk.corpus`.

word_match_share
tfidf_word_match_share

Added Counting, Distance and more tfidf features(BOW, TFIDF) for ngram = 1, 2, 3

Following ChengLong's idea and code: https://github.com/ChenglongChen/Kaggle_CrowdFlower/tree/master/Code/Feat

Count of words
Count of unique words
Ratio of unique words
Count of digits
Count of unique digits
Ratio of unique digits
Count of q1 words in q2
Count of q2 words in q1
Ratio of q1 words in q2
Ratio of q2 words in q1

jaccard_coef between q1 and q2
dice_dist between q1 and q2

cosine similarity of q1&q2 tfidf
cosine similarity of q1&q2 BOW
cosine similarity of q1&q2 SVD tfidf
cosine similarity of q1&q2 SVD BOW

Modeling

Simple Models

logistic regression -- train AUC scores: 0.804

xgboost -- test AUC scores: 0.872

LSTM/GRU -- following the leaky feature solution of lystdo: 
https://www.kaggle.com/lystdo/lb-0-18-lstm-with-glove-and-magic-features

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
apps		apps
dashboards		dashboards
datasets		datasets
datasources		datasources
domodel		domodel
flows		flows
jobs		jobs
jupyter		jupyter
misc		misc
model-groups		model-groups
models		models
packages		packages
rstudio		rstudio
scripts		scripts
shaper		shaper
zeppelin		zeppelin
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
feature.py		feature.py
gen_feat.py		gen_feat.py
lstm_glove.py		lstm_glove.py
main.py		main.py
ngram.py		ngram.py
nlp_utils.py		nlp_utils.py
param_config.py		param_config.py

mochiliu3000/Kaggle_Quora

Folders and files

Latest commit

History

Repository files navigation

Feature Engineer

We used regex to exclude punctuations in the strings and filtered out the stopwords based on the nltk.corpus.

Added Counting, Distance and more tfidf features(BOW, TFIDF) for ngram = 1, 2, 3

Modeling

Simple Models

About

Resources

Stars

Watchers

Forks

Languages

We used regex to exclude punctuations in the strings and filtered out the stopwords based on the `nltk.corpus`.