Skip to content

WhiteIsClosing/Kaggle-Quora

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title author
Entry for Kaggle - Quora Competition
Team Members: Germayne Ng, Alson Yap

quora

Competition


Link to competition details and question: https://www.kaggle.com/c/quora-question-pairs

Update Logs


  • Version 1.6 - 13th May 2017:

    • Revamp LSA features, based on train and test data as 'document' as opposed to referencing them seperately.
    • 6 new features: LSA Q1 component 1, 2 , Q2 component 1, 2 and 2 distancing based on the components
    • total 55. (note that 2 features will overwrite the old distancing features, so 4 new additions)
    • Major change in xgboost.py script: I have split section 5 into a,b,c. For Cross validation, do 5a> 5b. For modelling to submit, do 5a > 5c, skip 5b. Mainly to use full training set to modelling.
    • score - 0.27XX
  • Version 1.6 - 12th May 2017:

    • Added 4 magic features - 0.30XX
    • Added AB 12 features - 0.28XX
    • Total 51 features
  • Version 1.5 - 20th April 2017:

    • Finally best score so far. Managed to make good use of the LSA components features.
    • Distance features can be further distingused. Alson did the distance for single vector. If you define LSA components and apply distance, it can be a seperate feature.
    • Based on Alson's functions for distance, I created a euclidean and manhattan function for single vector. Essentially, there are 2 features based on euclidean and manhattan distances (total 4 )

    LSA components: Basically each question is now a vector in LSA-TFIDF components. I.E. Note the values are arbitary, just for example sake:

    question component 1 component 2 question component 1 component 2
    question 1 of pair 1 0.23 0.56 question 2 of pair 1 0.4 0.7

    So, question 1 instead of words is now [0.23 , 0.56] and question 2 [0.4, 0.7]. Now as vectors, we then calculate its distances.

    • tune 1000 nrounds and lower learning rate to 0.1 = better results.
  • Version 1.4 - 4th April 2017:

    • Expanded on the TFIDF function (3), Added character count without spaces (3) and character count per word (3)
    • Total 30 features. Accuracy: 0.32624 (Rank 162: Top 14%)
    • Updated features dataset in dropbox
  • Version 1.3 - 1st April 2017:

    • Added Jaccard distance and Cosine distance features. Total 21 features. (Rank 133: Top 15%)
  • Version 1.2 - 1st April 2017:

    • Added 7 FuzzyWuzzy features. Total 19 features. (Rank 151: Top 15%)
  • Version 1.1 - 31st March 2017:

    • Added additional features. Total 12 features. (Rank: 215 Top 25%)
  • Version 1.0 - 30th March 2017:

    • Implemented Xgboost with 6 features.

References

f81e92c0-472b-4770-b808-1abdd9376edf-original


To do/ comments

  1. Running the spell checker script on train set and testing set (~20k per hour)

So, train set will take about 404/20 = ~20 hours.
Then, test set will take 20hours * 6 = 120 hours == 5 days +

Spell checker script taken from: http://norvig.com/spell-correct.html

Download big.txt file because the script needs to reference to that corpus of words. I only added the sentence_correction function which is to be run on both data sets.

Conclusion: Fuzzy wuzzy features are really significant. But realise that after implementing the Jaccard and Cosine dist features, the rankings changed. Refer to the features importance in the figures folder


About

Participants in competition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.9%
  • R 5.1%