Classification and Auto-Tagging of Stack Exchange Questions

In this project, I tested a suite of machine learning and NLP techniques to learn patterns from Stack Overflow posts.

Word Clouds and POS tag analysis.
Document vectorization (bag of words, tfidf, word2vec)
Manifold visualization with t-sne
Document classification
- Multinomial Naive Bayes
- Random Forest (achieved an f1 score of 0.999848 on the holdout test set)
- Logistic Regression
- Support Vector Machines
- Gradient Boosted Trees
- AdaBoost
Topic Modeling with LDA

Problem posed by Kaggle: Predict tags from models trained on unrelated topics.

"What does physics have in common with biology, cooking, cryptography, diy, robotics, and travel? If you answered "all pursuits are governed by the immutable laws of physics" we'll begrudgingly give you partial credit. If you answered "all were chosen randomly by a scheming Kaggle employee for a twisted transfer learning competition", congratulations, we accept your answer and mark the question as solved.

In this competition, we provide the titles, text, and tags of Stack Exchange questions from six different sites. We then ask for tag predictions on unseen physics questions. Solving this problem via a standard machine approach might involve training an algorithm on a corpus of related text. Here, you are challenged to train on material from outside the field. Can an algorithm learn appropriate physics tags from "extreme-tourism Antarctica"? Let's find out."

Breakdown of the project

Read the notebooks in the following order:

EDA.ipynb
LDA.ipynb
Document_Classification.ipynb
Autotagging.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
code		code
images		images
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

images

images

notebooks

notebooks

README.md

README.md

Repository files navigation

Classification and Auto-Tagging of Stack Exchange Questions

Breakdown of the project

Read the notebooks in the following order:

How the modules in the code folder relate to each other.

About

Releases

Packages

Languages

thanhtd91/nlp_final_project

Folders and files

Latest commit

History

Repository files navigation

Classification and Auto-Tagging of Stack Exchange Questions

Breakdown of the project

Read the notebooks in the following order:

How the modules in the code folder relate to each other.

About

Resources

Stars

Watchers

Forks

Languages