All-Things-Data-Science: Article Finder

Joyce Duan

Overview

~4000 data science related articles were scraped from over 800 websites. NMF was used for topic modeling. A web app was implemented that allows users to browse the topics and search articles.

##Motivation

With the rapid advancement of technology, it is pertinent for data scientists to keep updated with the latest developments in methodology and business applications. Such information is abundant, but currently fragmented, existing in various blogs, articles from LinkedIn influencer articles, newspapers, and professional organization website. In this project, I used text analysis to model topics related to analytics and data science. The collection of articles enabled exploratory analysis on trending of topics. The web app provides a tool to organize these articles and allows users to search for articles of interest. The workflow developed in this project is also applicable to personalized content recommendations for other professions or personal interest areas.

##Getting the Data

The links to articles were collected from various sources including DataTau, weekly newsletters, collections of blogs as recommended on Quora. Articles published from Dec 2013 and June 2014 were scraped from the corresponding source website.

##Topic Modeling

The pipeline includes the following steps:

extract body text using boilerpipe
clean text by removing url links
tokennization converts a document into a list of 1, 2, 3-grams.
stemming is applied to reduce inflected or derived words to their base forms. This reduces dimension of the TF-IDF matrix. It also enables search using alternative forms of words.
TF-IDF: the collection of documents are represented using a document-words matrix of TFIDF features.
NMF: the TF-IDF matrix is approximated using the product of 2 low rank matrices: document-topic weighting, and topic-words weighting.

I tested different tokenizers, stemmers, n-gram ranging from 1 to 5, and number of topics from 20 to 40. The results were manually reviewed to check if topics were distinct and articles with highest weights under each topic shared similar contents. The combination with the most intuitive and sensible results was used in the final model.

##Final Output The web app, allows users to browse topics, search for articles, and explore visualization of trends by each topic.

##Next Steps

Add daily feed of newly published articles
Add feature that allows user to rate articles
Build personalized recommendation engine using user-rating data

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
code		code
data		data
webapp		webapp
README.md		README.md
allds.config		allds.config
readme.txt		readme.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

webapp

webapp

README.md

README.md

allds.config

allds.config

readme.txt

readme.txt

Repository files navigation

All-Things-Data-Science: Article Finder

Joyce Duan

Overview

About

Releases

Packages

Languages

joyce-duan/All-Things-Data-Science

Folders and files

Latest commit

History

Repository files navigation

All-Things-Data-Science: Article Finder

Joyce Duan

Overview

About

Resources

Stars

Watchers

Forks

Languages