DynamicLDA

Dynamic Topic Model of Reuters News Articles between 2007-2013

We have implemented fast version of Dynamic Topic Model proposed by David Blei and John Lafferty in 2006.

This version takes advantage of new advancements in LDA model. We have implemented the LDA part of DTM using SCVB0 which is proposed by Foulds, et al 2013. This is parallelized implementation of SCVB0 using OpenMP.

As per our evaluation, even our Serial version gives 36X speedup and the Parallel version when run on core 2 duo 2GHz 2Gb machine gives 53X speedup.

(Report with detail evaluation)

Reuters News Dataset Details

Timestamped News articles published by Reuters between 2007 and 2013. This is corpus of 161,989 documents with vocab size of 32,468 after preprocessing. Following are the preprocessing steps performed (Scripts are available in Scrapper folder)

From Reuters data we removed all the docs which have length less than 100 words
We have scrapped random 10% of the data from each day. This was done just to minimize the corpus size.The assumption is that randomly selected data wont cause problem while finding the long and major topics.
We removed all the punctuation marks and performed stemming using Porter2 stemmer
We also removed the words which have frequency of less than 25 or more than 100,000 example run of text2ldac:

Topic Chains

We have investigated the Topic Chains a solution to topic Birth-Death problem in Dynamic LDA proposed by Kim, et al in 2013.

We use the same Reuters dataset and use the Jensen-Shannon (JS) divergence to compare similarity between the topics.
We evaluate performance at different Similarity Thresholds and Window Sizes and find similar results as given in the original paper
We identify some issues in the method and propose solutions to the same (Please refer the report for more details)

Execution Commands

Scrape Data from reuters archive website between startMonth for num_of_months
python init.py startMonth num_of_months
Get Stopwords python removeInfrequentWords.py
Convert the text data to ldac format used by Blei's implementation
python multitext2ldac.py data_folder --stopwords stopwords_file
Convert data to UCI format python ldac2uci.py
Compile Dynamic LDA. make
Execute Dynamic Topic Modeling on UCI dataset
./fastLDA UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
Get the word trend in a topic
python getWordVariation.py TopicId WordId PiFolderPath StartYear EndYear
Compile Topic Chains GetData to get all the Topics in the dataset for all the TimeSlices make GetData
Execute GetData for Topic Chains
./GetData UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
Compile GenerateChains for Topic Chains make GenerateChains
Execute GenerateChains ./GenerateChains Pi_folder num_topics WindowSize SimilarityThreshold

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Evaluation		Evaluation
SCVB0		SCVB0
SCVB0_Evaluation		SCVB0_Evaluation
Scrapper		Scrapper
TopicChains		TopicChains
.gitignore		.gitignore
README.md		README.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation

Evaluation

SCVB0

SCVB0

SCVB0_Evaluation

SCVB0_Evaluation

Scrapper

Scrapper

TopicChains

TopicChains

.gitignore

.gitignore

README.md

README.md

makefile

makefile

Repository files navigation

DynamicLDA

Dynamic Topic Model of Reuters News Articles between 2007-2013

Reuters News Dataset Details

Topic Chains

Execution Commands

About

Releases

Packages

Contributors 2

Languages

VaradPathak/DynamicLDA

Folders and files

Latest commit

History

Repository files navigation

DynamicLDA

Dynamic Topic Model of Reuters News Articles between 2007-2013

Reuters News Dataset Details

Topic Chains

Execution Commands

About

Resources

Stars

Watchers

Forks

Languages