Skip to content

Dynamic Topic Modeling and Topic Chains of Reuters News Articles using SCVB0

Notifications You must be signed in to change notification settings

VaradPathak/DynamicLDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynamicLDA

Dynamic Topic Model of Reuters News Articles between 2007-2013

We have implemented fast version of Dynamic Topic Model proposed by David Blei and John Lafferty in 2006.

This version takes advantage of new advancements in LDA model. We have implemented the LDA part of DTM using SCVB0 which is proposed by Foulds, et al 2013. This is parallelized implementation of SCVB0 using OpenMP.

As per our evaluation, even our Serial version gives 36X speedup and the Parallel version when run on core 2 duo 2GHz 2Gb machine gives 53X speedup.

(Report with detail evaluation)

Reuters News Dataset Details

Timestamped News articles published by Reuters between 2007 and 2013. This is corpus of 161,989 documents with vocab size of 32,468 after preprocessing. Following are the preprocessing steps performed (Scripts are available in Scrapper folder)

  • From Reuters data we removed all the docs which have length less than 100 words
  • We have scrapped random 10% of the data from each day. This was done just to minimize the corpus size.The assumption is that randomly selected data wont cause problem while finding the long and major topics.
  • We removed all the punctuation marks and performed stemming using Porter2 stemmer
  • We also removed the words which have frequency of less than 25 or more than 100,000 example run of text2ldac:

Topic Chains

We have investigated the Topic Chains a solution to topic Birth-Death problem in Dynamic LDA proposed by Kim, et al in 2013.

  • We use the same Reuters dataset and use the Jensen-Shannon (JS) divergence to compare similarity between the topics.
  • We evaluate performance at different Similarity Thresholds and Window Sizes and find similar results as given in the original paper
  • We identify some issues in the method and propose solutions to the same (Please refer the report for more details)

Execution Commands

  • Scrape Data from reuters archive website between startMonth for num_of_months
    python init.py startMonth num_of_months
  • Get Stopwords python removeInfrequentWords.py
  • Convert the text data to ldac format used by Blei's implementation
    python multitext2ldac.py data_folder --stopwords stopwords_file
  • Convert data to UCI format python ldac2uci.py
  • Compile Dynamic LDA. make
  • Execute Dynamic Topic Modeling on UCI dataset
    ./fastLDA UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
  • Get the word trend in a topic
    python getWordVariation.py TopicId WordId PiFolderPath StartYear EndYear
  • Compile Topic Chains GetData to get all the Topics in the dataset for all the TimeSlices make GetData
  • Execute GetData for Topic Chains
    ./GetData UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
  • Compile GenerateChains for Topic Chains make GenerateChains
  • Execute GenerateChains ./GenerateChains Pi_folder num_topics WindowSize SimilarityThreshold

About

Dynamic Topic Modeling and Topic Chains of Reuters News Articles using SCVB0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published