Skip to content

tatianaruediger/online_twitter_lda

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This package contains scripts and python tools for running an online implementation of LDA.

Directory Structure and Files

  • input: the input directory; contains example input files for testing the program.
  • lda.py: main program that runs online LDA.
  • run_lda.sh: script that drives the online lda system.
  • stopwords.txt: a list of common stopwords to remove when generating the vocabulary.
  • vocabulary.py: complementary program that manages the update of vocabulary in documents.

Running the System

  • Generate input files according to the input format in the input directory.
  • Execute run_lda.sh.
  • System output are generated in output-time_slice directories.

Input Format

  • time_slice.text: text of the documents, one line per document;
  • time_slice.time: time information of the documents, each line maps to the document that has the same line number in time_slice.text.

Parameter Settings

Most parameter values (number of cores to use, minimum frequency threshold of vocabulary, etc) are set in lda.py. The number of topics, T, is specified in run_lda.sh. The size of the sliding window is fixed at 2 time slices. Modification of the code is required to change this parameter.

Credits & Licensing

Publications

  • Jey Han Lau, Nigel Collier and Timothy Baldwin (2012). On-line Trend Analysis with Topic Models: #twitter trends detection topic model online. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India.

About

Detection of microblogs novel events using an online variant of topic model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Shell 3.1%