GitHub - cuongnb/OPE: Posterior inference in topic models with provably guaranteed algorithms

cuongnb / OPE Public

forked from Khoat/OPE

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Posterior inference in topic models with provably guaranteed algorithms

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
ML-FW		ML-FW
ML-OPE		ML-OPE
Online-FW		Online-FW
Online-OPE		Online-OPE
Streaming-FW		Streaming-FW
Streaming-OPE		Streaming-OPE
common		common
data		data
readme.txt		readme.txt
settings.txt		settings.txt
topics.py		topics.py

Repository files navigation

***********************************************************************
Posterior inference in topic models with provably guaranteed algorithms
***********************************************************************

(C) Copyright 2015, Khoat Than and Tung Doan.
This package is for academic uses only. OTHER USAGES MUST ASK FOR PERMISSION.

------------------------------------------------------------------------
This Python package contains six algorithms for learning Latent Dirichlet
Allocation (LDA) at large scales, including: ML-FW, Online-FW, Streaming-FW, ML-OPE, Online-OPE, Streaming-OPE. They are stochastic algorithms that can work with big text collections and text streams. Their cores are two fast inference approaches to understanding individual texts.

The two inference approaches are Frank-Wolfe (FW) and Online Maximum a Posterior Estimation (OPE). FW has linear convergence rate, offers a principled way to trade off sparsity of solutions against quality. OPE theoretically converges to local optimum
or stationary point with a linear rate. These two methods can be easily employed to do posterior inference for various probabilistic models.

If you find the package useful, please consider to cite our related work:

Khoat Than and Tu Bao Ho. Inference in topic models I: sparsity trade-off. Technical report, 2015.
Khoat Than and Tung Doan. Inference in topic models II: provably guaranteed algorithms. Technical report, 2015.

------------------------------------------------------------------------
TABLE OF CONTENTS

A. LEARNING

1. SETTINGS FILE

2. DATA FILE FORMAT

B. MEASURE

C. PRINTING TOPICS

------------------------------------------------------------------------
A. LEARNING

To learn LDA by a method, do the following steps:

- Change the current directory to the folder which contain a learning method (e.g., ML-FW)
- Estimate a model by executing:

python run_[name of algorithm].py [train file] [setting file]
[model folder] [test data folder]

[train file] the training data.
[setting file] setting file which provides parameters for learning/inference.
[model folder] folder for saving the learned model.
[test data folder] folder containing data for computing perplexity (described in details in B).

The model folder will contain some more files. These files contain some statistics of how the model is after a mini-batch is processed. These statistics include topic mixture sparsity, perplexity, top ten words of each topic, and time for finishing the E and M steps.

Example: python ./run_ML_FW.py ../data/nyt_50k.txt ../settings.txt ../models/ML_FW/nyt ../data

1. Settings file

See settings.txt for a sample.

2. Data format

The implementations only support reading data type in LDA. Please refer to the following site for instructions.

http://www.cs.columbia.edu/~blei/lda-c/

Under LDA, the words of each document are assumed exchangeable. Thus, each document is succinctly represented as a sparse vector of word counts. The data is a file where each line is of the form:

[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

------------------------------------------------------------------------

B. MEASURE

Perplexity is a popular measure to see predictiveness and generalization of a topic model.

In order to compute perplexity of the model, testing data is needed. Each document in the testing data is randomly divided into two disjoint part w_obs and w_ho with the ratio 80:20
They are stored in [test data folder] with corresponding file name is of the form:

data_test_part_1.txt
data_test_part_2.txt

------------------------------------------------------------------------

D. PRINTING TOPICS

The Python script topics.py lets you print out the top N
words from each topic in a .topic file. Usage is:

python topics.py [beta file] [vocab file] [n words] [result file]

About

Posterior inference in topic models with provably guaranteed algorithms

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-FW

ML-FW

ML-OPE

ML-OPE

Online-FW

Online-FW

Online-OPE

Online-OPE

Streaming-FW

Streaming-FW

Streaming-OPE

Streaming-OPE

common

common

data

data

readme.txt

readme.txt

settings.txt

settings.txt

topics.py

topics.py

Repository files navigation

About

Releases

Packages

Languages

cuongnb/OPE

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages