GitHub - Syed-Arshad/streaming_vb: This is code associated with the paper: Broderick, T, Boyd, N, Wibisono, A, Wilson, AC, and Jordan, MI. Streaming variational Bayes. Neural Information Processing Systems, 2013. papers.nips.cc/paper/4980-streaming-variational-bayes.pdf

Syed-Arshad / streaming_vb Public

forked from tbroderick/streaming_vb

This is code associated with the paper: Broderick, T, Boyd, N, Wibisono, A, Wilson, AC, and Jordan, MI. Streaming variational Bayes. Neural Information Processing Systems, 2013. papers.nips.cc/paper/4980-streaming-variational-bayes.pdf

0 stars 11 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.txt		README.txt
archived_dataset.py		archived_dataset.py
asynchronous.py		asynchronous.py
batchvb.py		batchvb.py
ep2_lda.py		ep2_lda.py
ep_lda.py		ep_lda.py
evaluation.py		evaluation.py
filtering.py		filtering.py
gnu_gpl_v3.txt		gnu_gpl_v3.txt
onlineldavb.py		onlineldavb.py
onlinewikipedia.py		onlinewikipedia.py
parallelfiltering.py		parallelfiltering.py
topicstofiles.py		topicstofiles.py
utils.py		utils.py

Repository files navigation

======================================
= README
======================================

Contents
1. History and licensing information
2. Data format
3. How to run

======================================
1. History and licensing information
======================================

This code is largely the same as, and adapted from, the online VB (aka stochastic variational Bayes) code of
Matthew D. Hoffman, Copyright (C) 2010
found here: http://www.cs.princeton.edu/~blei/topicmodeling.html
The GPL license is inherited from that code.

Adapted by: Nick Boyd, Tamara Broderick, Andre Wibisono, Ashia C. Wilson

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program.  If not, see <http://www.gnu.org/licenses/>.

======================================
2. Data format
======================================

Each data set should consist of three files. In what follows, NAME represents the name of the corpus used as a prefix in filenames. In our experiments we used NAME equal to "wiki" or "nature". The three files should be:
1. NAME_vocab.txt: A file with one vocabulary word from the corpus per line.
2. NAME_train.txt: The training data for the corpus. Each line represents a document. Each line should be in the format:
U_D I_1:N_1 I_2:N_2 ... I_M:N_M
where U_D is the number of unique vocabulary words in this document, I_m is the index of the mth unique vocabulary word in the NAME_vocab.txt file, and N_m is the number of times this word occurs in this document. There is a space in between each index-count pair and a space after the count of unique vocabulary words.
3. NAME_test.txt: The test data for the corpus. This file is in the same format as the training data.

======================================
3. How to run
======================================

Below are some example use cases.

To run single-thread streaming variational Bayes on a data set with name NAME:
$ python onlinewikipedia.py --algorithmname=filtering --corpus=NAME --batchsize=32768 --eta=0.01 --max_iters=100 --threshold=1

To run synchronous, distributed, streaming variational Bayes on a data set with name NAME with 16 processors:
$ python onlinewikipedia.py --algorithmname=filtering --corpus=NAME
--batchsize=32768 --eta=0.01 --max_iters=100 --threshold=1 --numthreads=16

To run asynchronous, distributed, streaming variational Bayes on a data set with name NAME with 16 processors:
$ python onlinewikipedia.py --algorithmname=filtering --corpus=NAME
--batchsize=32768 --async_batches_per_eval=4 --eta=0.01 --max_iters=100 --threshold=1 --numthreads=16

To run the sufficient statistics algorithm on a data set with name NAME:
$ python onlinewikipedia.py --algorithmname=ss --corpus=NAME --batchsize=32768 --eta=0.01

About

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.txt

README.txt

archived_dataset.py

archived_dataset.py

asynchronous.py

asynchronous.py

batchvb.py

batchvb.py

ep2_lda.py

ep2_lda.py

ep_lda.py

ep_lda.py

evaluation.py

evaluation.py

filtering.py

filtering.py

gnu_gpl_v3.txt

gnu_gpl_v3.txt

onlineldavb.py

onlineldavb.py

onlinewikipedia.py

onlinewikipedia.py

parallelfiltering.py

parallelfiltering.py

topicstofiles.py

topicstofiles.py

utils.py

utils.py

Repository files navigation

About

Releases

Packages

Syed-Arshad/streaming_vb

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks