Yet another one.. I know... But it seems it is necessary for my needs. Luckily I can stand on the shoulders of talanted hackers that came before me David Newman and Radim Řehůřek.
Here is a small example of how to use liblda
- Prepare the corpus of documents
- Prepare the vocabulary
- Run the lda model
./run.py \
--docs liblda/test/arXiv_docs.txt \
--vocab liblda/test/arXiv_vocab.txt \
--numT 20 \
--seed 3 \
--iter 300 \
--alpha 0.01 \
--beta 0.01 \
--save_z --save_probs --print_topics 12
Yeah more info as we I know...
You can also use the LdaModel class in your own code.
import sys
sys.path.insert(1, '/Projects/LatentDirichletAllocation/')
from gensim import corpora, models, similarities
import liblda
from liblda.LDAmodel import LdaModel
c = corpora.mmcorpus.MmCorpus("liblda/test/test_corpus.mm")
lda = LdaModel( numT=3, corpus=c)
lda.train()
# results in
lda.phi
lda.theta
At this point we are calling out to Dave Newman's gibbs sampler. We can do 20 topics for 6M very short documents over a 1M vocabulary.