This is the source code to go along with the blog article
Clustering Text with Transformed Document Vectors
numpy
elasticsearch
nltk
gensim
scikit-learn
wordcloud
image
matplotlib
pyyaml
cd wordclouds
python ./plotWords.py twenty-news
to generate imges like:
(or)
python ./plotWords.py acl-imdb
to generate imges like:
cd analysis
python ./analyze.py twenty-news
python ./processAnalysis.py twenty-news
python ./analyze.py acl-imdb
python ./processAnalysis.py acl-imdb
to generate the box-whisker plot:
and for the intercluster/intracluster ratio:
cd clusters
mkdir logs
./run.sh twenty-news
./run.sh acl-imdb