Analyzing Corpora

For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.

The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.

Tools

We are working in a mixture of Python, Scala, Spark, R, and other tools. Setup instructions for each of these tools is described here.

Spark setup

How do you get spark running locally? Is it necessary? Or is this optional (because we are using Spark on the cluster)

Forqlift

Forqlift is a tool for converting plain text files to sequence files. HDFS (and thus spark) does not work well with lots of small files, so sequence files are used instead.

To install forqlift, simply download the binaries and extract them. Add $FORQLIFT/bin to your PATH and you are ready to run forqlift.

Dataset

A good example data set is the Enron email archive. This data set can be downloaded from here.

Step 1 - The original data

The initial enron email data set can be found here. This compressed file contains plain text emails. Use forqlift to create a sequence file:

forqlift fromarchive enron_mail_20150507.tgz --file enron_mail.seq --compress bzip2 --data-type text

Inputs:

.tgz file

Outputs:

.seq file with inside all e-mails

Step 2 - Preprocessing

Prepare e-mails as stored in sequence file for LDA classication with EmailParser.scala.

spark-submit --class EmailParser $myjar data/enron_mail.seq --metadata data/metadata.seq --dictionary data/dic.csv --corpus data/bow.csv

The prep-processing includes the exclusion of words that are too popular and the exclusion of words that are too rare, the criteria for this can be set with the optional arguments.

Inputs:

enron_mail.seq sequence file with all the e-mails
specify output-files
optional arguments for EmailParser, see also:

spark-submit --class EmailParser $myjar --help

Outputs:

Dictionary (.csv):
- dictionary linking wordid (integer) and word (character)
Bags of words (.csv):
- word count (integer) per document: documentid x wordid
metadata (.seq):
id: e-mail unique identifier
path to e-mail file
user (sender): character
from: emailaddress
to: list of emailaddress(es)
cc: list of emailaddress(es)
bcc: list of emailaddress(es)
sent data-type/rec. data-type
MIMMSGID *
subject: subject of the e-mail, one character string

Step 3 - Train LDA

This step could be run multiple times (for different number of topics).

See also the documentation on: https://github.com/nlesc-sherlock/spark-lda

  spark-submit --class ScalaLDA $myjar --k 10 data/bow.csv data/lda.csv

Inputs:

Bags of words generated by Step 2
k number of desired topics

Outputs:

Word by topic matrix (.csv)
- weights (floating point between 0 and 1) for wordid (integer) x topicid (integer)

For more information on the LDA optimization see here

Step 4 - Apply LDA

Use LDA model to generate document topic matrix

spark-submit --class ApplyLDA $myjar data/lda.csv.model data/bow.csv data/document_topics.csv

Inputs:

Word by topic matrix from step 3

Outputs:

Document by topic matrix / the LDA classification (.csv)
- weights (floating point between 0 and 1) for documentid (integer) x topicid (integer)

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
corpora		corpora
data		data
figures		figures
notebooks		notebooks
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
enron_small_clustertopics.csv		enron_small_clustertopics.csv
enron_small_dic.csv		enron_small_dic.csv
enron_small_lda_transposed.csv		enron_small_lda_transposed.csv
prepare_for_visualisation.ipynb		prepare_for_visualisation.ipynb
requirements.txt		requirements.txt
requirements_notebook.txt		requirements_notebook.txt
server.py		server.py
setup.py		setup.py
submit_corpus_to_sparse.sh		submit_corpus_to_sparse.sh
submit_filter_extremes.sh		submit_filter_extremes.sh
submit_lda_0.1.sh		submit_lda_0.1.sh
submit_parse_email.sh		submit_parse_email.sh
test_requirements.txt		test_requirements.txt

License

nlesc-sherlock/analyzing-corpora

Folders and files

Latest commit

History

Repository files navigation

Analyzing Corpora

Tools

Spark setup

Forqlift

Dataset

Step 1 - The original data

Step 2 - Preprocessing

Step 3 - Train LDA

Step 4 - Apply LDA

Step 5 - Visualization

Step 5.a - Run clustering / visualization (IPython notebook)

Step 5.b - Run R-shiny visualization

Further reading

About

Resources

License

Stars

Watchers

Forks

Languages