Textmining Project - Similarity measurement of News Articles and Party Manifestos

prerequisits

Manifestos are available as csv and spiegel articles are provided as .xml files. The preprocessing step creates out of these files corresponding .json files that are used in all followiing steps. The Spiegel corpus xml files should be extracted to data/spiegel/ and the manifesto csv files to data/manifestos/ for best reproducibility. To install all required python dependencies simple exec pip3 install -r requirements.txt in project root, furthermore python3 is required.

Word Based Similarity Measurement Algorithms

The algorithms, jaccard, tf cosine similarity and tf-idf cosine have the following functionality:

Compare manifesto of a certain party to all other manifestos
Compare manifestos to spiegel articles.

A typical workflow involves:

Having raw manifestos and Spiegel articles in the data folder
Calculate the Similarity between manifestos to newspaper articles
Evaluate output for a specific purpose. E.g. print articles that have the highest similarity with respect to a certain manifesto.

Sample workflows are available in folder "examples" with sample bash scripts for different algorithms. In an older version file "plot.py" was used for result visualization but the current outputs are visualized by Microsoft Excel. The ipython notebook "EvaluationSample.ipynb" provides further samples where output is generated with respect to different metrics. For using the ipython notebook, the files provided in : https://drive.google.com/open?id=0Bx3bSdxqXIf6TWZ0YTZOdUFxMzQ have to be extracted and copied to folder "data".

LDA

There are several scripts available to extract and explore topics as well as compare similarities with Latent Dirichlet Allocation. All scripts have configurable variables that can be altered in the corresponding python files.

lda_manifesto.py: train LDA model using all manifestos in a pre-configured folder. This involves splitting each manifesto into sentences and take those sentences as the input corpus. Optionally a pyLDAvis html can be generated
lda_spiegel.py: train LDA model using all spiegel articles in pre-configured folder. The corpus of spiegel articles of a year are taken as input. Optionally a pyLDAvis html can be generated.
lda_print_topics.py: print a configured amount of topics and terms per topic for a pre-calculated LDA model.
lda_generate_pyLDAvis_html.py: generate a pyLDAvis html file for exploration of a LDA model (LDA model is loaded from file)
lda_compare_manifesto_with_manifestos.py: Compare one manifesto with a configurable amount of other manifestos. Similarity is calculated using cosine similarity of the LDA model vectors.
lda_compare_parties.py: Compare LDA models generated for each party (using all manifestos of this party as input). Similarity is calculated using cosine similarity of the LDA model vectors.

A typical workflow involves:

Having raw manifestos and Spiegel articles in the data folder (as described in prerequisits)
Train LDA model for both all manifestos and Spiegel years
Explore topics using lda_print_topics.py and more advanced with pyLDAvis
Compare manifestos and parties

e.g.

# train LDA model for manifestos
python3 lda_manifesto.py
# train LDA model for Spiegel articles (this can take, depending on e.g. the amount of topics, a lot of time)
python3 lda_spiegel.py
# generate pyLDAvis html for a specific LDA model
python3 lda_generate_pyLDAvis_html.py
# open the generated html and analyse topics

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
data/manifestos		data/manifestos
examples		examples
media		media
.gitignore		.gitignore
EvaluationSample.ipynb		EvaluationSample.ipynb
README.md		README.md
cosine_sim.py		cosine_sim.py
create_bow.py		create_bow.py
data_io.py		data_io.py
jaccard.py		jaccard.py
lda_compare_manifesto_with_manifestos.py		lda_compare_manifesto_with_manifestos.py
lda_compare_parties.py		lda_compare_parties.py
lda_generate_pyLDAvis_html.py		lda_generate_pyLDAvis_html.py
lda_manifesto.py		lda_manifesto.py
lda_print_topics.py		lda_print_topics.py
lda_similarity.py		lda_similarity.py
lda_spiegel.py		lda_spiegel.py
plot.py		plot.py
proposal.pdf		proposal.pdf
requirements.txt		requirements.txt
statistics.py		statistics.py
subs.sh		subs.sh
tf_idf.py		tf_idf.py

nflow/textmining-project

Folders and files

Latest commit

History

Repository files navigation

Textmining Project - Similarity measurement of News Articles and Party Manifestos

prerequisits

Word Based Similarity Measurement Algorithms

LDA

About

Resources

Stars

Watchers

Forks

Languages