Wiki-ESA

Explicit Semantic Analysis based on Wikipedia

This is a python library which contains code to 1) construct a semantic interpreter based on data from Wikipedia and 2) apply this to various kinds of texts.

To construct an interpreter, first obtain a Wikipedia XML dump from http://dumps.wikimedia.org/enwiki/

Then run xml_parse.py with the downloaded file as its argument. This outputs some temporary files containing information on the words, links and articles encountered.
Next, run generate_indices.py to generate lists of indices corresponding to unique words and articles encountered
Finally, run matrix_builder.py to construct a very large sparse interpretation matrix. Each row corresponds to a unique word, each column to a 'concept', i.e. a Wikipedia article, and each entry is the TF-IDF score for word i in article j. The Matrix is saved in separate chunks to conserve memory.

medium_wiki.xml can be used as an example file for demonstration/testing purposes, as it contains only the first 100 or so Wikipedia articles.

cunning_linguistics.py then contains classes to perform text analysis and harvest tweets for analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
analyze.py		analyze.py
carlsberg_data.json		carlsberg_data.json
carlsberg_data.txt		carlsberg_data.txt
carlsberg_filtered_tweets.txt		carlsberg_filtered_tweets.txt
cleanup.bat		cleanup.bat
collect_tweets.py		collect_tweets.py
cunning_linguistics.py		cunning_linguistics.py
cunning_linguistics.pyc		cunning_linguistics.pyc
deep_dreams.py		deep_dreams.py
generate_indices.py		generate_indices.py
guesses.txt		guesses.txt
interpret_me.txt		interpret_me.txt
matrix_builder.py		matrix_builder.py
medium_wiki.xml		medium_wiki.xml
nanowiki.xml		nanowiki.xml
reference_google.txt		reference_google.txt
shared.py		shared.py
shared.pyc		shared.pyc
tweet_example.json		tweet_example.json
wikicleaner.py		wikicleaner.py
wikicleaner.pyc		wikicleaner.pyc
xml_parse.py		xml_parse.py

bjarkemoensted/Wiki-ESA

Folders and files

Latest commit

History

Repository files navigation

Wiki-ESA

About

Resources

Stars

Watchers

Forks

Languages