GitHub - so2jia/eigenvector-analysis: Code for "Interpreting Word Embeddings with Eigenvector Analysis" https://openreview.net/forum?id=rJfJiR5ooX.

Interpreting Word Embeddings with Eigenvector Analysis

This is the code for: Interpreting Word Embeddings with Eigenvector Analysis. Jamin Shin, Andrea Madotto, and Pascale Fung. NeurIPS 2018 Workshop on Interpretability and Robustness in Audio, Speech, and Language (IRASL). [PDF]

The code is mainly consisted of Jupyter Notebook and word embedding library hyperwords. If you use any source codes or datasets included in this toolkit in your work, please cite the following paper. The bibtex is listed below:

@InProceedings{shin2018interpreting,
  	author = "Shin, Jamin, Madotto, Andrea, and Fung, Pascale",
  	title = 	"Interpreting Word Embeddings with Eigenvector Analysis",
  	booktitle = 	"Workshop on Interpretability and Robustness in Audio, Speech, and Language (IRASL)",
  	year = 	"2018",
  	publisher = "NeurIPS IRASL"
}

Abstract

Dense word vectors have proven their values in many downstream NLP tasks over the past few years. However, the dimensions of such embeddings are not easily interpretable. Out of the d-dimensions in a word vector, we would not be able to understand what high or low values mean. Previous approaches addressing this issue have mainly focused on either training sparse/non-negative constrained word embeddings, or post-processing standard pre-trained word embeddings. On the other hand, we analyze conventional word embeddings trained with Singular Value Decomposition, and reveal similar interpretability. We use a novel eigenvector analysis method inspired from Random Matrix Theory and show that semantically coherent groups not only form in the row space, but also the column space. This allows us to view individual word vector dimensions as human-interpretable semantic features.

Installation

Our source code is mainly an analysis code using Jupyter Notebook, a modifed version of the Perl-based Wikipedia preprocessing script provided by Matt Mahoney, and hyperwords library by Omer Levy which utilized Python 2.7.

For the data, please download the latest Wikipedia dump, but note that the dump we used which is 2018 April does not exist anymore on Wikimedia Downloads.

Training and Testing

All the training and testing followed hyperwords source code, which the documentation is instructed under the hyperwords.md

Issues

This code is old and rather unmaintained. However, the main ideas are all shown in the paper and the analysis notebook file. For any potential issues you find, please open a Github Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
SimLex-999		SimLex-999
hyperwords		hyperwords
scripts		scripts
testsets		testsets
HKUST.jpg		HKUST.jpg
README.md		README.md
analysis.ipynb		analysis.ipynb
corpus2sgns.sh		corpus2sgns.sh
corpus2svd.sh		corpus2svd.sh
eval.sh		eval.sh
example_test.sh		example_test.sh
hyperwords.md		hyperwords.md
preprocess.pl		preprocess.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimLex-999

SimLex-999

hyperwords

hyperwords

scripts

scripts

testsets

testsets

HKUST.jpg

HKUST.jpg

README.md

README.md

analysis.ipynb

analysis.ipynb

corpus2sgns.sh

corpus2sgns.sh

corpus2svd.sh

corpus2svd.sh

eval.sh

eval.sh

example_test.sh

example_test.sh

hyperwords.md

hyperwords.md

preprocess.pl

preprocess.pl

Repository files navigation

Interpreting Word Embeddings with Eigenvector Analysis

Abstract

Installation

Training and Testing

Issues

About

Releases

Packages

Languages

so2jia/eigenvector-analysis

Folders and files

Latest commit

History

Repository files navigation

Interpreting Word Embeddings with Eigenvector Analysis

Abstract

Installation

Training and Testing

Issues

About

Resources

Stars

Watchers

Forks

Languages