Skip to content

Joint Learning of the Embedding of Words and Entities

Notifications You must be signed in to change notification settings

ZhuJiahui/entity-vector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation

Introduction

This is a Python implementation of the method of learning embeddings of words and Wikipedia entities proposed in this paper. The embeddings can be directly learned using a Wikipedia dump retrieved from Wikimedia Downloads.

Install

% pip install Cython numpy
% pip install -r requirements.txt
% python setup.py develop

Basic Usage

The pre-trained embedding can be downloaded from the following links. Please note that these files must be placed in the same directory.

Alternatively, the embedding can be built by using the following commands:

% entity-vector build_dictionary <WIKIPEDIA_DUMP_FILE> <DICTIONARY_FILE>
% entity-vector train_embedding <WIKIPEDIA_DUMP_FILE> <DICTIONARY_FILE> <OUT_FILE>

Sample Code

>>> from entity_vector import EntityVector
>>> entvec = EntityVector.load('enwiki_entity_vector_500_20151026.pickle')
>>> word = entvec.get_word(u'c-3po')
>>> entvec[word]
memmap([ 0.05961042,  0.24534572,  0.42090839, -0.01455959,  0.11772038,
        0.55437287, -0.62508648, -0.24478671,  0.07838536,  0.27331885,
        0.35184374,  0.34113087,  0.11718472, -0.14086614, -0.00730115,
...
>>> entvec.most_similar(word)
[(<Word c-3po>, 1.0000000000000002),
 (<Entity C-3PO>, 0.8855517572211461),
 (<Word r2-d2>, 0.85768096183067088),
 (<Entity R2-D2>, 0.81842535257607718),
 (<Word chewbacca>, 0.7771232783769505),
 (<Entity Chewbacca>, 0.77412692204846856),
...
>>> entity = entvec.get_entity(u'C-3PO')
>>> entvec[entity]
memmap([ -3.51071961e-03,   4.82281654e-01,   6.72443198e-01,
         2.41103170e-01,   1.43198542e-01,   6.44051048e-01,
        -5.48925964e-01,  -4.64934616e-01,  -2.48444133e-01,
...
>>> entvec.most_similar(entity)
[(<Entity C-3PO>, 1.0),
 (<Entity R2-D2>, 0.90188752966007535),
 (<Word c-3po>, 0.88555175722114643),
 (<Entity Chewbacca>, 0.8304708994223623),
 (<Word r2-d2>, 0.82777910810169675),
 (<Entity Han Solo>, 0.80912814689071744),
 ...
>>> entvec.get_similarity(word, entity)
0.90466782126690559

Reference

If you use the code or the pretrained embedding in your research, please cite the following paper:

@InProceedings{yamada-EtAl:2016:CoNLL,
author    = {Yamada, Ikuya  and  Shindo, Hiroyuki  and  Takeda, Hideaki  and  Takefuji, Yoshiyasu},
title     = {Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
booktitle = {Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
month     = {August},
year      = {2016},
address   = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages     = {250--259}
}

License

Apache License 2.0

About

Joint Learning of the Embedding of Words and Entities

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%