Skip to content

zhouyonglong/arxiv2018-xling-sentence-embeddings

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations

This repository contains the data and code to reproduce the results of our paper: https://arxiv.org/abs/1803.01400 It also contains the cross-lingual word embeddings that we used in our experiments, translated variants of SNLI, and our code to map embeddings of two languages into common space.

Please use the following citation:

@article{rueckle:2018,
  title = {Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations},
  author = {R{\"u}ckl{\'e}, Andreas and Eger, Steffen and Peyrard, Maxime and Gurevych, Iryna},
  journal = {arXiv},
  year = {2018},
  url = "https://arxiv.org/abs/1803.01400"
}

Abstract: Average word embeddings are a common baseline for more sophisticated sentence embedding techniques. An important advantage of average word embeddings is their computational and conceptual simplicity. However, they typically fall short of the performances of more complex models such as InferSent. Here, we generalize the concept of average word embeddings to p-mean word embeddings, which are (almost) as efficiently computable. We show that the concatenation of different types of p-mean word embeddings considerably closes the gap to state-of-the-art methods such as InferSent monolingually and substantially outperforms these more complex techniques cross-lingually. In addition, our proposed method outperforms different recently proposed baselines such as SIF and Sent2Vec by a solid margin, thus constituting a much harder-to-beat monolingual baseline for a wide variety of transfer tasks. Our data and code are publicly available.

Contact person: Andreas Rücklé

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Usage

We offer several TF-Hub modules for convenience:

url_de = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/en-de/1
url_fr = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/en-fr/1
url_monolingual = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/monolingual/1

embed = hub.Module(url)
representations = embed(["A_en long_en sentence_en ._en", "another_en sentence_en"])

The input strings have to be tokenized (tokens split by spaces), postfixed with _en/_de/_fr (except for the monolingual model) and lowercased. (We usually don't lowercase everything but at this time we don't see a simple method of doing this in a saved TF graph.) If you want to work with non-lowercased sequences, download and run the model as described below.

For full reproducibility please use our python code:

cd model
pip install -r requirements.txt
python main.py

Sub-Projects

This repository contains different sub-projects:

<ROOT>
├── README.md
├── model/
├── evaluation/
├── data/
└── map-word-embeddings/

Model This is our concatenated p-means model. On execution we will automatically fetch all required resources and provide an embeddings webserver that can generate sentence embeddings using our models (en-de, en-fr, monolingual).

Evaluation Contains our evaluation framework that we use to evaluate the three additional tasks we provide (mainly from argumentation mining).

Data We provide our datasets and other resources in this folder. This includes our cross-lingual tasks.

Map-Word-Embeddings We provide the software that we used to induce our cross-lingual word embeddings and to re-map existing ones. See the appendix of our paper for more details.

Additional Downloads

More details can be found in the data folder.

About

Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 92.2%
  • Python 7.3%
  • HTML 0.5%