Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations

This repository contains the data and code to reproduce the results of our paper: https://arxiv.org/abs/1803.01400 It also contains the cross-lingual word embeddings that we used in our experiments, translated variants of SNLI, and our code to map embeddings of two languages into common space.

Please use the following citation:

@article{rueckle:2018,
  title = {Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations},
  author = {R{\"u}ckl{\'e}, Andreas and Eger, Steffen and Peyrard, Maxime and Gurevych, Iryna},
  journal = {arXiv},
  year = {2018},
  url = "https://arxiv.org/abs/1803.01400"
}

Abstract: Average word embeddings are a common baseline for more sophisticated sentence embedding techniques. An important advantage of average word embeddings is their computational and conceptual simplicity. However, they typically fall short of the performances of more complex models such as InferSent. Here, we generalize the concept of average word embeddings to p-mean word embeddings, which are (almost) as efficiently computable. We show that the concatenation of different types of p-mean word embeddings considerably closes the gap to state-of-the-art methods such as InferSent monolingually and substantially outperforms these more complex techniques cross-lingually. In addition, our proposed method outperforms different recently proposed baselines such as SIF and Sent2Vec by a solid margin, thus constituting a much harder-to-beat monolingual baseline for a wide variety of transfer tasks. Our data and code are publicly available.

Contact person: Andreas Rücklé

https://www.ukp.tu-darmstadt.de/

https://www.tu-darmstadt.de/

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Usage

We offer several TF-Hub modules for convenience:

url_de = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/en-de/1
url_fr = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/en-fr/1
url_monolingual = https://public.ukp.informatik.tu-darmstadt.de/arxiv2018-xling-sentence-embeddings/tf-hub/monolingual/1

embed = hub.Module(url)
representations = embed(["A_en long_en sentence_en ._en", "another_en sentence_en"])

The input strings have to be tokenized (tokens split by spaces), postfixed with _en/_de/_fr (except for the monolingual model) and lowercased. (We usually don't lowercase everything but at this time we don't see a simple method of doing this in a saved TF graph.) If you want to work with non-lowercased sequences, download and run the model as described below.

For full reproducibility please use our python code:

cd model
pip install -r requirements.txt
python main.py

Sub-Projects

This repository contains different sub-projects:

<ROOT>
├── README.md
├── model/
├── evaluation/
├── data/
└── map-word-embeddings/

Model This is our concatenated p-means model. On execution we will automatically fetch all required resources and provide an embeddings webserver that can generate sentence embeddings using our models (en-de, en-fr, monolingual).

Evaluation Contains our evaluation framework that we use to evaluate the three additional tasks we provide (mainly from argumentation mining).

Data We provide our datasets and other resources in this folder. This includes our cross-lingual tasks.

Map-Word-Embeddings We provide the software that we used to induce our cross-lingual word embeddings and to re-map existing ones. See the appendix of our paper for more details.

Additional Downloads

Cross-lingual SNLI: en-de, en-fr
en-de cross-lingual word embeddings: BIVCD, AttractRepel, Fasttext (300K), Fasttext (Full)
en-fr cross-lingual word embeddings: BIVCD, AttractRepel, Fasttext (300K), Fasttext (Full)

More details can be found in the data folder.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
evaluation		evaluation
map-word-embeddings		map-word-embeddings
model		model
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

evaluation

evaluation

map-word-embeddings

map-word-embeddings

model

model

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

NOTICE.txt

NOTICE.txt

README.md

README.md

Repository files navigation

Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations

Usage

Sub-Projects

Additional Downloads

About

Releases

Packages

Languages

License

zhouyonglong/arxiv2018-xling-sentence-embeddings

Folders and files

Latest commit

History

Repository files navigation

Concatenated p-mean Embeddings as Universal Cross-Lingual Sentence Representations

Usage

Sub-Projects

Additional Downloads

About

Resources

License

Stars

Watchers

Forks

Languages