europarl-dissect

Table of Contents

europarl-dissect

Introduction

europarl-dissect is an experimental toolset to play around with distributional semantics.

Its basic intuition is to provide the best translation candidates for the words of a sentence (for any language direction we have training material for), but not just considering global relatedness of words in a training corpus, but also taking into account the concrete words of the input sentence provided.

By principle the system can also be used to check for relatedness of words in only one language, thus not giving translations, but more of synsets or semantic nets -- still considering the local context of the sentence provided.

The system consists of two programs, one for preparing and preprocessing the input data, the other one to calculate the similarity of each input word to other words in the local and global semantic spaces established by training and show off the best candidates available:

create_input_data.py processes (parallelized) training data available
besttranslations.py shows candidates based on training data and input

Dependencies

europarl-dissect runs under Python versions >= 2.6.6.

It was tested with following additional Python libraries it externally relies on:

DISSECT 0.1.0 is used for creating the semantic spaces and calculate all similarities involved.
NLTK 2.0.4
SciPy 0.12.0

The following program is included inside the lib/ directory to ensure better compatibility for the time being (as the package doesn't appear packaged by operating systems or e. g. pip2):

TreeTaggerWrapper allows to efficiently access the TreeTagger program for POS tagging and lemmatization of the input material in various languages.

Installation

Resolving the dependencies

The three libraries required as stated above can either be installed from operating system package systems (as this is the case for SciPy) or from pip2.

On a Debian operating system usually the following commands can be issued (as root) to bring these dependencies satisfied:

apt-get install python-scipy
pip2 install nltk
pip2 install dissect

TreeTagger configuration

After installing the TreeTagger package (cf. information on the project's home given above) it is necessary to specify where its base directory can be found.

In fixed form this can be specified in the parameters.py file by setting the TREETAGGER_BASE_PATH variable accordingly. The path can be specified either relatively to the project's directory or in absolute terms.

Installation of europarl training data

There is a README.rst file in data/in/ giving more information on where to find the europarl files we used to test the system.

Please note that by principle also other data material can be used as long as long as it simple adheres to the format provided by the europarl corpus (each language in plain text (UTF-8) with one sentence per line).

Preparing a (small) set of testing data

Example input material to test europarl-dissect for its disambiguation capabilities and example usage can be found in the files data/eval-de.txt and data/eval-en.txt for some German and English sentences, respectively.

More sentences in the same text domain can be found in data/europarl_2014_de.txt and data/europarl_2014_en.txt. It is also possible to use input material in the TreeTagger outpout format -- this is exemplified by europarl/europarl_2014_de_tagged.txt and europarl_2014_en_tagged.txt.

Most important thing to note is of course to not use sentences which are already part of the training material, as this would result in (clearly) biased results.

Clone the source code

To download and test our code, simply issue the following command:

git clone https://github.com/2mh/europarl-dissect.git

This will create the folder europarl-dissect from which on you can work.

Usage

Basic usage

If you download the code and put in place some training data (assuming the files data/in/europarl-v7.de-en.de and data/in/europarl-v7.de-en.de exist), you can immediately test the system, by the following commands:

# Create training material with 1,000 sentences.
./create_input_data.py -l 1000
# Test with test input material provided.
./show_candidates.sh de en # From DE to EN
./show_candidates.sh en de # From EN to DE

To test the system with lemmatized training and input material, do the following:

# As above, but with lemmatization; -t meaning to use TreeTagger.
./create_input_data.py -l 1000 -t
# The -l option makes sure the system knows it deals with lemmatization.
./show_candidates.sh de en -l # From DE to EN, lemmatized
./show_candidates.sh de en -l # From DE to EN, lemmatized

Advanced usage

To follow.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
data		data
dok		dok
eval/de_en		eval/de_en
lib		lib
README.rst		README.rst
besttranslations.py		besttranslations.py
create_input_data.py		create_input_data.py
create_randomized_input.sh		create_randomized_input.sh
helpers.py		helpers.py
outdata_stats.sh		outdata_stats.sh
parameters.py		parameters.py
show_candidates.sh		show_candidates.sh

2mh/europarl-dissect

Folders and files

Latest commit

History

Repository files navigation

europarl-dissect

Introduction

Dependencies

Installation

Resolving the dependencies

TreeTagger configuration

Installation of europarl training data

Preparing a (small) set of testing data

Clone the source code

Usage

Basic usage

Advanced usage

About

Resources

Stars

Watchers

Forks

Languages