Skip to content

An easy and robust model for Lexical Semantic Change Detection

License

Notifications You must be signed in to change notification settings

lzjtt2017/TemporalReferencing

 
 

Repository files navigation

TemporalReferencing

Data and code for the experiments in

If you use this software for academic research, please cite the paper above and make sure you give appropriate credit to the below-mentioned software this repository strongly depends on.

The code heavily relies on DISSECT (modules/composes). For aligning embeddings (SGNS) we used VecMap (alignment/map_embeddings). For alignment of the PPMI matrices and measuring cosine distance we relied on code from LSCDetection. We used hyperwords for training SGNS and PPMI on the extracted word-context pairs.

Usage Note

The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in sys.path.append('./modules/') in the scripts. All scripts can be run directly from the command line, e.g.:

python corpus_processing/extract-pairs.py <windowSize> <corpDir> <outDir> <lowerBound> <upperBound> <testset> <freqset> <minfreq>

We recommend you to run the scripts with Python 2.7.15, only for VecMap Python 3 is needed. You will have to install some additional packages such as: docopt, gensim, i.a. Those that aren't available from the Anaconda installer can be installed via EasyInstall, or by running pip install -r requirements.txt.

Pipeline

In scripts/run_test.sh you find an example of a full pipeline for the models on a small test corpus. Assuming you are working on a UNIX-based system, first make the script executable with

chmod 755 scripts/run_test.sh

Then run it with

bash -e scripts/run_test.sh

Make sure matrices/ is empty each time you run it. You can take a look at the code to understand how it makes use of the different scripts in the repository. It first reads the gzipped test corpus in corpora/test/testcorpus.gz containing many duplicate sentences for the time bins 1920, 1930 and 1940 with each line in the following format:

year [tab] word1 word2 word3...

It then extracts

  1. regular word-context pairs for each time bin (for alignment) and
  2. temporally referenced word-context pairs for specified target words (target_year)

with corpus_processing/extract-pairs.py. Then it creates basic matrices for PPMI and SGNS using the scripts under hyperwords/ and aligns the time-binned matrices with the scripts under alignment/. Finally, it extracts cosine distances (displacement) for each pair of adjacent time bins and nearest neighbors for each time bin to results/ using the scripts under measures/. The script measures/displacement.py needs a different input for each model: for the alignment models it takes the file testsets/test/testset-pairs.csv in the following format:

target1 [tab] target1
target2 [tab] target2

For Temporal Referencing it takes the files testsets/test/testset-1920-1930-pairs.csv and testsets/test/testset-1930-1940-pairs.csv as input with the following respective formats:

target1_year1 [tab] target1_year2
target2_year1 [tab] target2_year2

and

target1_year2 [tab] target1_year3
target2_year2 [tab] target2_year3

Data

Under data/ we give a spreadsheet with experimental results on the Word Sense Change testset and lists of the nearest neighbors for the test words found by the different models when trained on COHA (1920-1970).

BibTex

@inproceedings{Dubossarskyetal19,
title = {{Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change}},
author = {Dubossarsky, Haim and Hengchen, Simon and Tahmasebi, Nina and Schlechtweg, Dominik},
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics"
}

About

An easy and robust model for Lexical Semantic Change Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 70.5%
  • C 24.7%
  • Shell 4.5%
  • Makefile 0.3%