This repository contains code to create input for an annotation task. In this task, property-concept pairs should be annotated with semantic relations. The property-concept pairs were collected from various resources (https://github.com/cltl/semantic_property_dataset).
The motivation of this data set is described in this paper:
@inproceedings{sommerauer-2020-why, title={Why is penguin more similar to polar bear than to sea gull? Analyzing conceptual knowledge in distributional models}, author={Sommerauer, Pia}, booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop}, year={in press}}
The collection of property-concept pairs to be annotated is outlined in this paper:
@inproceedings{sommerauer-etal-2019-towards, Address = {Wroclaw, Poland}, Author = {Sommerauer, Pia and Fokkens, Antske and Vossen, Piek}, Booktitle = {Proceedings of the 10th Global Wordnet Conference}, Pages = {85--95}, Title = {Towards Interpretable, Data-derived Distributional Semantic Representations for Reasoning: A Dataset of Properties and Concepts}, Url = {https://clarin-pl.eu/dspace/handle/11321/718}, Year = {2019}, Bdsk-Url-1 = {https://clarin-pl.eu/dspace/handle/11321/718 }}
Please use these references if you are using this repository in your research.
The annotation task is set up as follows: Participants are presented with short descriptions of a relation between a property and a concept and asked to indicate whether they agree or disagree with the description.
Example of a single instance in the annotation task
We run the task using the Lingoturk framework (https://github.com/FlorianPusse/Lingoturk) and distribute it via the platform Prolific (https://www.prolific.co/).
You can try out the task using a small demo set.
(1) Create all questions of a run of your experiment. Each question will receive a unique identifier. Do not change them anymore.
Each question consists of:
- property
- concept
- relation
- sentence describing the concept-property-relation
- a positive example of such a relation
- a negative example of such a relation
- pos and neg example property
- pos and neg example concept
- source of the property-concept pair
Exclude concepts and update dataset:
- Follow instructions in this notebook: scripts/Filter pairs
- Replace concepts with concepts from the full candidate set by running:
cd scripts/
python replace_excluded_conepts.py [property] [collection]
- Then update the dataset by running:
python update_dataset.py
The resampled set is stored in data/resampled/
Important note: From now on, only data in data/resampled/
will be used for question generation. (update November 27, 2020)
To create the questions, run:
cd scripts/
python create_questions.py [run number]
Replace [run number]
with the experiment run. Currently, I created run3. The run number determines which descriptions of relations will be used (stored in templates/
). The examples are taken from examples/
.
Update November 27, 2020: As of now, questions will only be created based on updated data (stored in data/resampled)
(2) Create batches of questions which have not been annotated yet as you go.
cd scripts/
python create_batch.py [run] [prolific completion-url] [n_participants]
Replace [prolific completion-url]
with the completion url you received from Prolific when setting up the task on the platform. Replace [n_participants]
with either the number of participants (used to calculate a suggested reward) or'test' (for testing).
(3) Launch the task:
- Upload your input to your Lingoturk platform
- Link the task to your Prolific Task
- Publish
The link to the annotated dataset will be made available here.
To modify examples, adapt the existing examples in examples/[relation]-pairs.csv
.
When you're done, run:
cd scripts
python add_properties_to_info.py
Then add the information (property-category, etc) manually in the file data/property_info.csv