This repository accompanies our EMNLP 2020 paper which you can cite like this:
@inproceedings{rodriguez2020curiosity,
title = {Information Seeking in the Spirit of Learning: a Dataset for Conversational Curiosity},
author = {Pedro Rodriguez and Paul Crook and Seungwhan Moon and Zhiguang Wang},
year = 2020,
booktitle = {Empirical Methods in Natural Language Processing}
}
To explore the dataset visit: datasets.pedro.ai/dataset/curiosity
For a summary of our work visit: pedro.ai/curiosity
The project has three components:
- The Curiosity dataset in
dialog_data/curiosity_dialogs.json
, with folded versions indialog_data/curiosity_dialogs.*.json
- Modeling code used in experiments
- Analysis, plotting, and latex code that generates the publication's PDF file.
curiosity_dialogs.{train, val, test, test_zero}.json
: Dialogs corresponding to each data foldwiki_sql.sqlite.db
: A sqlite database storing our processed version of the Wikipedia subset that we usefact_db_links.json
: A json file containing an entity linked version of our Wikipedia data. It stores the location of the entity link, which the database does not contain
There are two ways to download our data.
First, you could clone our repository and use git lfs.
- Install git lfs
- Clone the repository:
https://github.com/facebookresearch/curiosity.git
- Run
git lfs pull
Second, you can download from these URLs with tools like wget
:
- https://obj.umiacs.umd.edu/curiosity/curiosity_dialogs.json
- https://obj.umiacs.umd.edu/curiosity/curiosity_dialogs.train.json
- https://obj.umiacs.umd.edu/curiosity/curiosity_dialogs.val.json
- https://obj.umiacs.umd.edu/curiosity/curiosity_dialogs.test.json
- https://obj.umiacs.umd.edu/curiosity/curiosity_dialogs.test_zero.json
- https://obj.umiacs.umd.edu/curiosity/fact_db_links.json
- https://obj.umiacs.umd.edu/curiosity/wiki2vec_entity_100d.txt
- https://obj.umiacs.umd.edu/curiosity/wiki_sql.sqlite.db
We provide the inputs to our modeling experiments; to reproduce these inputs follow these instructions.
The file wiki2vec_entity_100d.txt
is the output of running the following steps:
- Download the embeddings with
wget http://wikipedia2vec.s3.amazonaws.com/models/en/2018-04-20/enwiki_20180420_100d.txt.bz2
- Decompress:
bzip2 -d enwiki_20180420_100d.txt.bz2
- Filter out non-entities with:
./cli filter-emb enwiki_20180420_100d.txt wiki2vec_entity_100d.txt
Our model code is written using pytorch
and allennlp
.
Before reproducing our experiments, you'll need to install some software.
Install a recent version of anaconda python https://www.anaconda.com/distribution/. The canonical way to reproduce our experiments is with the poetry configuration. We also provide anaconda environment definitions, but the exact versions of all dependencies are not pinned so results may differ.
- Install poetry
- Run
conda create -n curiosity python=3.7
- Run
conda activate curiosity
(fish shell) - Run
poetry install
- Before running any model commands, activate the environment with
poetry shell
For CPU:
- Create an anaconda environment
conda env create -f environment.yaml
(creates an environment named curiosity) - Activate the environment, in fish shell this is
conda activate curiosity
For GPU:
- Create an anaconda environment
conda env create -f environment_gpu.yaml
(creates an environment named curiosity) - Activate the environment, in fish shell this is
conda activate curiosity
If you prefer using Docker for dependencies, we include a Dockerfile
that builds all the required dependencies.
Note, that to enable GPU support you may need to use nvidia-docker and modify this file to install cuda dependencies
Models are run using a combination of the allennlp train
, allennlp evaluate
, and ./cli
command (in this repository).
In our paper, we vary models according to two axes:
- Our
charm
model corresponds toglove_bilstm
glove_distributed
is the context-free version ofcharm
- The
bert
baseline corresponds toe2e_bert
- Names with like
glove_bilstm-feature
mean trainglove_bilstm
ablating (-
minus)feature
.
allennlp
defines model configuration with jsonnet
or json
files.
In our work, we used the configuration files in configs/generated/
:
glove_bilstm.json
glove_distributed.json
e2e_bert.json
These configurations were generated from the parent configuration configs/model.jsonnet
To re-generate these, you can run this command:
$ ./cli gen-configs experiments/
This generates configurations in configs/generated/
and a run file run_allennlp.sh
that lists the correct command to run each model variant. Generally, the commands look like this:
$ allennlp train --include-package curiosity -s models/glove_bilstm -f configs/generated/glove_bilstm.json
$ allennlp evaluate --include-package curiosity --output-file experiments/glove_bilstm_val_metrics.json models/glove_bilstm dialog_data/curiosity_dialogs.val.json
$ allennlp evaluate --include-package curiosity --output-file experiments/glove_bilstm_test_metrics.json models/glove_bilstm dialog_data/curiosity_dialogs.test.json
$ allennlp evaluate --include-package curiosity --output-file experiments/glove_bilstm_zero_metrics.json models/glove_bilstm dialog_data/curiosity_dialogs.test_zero.json
By default, the configurations don't specify the cuda device so this must be passed in as an override like so:
- For
allennlp train
:-o '{"trainer": {"cuda_device": 0}}'
- For
allennlp evaluate
:--cuda-device 0
The configuration generator also properly names files so that if you copy files with ssh
as shown below, the results will automagically update the next time you run make 2020_acl_curiosity.paper.pdf
in the repo:
By default, the scripts in run_allennlp.sh
put models in models/
and experimental results (metrics etc) in experiments
.
Our code is designed so that copying the contents of experiments
into a corresponding directory in the paper code will "import" the results into the paper.
# Local copy
cp 'experiments/*' ~/code/curiosity-paper/2020_emnlp_curiosity/data/experiments/
# remote copy
scp 'experiments/*' hostname:~/code/curiosity-paper/2020_emnlp_curiosity/data/experiments/
Run pytest
to run unit tests for the loss, metrics, and reader.
The code for the paper can be found here: github.com/entilzha/publications
- Is the data collection interface open source? No, unfortunately that is tied to internal systems so it is difficult to open source. The interfaces were written in a combination of ReactJS and python/flask.
- Who should I contact with questions? Please email Pedro Rodriguez at me@pedro.ai
Curiosity is released under CC-BY-NC-4.0, see LICENSE for details.