Learning to automatically select pull quotes (wikipedia).
This code accompanies the accepted COLING-2020 paper Catching Attention with Automatic Pull Quote Selection.
This project is written in Python3.6.9
The following non-default libraries are used:
- numpy 1.18.2
- sklearn 0.22.2.post1
- seaborn 0.9.0
- matplotlib 3.1.2
- scipy 1.4.1
- keras 2.3.0
- tensorflow 1.14.0
- sumy 0.8.1
- nltk 3.4.5
- textstat 0.6.0
- textblob 0.15.3
- sentence_transformers 0.2.5
To reproduce our dataset:
- navigate to the
datasets/url_lists/
directory and unzipurl_lists.zip
so that the 4 files are indatasets/url_lists/
- nagivate to
datasets/
and runpython3.6 construct_dataset.py source my_save_dir/
.- source can be one of
intercept
,ottawa-citizen
,cosmo
,national-post
, orall
- the samples for a given source will be stored in
my_save_dir/source/
⚠️ Updatesettings.py
so thatbase_pq_directory
points tomy_save_dir/
.⚠️ This will take a long time.
- source can be one of
- navigate to the root repo folder and run
python3.6 calculate_data_stats.py
to calculate dataset statistics to compare with our paper.
To reproduce our experimental results, run bash run_experiments.sh
(output will be stored in /results
).
ℹ️ To first make sure that things work, run bash run_experiments.sh --quick
. It should take just a few minutes.
To reproduce the handcrafted feature value distribution figures, run python3.6 view_feature_dists.py
To analyze test articles with a all models, run bash generate_model_samples.sh
. The --quick
argument can similarly be used to make sure things are working.