This repository contains the reimplementations of neural headline generation systems built as part of my Advanced Computer Science MPhil dissertation at the University of Cambridge. The baseline system is a PyTorch port of the Lua implementation of the Neural Attention Model for Abstractive Summarization (Rush et al. 2015). The extension system is an implementation of the coarse-to-fine heirarchal attention system based on (Tan et al. 2017).
Existing efforts in the task of abstractive headline generation have limited their evaluation to the domain of newspaper articles. In this project I explore the effectiveness of neural approaches to abstractive headline generation on general text. I reimplement two neural abstractive text summarisers using the Pytorch library. As a baseline, a feed-forward attention-based neural network (Rush et al., 2015) is reimplemented. I implement an extension which features a coarse-to-fine approach, where extractive summarisers are first employed to find important sentences. These are used to train a recurrent neural network to predict the summary (Tan et al., 2017). Additionally, I utilise the OpenNMT framework (Klein et al., 2017) to find the effect of using Recurrent Neural Networks without the coarse-to-fine approach. Along with the Gigaword dataset featuring newspaper articles, the systems are evaluated using short stories from English Language exams. The style of this dataset is less journalistic, which highlights how well these systems perform on general text. Quantitative evaluation is conducted which measures the lexical and semantic similarity of the predicted and actual titles. I also evaluate the outputs qualitatively by measuring the grammaticality of the outputs. From the results, it is found that the OpenNMT (Klein et al., 2017) model trained to produce summaries from lead sentences, is able to produce the most accurate headlines. By evaluating the systems on a dataset with general text, it is found that they do not produce the same accuracies as the Gigaword dataset, suggesting they don’t effectively generalise across domains.
git clone https://github.com/martinhartt/HGfGT
cd HGfGT
Install Pipenv and run the following commands after cloning the repository:
pipenv install
pipenv shell
- Include the raw Gigaword dataset (with uncompressed files) under the path
data/agiga
. - Create train/test/validation splits with
bash bin/create_splits.sh data/agiga
- Preprocess the files using the command
bash bin/dataprep_agiga.sh --extract --splits --filter --all --test
(use--filter
flag for baseline and--all
for extension) - If using
--all
flag for extension system, run thebash bin/dataprep_extsum.sh
in parallel with multiple workers with the$SLURM_ARRAY_TASK_ID
environment variable set as the id of the worker (1-32). If multiple workers are not available, it can be run sequentially with the following command (it can take a few days for the Gigaword dataset):
for i in `seq 1 32`; do export SLURM_ARRAY_TASK_ID=$i; bash bin/dataprep_extsum.sh; done
- If using
--all
flag for extension system, combine the sharded files with thebash bin/dataprep_combine.sh
command.
To train the dataset run the following command:
bin/train.sh model_name
--hier
: Include to train the extension system, and leave out for the baseline.--restore
: Include to continue training a saved model--glove
: Include to use pretrained GloVe embeddings
To generate headlines from an input file:
bin/generate.sh model_name --agiga > output_file
--agiga
: Uses the Gigaword dataset--hier
: Include if generating from an extension system--no-repeat
: If included, the model doesn't repeat words in headlines
To evaluate the generated headlines with various metrics, use the following command:
# Download SpaCy model
pipenv install --ignore-pipfile https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz
python summary/evaluate.py output_file
--csv
: Include to get results in CSV format