Skip to content

martinhartt/HGfGT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Headline Generation for General Text

This repository contains the reimplementations of neural headline generation systems built as part of my Advanced Computer Science MPhil dissertation at the University of Cambridge. The baseline system is a PyTorch port of the Lua implementation of the Neural Attention Model for Abstractive Summarization (Rush et al. 2015). The extension system is an implementation of the coarse-to-fine heirarchal attention system based on (Tan et al. 2017).

Project abstract

Existing efforts in the task of abstractive headline generation have limited their evaluation to the domain of newspaper articles. In this project I explore the effectiveness of neural approaches to abstractive headline generation on general text. I reimplement two neural abstractive text summarisers using the Pytorch library. As a baseline, a feed-forward attention-based neural network (Rush et al., 2015) is reimplemented. I implement an extension which features a coarse-to-fine approach, where extractive summarisers are first employed to find important sentences. These are used to train a recurrent neural network to predict the summary (Tan et al., 2017). Additionally, I utilise the OpenNMT framework (Klein et al., 2017) to find the effect of using Recurrent Neural Networks without the coarse-to-fine approach. Along with the Gigaword dataset featuring newspaper articles, the systems are evaluated using short stories from English Language exams. The style of this dataset is less journalistic, which highlights how well these systems perform on general text. Quantitative evaluation is conducted which measures the lexical and semantic similarity of the predicted and actual titles. I also evaluate the outputs qualitatively by measuring the grammaticality of the outputs. From the results, it is found that the OpenNMT (Klein et al., 2017) model trained to produce summaries from lead sentences, is able to produce the most accurate headlines. By evaluating the systems on a dataset with general text, it is found that they do not produce the same accuracies as the Gigaword dataset, suggesting they don’t effectively generalise across domains.

Usage

Clone repository

git clone https://github.com/martinhartt/HGfGT
cd HGfGT

Dependencies

Install Pipenv and run the following commands after cloning the repository:

pipenv install
pipenv shell

Data preparation

  1. Include the raw Gigaword dataset (with uncompressed files) under the path data/agiga.
  2. Create train/test/validation splits with bash bin/create_splits.sh data/agiga
  3. Preprocess the files using the command bash bin/dataprep_agiga.sh --extract --splits --filter --all --test (use --filter flag for baseline and --all for extension)
  4. If using --all flag for extension system, run the bash bin/dataprep_extsum.sh in parallel with multiple workers with the $SLURM_ARRAY_TASK_ID environment variable set as the id of the worker (1-32). If multiple workers are not available, it can be run sequentially with the following command (it can take a few days for the Gigaword dataset):
for i in `seq 1 32`; do export SLURM_ARRAY_TASK_ID=$i; bash bin/dataprep_extsum.sh; done
  1. If using --all flag for extension system, combine the sharded files with the bash bin/dataprep_combine.sh command.

Training

To train the dataset run the following command:

bin/train.sh model_name

Flags

  • --hier: Include to train the extension system, and leave out for the baseline.
  • --restore: Include to continue training a saved model
  • --glove: Include to use pretrained GloVe embeddings

Generation

To generate headlines from an input file:

bin/generate.sh model_name --agiga > output_file

Flags

  • --agiga: Uses the Gigaword dataset
  • --hier: Include if generating from an extension system
  • --no-repeat: If included, the model doesn't repeat words in headlines

Evaluation

To evaluate the generated headlines with various metrics, use the following command:

# Download SpaCy model
pipenv install --ignore-pipfile https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz
python summary/evaluate.py output_file

Flags

  • --csv: Include to get results in CSV format

About

Headline Generation for General Text (reimplementations of Rush et al. 2015 and Tan et al. 2017)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published