Skip to content

yugaljain1999/Copycat-abstractive-opinion-summarizer

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Opinion Summarization as Copycat-Review Generation

This repository contains the Python (PyTorch) codebase of the corresponding paper accepted at ACL 2020, Seattle, USA.

The model is fully unsupervised and is trained on a large corpus of customer reviews, such as Yelp or Amazon. It generates abstractive summaries condensing common opinions across a group of reviews. It relies on Bayesian auto-encoding that fosters learning rich hierarchical semantic representations of reviews and products. Finally, the model uses a copy mechanism to better preserve details of input reviews.

Example summaries produced by the system are shown below.

  • This restaurant is a hidden gem in Toronto. The food is delicious, and the service is impeccable. Highly recommend for anyone who likes French bistro.

  • This is a great case for the Acer Aspire 14" laptop. It is a little snug for my laptop, but it's a nice case. I would recommend it to anyone who wants to protect their laptop.

  • This is the best steamer I have ever owned. It is easy to use and easy to clean. I have used it several times and it works great. I would recommend it to anyone looking for a steamer.

For more examples, please refer to the artifacts folder.

Installation

The easiest way to proceed is to create a separate conda environment.

conda create -n copycat python=3.6.9
conda activate copycat

Install required modules.

pip install -r requirements.txt

Add the root directory to the path.

export PYTHONPATH=root_path:$PYTHONPATH

Data

Our model is trained on two different collections of customer reviews - Amazon and Yelp. The evaluation was performed on human-annotated summaries based on both datasets.

Preprocessing of Unsupervised Data

To train the model, one needs to download the datasets from the official websites. Both are publicly available, free of charge. The model expects a certain format of input, which can be obtained by preprocessing the downloaded data using the provided preprocessing scripts.

Input Data Format

If training should be performed on a separate dataset, the expected format of input is provided in artifacts. Each business/product has to be separated to CSV files where each line corresponds to a separate review.

group_id review_text rating category
159985130X We recommend the Magnifier ... 4.0 health_and_personal_care

The rating column is optional as it is not used by the model.

Evaluation Summaries

Evaluation can be performed on human-created summaries, both Amazon and Yelp summaries are publicly available. No preprocessing is needed for evaluation. The Amazon summaries were created by us using the Mechanical Turk Platform, more information on the process can be found in the corresponding folder.

Running

If you preprocessed data yourself, please create your vocabulary and truecaser. Otherwise, you can skip the following two sections.

Vocabulary Creation

Vocabulary contains to a mapping from words to frequency, where file position corresponds to ids used by the model.

python copycat/scripts/create_vocabulary.py --data_path=your_data_path --vocab_fp=data/dataset_name/vocabs/vocab.txt

Truecaser Creation

Truecaser is used to reverse lowercase letters, and needs to be trained (quickly) by scanning the dataset. Note that multiple folders can be assigned to the data_path parameter.

python copycat/scripts/train_truecaser.py --data_path=your_data_path --tcaser_fp=data/dataset_name/tcaser.model

Workflow

One needs to set parameters of the workflow in copycat/hparams/run_hp.py. E.g., by altering data paths or specifying the number of training epochs.

The file run_copycat.py contains a workflow of operations that are executed to prepare necessary objects (e.g., beam search) and then run a training and/or evaluation procedure. After adjusting run parameters, execute the following command.

python copycat/scripts/run_workflow.py

Checkpoints

Amazon and Yelp checkpoints are available for download. Please put them to copycat/artifacts/, to the corresponding dataset sub-folders.

LICENSE

MIT

Citation

@inproceedings{brazinskas2020unsupervised,
  title={Unsupervised Opinion Summarization as Copycat-Review Generation},
  author={Bra{\v{z}}inskas, Arthur and Lapata, Mirella and Titov, Ivan},
 booktitle={Proceedings of Association for Computational Linguistics (ACL)},
  year={2020}
}

Notes

  • Minor deviations from the published results are expected as the code was migrated from a bleeding-edge PyTorch version and Python 2.7.

  • Post factum, we added a beam search generator that has the n-gram blocking functionality (based on OpenNMT). The enhancement allows for a repetition reduction.

  • The setup was fully tested with Python 3.6.9.

  • The model work on a single GPU only.

  • mltoolkit provides the backbone functionality for data processing and modelling. Make sure it's visible to the interpreter.

About

ACL 2020 Unsupervised Opinion Summarization as Copycat-Review Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%