Skip to content

axenov/MultiDocMultiLingualSum

Repository files navigation

Multi-Document Multi-Lingual Summarization

Code to reproduce data, models and results of the paper Multi-Language Multi-Document Summarization.

Multi-Wiki-News

Reproduce the dataset

All the code to create Multi-Wiki-News and reproduce stats and explaination are in the dataset folder.

Load the dataset

Raw data of each version of the dataset are available here.

You can also load the dataset with the HuggingFace nlp library using en_wiki_multi_news.py for the English version, de_wiki_multi_news.py for the German version or fr_wiki_multi_news.py for the French one.

For load the Multi-en-Wiki-News, run:

from nlp import load_dataset

dataset = load_dataset('en_wiki_multi_news.py', cache_dir='dataset/.en-wiki-multi-news-cache')

train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

Models

Training models are available as HugginFace models here.

Implementation code and training scripts are in the train folder.

For example, you can use BART fine-tuned on Multi-en-Wiki-News as follow:

from transformers import AutoTokenizer, AutoModelWithLMHead

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")

# Load model
model = AutoModelWithLMHead.from_pretrained("airKlizz/bart-large-multi-en-wiki-news")

# Prepare inputs
inputs = tokenizer.encode_plus(TEXT_TO_SUMMARIZE, max_length=1024, return_tensors="pt")

# Summarize
outputs = model.generate(
  input_ids=inputs['input_ids'], 
  attention_mask=inputs['attention_mask'], 
  max_length=400, 
  min_length=150, 
  length_penalty=2.0, 
  num_beams=4, 
  early_stopping=True
)

# Decode
summary = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)

Results

All extractive and abstractive models implementations and evaluation scripts are in the evaluate folder.

We create an summarization evaluation environement easy to use for all models and all datasets. You can find more details in the evaluate folder.

Demo

A demo will be available soon.

About

Multi-Document Multi-Lingual Summarization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages