Code to reproduce data, models and results of the paper Multi-Language Multi-Document Summarization.
All the code to create Multi-Wiki-News and reproduce stats and explaination are in the dataset
folder.
Raw data of each version of the dataset are available here.
You can also load the dataset with the HuggingFace nlp library using en_wiki_multi_news.py
for the English version, de_wiki_multi_news.py
for the German version or fr_wiki_multi_news.py
for the French one.
For load the Multi-en-Wiki-News, run:
from nlp import load_dataset
dataset = load_dataset('en_wiki_multi_news.py', cache_dir='dataset/.en-wiki-multi-news-cache')
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']
Training models are available as HugginFace models here.
Implementation code and training scripts are in the train
folder.
For example, you can use BART fine-tuned on Multi-en-Wiki-News as follow:
from transformers import AutoTokenizer, AutoModelWithLMHead
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
# Load model
model = AutoModelWithLMHead.from_pretrained("airKlizz/bart-large-multi-en-wiki-news")
# Prepare inputs
inputs = tokenizer.encode_plus(TEXT_TO_SUMMARIZE, max_length=1024, return_tensors="pt")
# Summarize
outputs = model.generate(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
max_length=400,
min_length=150,
length_penalty=2.0,
num_beams=4,
early_stopping=True
)
# Decode
summary = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)
All extractive and abstractive models implementations and evaluation scripts are in the evaluate
folder.
We create an summarization evaluation environement easy to use for all models and all datasets. You can find more details in the evaluate
folder.
A demo will be available soon.