Multi-Document Multi-Lingual Summarization

Code to reproduce data, models and results of the paper Multi-Language Multi-Document Summarization.

Multi-Wiki-News

Reproduce the dataset

All the code to create Multi-Wiki-News and reproduce stats and explaination are in the dataset folder.

Load the dataset

Raw data of each version of the dataset are available here.

You can also load the dataset with the HuggingFace nlp library using en_wiki_multi_news.py for the English version, de_wiki_multi_news.py for the German version or fr_wiki_multi_news.py for the French one.

For load the Multi-en-Wiki-News, run:

from nlp import load_dataset

dataset = load_dataset('en_wiki_multi_news.py', cache_dir='dataset/.en-wiki-multi-news-cache')

train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

Models

Training models are available as HugginFace models here.

Implementation code and training scripts are in the train folder.

For example, you can use BART fine-tuned on Multi-en-Wiki-News as follow:

from transformers import AutoTokenizer, AutoModelWithLMHead

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")

# Load model
model = AutoModelWithLMHead.from_pretrained("airKlizz/bart-large-multi-en-wiki-news")

# Prepare inputs
inputs = tokenizer.encode_plus(TEXT_TO_SUMMARIZE, max_length=1024, return_tensors="pt")

# Summarize
outputs = model.generate(
  input_ids=inputs['input_ids'], 
  attention_mask=inputs['attention_mask'], 
  max_length=400, 
  min_length=150, 
  length_penalty=2.0, 
  num_beams=4, 
  early_stopping=True
)

# Decode
summary = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)

Results

All extractive and abstractive models implementations and evaluation scripts are in the evaluate folder.

We create an summarization evaluation environement easy to use for all models and all datasets. You can find more details in the evaluate folder.

Demo

A demo will be available soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

demo

demo

evaluate

evaluate

train

train

.gitignore

.gitignore

README.md

README.md

combine_wiki_multi_news.py

combine_wiki_multi_news.py

de_wiki_multi_news.py

de_wiki_multi_news.py

en_wiki_multi_news.py

en_wiki_multi_news.py

fr_wiki_multi_news.py

fr_wiki_multi_news.py

Repository files navigation

Multi-Document Multi-Lingual Summarization

Multi-Wiki-News

Reproduce the dataset

Load the dataset

Models

Results

Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
dataset		dataset
demo		demo
evaluate		evaluate
train		train
.gitignore		.gitignore
README.md		README.md
combine_wiki_multi_news.py		combine_wiki_multi_news.py
de_wiki_multi_news.py		de_wiki_multi_news.py
en_wiki_multi_news.py		en_wiki_multi_news.py
fr_wiki_multi_news.py		fr_wiki_multi_news.py

axenov/MultiDocMultiLingualSum

Folders and files

Latest commit

History

Repository files navigation

Multi-Document Multi-Lingual Summarization

Multi-Wiki-News

Reproduce the dataset

Load the dataset

Models

Results

Demo

About

Resources

Stars

Watchers

Forks

Languages