EMNLP19 Submission: "MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance"
MoverScore measures semantic distance between system and reference texts by aligning semantically similar words and finding the corresponding travel costs.
EvalSerivce is a evaluation framework for NLG tasks, assigning scores (e.g., ROUGE ans MoverScore) to system-generated text by comparing it against human references for content matching.
Install the server and client via pip
. They can be installed separately or even on different machines:
cd server/
python3 setup.py install # server
cd client/
python3 setup.py install # client
Note that the server MUST be running on Python >= 3.5. Again, the server does not support Python 2!
The client can be running on both Python 2 and 3 for the following consideration.
After installing the server, you should start a serivce as follows:
summ-eval-start -data_dir ../../ -num_worker=4
This will start the service with four workers, meaning that it can handle up to four concurrent requests.
Now you can get scores:
from summ_eval.client import EvalClient
ec = EvalClient()
system = ['A guy with a read jacket is standing on a boat']
references = ['A man wearing a lifevest is sitting in a canoe','A small white ferry rides through water']
example_1 = [system, references, 'rouge_1']
example_2 = [system, references, 'rouge_2']
example_3 = [system, references, 'wmd_1'] # BERTWordMover-unigram
example_4 = [system, references, 'wmd_2'] # BERTWordMover-bigram
example_5 = [system, references, 'smd'] # BERTSentMover
ec.eval([example_1,example_2,example_3,example_4,example_5])