- A CNTK(Microsoft deep learning toolkit) implementation of S-NET: FROM ANSWER EXTRACTION TO ANSWER GENERATION FOR MACHINE READING COMPREHENSION with some modifications.
- This project is designed for the MSMARCO dataset
- Code structure is based on CNTK BIDAF Example
- Support MSMARCO V1 and V2!
Here are some required libraries for training.
- python3.6
- cuda-9.0 (CNTK required)
- openmpi-1.10 (CNTK required)
- gcc >= 6 (CNTK required)
- Please refer requirements.txt
Download MSMARCO v1 dataset, GloVe embedding.
cd data
python3.6 download.py v1
Convert raw data to tsv format
python3.6 convert_msmarco.py --threads=`nproc`
Convert tsv format to ctf(CNTK input) format and build vocabs dictionary
python3.6 tsv2ctf.py
Generate elmo embedding
sh elmo.sh
Download MSMARCO v2 dataset, GloVe embedding.
cd data
python3.6 download.py v2
Convert raw data to tsv format
python3.6 convert_msmarco.py --threads=`nproc` --ratio=0.1
Convert tsv format to ctf(CNTK input) format and build vocabs dictionary
python3.6 tsv2ctf.py
Generate elmo embedding
sh elmo.sh
cd ../script
mkdir log
sh run.sh
cd Evaluation
sh eval.sh v1
cd Evaluation
sh eval.sh v2
rouge-l | bleu_1 | |
---|---|---|
S-Net (Extraction) | 41.45 | 44.08 |
S-Net (Extraction, Ensemble) | 42.92 | 44.97 |
rouge-l | bleu_1 | |
---|---|---|
MSMARCO v1 w/o elmo | 38.43 | 39.14 |
MSMARCO v1 w/ elmo | 39.42 | 39.47 |
MSMARCO v2 w/ elmo | 43.66 | 44.44 |
- Multi-threads preprocessing
- Elmo-Embedding
- Evaluation script
- MSMARCO v2 support
- Reasonable metrics