Quickstart

This is the official code repository for the paper

Zhaofeng Wu, Yan Song, Sicong Huang, Yuanhe Tian and Fei Xia
A Hybrid Approach to Biomedical Natural Language Inference

This package is based on the repostiory for Multi-Task Deep Neural Networks (MT-DNN).

Requirement: Python >= 3.6; pip install -r requirements.txt
Get MT-DNN pre-trained models: sh download.sh
Get the MedNLI data mli_{train,dev,test}_v1.jsonl and put them under data/
As the paper describes, we merge the original MedNLI train and dev sets, and treat the original test set as the new dev set: cat mli_dev_v1.jsonl >> mli_train_v1.jsonl && mv mli_train_v1.jsonl train.json && mv mli_test_v1.jsonl dev.json
Convert the MedNLI json files into tsv: python json2tsv.py
Optional: Process the MedNLI data using this tool that substitutes masked PHI tokens with pseudo-information. You might need to make some minor modifications of the script as MedNLI has slightly different PHI format from MIMIC-III
Preprocess tsv into a format that the MT-DNN scripts like: python prepro.py
Generate features: python generate_domain_features.py && python generate_generic_features.py
Put glove.6B.300d.txt under the root directory
python train.py --data_dir data/mt_dnn/ --init_checkpoint mt_dnn_models/mt_dnn_base.pt --batch_size 16 --output_dir checkpoints/model --log_file checkpoints/model/log.log --answer_opt 0 --train_datasets mednli --test_datasets mednli --epochs 15 --stx_parse_dim 100 --glove_path glove.6B.300d.txt --unk_threshold 5 --feature_dim 20 --use_parse --use_generic_features --use_domain_features
Postprocessing: see postpro/postpro.py

Language Model Fine-tuning

Get some unlabeled corpus. We used MIMIC-III
Do some cleaning and convert the format to what step 3 below expects. We used this tool to substitute masked PHI tokens with pseudo-information
Use the repo pytorch-pretrained-BERT (now renamed to pytorch-transformers). Use the pregenerate_training_data.py and finetune_on_pregenerated scripts under their examples/lm_finetuning directory
We need to resolve some format inconsistencies between the two libraries: python wrapping_util.py [PATH_TO_pytorch-pretrained-BERT_TRAINED_MODEL] mt_dnn_models/mt_dnn_base.pt && mv new_pytorch_model.bin mt_dnn_models/finetuned_model.pt. This assumes that you're using the base model; change to mt_dnn_large.pt otherwise
Train the model as in Quickstart, but change --init_checkpoint to point to this new model

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
checkpoints		checkpoints
config		config
data		data
data_utils		data_utils
module		module
mt_dnn		mt_dnn
mt_dnn_models		mt_dnn_models
postpro		postpro
.gitignore		.gitignore
README.md		README.md
download.sh		download.sh
generate_domain_features.py		generate_domain_features.py
generate_generic_features.py		generate_generic_features.py
json2tsv.py		json2tsv.py
prepro.py		prepro.py
requirements.txt		requirements.txt
run_toy.sh		run_toy.sh
train.py		train.py
wrapping_util.py		wrapping_util.py