Skip to content

shj1987/MEDIQA_WTMED

 
 

Repository files navigation

This is the official code repository for the paper

Zhaofeng Wu, Yan Song, Sicong Huang, Yuanhe Tian and Fei Xia
A Hybrid Approach to Biomedical Natural Language Inference

This package is based on the repostiory for Multi-Task Deep Neural Networks (MT-DNN).

Quickstart

  1. Requirement: Python >= 3.6; pip install -r requirements.txt
  2. Get MT-DNN pre-trained models: sh download.sh
  3. Get the MedNLI data mli_{train,dev,test}_v1.jsonl and put them under data/
  4. As the paper describes, we merge the original MedNLI train and dev sets, and treat the original test set as the new dev set: cat mli_dev_v1.jsonl >> mli_train_v1.jsonl && mv mli_train_v1.jsonl train.json && mv mli_test_v1.jsonl dev.json
  5. Convert the MedNLI json files into tsv: python json2tsv.py
  6. Optional: Process the MedNLI data using this tool that substitutes masked PHI tokens with pseudo-information. You might need to make some minor modifications of the script as MedNLI has slightly different PHI format from MIMIC-III
  7. Preprocess tsv into a format that the MT-DNN scripts like: python prepro.py
  8. Generate features: python generate_domain_features.py && python generate_generic_features.py
  9. Put glove.6B.300d.txt under the root directory
  10. python train.py --data_dir data/mt_dnn/ --init_checkpoint mt_dnn_models/mt_dnn_base.pt --batch_size 16 --output_dir checkpoints/model --log_file checkpoints/model/log.log --answer_opt 0 --train_datasets mednli --test_datasets mednli --epochs 15 --stx_parse_dim 100 --glove_path glove.6B.300d.txt --unk_threshold 5 --feature_dim 20 --use_parse --use_generic_features --use_domain_features
  11. Postprocessing: see postpro/postpro.py

Language Model Fine-tuning

  1. Get some unlabeled corpus. We used MIMIC-III
  2. Do some cleaning and convert the format to what step 3 below expects. We used this tool to substitute masked PHI tokens with pseudo-information
  3. Use the repo pytorch-pretrained-BERT (now renamed to pytorch-transformers). Use the pregenerate_training_data.py and finetune_on_pregenerated scripts under their examples/lm_finetuning directory
  4. We need to resolve some format inconsistencies between the two libraries: python wrapping_util.py [PATH_TO_pytorch-pretrained-BERT_TRAINED_MODEL] mt_dnn_models/mt_dnn_base.pt && mv new_pytorch_model.bin mt_dnn_models/finetuned_model.pt. This assumes that you're using the base model; change to mt_dnn_large.pt otherwise
  5. Train the model as in Quickstart, but change --init_checkpoint to point to this new model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%