Skip to content

(FORKED REPO) Biomedical Entity Representations with Synonym Marginalization (Sung et al., ACL 2020)

License

Notifications You must be signed in to change notification settings

ronaldseoh/BioSyn

Repository files navigation

BioSyn GitHub

Biomedical Entity Representations with Synonym Marginalization

BioSyn Overview

We present BioSyn for learning biomedical entity representations. You can train BioSyn with the two main components described in our paper: 1) synonym marginalization and 2) iterative candidate retrieval. Once you train BioSyn, you can easily normalize any biomedical mentions or represent them into entity embeddings.

Requirements

$ conda create -n BioSyn python=3.6
$ conda activate BioSyn
$ conda install numpy tqdm nltk scikit-learn
$ conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch
$ pip install transformers==2.0.0

Note that Pytorch has to be installed depending on the version of CUDA.

Resources

Pretrained Model

We use the Huggingface version of BioBERT v1.1 so that the pretrained model can be run on the pytorch framework.

Datasets

Datasets consist of queries (train, dev, test, and traindev), and dictionaries (train_dictionary, dev_dictionary, and test_dictionary). Note that the only difference between the dictionaries is that test_dictionary includes train and dev mentions, and dev_dictionary includes train mentions to increase the coverage. The queries are pre-processed with lowercasing, removing punctuations, resolving composite mentions and resolving abbreviation (Ab3P). The dictionaries are pre-processed with lowercasing, removing punctuations (If you need the pre-processing codes, please let us know by openning an issue).

Note that we use development (dev) set to search the hyperparameters, and train on traindev (train+dev) set to report the final performance.

TAC2017ADR dataset cannot be shared because of the license issue. Please visit the website or see here for pre-processing scripts.

Train

The following example fine-tunes our model on NCBI-Disease dataset (train+dev) with BioBERTv1.1.

MODEL=biosyn-ncbi-disease
BIOBERT_DIR=./pretrained/pt_biobert1.1
OUTPUT_DIR=./tmp/${MODEL}
DATA_DIR=./datasets/ncbi-disease

python train.py \
    --model_dir ${BIOBERT_DIR} \
    --train_dictionary_path ${DATA_DIR}/train_dictionary.txt \
    --train_dir ${DATA_DIR}/processed_traindev \
    --output_dir ${OUTPUT_DIR} \
    --use_cuda \
    --topk 20 \
    --epoch 10 \
    --train_batch_size 16\
    --initial_sparse_weight 0\
    --learning_rate 1e-5 \
    --max_length 25 \
    --dense_ratio 0.5

Note that you can train the model on processed_train and evaluate it on processed_dev when you want to search for the hyperparameters. (the argument --save_checkpoint_all can be helpful. )

Evaluation

The following example evaluates our trained model with NCBI-Disease dataset (test).

MODEL=biosyn-ncbi-disease
MODEL_DIR=./tmp/${MODEL}
OUTPUT_DIR=./tmp/${MODEL}
DATA_DIR=./datasets/ncbi-disease

python eval.py \
    --model_dir ${MODEL_DIR} \
    --dictionary_path ${DATA_DIR}/test_dictionary.txt \
    --data_dir ${DATA_DIR}/processed_test \
    --output_dir ${OUTPUT_DIR} \
    --use_cuda \
    --topk 20 \
    --max_length 25 \
    --save_predictions

Result

The predictions are saved in predictions_eval.json with mentions, candidates and accuracies (the argument --save_predictions has to be on). Following is an example.

{
  "queries": [
    {
      "mentions": [
        {
          "mention": "ataxia telangiectasia",
          "golden_cui": "D001260",
          "candidates": [
            {
              "name": "ataxia telangiectasia",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "ataxia telangiectasia syndrome",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "ataxia telangiectasia variant",
              "cui": "C566865",
              "label": 0
            },
            {
              "name": "syndrome ataxia telangiectasia",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "telangiectasia",
              "cui": "D013684",
              "label": 0
            }]
        }]
    },
    ...
    ],
    "acc1": 0.9114583333333334,
    "acc5": 0.9385416666666667
}

Inference

We provide a simple script that can normalize a biomedical mention or represent the mention into an embedding vector with BioSyn. If you do not have pre-trained BioSyn, please download BioSyn pre-trained on NCBI-Disease.

Predictions (Top 5)

The example below gives the top 5 predictions for a mention ataxia telangiectasia. Note that the initial run will take some time to embed the whole dictionary. You can download the dictionary file here.

MODEL=biosyn-ncbi-disease
MODEL_DIR=./tmp/${MODEL}
DATA_DIR=./datasets/ncbi-disease

python inference.py \
    --model_dir ${MODEL_DIR} \
    --dictionary_path ${DATA_DIR}/test_dictionary.txt \
    --use_cuda \
    --mention "ataxia telangiectasia" \
    --show_predictions

Result

{
  "mention": "ataxia telangiectasia", 
  "predictions": [
    {"name": "ataxia telangiectasia", "id": "D001260|208900"}, 
    {"name": "ataxia telangiectasia syndrome", "id": "D001260|208900"}, 
    {"name": "ataxia telangiectasia variant", "id": "C566865"}, 
    {"name": "syndrome ataxia telangiectasia", "id": "D001260|208900"}, 
    {"name": "telangiectasia", "id": "D013684"}
  ]
}

Embeddings

The example below gives an embedding of a mention ataxia telangiectasia.

MODEL=biosyn-ncbi-disease
MODEL_DIR=./tmp/${MODEL}
DATA_DIR=./datasets/ncbi-disease

python inference.py \
    --model_dir ${MODEL_DIR} \
    --use_cuda \
    --mention "ataxia telangiectasia" \
    --show_embeddings

Result

{
  "mention": "ataxia telangiectasia", 
  "mention_sparse_embeds": array([0.05979538, 0., ..., 0., 0.], dtype=float32), 
  "mention_dense_embeds": array([-7.14258850e-02, ..., -4.03847933e-01,],dtype=float32)
}

Demo

How to run web demo

Web demo is implemented on Tornado framework. If a dictionary is not yet cached, it will take about couple of minutes to create dictionary cache.

MODEL=biosyn-ncbi-disease
MODEL_DIR=./tmp/${MODEL}

python demo.py \
  --model_dir ${MODEL_DIR} \
  --use_cuda \
  --dictionary_path ./datasets/ncbi-disease/test_dictionary.txt

You can try our running demo for nomalization of disease named entities.

Citations

@inproceedings{sung2020biomedical,
    title={Biomedical Entity Representations with Synonym Marginalization},
    author={Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},
    booktitle={ACL},
    year={2020},
}

About

(FORKED REPO) Biomedical Entity Representations with Synonym Marginalization (Sung et al., ACL 2020)

Resources

License

Stars

Watchers

Forks