Named Entity Recognition of diseases

Introduction

Named-entity recognition (NER) is a task of NLP that seeks to locate and classify named entity mentioned in unstructured text. In this repository, I do a quick overview of supervised and unsupervised methods for this task. The goal is to find diseases in a given text, thus is a very specific case of NER.

There is an overview of the models and their results in this README, but you can find all the details in the notebooks:

To run the previous notebooks, you need to install the Python libraries that are listed here: requirements.txt. Or you can run the Docker containers which are described below.

Also, you may need to install additional components, such as the vocabulary used by Spacy. If you receive any error when trying to import en_core_web_sm, try executing this command first: python -m spacy download en_core_web_sm.

Dataset

The NCBI Disease Corpus: The NCBI disease corpus is fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Source: https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/

Note 1: Download the mentioned dataset and place the files NCBI_corpus_training.txt and NCBI_corpus_testing.txt in the folder ./data. Once you put the files there, you are ready to run the commands described below.

Note 2: In order to use the notebook with GloVe, first you need to download the file glove.840B.300d.txt from https://nlp.stanford.edu/projects/glove/. Then, place the file in the folder ./data.

Commands

Note: These commands have been tested only in MacOS, but they should work in Git Bash (Windows) too.

You can control the docker containers with these two commands:

sh manager.sh docker:run
sh manager.sh docker:down

Now, you have two commands that you can use to train the model and to make predictions:

sh manager.sh train "model-name"
sh manager.sh predict "model-name" "Write your text here..."

There are two model names that you can use: dictionary or lstm-crf. For example:

sh manager.sh train "lstm-crf"
sh manager.sh predict "lstm-crf" "Write your text here..."

Tip: remember that you can make predictions using dictionary without any previous training.

Have fun! ᕙ (° ~ ° ~)

Algorithms

Dictionary

This is a naive approach of NER using a fixed list of medical terms and names of diseases. Basically, the NER will be as good as good is the list. I have used a scrapper to obtain a little list of diseases (source: https://www.nhsinform.scot/illnesses-and-conditions/a-to-z), I did a bit of manual parsing on the list and store them in the file diseases.txt. My goal is to see how good can be an algorithm based on such list.

A simlar approch is creating a list of Regular Expression for a particular language. Obviously, a person must create those patterns, so it can be hard to maintain a big list of patterns.

LSTM-CRF with PoS

This approach (supervised algorithm) is based on the idea described at Bidirectional LSTM-CRF Models for Sequence Tagging by Zhiheng Huang et al.: https://arxiv.org/abs/1508.01991.

Results

I evaluated the performance of each algorithm using the Jaccard Index:

Dictionary: 0.1369
LSTM-CRF with PoS: 0.8154
LSTM-CRF with PoS and GloVe: 0.8435

Notes:

I did not optimize the hyperparameters of the algorithms, so you may obtain better results if you do that.

Future work

I would like to implement the idea described at http://www.cogprints.org/5025/1/NRC-48727.pdf
Using the information from health organizations, there are some tips to build regular expressions, for example: https://apps.who.int/iris/bitstream/handle/10665/163636/WHO_HSE_FOS_15.1_eng.pdf
Apply K-Folds to training of LSTM-CRF.
Implement BERT model.
Code organization

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
data		data
doc		doc
misc		misc
notebooks		notebooks
.env		.env
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
manager.sh		manager.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

doc

doc

misc

misc

notebooks

notebooks

.env

.env

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

manager.sh

manager.sh

Repository files navigation

Named Entity Recognition of diseases

Introduction

Dataset

Commands

Algorithms

Dictionary

LSTM-CRF with PoS

Results

Future work

About

Releases

Packages

Languages

tfuzi/diseases-ner

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition of diseases

Introduction

Dataset

Commands

Algorithms

Dictionary

LSTM-CRF with PoS

Results

Future work

About

Resources

Stars

Watchers

Forks

Languages