PhenoBERT

A knowledge-enhanced tool for recognizing human clinical phenotype with deep-learning.

What is PhenoBERT?

PhenoBERT is a method that uses advanced deep learning methods (i.e. convolutional neural networks and BERT) to identify clinical disease phenotypes from free text. Currently, only English text is supported. Compared with other methods in the expert-annotated test set, PhenoBERT has reached SOTA effect.

In GSC+(Lobo et al., 2017) dataset:

Method	Precision	Recall	F1-score	Set Similarity
NCBO Annotator	66.32	47.79	55.55	72.04
NeuralCR	70.52	63.98	67.09	78.34
Clinphen	49.82	37.99	43.11	56.29
MetaMapLite	62.43	46.67	53.41	66.02
Doc2hpo	68.93	44.98	54.44	69.19
PhenoBERT	75.42	66.18	70.50	82.39

Citation:

Yuhao Feng, Lei Qi, Weidong Tian*; PhenoBERT: a knowledge-enhanced tool for recognizing human clinical phenotype with deep-learning. Coming soon

How to install PhenoBERT

You can use PhenoBERT on your local machine. Due to some inevitable reason, the web version of PhenoBERT is not yet available.

From Source

Download total project from github.

git clone https://github.com/EclipseCN/PhenoBERT.git

Enter the project main directory.

cd PhenoBERT

Install dependencies in the current Python3 environment.

Notice: we recommend using Python virtual environment to avoid confusion.

pip install -r requirements.txt
python setup.py

Move the pretrained files into the corresponding folder.

# download files from Google Drive in advance
mv /path/to/download/embeddings/* phenobert/embeddings
mv /path/to/download/models/* phenobert/models

After step 4, file structure should like:

- phenobert/
    -- models/
         -- HPOModel_H/
         -- bert_model_max_triple.pkl
    -- embeddings/
         -- biobert_v1.1_pubmed/
         -- fasttext_pubmed.bin

Pretrained embeddings and models

We have prepared pre-trained fastText and BERT embeddings and model files with .pkl suffix on Google Drive for downloading.

click download link

Directory Name	File Name	Description
models/	HPOModel_H/	CNN hierarchical model file
	bert_model_max_triple.pkl	BERT model file
embeddings/	biobert_v1.1_pubmed/	BERT embedding obtained from BioBERT
	fasttext_pubmed.bin	fastText embedding trained on pubmed

Once the download is complete, please put it in the corresponding folder for PhenoBERT to load.

How to use PhenoBERT?

We provide three ways to use PhenoBERT. Due to this issue, all calls need to be in the phenobert/utils path.

cd phenobert/utils

Annotate corpus folder

The most common usage is recognizing human clinical disease phenotype from free text.

Giving a set of text files, PhenoBERT will then annotate each of the text files and generate an annotation file with the same name in the target folder.

Example use annotate.py :

python annotate.py -i DIR_IN -o DIR_OUT

Arguments:

[Required]

 -i directory for storing text files
 -o directory for storing annotation files
 
[Optional]

 -p1 parameter for CNN model [0.8]
 -p2 parameter for CNN model [0.6]
 -p3 parameter for BERT model [0.9]
 -al flag for not filter overlapping concept
 -nb flag for not use BERT
 -t  cpu threads for calculation [10]

Related API

We also provide some APIs for other programs to integrate.

from api import *

Running the above code will import related functions and related models, and temporarily store them as global variables for quick and repeated calls. Or you can simply use Python interactive shell.

Currently we have integrated the following functions:

annotate directly from String

print(annotate_text("I have a headache"))

Output:

9       17      headache        HP:0002315        1.0

Notice: use output = path/can redirect output to specified file

get the approximate location of the disease

print(get_L1_HPO_term(["cardiac hypertrophy", "renal disease"]))

Output:

[['cardiac hypertrophy', {'HP:0001626'}], ['renal disease', {'HP:0000119'}]]

get most similar HPO terms.

print(get_most_related_HPO_term(["cardiac hypertrophy", "renal disease"]))

Output:

[['cardiac hypertrophy', 'None'], ['renal disease', 'HP:0000112']]

determine if two phrases match

print(is_phrase_match_BERT("cardiac hypertrophy", "Ventricular hypertrophy"))

Output:

Match

GUI application

For users who are not comfortable with command line tools, we also provide GUI annotation applications.

Simply use

python gui.py

Then you will get a visual interactive interface as shown in the figure below, in which the yellow highlighted dialog box will display the running status.

Dataset

We provide here two corpus with annotations used in the evaluation (phenobert/data), which are currently publicly available due to privacy processing.

Dataset	Num	Description
GSC+	228	Contains 228 abstracts of biomedical literature (Lobo et al., 2017) in raw format
68_clinical	68	Clinical description of 68 real cases in the intellectual disability study (Anazi et al., 2017)
GeneReviews	10	Contains 10 GeneReviews clinical cases and annotations
val	30	Contains 30 disease research articles from the OMIM database to determine hyperparameters in our model

Train your own model

For the convenience of some users who cannot log in to Google Drive or who want to customize training process for their selves.

We provide the training Python script and training set used by PhenoBERT. Of course, the training set can be customized by the user to generate specific models for other purposes.

cd phenobert/utils

# produce trained models for CNN model
python train.py
python train_sub.py

# produce trained models for BERT model
python my_bert_match.py

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
phenobert		phenobert
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

phenobert

phenobert

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

PhenoBERT

What is PhenoBERT?

Citation:

How to install PhenoBERT

From Source

Pretrained embeddings and models

How to use PhenoBERT?

Annotate corpus folder

Related API

GUI application

Dataset

Train your own model

About

Releases

Packages

Languages

License

gobbletown/PhenoBERT

Folders and files

Latest commit

History

Repository files navigation

PhenoBERT

What is PhenoBERT?

Citation:

How to install PhenoBERT

From Source

Pretrained embeddings and models

How to use PhenoBERT?

Annotate corpus folder

Related API

GUI application

Dataset

Train your own model

About

Resources

License

Stars

Watchers

Forks

Languages