Named Entity Recognition

Training and deployment of BiLSTM and RoBERTa in AWS SageMaker for NER task.
I strongly encourage you to take advantage of Jupyter Notebook Viewer to explore this repository.

tl;dr

Fine-tuned RoBERTa (F1 0.838) turned out to outperform BiLSTM (F1 0.788). In this repository you can explore the capabilities of AWS SageMaker (training and deployment scripts for Tensorflow and PyTorch), S3, Lambdas, API Gateway (model deployment) and Flask Framework (Web App).

Project report

If you would like to go through the whole project you can start with the project report and then follow the codes as it is described in the section below.

Project flow

If you would like to replicate the results simply follow the flowchart - you will find all necessary scripts in src.

Data

Data source: Annotated Corpus for Named Entity Recognition

This is the extract from Groningen Meaning Bank (GMB) corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as location names, organisations, time, people identification etc.

Dataset consits of:
● 47959 sentences
● 1354149 words
● 17 distinct entity tags

BiLSTM and RoBERTa Source

src directory contains source code for both models, but also EDA, full data preparation and inference code. I tried to keep it together with cookie-cutter, but had to make some slight adjustments. Training processes are thoroughly described in train_*.ipynb notebooks.

Folder tree made with simple yet amazing repository tree :

├─ src
│  ├─ data_processing
│  │  ├─ helpers.py
│  │  ├─ prepare_data_bilstm.ipynb
│  │  └─ prepare_data_for_roberta.ipynb
│  ├─ eda
│  │  └─ eda.ipynb
│  ├─ serve
│  │  ├─ predict.py
│  │  └─ requirements.txt
│  ├─ source_bilstm
│  │  └─ train_bilstm.py
│  ├─ source_roberta
│  │  ├─ requirements.txt
│  │  ├─ train_roberta.py
│  │  └─ utils.py

I also experimented with different architectures such as BERT, DistilBERT and BiLSTM-CRF (which unfortunately is not yet supported in AWS SageMaker using TensorflowPredictor and script mode). However, RoBERTa seemed to perform better than all of them, I am curious how it will compare to BiLSTM-CRF.

Model evaluation

Both models were tested on the same test set (10%) and achieved following results:

	F1 Score
BiLSTM	`0.788`
RoBERTa	`0.838`

Fine-tuned RoBERTa clearly outperforms BiLSTM, as well as all models presented in kaggle kernels for this dataset.

Model Deployment

For this purpose I utilized two additional AWS services: Lambda and API Gateway. I also developed Flask WebApp that enables the user to use the API.

If you will need any help with Lambda or API Gateway check out this deployment cheatsheet.

Brief look into the app:

Further research

Experiment with CRF layers (combined with BiLSTM and some embeddings like ELMo)
Experiment with CNN character embeddings
Experiment with different XAI techniques to explain NER predictions (like LIME, eli5)

References

Contributions

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
If you would like to collaborate on points from further research, feel free to open an issue or msg me on linkedin 😉

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
experiments		experiments
reports		reports
src		src
web_app		web_app
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/ISSUE_TEMPLATE

.github/ISSUE_TEMPLATE

data

data

experiments

experiments

reports

reports

src

src

web_app

web_app

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Named Entity Recognition

tl;dr

Project report

Project flow

Data

BiLSTM and RoBERTa Source

Model evaluation

Model Deployment

Further research

References

Contributions

About

Releases

Packages

Languages

License

blawok/named-entity-recognition

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition

tl;dr

Project report

Project flow

Data

BiLSTM and RoBERTa Source

Model evaluation

Model Deployment

Further research

References

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages