Skip to content

blawok/named-entity-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition

GitHub last commit GitHub repo size made-with-python PRs Welcome

Training and deployment of BiLSTM and RoBERTa in AWS SageMaker for NER task.
I strongly encourage you to take advantage of Jupyter Notebook Viewer to explore this repository.

tl;dr

Fine-tuned RoBERTa (F1 0.838) turned out to outperform BiLSTM (F1 0.788). In this repository you can explore the capabilities of AWS SageMaker (training and deployment scripts for Tensorflow and PyTorch), S3, Lambdas, API Gateway (model deployment) and Flask Framework (Web App).

Project report

If you would like to go through the whole project you can start with the project report and then follow the codes as it is described in the section below.

Project flow

mermaid_flowchart

If you would like to replicate the results simply follow the flowchart - you will find all necessary scripts in src.

Data

Data source: Annotated Corpus for Named Entity Recognition

This is the extract from Groningen Meaning Bank (GMB) corpus which is tagged, annotated and built specifically to train the classifier to predict named entities such as location names, organisations, time, people identification etc.

Dataset consits of:
● 47959 sentences
● 1354149 words
● 17 distinct entity tags

BiLSTM and RoBERTa Source

src directory contains source code for both models, but also EDA, full data preparation and inference code. I tried to keep it together with cookie-cutter, but had to make some slight adjustments. Training processes are thoroughly described in train_*.ipynb notebooks.

Folder tree made with simple yet amazing repository tree :

├─ src
│  ├─ data_processing
│  │  ├─ helpers.py
│  │  ├─ prepare_data_bilstm.ipynb
│  │  └─ prepare_data_for_roberta.ipynb
│  ├─ eda
│  │  └─ eda.ipynb
│  ├─ serve
│  │  ├─ predict.py
│  │  └─ requirements.txt
│  ├─ source_bilstm
│  │  └─ train_bilstm.py
│  ├─ source_roberta
│  │  ├─ requirements.txt
│  │  ├─ train_roberta.py
│  │  └─ utils.py

I also experimented with different architectures such as BERT, DistilBERT and BiLSTM-CRF (which unfortunately is not yet supported in AWS SageMaker using TensorflowPredictor and script mode). However, RoBERTa seemed to perform better than all of them, I am curious how it will compare to BiLSTM-CRF.

Model evaluation

Both models were tested on the same test set (10%) and achieved following results:

F1 Score
BiLSTM 0.788
RoBERTa 0.838

Fine-tuned RoBERTa clearly outperforms BiLSTM, as well as all models presented in kaggle kernels for this dataset.

Model Deployment

For this purpose I utilized two additional AWS services: Lambda and API Gateway. I also developed Flask WebApp that enables the user to use the API.

If you will need any help with Lambda or API Gateway check out this deployment cheatsheet.

Brief look into the app:

Recordit GIF

Further research

  • Experiment with CRF layers (combined with BiLSTM and some embeddings like ELMo)
  • Experiment with CNN character embeddings
  • Experiment with different XAI techniques to explain NER predictions (like LIME, eli5)

References

Contributions

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
If you would like to collaborate on points from further research, feel free to open an issue or msg me on linkedin 😉