botbot-nlp

Quick start

See the .ipynb files (viewable on Github) for instructions and demo using the script
Intent classification: intents_train.ipynb intents_predict.ipynb
Entities recognition: entities_train.ipynb entities_predict.ipynb
Complete the environmental setup to run the .ipynb files

Environmental setup:

Run data/get_data.bash to download data files Notes:

Modify this file to selectively download only required files (usually either GLoVE/fasttext vectors)
You might need to run chmod u+x data/get_data.bash as well to grant execute permissions
Technically, the models can be generalized to use vectors from https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md for all other languages

Data path configurations are stored in config.py

The project (might) also uses 20.000 most common words list from https://github.com/first20hours/google-10000-english/blob/master/20k.txt

Setup Python environment:

Download Anaconda/Miniconda for Python 3.6
Create a new environment for the project by running conda create --name botbot-nlp python=3.6
Switch to the newly created environment by running source activate botbot-nlp (activate botbot-nlp inside Anaconda Prompt on Windows)
Install dependencies by running pip install -r requirements.txt or pip install -r requirements_win.txt for Windows
Install any missing dependencies (because of platform differences)

Build Cython modules

Run python setup.py build_ext --inplace (requires gcc/clang - sudo apt-get install build-essential on Deb/Ubuntu or Xcode+CLI tools on OSX)

Using Jupyter notebook for evaluation

Activate the environment by source activate botbot-nlp
Navigate to the root directory by cd
Run jupyter notebook. A browser tab should open and navigate to jupyter notebook at localhost:8888 by default
Open entities_train.ipynb inside the notebook
Click Kernel > Run All to start training

Note: if the progress bars doesn't show up properly during training, run conda install -c conda-forge ipywidgets

Scripts

train_quora.py trains the paraphrasing model on the Quora duplicate questions dataset

train_sent_to_vec.py trains the InferSent model on NLI + SNLI

train_amazon_sentiment.py trains the classification model on the amazon sentiment dataset

train_conll_eval trains the entity recognition model on the CoNLL2003 dataset

start_flask.py (to be used with -debug True) is for debugging the NLU flask server

Notes about using the Flask server

The model should be run on Gunicorn by

gunicorn -w 1 -t 0 -b 127.0.0.1:5000 flask_app.entrypoint:app

Arguments:

-w number of workers
-t timeout (set to a high number because loading word vectors takes a while)
-b optionally binds to a different address

After running the server

/upload (POST) is used to upload a data file & train the NLU on the data file with the file argument - e.g: curl -X POST -F 'file=@./data/botbot/francis.json' 127.0.0.1:5000/upload

curl -X POST -F "file=@./data/botbot/francis.json" 127.0.0.1:5000/upload

Returns: model_id

/predict (POST) is used to send a query for prediction

e.g

curl -X POST -H "Content-Type: application/json"  -d '{"query":"Hello world!", "model_id": "guid_in_previous_step"}' 127.0.0.1:5000/predict | json_pp

Extras

Using GPU

Using GPU will massively speed up training and inference time (brings training from hours of CPU time to about an hour or a few minutes depending on GPU spec)

e.g:

Training on CPU: Intel Core i7-4710HQ @2.7Ghz: ~45m/epoch - 12 epochs to reach 87% accuracy
Training on GPU: same machine, NVIDIA GeForce GTX 850M: ~4m/epoch
Follow Tensorflow-GPU setup instructions at https://www.tensorflow.org/install/install_linux (Including installing the exact CUDA & CuDNN versions - e.g if using Tensorflow 1.5.0 then CUDA 9.0 and CuDNN v7.0 is required even if newer versions exist)
Run source activate botbot-nlp
Use the commands for specific platforms on http://pytorch.org/ to install PyTorch
(Run pip uninstall tensorflow then pip install --upgrade tensorflow-gpu)

Using Tensorboard

Run tensorboard --logdir=bilstm/logs/
Navigate to localhost:6006 to see training graphs

(These steps are completely optional but are used for deprecated code paths / might help with experimenting)

Install HDF5 for Keras model saving/loading

Install HDF5 from https://support.hdfgroup.org/HDF5/ or by using homebrew:
Instructions using homebrew (on UNIX):

brew tap homebrew/science
brew install hdf5

Activate the environment by source activate botbot-nlp
Install h5py by running pip install h5py

Download NLTK data:

Activate the environment by source activate botbot-nlp
Run python -m nltk.downloader all

Notes about pytorch-qrnn

Although almost as effective as LSTMs - Salesforce's QRNN is hardly maintained. I needed to create a fork for the following issues:

cupy & pynvrtc should be optional imports on a machine without NVIDIA GPU (it's supposed to support CPU as well)
The CUDA code is encoded twice, once by QRNN library code and once more by pynvrtc. I needed to remove the one in pynvrtc.

To install, run pip install cupy pynvrtc git+https://github.com/luungoc2005/pytorch-qrnn

Explanation

This demonstrates the use of Facebook fastText for intent classification Entities recognition is a Bi-directional LSTM + CRF implemented in PyTorch This also uses fastText for word embeddings to alleviate the use of Char-CNN + GloVE embeddings for performance (should re-explore the other option some time in the future)

TODOS

Models:

Try using spaces as separate tokens (modify bilstm/utils - wordpunct_tokenize function)

Other:

Setup a docker image for ease getting started
Setup a small Flask server and CLI for ease in using the project

Name		Name	Last commit message	Last commit date
Latest commit History 620 Commits
.vscode		.vscode
common		common
data		data
entities_recognition		entities_recognition
featurizers		featurizers
flask_app		flask_app
k8s		k8s
legacy_scripts		legacy_scripts
notebooks		notebooks
paraphrase_vae		paraphrase_vae
sent_to_vec		sent_to_vec
spell_check		spell_check
text_classification		text_classification
tokenization		tokenization
.dockerignore		.dockerignore
.floyddata		.floyddata
.floydexpt		.floydexpt
.floydignore		.floydignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
config.py		config.py
docker-compose.staging.yml		docker-compose.staging.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
floyd_requirements.txt		floyd_requirements.txt
intents.png		intents.png
kc_data.json		kc_data.json
requirements.txt		requirements.txt
requirements_ubuntu.txt		requirements_ubuntu.txt
requirements_win.txt		requirements_win.txt
test_masked_lm.py		test_masked_lm.py
test_masked_lm_vi.py		test_masked_lm_vi.py
train_masked_lm.py		train_masked_lm.py
train_masked_lm_en_transformer.py		train_masked_lm_en_transformer.py
train_masked_lm_vi.py		train_masked_lm_vi.py
train_masked_lm_vi_transformer.py		train_masked_lm_vi_transformer.py

luungoc2005/nlp-test

Folders and files

Latest commit

History

Repository files navigation

botbot-nlp

Quick start

Environmental setup:

Scripts

Notes about using the Flask server

Extras

Notes about pytorch-qrnn

Explanation

TODOS

About

Resources

Stars

Watchers

Forks

Languages