Skip to content

luungoc2005/nlp-test

Repository files navigation

botbot-nlp

Quick start

Environmental setup:

  1. Run data/get_data.bash to download data files Notes:

Data path configurations are stored in config.py

  1. Setup Python environment:
  • Download Anaconda/Miniconda for Python 3.6
  • Create a new environment for the project by running conda create --name botbot-nlp python=3.6
  • Switch to the newly created environment by running source activate botbot-nlp (activate botbot-nlp inside Anaconda Prompt on Windows)
  • Install dependencies by running pip install -r requirements.txt or pip install -r requirements_win.txt for Windows
  • Install any missing dependencies (because of platform differences)
  1. Build Cython modules
  • Run python setup.py build_ext --inplace (requires gcc/clang - sudo apt-get install build-essential on Deb/Ubuntu or Xcode+CLI tools on OSX)
  1. Using Jupyter notebook for evaluation
  • Activate the environment by source activate botbot-nlp
  • Navigate to the root directory by cd
  • Run jupyter notebook. A browser tab should open and navigate to jupyter notebook at localhost:8888 by default
  • Open entities_train.ipynb inside the notebook
  • Click Kernel > Run All to start training

Note: if the progress bars doesn't show up properly during training, run conda install -c conda-forge ipywidgets

Scripts

train_quora.py trains the paraphrasing model on the Quora duplicate questions dataset

train_sent_to_vec.py trains the InferSent model on NLI + SNLI

train_amazon_sentiment.py trains the classification model on the amazon sentiment dataset

train_conll_eval trains the entity recognition model on the CoNLL2003 dataset

start_flask.py (to be used with -debug True) is for debugging the NLU flask server

Notes about using the Flask server

The model should be run on Gunicorn by

gunicorn -w 1 -t 0 -b 127.0.0.1:5000 flask_app.entrypoint:app

Arguments:

  • -w number of workers
  • -t timeout (set to a high number because loading word vectors takes a while)
  • -b optionally binds to a different address

After running the server

  1. /upload (POST) is used to upload a data file & train the NLU on the data file with the file argument - e.g: curl -X POST -F 'file=@./data/botbot/francis.json' 127.0.0.1:5000/upload
curl -X POST -F "file=@./data/botbot/francis.json" 127.0.0.1:5000/upload

Returns: model_id

  1. /predict (POST) is used to send a query for prediction

e.g

curl -X POST -H "Content-Type: application/json"  -d '{"query":"Hello world!", "model_id": "guid_in_previous_step"}' 127.0.0.1:5000/predict | json_pp

Extras

  1. Using GPU

Using GPU will massively speed up training and inference time (brings training from hours of CPU time to about an hour or a few minutes depending on GPU spec)

e.g:

  • Training on CPU: Intel Core i7-4710HQ @2.7Ghz: ~45m/epoch - 12 epochs to reach 87% accuracy

  • Training on GPU: same machine, NVIDIA GeForce GTX 850M: ~4m/epoch

  • Follow Tensorflow-GPU setup instructions at https://www.tensorflow.org/install/install_linux (Including installing the exact CUDA & CuDNN versions - e.g if using Tensorflow 1.5.0 then CUDA 9.0 and CuDNN v7.0 is required even if newer versions exist)

  • Run source activate botbot-nlp

  • Use the commands for specific platforms on http://pytorch.org/ to install PyTorch

  • (Run pip uninstall tensorflow then pip install --upgrade tensorflow-gpu)

  1. Using Tensorboard
  • Run tensorboard --logdir=bilstm/logs/
  • Navigate to localhost:6006 to see training graphs

(These steps are completely optional but are used for deprecated code paths / might help with experimenting)

  1. Install HDF5 for Keras model saving/loading
brew tap homebrew/science
brew install hdf5
  • Activate the environment by source activate botbot-nlp
  • Install h5py by running pip install h5py
  1. Download NLTK data:
  • Activate the environment by source activate botbot-nlp
  • Run python -m nltk.downloader all

Notes about pytorch-qrnn

Although almost as effective as LSTMs - Salesforce's QRNN is hardly maintained. I needed to create a fork for the following issues:

  • cupy & pynvrtc should be optional imports on a machine without NVIDIA GPU (it's supposed to support CPU as well)
  • The CUDA code is encoded twice, once by QRNN library code and once more by pynvrtc. I needed to remove the one in pynvrtc.

To install, run pip install cupy pynvrtc git+https://github.com/luungoc2005/pytorch-qrnn

Explanation

This demonstrates the use of Facebook fastText for intent classification Entities recognition is a Bi-directional LSTM + CRF implemented in PyTorch This also uses fastText for word embeddings to alleviate the use of Char-CNN + GloVE embeddings for performance (should re-explore the other option some time in the future)

TODOS

Models:

  • Try using spaces as separate tokens (modify bilstm/utils - wordpunct_tokenize function)

Other:

  • Setup a docker image for ease getting started
  • Setup a small Flask server and CLI for ease in using the project