Overview

The intention of this project is to create a chatbot based on movie reviews so that you can ask questions and have a free conversation about this topic.

Motivation

Recently I had to buy a new internet service, so I tried to do it using the available chatbot of the company. I noticed the conversation with the chatbot was based on rules and conditions. Hence, for each question I was doing to the bot, it was sending to me a list of options I needed to choose to go to the next step of the conversation. The experience was not good for me and it did not solve my problem. So, I started search for possible solutions, just for curiosity, and I found some contents in the internet talking about the training of a chatbot using Natural Language Processing (NLP). After this reading, I decided to take the challenge and train my on chatbot for natural conversations.

The chat bot works as follows:

an input message is provided by the user;
the chat bot receives this message and saves it in a datafile for future improvements;
the message is preprocessed to serve the neural network and be labeled as a question (1) or answer (0).
the same original message is also preprocessed to serve the algorithm of similarity; 1.in any of the preprocessing, in the case of messages that cannot be be used, as numbers only, only special characters, etc., a standard emergency message is returned to the user. 2. this standard message is fetched from a list of standard messages;
The preprocessed message is labeled and depending on the label it is compared with the list of messages of the same label. for example, if the message is labeled as a question, it is compared to the questions dataset
If a similar message is found, the chat bot returns the associated response to this message.
If there is no similar message, a standard kind message is returned to the user.
All messages

Concepts

Pre-processing data

The dataset is pre-processed in pairs of entry-output messages, for example "what is it?"-"a dog". Those messages are used to map the closest answer to a given messages from the user.

Page Rank

A graph of similar messages was done to feed the Page Rank algorithm, so the most relevant messages are ranked on the top of the list. The rank is used to in the output message.

Cossine similarity

The Cossine similarity is used to match the entry message of the user against the most similar message in the dataset. This value is summed with the Page Rank of the message. This processed is done for all messages and the message with the highest value (Page Rank + similarity) is returned to the user.

How it was be done

Used a dataset with fictional conversations about movies
Processed the data to build the sequence of conversations
Applied capitalization, lemmatization and stemming to reduce the variation of words
Enriched the dataset with more features (similarity of sentences)
Trained each message with its corresponding answer using a Neural Network
Built a user interface to allow the interaction with the chatbot
Deployed the chatbot in a free and public domain (Heroku)

The chatbot

Used libraries

pandas
re
keras
nump
sklearn
Scipy
train_test_split
math

Interface

The chat bot is deployed at https://chatbotnaive.herokuapp.com/, so try this :)

Installing the chatbot locally

pip3 install -r requirements.txt

Note: for Windows, install the Xming and export the DISPLAY. The server must be running before launch the UI. More details in this ticket: https://stackoverflow.com/questions/39804366/tclerror-no-display-name-and-no-display-environment-variable-on-windows-10-bas/39805613.

Running the app server

cd scr/
python3 app.py

access the url informed by the server. For example http://127.0.0.1:5000/

Running the chatbot in CLI

cd scr/
python3 run_cli.py

Running the chatbot in a Desktop UI

export DISPLAY=0.0
cd scr/
python3 run_ui.py

Running the tests and coverage

cd src/
sh coverage.sh

The coverage report is generated in htmlcov/index.html

The current coverage is:

Name                        Stmts   Miss  Cover
-----------------------------------------------
backend/__init__.py             0      0   100%
backend/chatbot.py             40      3    92%
backend/dataset.py             28      0   100%
backend/pre_processing.py      62      0   100%
backend/predict.py             34      3    91%
backend/similarity.py          46      0   100%
backend/utils.py               17      0   100%
settings.py                    16      0   100%
-----------------------------------------------
TOTAL                         243      6    98%

Attention

This chat bot was developed using WSL Ubuntu, so it is not guaranteed to work on different environment.
To retrain the chat bot it is necessary to use the notebooks following the order of the files 001, 002... and maybe the notebooks will need to be adapted dependin on your dataset.
The notebooks generate the 3 datasets used by the chat bot: movie_lines_pre_processed_for_test.tvs, page_rank_questions.txt and page_rank_answers.txt. If retraining, get the generated files in notebooks/chatdata and put in src/chatdata.
The model.h5 and the tokenizer.pickle are also generated by the notebooks and it is needed to copy both in src/chatdata.
This chat bot was developed using 30000 messages due to performance issues, so pay attention to your dataset if you are retrainign the chat bot.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Mark the repository with a star if liked it.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
README.pt.md		README.pt.md

douglasdcm/chatbot_for_movies

Folders and files

Latest commit

History

Repository files navigation