Skip to content

Keras project to build a model intended for use by a Discord bot.

Notifications You must be signed in to change notification settings

dustinc555/discord-chatbot

Repository files navigation

discord-chatbot

Keras project to build a model intended for use by a Discord bot to talk to people.

Environment

I have an amd gpu, so a docker container is provided that supports it.

If you do not have an amd gpu, you can change what tensorflow is installed in requirments.py.

Process

The words need to be converted to tokens in order to learn on them. I utilize Keras's tokenizer: https://keras.io/preprocessing/text/.

Note: The model cannot learn on words dynamically, words that it receives that are unknown are given a special token. In order to learn more words the process needs to start over and the model retrained.

The following process depends on the contents of settings.toml

  1. Data is prepared by running python prep.py -> data/training/data.txt
  2. The model is created by running python makemodel.py -> produces models/production.h5
  3. The model may continue to be trained by running python3 train.py, trains models/production.h5

It is important to note that the tokenizer is saved along with the model in settings['tokenizer']['production']

Data

The data trained on is located in data/train/data.txt; it is specified in settings.toml under files->training. The file consists of a list of sentences seperated by new lines. Every other sentence is a reply to the previous.

Every dataset name listed in settings.toml->preperation->sets is used as a key in data_preperation_procedures to call a function that prepares the data.

Adding sets

  1. Place the data set to its own sub folder in data/sets.
  2. Add a unique name for the set in settings.toml->preperation->sets.
  3. Create a function that also has a unique name in preperation->sets in prep.py that parces the data.
  4. Add the function name to data_preperation_procedures in prep.py as a value with the name in step 2 as the key.

The function added in step 3 returns the data as (input sentence)\n(output sentence)\n ... This will be summed up with the output of the other sets and consolidated in data/train/data.txt.

Recommended sets

The Model

LSTM - (Long Short Term Memory)

This is the primary layer for the model. It helps predict the next steps with a given frequency. I found this tutorial to be an adequate introduction: https://adventuresinmachinelearning.com/keras-lstm-tutorial/

Analysis

PCA

Basic sentences are clustered together but more complicated sentences fringe out. I found that a large number of the components is required to capture 90% or more of the information a given sentence contains.

Vocab representation

The Keras tokenizer can provided a wealth of information. Ex: This snippet taken from: https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/

# define 5 documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)
OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'effort': 6, 'done': 3, 'great': 5, 'good': 4, 'excellent': 8, 'well': 2, 'nice': 7}
{'work': 2, 'effort': 1, 'done': 1, 'well': 1, 'good': 1, 'great': 1, 'excellent': 1, 'nice': 1}
[[ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.]]

About

Keras project to build a model intended for use by a Discord bot.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published