Pasadóir is an IRC bot that generates text using individualized Markov chains based on given source material. It can be used to generate "new" phrases from a user based on things they have said before. It is written in python, using the Twisted framework.
The term "lookback tuple" refers to current state in the Markov chain. The length of the lookback tuple determines the quality of quote generated. In general, a lower tuple length will lead to longer, more rambling, and less coherent phrases. Conversely, a higher length will generate shorter, less varied, and more syntactically correct quotes, which are more likely to be exact matches to underlying source material. The tuple length is configurable for Pasadóir, and is set to 2 by default.
The term "transition table" is used here to describe how lookback tuples are mapped to potential following tokens.
Take a user with the following input set:
she sells sea-shells by the sea-shore
the dog was eating sausages by the dozen
The following are the possible outputs, where the lookback tuple length is 2:
she sells sea-shells by the sea-shore
she sells sea-shells by the dozen
the dog was eating sausages by the dozen
the dog was eating sausages by the sea-shore
There must be a single input directory, containing the items listed below. The location of this directory does not matter; the directory name will be passed at run-time.
These are plain text files, each with the extension .src
. The filename minus the extension is the username. The following format must apply.
- Each line of input is to be on its own line in the file.
- Each line is to have at least the same number of words as the lookback tuple length.
How the source material is generated is up to you. Typically, it will involve parsing IRC log files, stripping out very short lines, and maybe some normalization.
Note that the more input (both number of lines and number of words per line) we have for a user, the better the output will be. The generator works best when there are many possible successors to each tuple of words; the more possible successors, the more varied the generated lines.
Say we have a user with username kitten
. In our sources directory there must be a file named kitten.src
. Its contents could be something like this:
I yawn in your general direction
my occupation is mousing
om nom nom tasty fish
I'm just going to snooze here while I wait for my opportunity to pounce
An optional file called source.info
may be added to the input directory. If present, this would contain metadata about source generation. The following attributes are supported.
date
: Unix timestamp of when the source material was generated.channels
: Names of IRC channels used to generate the source material.
date=1489964352
channels=#underthetree #kitten_test #sleep
An optional file called merge.lst
may be added to the input directory. If present, this would contain mappings of user aliases (other nicks a user has been known by) to canonical nicknames. The file should be laid out as follows.
- Each user must be on its own line.
- Lines consist of tab-separated nicks.
- The first nick is the canonical one; all other nicks are aliases.
kitten\tkitten_\tkitten1
user1\tUserOne
- Python 3: tested with Python 3.8.10, other versions of Python 3.x may also work
- python3.x-venv
python3 -m venv venv
. venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
To run unit tests, run the following from the checkout location.
export PYTHONPATH=$PWD/src:$PYTHONPATH
python -m unittest discover test
Run the following command from the checkout location.
python src/bot.py <host> <port> <channel> <src_dir>
Where
host
andport
are the host and port to run on.channel
is the IRC channel for the bot to run in.src_dir
is the location of the directory containing the source material as described above.
By default, the bot will connect with the nick pasadoir
. If this is taken, then it will fail. The nick may be configured to something else in config.py
.
The instructions below indicate what to type in IRC to prompt the bot. First, you must connect to IRC as usual and join the channel the bot is in.
There are several different triggers for the Pasadóir bot. The bot looks only at lines starting with those characters, and ignores all others. These are all configurable, with the following defaults.
!
^
~
: these are all used to generate quotes; the differences between them will be described below.@
: this is used for non-quote related commands.
Quotes are generated by typing one of the quote-generating triggers followed by a nickname. To generate a quote based on a single user named kitten
, type the following.
!kitten
A quote may be generated from the source of multiple users combined. There is a limit to the number of users a quote may be generated for; this is determined by the cache size (see the section on caching below).
The syntax for this is to separate nicks with a colon.
!kitten:daffodil:user1
The ordering does not matter. The configurations below (or any other possible combinations) would produce exactly the same results.
!kitten:user1:daffodil
!daffodil:user1:kitten
Optionally, quotes may be seeded, meaning it is possible to give some words for the quote to start from.
A few notes on this:
- Seed words must be space-separated.
- While nicks and special-feature triggers are case-insensitive, seed words are case-sensitive.
- Seed words will match from anywhere within a user's (or multiple users') material, i.e. not just on their starting tuples.
- The number of seed words given must be less than or equal to the lookback tuple length.
- If no matching tuples or partial tuples are found, the bot will not output anything.
The different quote triggers are used to vary the behaviour of the seeding. The following defaults apply.
^
indicates that the quote is to start with the words given.~
indicates that the quote is to end with the words given.!
is used to indicate that the words given may appear anywhere in the quote, i.e. the generation runs bidirectionally from the seed words.
As an example, imagine a user kitten
with the following source material, and where the tuple length is 2:
I yawn in your general direction
The following will be the result of various interactions:
| Input | Output |
|----------------------|-------------------------------------------|
| ^kitten I | [kitten] I yawn in your general direction |
| ^kitten I yawn | [kitten] I yawn in your general direction |
| ^kitten your | [kitten] your general direction |
| ~kitten direction | [kitten] I yawn in your general direction |
| ~kitten your general | [kitten] I yawn in your general |
| !kitten I | [kitten] I yawn in your general direction |
| !kitten direction | [kitten] I yawn in your general direction |
| !kitten your | [kitten] I yawn in your general direction |
| !kitten in your | [kitten] I yawn in your general direction |
Note that either the starts-with trigger ^
or the ends-with trigger ~
may also be used without arguments to generate basic quotes as well; however, without seed words there is no specific reason to use them.
- If the bot is called on a username that does not exist, then it does not do anything.
- If the bot is called on a nick that is an alias, then the real user will be resolved.
- If it is called for a combination of a user that exists and one that does not, then it will only return a quote for one that exists.
To see basic general numbers, run:
@stats
To see basic information relating to a specific user, run:
@stats <nick>
At start time, a user list is constructed based on the source files found in the given source directory. However, as the source material itself may be quite large, it is not loaded at start time. Because building transition tables may be quite slow (especially for multi-user quotes) a cache is maintained containing tables for the most recently-called users (or combinations of users).
When a quote is requested for a user that is not already in the cache, that user's transition table is built and placed in the cache. The cache is keyed by user(s) and by directionality (keeping in mind that a different table must be built depending on whether the quote runs forwards or backwards from a seed).
The number of slots in the cache is configurable. If the cache is full and a quote is requested of a user not in the cache, then the least recently called user is evicted.
When generating a quote from multiple users' source, each of those users is read into the cache, as well as their combined transition table. For this reason, the number of users that may be combined is limited by the cache size.
If a quote is requested for a user already in the cache, then that user's transition table is returned directly. This speeds up access, but may also lead to stale information, e.g. where the content on disk has been updated. To purge the cache, run:
@refresh