Skip to content

mjziolko/ThenThanMistakeNoted

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ThenThanMistakeNoted

Reddit bot that attempts to detect when someone comments with the incorrect usage of then or than.

How it Works

The bot uses PRAW to interact with the Reddit API.
In this implementation of the bot, PRAW is used to gather new comments from a specified list of subreddits and to comment on posts that have been identified as misusing then or than.

As the bot is gathering comments, it will identify which ones contain any mention of then or than and store them in a list to analyze later.

Once the bot has finished gathering comments, it begins an analyzation phase where it goes through an algorithm I developed to identify mistakes in usage.

There are 3 main parts to the algorithm:

In part one, the algorithm will collect data about the words surrounding the then/than. It grabs the +/- three words and their positions and stores them into a database table specific to then or than, depending on which one was used in the sentence. For instance, the sentence "This implementation is better than the other one." would grab the words (better, -1), (is, -2), (implementation, -3), (the, 1), (other, 2), (one, 3). When the words are stored in the database, an overall count of the occurrences of the word/position combination is created. Once the bot has processed every sentence, it proceeds to part two.

Part two attempts to determine a certainty for whether or not the commenter has used then or than incorrectly. A higher certainty value corresponds to how likely the bot will think the usage is incorrect and subsequently comment on the post. It determines the certainty by firstly grabbing the top frequently seen word/position combinations from the previous step. It will check the sentence that is being analyzed to see if it matches any of the pairs that are frequently seen in the then/than opposite of what is being used. Using the previous example, but modified to use the incorrect then/than: "This implementation is better then the other one." and a top generated pair list of [(better, -1), (rather, -1), (better, -3), (greater, -1), (the, 1)], the bot would see that the sentence contains "better" at the -1 position of our sentence, and since the listing was generated from the common words surrounding than, it ranks it fairly probable that the usage is incorrect. It would give it a value depending on how common (better, -1) is in the list as well as its distance from the then/than in question. Following this, it does the same exact process but with the list generated by then. The difference is retrieved between the two values and the bot uses this as its confidence level. If it was ranked 90 from the incorrect analysis, and 10 from the correct analysis, its ending confidence is 80. The bot then proceeds to part 3 of the algorithm.

Part three is a comparison on a threshold with the confidence level obtained during part 2. The threshold was generated by me manually telling the bot when it was correct or incorrect. If the confidence was above the threshold and the bot identified a misuse, it would comment but leave the threshold the same. Similarly, if the bot had determined the usage was correct and the confidence was below the threshold, it would not comment and leave the threshold the same. If the confidence was above the threshold and its usage was actually correct, the bot would raise the threshold to compensate for the false positive. And finally, if the confidence was below the threshold and the usage incorrect, it would lower the threshold. This teaching phase only lasted a couple days before I was finally comfortable with the results and fully automated the bot. Following its full automation, the bot would simply just make a comparison with the confidence level and the threshold and decide to comment just based on that.

After the algorithm has ran through these steps, it would start from the beginning again and continue to gather new comments.

Unfortunately, my HDD containing the database has crashed and the bot is no longer running. I lost my entire set of data in the crash, which contained tens of thousands of entries for comments viewed and word pairs gathered, and hundreds of entries for comments that the bot made.

Other Methods

While working on the project, I became very interested in natural language processing and researched many methods for improving my bot's results. My algorithm worked very well for what it did and achieved about a 90% success rate for mistakes it identified, but it also tossed out a lot of mistakes that it couldn't have known about due to the simplistic nature of pattern matching.

One of the first methods I tried after developing this algorithm was to use grammar tags to replace the word pattern matching. I thought that there might be more concrete grammar elements used before a than over a then. For instance, than is used for comparison purposes, so I sought out to determine whether or not comparative adverbs or comparative adjectives predominantly surrounded than and not then. I used the python library NLTK (Natural Language Toolkit) to tokenize the sentence and automatically generate the grammar tags.

It turns out this method returned wildly varying results and there appeared to be too much in common surrounding then and than grammar-wise for my algorithm to properly identify.

Following this, while researching more into natural language processing, I came across another promising discovery. Microsoft has a service called Microsoft Cognitive Services that contain a lot of interesting APIs for things like image recognition, natural language processing, speech recognition, and more. The APIs have some restrictions on usage per month, but I felt they were very generous for what they were offering before you reached the paid tier. In my case, their Web Language Model API was free for 100,000 transactions/month and $.05/1000 transactions after that. I decided to give it a shot, specifically the Calculate Condition Probability module in the WebLM API. This module will "Calculate the conditional probability that a particular word will follow a given sequence of words." I thought it would be perfect to pass in the words preceding a potential then/than misuse and compare which one Microsoft thought would have a better probability following, then or than. Unfortunately, the results were not has great as I had originally anticipated. It would provide correct results occasionally, but still was wrong enough to not warrant a change from my original algorithm. I was, however, hopeful enough to retain this code in my unstable branch, as I had hoped to tinker with the functionality to see if I could pull out any better results.

With these two results being sub-optimal, I decided there was nothing left to do besides research machine learning myself to see if I could come up with something that would improve my algorithm. It was at this stage where my HDD died and I lost all of my progress that I determined my time might be better off focusing on focusing on the remainder of my internship and studying for future interviews for when I graduate rather than rebuilding all of my lost progress after the crash.

Reddit's Reception

This was meant to be a fun little project that I hoped people would get a kick out of on Reddit. The reception was generally positive, quite a few people loved its execution and occasionally would spark enough interest for people to inquire as to how it worked. I had a couple of discussions with Reddit users who were concerned about the morality behind my bot. They didn't take as kindly to a faceless robot critiquing people's grammar, especially in an area of linguistics where the actual difference between then and than might not matter so much as the two words are beginning to become almost interchangeable. I also received quite a bit of criticism from users who were offended that they were publicly called out, some numerous times, and also for derailing discussions in some of the more focused subreddits. I tried my best to make sure my bot was not actively harassing anyone, mostly by monitoring its activity and sometimes even shutting it down when I couldn't police it. I also tried to stay away from subreddits where the discussions were more serious in nature and where the bot’s presence would be seen as inappropriate.

Ultimately, I greatly enjoyed my time developing this bot and also my conversations through the bot with curious and passionate Reddit users. I hope that this project might be of interest to someone or anyone who is remotely curious about natural language processing or perhaps even just computer science in general.

Thank you for reading!

About

Reddit bot that attempts to detect when someone comments with the incorrect usage of then or than.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages