This is a repository for Princeton COS598C final project, Spring 2020, taught by Prof. Danqi Chen.
👉 In short, the project explored the performance of frozen/unfrozen pre-trained BERT-base-uncased
models on the joke recognition, and the effect dataset formation has on the performance.
🧠 Obtained a good performance for joke recognition. Observed that cleaner datasets give better results and adding lower weight to poor quality samples could be useful. Increasing the domain of training data results in better performance for some tasks. Selecting best combination of datasets based on frozen BERT with intent to be trained with unfrozen BERT would not work.
📜 See final report in the repo above
❗❗❗ All datasets except the first one contain offensive jokes, be careful
😂 Humicroedit and FunLines datasets
Nabil Hossain, John Krumm and Michael Gamon. "President Vows to Cut Taxes Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines. 2019. In NAACL.
FunLines - Nabil Hossain, John Krumm Tanvir Sajed and Henry Kautz. "Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines. arXiv preprint (2020).
Description: Both of the datasets have an original news title and “microedited” one, where one word was replaced to make is funny. The readers then assigned a score from 0 to 3; each title has multiple people reviewing it.
Download from: https://www.cs.rochester.edu/u/nhossain/humicroedit.html
😂 CrowdTruth Short-Text-Corpus-For-Humor-Detection
Description: Scraped from twitter, contains posts from “funny” accounts, as well as Reuters headlines, English proverbs and Wiki sentences. This results in approximately 22K funny items, and 21K of neutral posts.
Download from: https://github.com/CrowdTruth/Short-Text-Corpus-For-Humor-Detection
😂 Kaggle, Short Jokes
Download from: https://www.kaggle.com/abhinavmoudgil95/short-jokes
😂_Puns, reddit full and short jokes_
Download from: https://github.com/orionw/RedditHumorDetection
🤔 For additional balancing: A Million News Headlines
Use most recent 200K news titles https://www.kaggle.com/therohk/million-headlines
BERT.ipynb contains a simlple working fine-tuning example
Dataset analysis and preparation.ipynb contains dataset description and data mixing explanations
dataset comparison and assemble.ipynb mixing data
Humicroedit.ipynb a closer look at Humicroedit dataset since it is an interesting one
Followed this tutorial for BERT finetuning https://medium.com/swlh/painless-fine-tuning-of-bert-in-pytorch-b91c14912caa