Skip to content

ksenia007/humor_recognition

Repository files navigation

Humor Project

This is a repository for Princeton COS598C final project, Spring 2020, taught by Prof. Danqi Chen.

👉 In short, the project explored the performance of frozen/unfrozen pre-trained BERT-base-uncased models on the joke recognition, and the effect dataset formation has on the performance.

🧠 Obtained a good performance for joke recognition. Observed that cleaner datasets give better results and adding lower weight to poor quality samples could be useful. Increasing the domain of training data results in better performance for some tasks. Selecting best combination of datasets based on frozen BERT with intent to be trained with unfrozen BERT would not work.

📜 See final report in the repo above

Datasets:

❗❗❗ All datasets except the first one contain offensive jokes, be careful

😂 Humicroedit and FunLines datasets

Nabil Hossain, John Krumm and Michael Gamon. "President Vows to Cut Taxes Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines. 2019. In NAACL.

FunLines - Nabil Hossain, John Krumm Tanvir Sajed and Henry Kautz. "Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines. arXiv preprint (2020).

Description: Both of the datasets have an original news title and “microedited” one, where one word was replaced to make is funny. The readers then assigned a score from 0 to 3; each title has multiple people reviewing it.

Download from: https://www.cs.rochester.edu/u/nhossain/humicroedit.html

😂 CrowdTruth Short-Text-Corpus-For-Humor-Detection

Description: Scraped from twitter, contains posts from “funny” accounts, as well as Reuters headlines, English proverbs and Wiki sentences. This results in approximately 22K funny items, and 21K of neutral posts.

Download from: https://github.com/CrowdTruth/Short-Text-Corpus-For-Humor-Detection

😂 Kaggle, Short Jokes

Download from: https://www.kaggle.com/abhinavmoudgil95/short-jokes

😂_Puns, reddit full and short jokes_

Download from: https://github.com/orionw/RedditHumorDetection

🤔 For additional balancing: A Million News Headlines

Use most recent 200K news titles https://www.kaggle.com/therohk/million-headlines

Contents of notebook datasets

BERT.ipynb contains a simlple working fine-tuning example

Dataset analysis and preparation.ipynb contains dataset description and data mixing explanations

dataset comparison and assemble.ipynb mixing data

Humicroedit.ipynb a closer look at Humicroedit dataset since it is an interesting one

Note

Followed this tutorial for BERT finetuning https://medium.com/swlh/painless-fine-tuning-of-bert-in-pytorch-b91c14912caa

About

Scoring how funny phrases are, Princeton COS598C final project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published