Humor Project

This is a repository for Princeton COS598C final project, Spring 2020, taught by Prof. Danqi Chen.

👉 In short, the project explored the performance of frozen/unfrozen pre-trained BERT-base-uncased models on the joke recognition, and the effect dataset formation has on the performance.

🧠 Obtained a good performance for joke recognition. Observed that cleaner datasets give better results and adding lower weight to poor quality samples could be useful. Increasing the domain of training data results in better performance for some tasks. Selecting best combination of datasets based on frozen BERT with intent to be trained with unfrozen BERT would not work.

📜 See final report in the repo above

Datasets:

❗❗❗ All datasets except the first one contain offensive jokes, be careful

😂 Humicroedit and FunLines datasets

Nabil Hossain, John Krumm and Michael Gamon. "President Vows to Cut ~~Taxes~~ Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines. 2019. In NAACL.

FunLines - Nabil Hossain, John Krumm Tanvir Sajed and Henry Kautz. "Stimulating Creativity with FunLines: A Case Study of Humor Generation in Headlines. arXiv preprint (2020).

Description: Both of the datasets have an original news title and “microedited” one, where one word was replaced to make is funny. The readers then assigned a score from 0 to 3; each title has multiple people reviewing it.

Download from: https://www.cs.rochester.edu/u/nhossain/humicroedit.html

😂 CrowdTruth Short-Text-Corpus-For-Humor-Detection

Description: Scraped from twitter, contains posts from “funny” accounts, as well as Reuters headlines, English proverbs and Wiki sentences. This results in approximately 22K funny items, and 21K of neutral posts.

Download from: https://github.com/CrowdTruth/Short-Text-Corpus-For-Humor-Detection

😂 Kaggle, Short Jokes

Download from: https://www.kaggle.com/abhinavmoudgil95/short-jokes

😂_Puns, reddit full and short jokes_

Download from: https://github.com/orionw/RedditHumorDetection

🤔 For additional balancing: A Million News Headlines

Use most recent 200K news titles https://www.kaggle.com/therohk/million-headlines

Contents of notebook datasets

BERT.ipynb contains a simlple working fine-tuning example

Dataset analysis and preparation.ipynb contains dataset description and data mixing explanations

dataset comparison and assemble.ipynb mixing data

Humicroedit.ipynb a closer look at Humicroedit dataset since it is an interesting one

Note

Followed this tutorial for BERT finetuning https://medium.com/swlh/painless-fine-tuning-of-bert-in-pytorch-b91c14912caa

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
code		code
images		images
BERT.ipynb		BERT.ipynb
Dataset analysis and preparation.ipynb		Dataset analysis and preparation.ipynb
Humicroedit.ipynb		Humicroedit.ipynb
README.md		README.md
REPORT_FINAL.pdf		REPORT_FINAL.pdf
dataset comparison and assemble.ipynb		dataset comparison and assemble.ipynb
predict_dataset_all.py		predict_dataset_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

code

code

images

images

BERT.ipynb

BERT.ipynb

Dataset analysis and preparation.ipynb

Dataset analysis and preparation.ipynb

Humicroedit.ipynb

Humicroedit.ipynb

README.md

README.md

REPORT_FINAL.pdf

REPORT_FINAL.pdf

dataset comparison and assemble.ipynb

dataset comparison and assemble.ipynb

predict_dataset_all.py

predict_dataset_all.py

Repository files navigation

Humor Project

Datasets:

Contents of notebook datasets

Note

About

Releases

Packages

Languages

ksenia007/humor_recognition

Folders and files

Latest commit

History

Repository files navigation

Humor Project

Datasets:

Contents of notebook datasets

Note

About

Resources

Stars

Watchers

Forks

Languages