Skip to content

jonoleson/SwearMapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SwearMapper


The most important Twitter analysis ever done: Mapping how much each state curses on Twitter.

Overview

The goal of the project was to measure how much each US state curses on Twitter, as a proportion of their total tweet output. To accomplish this, I used the Tweepy library to interface with Twitter's streaming API, Python's Geocoder library (in conjunction with Mapbox's geocoding API), and Plotly's Pandas API to build an interactive Chloropleth map.

Getting the Data

In building the listener, I mostly followed this tutorial written by Adil Moujahid. I modified the listener filter parameters to capture only Tweets that included location tags from within the continental US.

I ran the listener for roughly 2 hours, piping the output into a txt file with the following command in the terminal: $ python listener.py > data.txt

The code for this can be found in listener.py.

Parsing the Data

Parsing the data involved loading the JSON of each Tweet's data from the data.txt file, then feeding the data into a Pandas dataframe. Each tweet had coordinate data, which I reverse-geocoded to extract the state each tweet was sent from. To use the Mapbox API I used during this step, you have to set a Mapbox Access Token as an environment variable in the directory where you're running the parser, like so: $ export MAPBOX_ACCESS_TOKEN=<Secret Access Token>. On ordinary hardware, the reverse geocoding step can take up to 2 hours with a dataset of roughly 50k tweets.

The code for this can be found in parse.py.

What is a Swear?

Easy. The swear_set is derived from this scene in the canonical cinematic work on profanity, "South Park: Bigger, Longer, and Uncut": (specifically starting at the 00:47 mark).

Results

The top 5 swear-iest states on Twitter were:

State % Tweets Containing Profanity
Michigan 7.75
New Jersey 6.11
New Mexico 5.71
Georgia 5.37
Washington DC 5.08

My graphed results can be found here, I used Plotly's Pandas API. The map must be generated from within an iPython Notebook. For instructions on getting set up with Plotly's API, see here.

Find the graphing code in swearmap.ipynb.

Caveats

My process as outlined here has several potential issues:

  • If the listener receives data faster than it can store it, it will fall behind the stream and disconnect. I ran into this issue several times before getting a volume of data I was satisfied with. The listener as currently built has no control for this and it is a subject for future development.
  • Limited sample size. The dataset contained over 50k tweets but there were still a large number of states for which no data was collected. On the positive side, the proportions of tweets containing profanity was fairly consistent among the states with decent samples, typically in the 2-5% range.
  • Limited definition of profanity. I limited my definition of profanity to this clip from "South Park: Bigger, Longer, and Uncut". Furthermore, I assumed all instances of profanity were correctly spelled and properly spaced, by far the most naive assumption ever made about Twitter.

Conclusion

This was just an exercise, and an admittedly silly one at that. I make no claims about the results here being meaningful in any way, but I hope there at least some useful technical points in API utilization and dataframe manipulation. Cheers!

Libraries and APIs Used:

  • Numpy
  • Pandas
  • Geocoders/Mapbox
  • Tweepy/Twitter Streaming API
  • Plotly

About

Mapping how much each state curses on Twitter

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages