Collects all tweets from the sample Public stream using Twitter's streaming API, and saves them to a file for later use as a corpus.
The sample Public stream "Returns a small random sample of all public statuses. The Tweets returned by the default access level are the same, so if two different clients connect to this endpoint, they will see the same Tweets."
This module consumes tweets from the sample Public stream and putes them on a queue. The tweets are then consumed from the queue by writing them to a file in JSON format as sent by twitter, with one tweet per line. This file can then be processed and filtered as necessary to create a corpus of tweets for use with Machine Learning, Natural Language Processing, and other Human-Centered Computing applications.
First, you will need to configure the script by supplying tokens that are generated by Twitter for your application. Follow the instructions that are given in the top of the script. Other module configuration values are also defined at the top, such as the default filename to store the tweets.
Use the following examples as a guide so that you may connect to the
appropriate stream in the def main()
function.
By default, the script is configured to connect to the sample stream, which "returns a small random sample of all public statuses".
stream.sample()
If you would like to filter tweets by location boxes then be sure to read the location parameter information from the Twitter API. Below is an example to filter tweets for the continental United States.
LOCATIONS = [-124.85, 24.39, -66.88, 49.38,]
stream.filter(locations=LOCATIONS)
If you would like to filter by keywords instead, use the track
parameter. Below is an example to filter for some example emoticons.
EMOTICONS = ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) "
EMOTICONS = EMOTICONS.strip().split(' ')
stream.filter(track=EMOTICONS)
Please refer to the streaming.py
module from the Tweepy library.
If you would like to modify the application to process tweets as they
are received instead of saving them to a file for later processing, edit
the def worker()
function as needed.
- Public streams, which describe the types of streams available.
- Statuses/filter, which describes the limits for the number of keywords, users, and location boxes that you are allowed to use with the filter. Pay special note to the fact that all filters are combined through the OR operator and not the AND operator. For example, specifying both location and track parameters will return a Tweet object that matches either criteria, and not necessarily both.
- Tweet objects, which are returned by the streams. Describes all the fields present.
- User objects, which are also embedded into each Tweet object. Describes all the fields present.
- Location parameter information.
- Track (keyword) parameter information.
- A Python library for accessing the Twitter API.
- Does all the heavy lifting of connecting to the sample Public stream.
- Available at: https://github.com/tweepy/tweepy