BHL TWARC

Bentley Historical Library's implementation of twarc, used to capture searches of hashtags and mentions using the Twitter API

Requirements

BHL TWARC Set Up

Clone the repository
cd bhl_twarc
Create a configuration file called feeds.txt
Entries in the configuration file should look like this:

[examplehashtag]
crawl:True
name:Example Hashtag (#examplehashtag)
crawl_type:hashtag
search_string:#examplehashtag

[examplementions]
crawl:False
name: Example Mentions (@examplementions)
crawl_type:mentions
search_string:@examplementions

Twitter API Set Up

Create an application at apps.twitter.com
Note the consumer key, consumer secret, access token, and access token secret associated with the application

Use

Run bhl_twarc.py
The script will parse entries in feeds.txt and initiate a Twitter search for all that have a crawl setting of True
bhl_twarc will create the following directory structure (using the example configuration above as an example), if it does not exist:

feeds
  examplehashtag
    html
    json
    logs
    media
      profile_images
      tweet_images

The raw JSON returned by the Twitter API will be saved to the feed's json directory
Logs for the API search will be stored to a twarc.log file in the logs directory
An HTML file created using the Twitter JSON will be stored in the html directory
- based heavily off of twarc's wall.py
Profile images and embedded images from tweets will be fetched and stored in the corresponding folders in the media directory
- The paths to images in the converted HTML files will point to the images stored in the media directory
- CSV files will also be created and stored in the media directory, indicating each image's original URL and download location
An index.html file will be created in the feed's root directory containing a table pointing to the raw JSON and converted HTML for each crawl
The README.txt from bhl_twarc\lib will be copied to the feed's root directory

First time use

The first time bhl_twarc.py is run, it will prompt you for your consumer key, consumer secret, access token, and access token secret, which will then be stored in a file called .twarc This file is ignored by default in this repo's .gitignore. Make sure not to commit this file to GitHub as it will contain your Twitter API secret keys

Options

The following command line arguments can be passed to bhl_twarc.py.

To perform a search of a particular feed and only that feed from feeds.txt:

bhl_twarc.py -f examplehashtag

To exclude feeds from feeds.txt from a crawl:

bhl_twarc.py -e examplehashtag examplementions

To run a test crawl, using a configuration file called feeds_test.txt, the results of which will be saved to a directory called test_crawls:

bhl_twarc.py -t

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
lib		lib
scripts		scripts
.gitignore		.gitignore
README.md		README.md
bhl_twarc.py		bhl_twarc.py
sample_config.txt		sample_config.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

scripts

scripts

.gitignore

.gitignore

README.md

README.md

bhl_twarc.py

bhl_twarc.py

sample_config.txt

sample_config.txt

Repository files navigation

BHL TWARC

Requirements

BHL TWARC Set Up

Twitter API Set Up

Use

First time use

Options

About

Releases

Packages

Languages

djpillen/bhl_twarc

Folders and files

Latest commit

History

Repository files navigation

BHL TWARC

Requirements

BHL TWARC Set Up

Twitter API Set Up

Use

First time use

Options

About

Resources

Stars

Watchers

Forks

Languages