A proof-of-concept for harvesting social media and other web resources to WARC files.
-
Create a virtual environment:
virtualenv ENV source ENV/bin/activate
-
Install requirements:
pip install -r requirements.txt
-
Make a local copy of
config.py
:cp sample_config.py config.py
At this point, there are no values you should need to change in this file.
-
Make a copy of
sample_seeds
:cp -r sample_seeds seeds
Make changes in the seed files (or create additional) as appropriate. You must provide correct Twitter, Flickr, and Tumblr api credentials.
sfh will harvest Twitter, Flickr, and Tumblr data. It can be invoked with:
python sfh.py <collection path> <seed file>
For example:
python sfh.py /collections/my_collection seeds/flickr_seeds.json
twh will harvest from the Twitter Streaming API. It is currently broken.