Skip to content

samzhang111/syllabi-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Scraper

Requirements: python 2.7, MySQL, redis, and requirements.txt

A python scraper that takes an input tab-separated file of IDs and URLs, and stores the HTML contents of the URLs into a MySQL database, keyed by ID.

###Run

  1. set config variables in scraper/config.py and scraper/db_settings.py
  2. initialize redis with python helpers/makequeue.py
  3. python scraper/run_workers.py

###Settings scraper/config.py - contains settings information

  • num_workers = the number of parallel workers to create
  • redis_name = the name of the redis database
  • path = the location of the id-url text file
  • timeout = the number of seconds before the worker raises a timeout exception
  • wayback = false: scrape the url itself, true: scrape the wayback cache of the page

scraper/db_settings.py - database settings (MySQL)

helpers contains several useful files for setting up redis, peeking at scrape status, and cleaning up aborted scrapes.

###Todo

  • Automatically scrape wayback machine on failure.
  • Implement google cache support
  • Fix sentinel values (the program idles on completion)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages