These code snippets are the core of a post I wrote about web scraping in python. It's addressed at people who have already done a bit of coding but want to explore scraping in python
in more depth. The workshop will be much easier if you have a Mac or Linux-based computer.
-
Download repo: https://github.com/abelsonlive/scrape-the-gibson
-
Install dependencies
- If you don't have pip installed, type:
sudo easy_install pip
- change directories
cd nyu-skill-share-scraping
- now run:
sudo pip install -r requirements.txt
- Getting started with Scraping in Python using requests
- Exploring HTML documents and extracting the data, with BeautifulSoup
- Saving scraped data to a database with dataset
- Thinking about ETL (Extract, Transform, Load)
- Keep your source data around.
- Running multiple requests in parallel to scrape faster
- Regular Expressions to Extract More Data
- Programmatic crawling of entire sites.
There are plenty of existing resources on scraping. A few links:
- Paul Bradshaw's Scraping for Journalists, excellent for non-coders.
- School of Data Handbook Recipes
- ScraperWiki (Classic) Docs, moving to GitHub