Skip to content

mellamanjorge/hhba-scraping

 
 

Repository files navigation

Scraping Workshop for H/H BA

These code snippets are the core of a scraping workshop for the Hacks/Hackers Buenos Aires Media Party. It'd addressed at people who have already done some Python coding but want to explore scraping in more depth.

Get a working environment

To recreate examples from the workshop, create a Python virtual environment like this:

# Create the virtualenv:
virtualenv scraping-env

# Activate it:
source scraping-env/bin/activate

# Finally, install the dependencies for this workshop:
pip install -r requirements.txt

Topics

Introduction

  • Getting started with Scraping in Python using requests
  • Exploring HTML documents and extracting the data, with lxml
  • Saving scraped data to a database with dataset

Advanced

  • Thinking about ETL (Extract, Transform, Load)
  • Keep your source data around.
  • Dealing with sessions (e.g. logins), forms and searches.
  • Running multiple requests in parallel to scrape faster
  • Performing sanity checks on your data
  • Understanding HTTP cache controls to check if new content is available.
  • Hiding the fact that you're scraping a site

Pro

Links

There are plenty of existing resources on scraping. A few links:

About

Scraping Workshop for Hacks/Hackers BA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published