Scraping Workshop for H/H BA

These code snippets are the core of a scraping workshop for the Hacks/Hackers Buenos Aires Media Party. It'd addressed at people who have already done some Python coding but want to explore scraping in more depth.

Get a working environment

To recreate examples from the workshop, create a Python virtual environment like this:

# Create the virtualenv:
virtualenv scraping-env

# Activate it:
source scraping-env/bin/activate

# Finally, install the dependencies for this workshop:
pip install -r requirements.txt

Topics

Introduction

Getting started with Scraping in Python using requests
Exploring HTML documents and extracting the data, with lxml
Saving scraped data to a database with dataset

Advanced

Thinking about ETL (Extract, Transform, Load)
Keep your source data around.
Dealing with sessions (e.g. logins), forms and searches.
Running multiple requests in parallel to scrape faster
- Thready
Performing sanity checks on your data
- Sunlight's validictory
- Colander
- Example: UK Spend Reporting Tool and here
Understanding HTTP cache controls to check if new content is available.
Hiding the fact that you're scraping a site

Pro

Building your own ScraperWiki with Jenkins CI

Links

There are plenty of existing resources on scraping. A few links:

Paul Bradshaw's Scraping for Journalists, excellent for non-coders.
School of Data Handbook Recipes
ScraperWiki (Classic) Docs, moving to GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
00-basics.py		00-basics.py
01-dataset.py		01-dataset.py
02-local-store.py		02-local-store.py
03-multithreading.py		03-multithreading.py
04-forms.py		04-forms.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

00-basics.py

00-basics.py

01-dataset.py

01-dataset.py

02-local-store.py

02-local-store.py

03-multithreading.py

03-multithreading.py

04-forms.py

04-forms.py

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Scraping Workshop for H/H BA

Get a working environment

Topics

Introduction

Advanced

Pro

Links

About

Releases

Packages

mellamanjorge/hhba-scraping

Folders and files

Latest commit

History

Repository files navigation

Scraping Workshop for H/H BA

Get a working environment

Topics

Introduction

Advanced

Pro

Links

About

Resources

Stars

Watchers

Forks