Scrape the Gibson

These code snippets are the core of a post I wrote about web scraping in python. It's addressed at people who have already done a bit of coding but want to explore scraping in python in more depth. The workshop will be much easier if you have a Mac or Linux-based computer.

Dependencies

Download repo: https://github.com/abelsonlive/scrape-the-gibson
Install dependencies

If you don't have pip installed, type:

sudo easy_install pip

change directories

cd nyu-skill-share-scraping

now run:

sudo pip install -r requirements.txt

Topics

Introduction

Getting started with Scraping in Python using requests
Exploring HTML documents and extracting the data, with BeautifulSoup
Saving scraped data to a database with dataset

Advanced

Thinking about ETL (Extract, Transform, Load)
Keep your source data around.
Running multiple requests in parallel to scrape faster
- Thready
Regular Expressions to Extract More Data
Programmatic crawling of entire sites.

Links

There are plenty of existing resources on scraping. A few links:

Paul Bradshaw's Scraping for Journalists, excellent for non-coders.
School of Data Handbook Recipes
ScraperWiki (Classic) Docs, moving to GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
00-basics.py		00-basics.py
01-dataset.py		01-dataset.py
02-caching.py		02-caching.py
03-multithreading.py		03-multithreading.py
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

00-basics.py

00-basics.py

01-dataset.py

01-dataset.py

02-caching.py

02-caching.py

03-multithreading.py

03-multithreading.py

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Scrape the Gibson

Dependencies

Topics

Introduction

Advanced

Links

About

Releases

Packages

Languages

abelsonlive/scrape-the-gibson

Folders and files

Latest commit

History

Repository files navigation

Scrape the Gibson

Dependencies

Topics

Introduction

Advanced

Links

About

Resources

Stars

Watchers

Forks

Languages