Skip to content

coyotemarin/scrape-campaigns

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapers for Consumer Campaigns

The goal of this project is to scrape consumer campaign data into a common format so that any tool (e.g. websites, browser extensions, apps) can help people be a part of any consumer campaign.

This is a project of SpendRight. You can contact the author (David Marin) at dave@spendright.org.

Using the Data

This data probably isn't very useful as-is because different campaigns can refer to the same company in different ways (e.g. "LG", "LGE", "LG Electronics"), and some contain inaccurate brand data. Instead, we recommend getting your data from the here, which merges together data from the various campaigns in a consistent way.

Also, please note that we don't place any restrictions on the data, but these campaigns are copyrighted by the non-profits who created them. Here's the current status of each campaign, to the best of our knowledge:

Writing a Scraper

Writing a scraper is pretty simple: create a module in scrapers/ that defines a function scrape_campaign(). The function should yield tuples of table_name, row. For example:

yield 'brand', {'brand': "Burt's Bees', 'company': 'Clorox'}

The names and fields of each table are described in this README.

For ratings and the campaign itself, don't include campaign_id; this is added automatically. You may also refer to campaign_brand_rating and campaign_company_rating as simply brand_rating and company_rating.

It's fine to use other python libraries; just please add them to requirements.txt.

The harness that runs scrapers provides a number of tricks so that your scraper can follow the structure of the page rather than the structure of our tables:

It's okay to output duplicates of the same row; the harness will merge them before writing them to the database. Look at TABLE_TO_KEY_FIELDS in scraper.py to see the primary key of each table (it's pretty much what you'd expect).

Strings are automatically stripped.

"company" is usually a text field, but you can also use a dict if you have other information about the company (e.g. its URL). The name of the company in that case is "company".

If you are outputting a company or company rating, you can add a "brands" field which is a list of brands. These are usually strings, but they can also be dicts (like for "company").

If you are outputting a company, brand, or rating, you can add a "categories" field which is a list of categories for the company/brand.

Rows in company and brand are automatically created for every company and brand/company pair mentioned. You might still want to emit rows for companies or brands if you have additional information (e.g. their twitter_handle).

Translating a campaign's ratings into a judgment is sometimes (ahem) a judgment call, but it's usually obvious. Mapping green to 1, yellow to 0, and red to -1 is a pretty safe bet, as is (for grades) mapping A and B to 1, C to 0, and D through F to -1 (scraper.grade_to_judgment() does exactly that). Sometimes the campaign is just a list of things to support or avoid, in which case you should use the same judgment throughout.

If you're not sure, ask the campaign's creator.

Once you're done, submit a pull request on GitHub.

If you get stuck, ask me questions! (dave@spendright.org)

About

scrape campaign data for SpendRight

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.2%
  • Ruby 3.8%