Scrapers for Consumer Campaigns

The goal of this project is to scrape consumer campaign data into a common format so that any tool (e.g. websites, browser extensions, apps) can help people be a part of any consumer campaign.

This is a project of SpendRight. You can contact the author (David Marin) at dave@spendright.org.

Using the Data

This data probably isn't very useful as-is because different campaigns can refer to the same company in different ways (e.g. "LG", "LGE", "LG Electronics"), and some contain inaccurate brand data. Instead, we recommend getting your data from the here, which merges together data from the various campaigns in a consistent way.

Also, please note that we don't place any restrictions on the data, but these campaigns are copyrighted by the non-profits who created them. Here's the current status of each campaign, to the best of our knowledge:

b_corp: The entire list of Certified B Corporations. Their Terms of Use are horribly awful (they actually threaten to prosecute people who "illegally attempt to mine member data from the site"), but everyone I've actually talked to at B Labs has been friendly and supportive. As far as I've been able to gather, they just don't want people to somehow pull non-public data from the website. Just to be safe, I'd recommend getting writtem permission from them, as required in their Terms of Use (email thelab@bcorporation.net).
bang_accord: Signatories of the Accord on Fire and Building Safety In Bangladesh. No explicit permission, but you probably don't need it; these are just facts.
climate_counts: The Climate Counts Scorecard. Their website actually explicitly invites people to build tools that use their data. Send them an email at info@climatecounts.org; they'll be happy to hear from you!
free2work: Free2Work by Not for Sale. No explicit policy, but have talked to them personally, and they seem to be okay with people using their data. It's a good idea to email feedback@free2work.org, but expect a very slow response (weeks to months).
greenpeace_electronics: Greenpeace International's Guide to Greener Electronics. They have an explicit and very liberal copyright policy. If you want to use it commercially, you need to ask permission; email supporter.services.int@greenpeace.org (expect a response within a week).
hope4congo: RAISE Hope for Congo's Conflict Minerals Company Rankings. They have a scary-sounding but actually very liberal reuse policy that even allows commercial reuse. Just make sure to link back to their website and include this text: This material [article] was created by RAISE Hope for Congo, a campaign of the Enough Project
hrc: The Human Rights Campaign's Buyer's Guide'. No explicit policy on the website. Tried to contact them through their Buyer's Guide's feedback form to no avail. If you have an email or phone number for the people who work on the Buyer's Guide, please pass it along!
rankabrand: Rank a Brand. No explicit policy, but got a positive, friendly response by email. wegreen and Ethical Barcode also use their data. Probably a good idea to shoot them an email at contact@rankabrand.com. They respond quickly.

Writing a Scraper

Writing a scraper is pretty simple: create a module in scrapers/ that defines a function scrape_campaign(). The function should yield tuples of table_name, row. For example:

yield 'brand', {'brand': "Burt's Bees', 'company': 'Clorox'}

The names and fields of each table are described in this README.

For ratings and the campaign itself, don't include campaign_id; this is added automatically. You may also refer to campaign_brand_rating and campaign_company_rating as simply brand_rating and company_rating.

It's fine to use other python libraries; just please add them to requirements.txt.

The harness that runs scrapers provides a number of tricks so that your scraper can follow the structure of the page rather than the structure of our tables:

It's okay to output duplicates of the same row; the harness will merge them before writing them to the database. Look at TABLE_TO_KEY_FIELDS in scraper.py to see the primary key of each table (it's pretty much what you'd expect).

Strings are automatically stripped.

"company" is usually a text field, but you can also use a dict if you have other information about the company (e.g. its URL). The name of the company in that case is "company".

If you are outputting a company or company rating, you can add a "brands" field which is a list of brands. These are usually strings, but they can also be dicts (like for "company").

If you are outputting a company, brand, or rating, you can add a "categories" field which is a list of categories for the company/brand.

Rows in company and brand are automatically created for every company and brand/company pair mentioned. You might still want to emit rows for companies or brands if you have additional information (e.g. their twitter_handle).

Translating a campaign's ratings into a judgment is sometimes (ahem) a judgment call, but it's usually obvious. Mapping green to 1, yellow to 0, and red to -1 is a pretty safe bet, as is (for grades) mapping A and B to 1, C to 0, and D through F to -1 (scraper.grade_to_judgment() does exactly that). Sometimes the campaign is just a list of things to support or avoid, in which case you should use the same judgment throughout.

If you're not sure, ask the campaign's creator.

Once you're done, submit a pull request on GitHub.

If you get stuck, ask me questions! (dave@spendright.org)

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
scrapers		scrapers
submodules		submodules
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scraper.py		scraper.py
srs		srs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapers

scrapers

submodules

submodules

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

scraper.py

scraper.py

srs

srs

Repository files navigation

Scrapers for Consumer Campaigns

Using the Data

Writing a Scraper

About

Releases

Packages

Languages

License

coyotemarin/scrape-campaigns

Folders and files

Latest commit

History

Repository files navigation

Scrapers for Consumer Campaigns

Using the Data

Writing a Scraper

About

Resources

License

Stars

Watchers

Forks

Languages