GitHub - garbados/Soup-Miner: A small, portable data scraper built on BeautifulSoup

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Designer		Designer
Generic		Generic
Submission		Submission
.gitignore		.gitignore
BeautifulSoup.py		BeautifulSoup.py
README		README
__init__.py		__init__.py
manage.py		manage.py
miner.py		miner.py
soupify.py		soupify.py
spider.py		spider.py

Repository files navigation

SOUP MINER README

COPYRIGHT INFORMATION:
Soup Miner is licensed under GPL 3.0. The full license can be found here: http://www.gnu.org/copyleft/gpl.html

OVERVIEW
Soup Miner is an open-source data-miner for lightweight projects. 
It revolves around the use of urllib2 to collect web pages, 
while using the BeautifulSoup HTML parsing library to parse the collected data.

CONTENTS
	manager.py
		Control panel. Start, reset, and run spiders.
	miner.py
		Data mining library.
	[spider]/spider.py
		Contains the spider's model as a dict,
		a function that defines the list of URLs,
		and the mining code as what to do given a URL.
	[spider]/results.csv
		Contains the aggregate of the spider's collected data in CSV form.
		Won't exist until spider is run.
	[spider]/recent.csv
		Contains data from the most recent run.
		Won't exist until spider is run.
	[spider]/settings.txt
		Contains spider settings, meant to be edited by hand (so the unsavvy needn't soil themselves with the command line) pertaining to:
		- what url to use as a base
		- whether to run in debug mode
		- what the last ID parsed was
		- how many iterations the spider last ran
		- whether to rectify missing urls -- as in, pages we should have parsed, but for whatever reason haven't.