garbados/Soup-Miner
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SOUP MINER README COPYRIGHT INFORMATION: Soup Miner is licensed under GPL 3.0. The full license can be found here: http://www.gnu.org/copyleft/gpl.html OVERVIEW Soup Miner is an open-source data-miner for lightweight projects. It revolves around the use of urllib2 to collect web pages, while using the BeautifulSoup HTML parsing library to parse the collected data. CONTENTS manager.py Control panel. Start, reset, and run spiders. miner.py Data mining library. [spider]/spider.py Contains the spider's model as a dict, a function that defines the list of URLs, and the mining code as what to do given a URL. [spider]/results.csv Contains the aggregate of the spider's collected data in CSV form. Won't exist until spider is run. [spider]/recent.csv Contains data from the most recent run. Won't exist until spider is run. [spider]/settings.txt Contains spider settings, meant to be edited by hand (so the unsavvy needn't soil themselves with the command line) pertaining to: - what url to use as a base - whether to run in debug mode - what the last ID parsed was - how many iterations the spider last ran - whether to rectify missing urls -- as in, pages we should have parsed, but for whatever reason haven't.
About
A small, portable data scraper built on BeautifulSoup
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published