PubCrawl

As the world doesn't have enough horrible crawlers

This is a minimalist web crawler written in Python for fun over two evenings. PubCrawl currently uses Redis as a persistent backend for the URL frontier (queue), seen URL sets and Memcache style key expiration for use with robots.txt. On a commodity laptop with a standard ADSL2+ connection the download rate is a sustained 25-50 pages per second (or 2-4 million crawled pages per day).

Multiple Python clients can be started and run at the same time either using multiple processes (multiprocessing) or threads
The crawl can be stopped and restarted with only a minimal loss of URLs (zero by the future addition of an in-progress set of URLs)
The web crawler respects robots.txt even when the crawl is stopped and restarted (as long as the Redis database layer persists)
PubCrawl only depends on Redis which plays the role of message queue, URL set curator and Memcache server

The web crawler respects robots.txt through the use of a slightly modified robot exclusion rules parser by Philip Semanchuk. Currently when a website is retrieved the web domain is added as a key to Redis and given an expiry time. The expiry time is either the one provided by robots.txt or a one second program default.

For actual production use it's suggested to install a local DNS caching server such as dnsmasq or ncsd for performance reasons.

Dependencies

Python 2.6 (due to multiprocessing)
Redis
redis-py sudo easy_install redis

Todo

Investigate alternative methods for fetching URLs in order to improve the speed and concurrency over urllib2.urlopen
Better busy queue implementation for handling links that are constrained due to the robots.txt delay
The link graph may be better stored on disk (or at the very least there should be an interface for storage/manipulation)
Implement a flat file storage system for the page contents (as this is currently only useful for the link graph)
The cache for robots.txt should be globally accessible and not a localised Python object
Allow for modifying both the structure, database layer and location of the Redis database server
For longer crawls exceptions and their tracebacks are ignored but should be stored for later review
Make CrawlRequest easily extendible so that a new layer of processing can be added arbitrarily
Improve the robots.txt parser to handle Unicode and generally more complex formats

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
crawl.cfg		crawl.cfg
crawler.py		crawler.py
raw_link_graph.py		raw_link_graph.py
readme.md		readme.md
robotexclusionrulesparser.py		robotexclusionrulesparser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl.cfg

crawl.cfg

crawler.py

crawler.py

raw_link_graph.py

raw_link_graph.py

readme.md

readme.md

robotexclusionrulesparser.py

robotexclusionrulesparser.py

Repository files navigation

PubCrawl

Dependencies

Todo

About

Releases

Packages

Languages

Smerity/pubcrawl

Folders and files

Latest commit

History

Repository files navigation

PubCrawl

Dependencies

Todo

About

Resources

Stars

Watchers

Forks

Languages