Python crawler exercise

Usage: ./crawl.py http://www.periapt.co.uk/ --verbose --dir periapt

I chose http://pypi.python.org/pypi/htmldata/1.1.1 for extracting urls from the html for its ease of use, even though it is neither perfect nor up-to-date.

I defined the "domain" as being:

splitted_url = urlparse.urlsplit( seed )
domain = splitted_url.netloc
domain = re.sub( r'^www\.', '', domain)

so that subdomains where "www" is replaced by "code" or "web" are also crawled.

I did not deal with user-agent definitions nor robots rules.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Crawler.py		Crawler.py
README.md		README.md
crawl.py		crawl.py
htmldata.py		htmldata.py

Provide feedback