Skip to content

biroc/simplecrawler

Repository files navigation

simplecrawler

A simple implementation of a multithreaded web crawler that extracts assets and links from each page it visits and constructs a sitemap. It uses Beautiful Soup 4 and lxml to parse html and look for specific elements, and urllib for downloading the page.

If present the robots.txt will be taken into account and possible rules are applied for each URL to consider if it should be added to the sitemap. By default it also ignores .pdf and .xml extension.

Installation

Clone the repository and install dependencies (preferably in a virtualenv) :

pip3 install -r requirements.txt

Usage

python3 crawl.py <domain_to_crawl>

Ouputting to a file:

python3 crawl.py <domain_to_crawl> > out.txt

Results will be in the following json format:

{
   "Links": [
      "#start-of-content",
      "https://github.com/",
      "/personal",
      "/open-source",
      "/business",
      "/explore",
      "/join?source=header-home",
      "/login",
      "/pricing",
      "/blog",
      ...
   ],
   "Assets": [
      "https://assets-cdn.github.com/images/modules/site/home-ill-build.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-work.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-projects.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-platform.png?sn",
      "https://assets-cdn.github.com/images/modules/site/org_example_nasa.png?sn",
      ...
   "URL": "http://github.com/"
}

Design decisions

  • Multi-threaded. Basic idea is that workers feed URLs to a multi-threaded priority queue which 'schedules' them to be parsed. For a quick implementation this is a relatively good choice for performance, as you don't have to wait for a result to continue crawling. The number of crawlers can be set at initialization.

  • Compliant with robots.txt.

  • Exclude recursive or duplicate crawls using 2 sets to maintain already visited nodes and exlcuded nodes (via robots.txt). Acess to sets synchronized by a Lock

  • Parser instead of Regex to extract elements from HTML. Simply, regular expressions are not parsers, they are tools to find patterns. If you want to find specific patterns use regex. HTML can be nested, malformed, and have other problems. There is a lot of discussion around this topic, see this and this.

  • lxml as a BeautifulSoup 4 parser.

About

Python webcrawler that outputs sitemap and assets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages