simplecrawler

A simple implementation of a multithreaded web crawler that extracts assets and links from each page it visits and constructs a sitemap. It uses Beautiful Soup 4 and lxml to parse html and look for specific elements, and urllib for downloading the page.

If present the robots.txt will be taken into account and possible rules are applied for each URL to consider if it should be added to the sitemap. By default it also ignores .pdf and .xml extension.

Installation

Clone the repository and install dependencies (preferably in a virtualenv) :

pip3 install -r requirements.txt

Usage

python3 crawl.py <domain_to_crawl>

Ouputting to a file:

python3 crawl.py <domain_to_crawl> > out.txt

Results will be in the following json format:

{
   "Links": [
      "#start-of-content",
      "https://github.com/",
      "/personal",
      "/open-source",
      "/business",
      "/explore",
      "/join?source=header-home",
      "/login",
      "/pricing",
      "/blog",
      ...
   ],
   "Assets": [
      "https://assets-cdn.github.com/images/modules/site/home-ill-build.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-work.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-projects.png?sn",
      "https://assets-cdn.github.com/images/modules/site/home-ill-platform.png?sn",
      "https://assets-cdn.github.com/images/modules/site/org_example_nasa.png?sn",
      ...
   "URL": "http://github.com/"
}

Design decisions

Multi-threaded. Basic idea is that workers feed URLs to a multi-threaded priority queue which 'schedules' them to be parsed. For a quick implementation this is a relatively good choice for performance, as you don't have to wait for a result to continue crawling. The number of crawlers can be set at initialization.
Compliant with robots.txt.
Exclude recursive or duplicate crawls using 2 sets to maintain already visited nodes and exlcuded nodes (via robots.txt). Acess to sets synchronized by a Lock
Parser instead of Regex to extract elements from HTML. Simply, regular expressions are not parsers, they are tools to find patterns. If you want to find specific patterns use regex. HTML can be nested, malformed, and have other problems. There is a lot of discussion around this topic, see this and this.
lxml as a BeautifulSoup 4 parser.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
tests		tests
.gitignore		.gitignore
Crawler.py		Crawler.py
LICENSE		LICENSE
README.md		README.md
Simple_Crawler.py		Simple_Crawler.py
URLHelper.py		URLHelper.py
Writer.py		Writer.py
crawl.py		crawl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

.gitignore

.gitignore

Crawler.py

Crawler.py

LICENSE

LICENSE

README.md

README.md

Simple_Crawler.py

Simple_Crawler.py

URLHelper.py

URLHelper.py

Writer.py

Writer.py

crawl.py

crawl.py

requirements.txt

requirements.txt

Repository files navigation

simplecrawler

Installation

Usage

Design decisions

About

Releases

Packages

Languages

License

biroc/simplecrawler

Folders and files

Latest commit

History

Repository files navigation

simplecrawler

Installation

Usage

Design decisions

About

Resources

License

Stars

Watchers

Forks

Languages