🕷 Crawler

🕷 Crawler is a simple (but effective!) web crawler written in Python. It outputs a flat dictionary which shows each page crawled, along with the static assets (e.g. images) found and the links between pages.

Key features:

Fast LRU Cache from Python's standard library
Unit tests (more to come soon!)
Outputs a flat Python dict — easily serializable to JSON
Configurable maximum recursion depth
Restricted to crawling same-domain pages.

A sample of the output format:

{
  "https://website.tld": {
    "assets": {
      "images": ["https://website.tld/image.png"],
      "scripts": ["https://othersite.tld/script.js"]
    },
    "links": "https://website.tld/page.html"
  },
  
  "https://website.tld/page.html": {
    "assets": {
      "images": [],
      "scripts": ["https://website.tld/scripts/counter.js"]
    },
    "links": []
  }
}

Tests can be run in the root of this repository with python -m nose.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
crawler		crawler
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

🕷 Crawler

About

Releases

Packages

Languages

bedekelly/crawler

Folders and files

Latest commit

History

Repository files navigation

🕷 Crawler

About

Resources

Stars

Watchers

Forks

Languages