Skip to content

A simple (but effective!) web crawler written in Python.

Notifications You must be signed in to change notification settings

bedekelly/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕷 Crawler

Build Status

🕷 Crawler is a simple (but effective!) web crawler written in Python. It outputs a flat dictionary which shows each page crawled, along with the static assets (e.g. images) found and the links between pages.

Key features:

  • Fast LRU Cache from Python's standard library
  • Unit tests (more to come soon!)
  • Outputs a flat Python dict — easily serializable to JSON
  • Configurable maximum recursion depth
  • Restricted to crawling same-domain pages.

A sample of the output format:

{
  "https://website.tld": {
    "assets": {
      "images": ["https://website.tld/image.png"],
      "scripts": ["https://othersite.tld/script.js"]
    },
    "links": "https://website.tld/page.html"
  },
  
  "https://website.tld/page.html": {
    "assets": {
      "images": [],
      "scripts": ["https://website.tld/scripts/counter.js"]
    },
    "links": []
  }
}

Tests can be run in the root of this repository with python -m nose.

About

A simple (but effective!) web crawler written in Python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages