Skip to content

techhat/cauthon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cauthon

Cauthon is a web crawler and processing engine, with filters based on the Salt loader system.

import cauthon
crawler = cauthon.Crawler()
links = crawler.scrape('http://example.com/path/to/page.html')

TO DO

  • Change sqlite schema to map from URL to checksum to content, using some sort of hashmap.
  • Allow Cauthon to connect to other workers and command them.
  • Splay processing and downloading across multiple workers.
  • Add more intelligent methods for running filters than just a site map. Filters which analyze pages to categorize and rank them cannot be constrained to use filters based on domain name.
* Support other databases than sqlite.
  • Genesis should be added as a generic database driver.

Why the Name?

The Cauthon web crawler is so named in part because it can collect data from various sources, and compile it into a larger database. It can analyze those data to reach certain conclusions. It also has the ability to command other instances of itself, increasing its ability to complete the task at hand.

About

Web Crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages