Skip to content

icyrizard/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

webcrawler

A commandline webcrawler created with scrapy. It will index websites and collect meta information about these.

Prerequisites

The tool is written in python and supports 2.7 syntax, so first install python2.7 if needed.

The following packages are installed with pip, so if pip isn't installed, do this.

This tool wants to log it's work in the database so get a database-server on your machine. The supported databases are listed [here] (http://docs.sqlalchemy.org/en/rel_0_8/core/engines.html#database-urls)

  • Mysql / Postgres / Orocle / SQLite

Usage

Get urls first, the first spider will start with webdesigner.startpagina.nl and crawls recursively until the end of the internet. ###startpagina crawler scrapy crawl startpagina-crawly

wordpress crawler

scrapy crawl wp-crawly

About

A commandline webcrawler created with scrapy. It will index websites and collect meta information about these.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published