dirbot

This is a Scrapy project to scrape websites from public web directories.

This project is licensed under the terms of the MIT license.

Items

The items scraped by this project are websites, and the item is defined in the class:

dirbot.items.Website

See the source code for more details.

Spiders

This project contains one spider called dmoz that you can see by running:

scrapy list

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the Scrapy tutorial

This spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages.

Pipelines

Filtering by words

A pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

dirbot.pipelines.FilterWordsPipeline

Requiring certain item fields

A pipeline to discard items that lack of certain fields. This pipeline is defined in the class:

dirbot.pipelines.RequiredFieldsPipeline

Storing items into a database

A pipeline to store (insert or update) scraped items in a database. This pipeline is defined in the class:

dirbot.pipelines.DbPipeline

The database schema is defined in db/script.sql and the settings file contains the default DB_* settings values(which is MySQL). The scraped items will be stored in the website table under the dirbot database.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
db		db
dirbot		dirbot
.gitignore		.gitignore
README.rst		README.rst
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db

db

dirbot

dirbot

.gitignore

.gitignore

README.rst

README.rst

scrapy.cfg

scrapy.cfg

setup.py

setup.py

Repository files navigation

dirbot

Items

Spiders

Spider: dmoz

Pipelines

Filtering by words

Requiring certain item fields

Storing items into a database

About

Releases

Packages

Languages

CrazyOrr/dirbot-db

Folders and files

Latest commit

History

Repository files navigation

dirbot

Items

Spiders

Spider: dmoz

Pipelines

Filtering by words

Requiring certain item fields

Storing items into a database

About

Resources

Stars

Watchers

Forks

Languages