This repository contains a Scrapy project scraping first pages of three GitHub's projects, and some figures of its results. It aims to demonstrate that Scrapy so-called "depth-first" order is actually a breadth-first order.
Scrapy project crawls some given and specified web structure <structure>
, and outputs both requests and responses proceeded orders, allowing to reconstruct walked graph.
Actually crawling orders for both requests and responses are figured (by hand) as graphs in tree/
directory (SVG files created with Inkscape and exported as PNG). They exist for two configurations <configurations>
of Scrapy, default one (files named github-tree-*-depth_priority_0.*
) described as configuration for "depth-first" order, and alternative configuration (files named github-tree-*-depth_priority_1.*
) for "breadth-first" order.
Project crawls three GitHub's projects (scrapy/scrapy, scrapy/scrapyd, scrapinghub/scrapylib) and in each project crawls two or three directories, then in each of these directories one, two or three files.
Complete crawled structure follows, and is defined in project as github.spiders.PROJECTS
:
github.com
\_ github's search page
\_ scrapy/scrapy
\_ docs/
\_ README
\_ conf.py
\_ faq.rst
\_ scrapy/
\_ VERSION
\_ spider.py
\_ extras/
\_ scrapy.1
\_ scrapy_zsh_completion
\_ scrapy/scrapyd
\_ docs/
\_ conf.py
\_ index.rst
\_ install.rst
\_ scrapyd/
\_ VERSION
\_ app.py
\_ utils.py
\_ extras/
\_ test-scrapyd.sh
\_ scrapinghub/scrapylib
\_ scrapylib/
\_ redisqueue.py
\_ links.py
\_ tests/
\_ test_links.py
\_ test_magicfields.py
For each parent node, its direct children order is specified above as top-bottom, e.g. crawler at scrapy/scrapy
project and docs/
directory will request README
first, then conf.py
and finally faq.rst
. This order is represented in figures tree/github-tree*
as left-to-right.
Configuration of project is done through github/settings.py
. Default configuration, as documented by Scrapy here: http://doc.scrapy.org/en/1.0/faq.html#does-scrapy-crawl-in-breadth-first-or-depth-first-order , is for "depth-first" order.
To switch to "breadth-first" order, uncomment last lines as such:
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
To ensure that requests are proceed without any randomized delay, in the order they are emitted by spider functions, RANDOMIZE_DOWNLOAD_DELAY
is set to False
.
In parsing function, requests emitted and responses received are stored in spider's GitHubSpider.requests
and GitHubSpider.responses
lists.
Requests are stored just before being emitted by parsing function:
def parse_directory(self, response):
#...
for filename in project.dirs[crawled_infos.current_dir]:
#...
request = Request(...)
self.requests.append(request)
yield request
Responses are stored at parsing function beginning, with:
def parse_directory(self, response):
self.responses.append(response)
#...