Skip to content

vincent-ferotin/scraping-github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Demo of scraping GitHub's page to illustrate "depth-first" order of Scrapy

This repository contains a Scrapy project scraping first pages of three GitHub's projects, and some figures of its results. It aims to demonstrate that Scrapy so-called "depth-first" order is actually a breadth-first order.

Scrapy project crawls some given and specified web structure <structure>, and outputs both requests and responses proceeded orders, allowing to reconstruct walked graph.

Actually crawling orders for both requests and responses are figured (by hand) as graphs in tree/ directory (SVG files created with Inkscape and exported as PNG). They exist for two configurations <configurations> of Scrapy, default one (files named github-tree-*-depth_priority_0.*) described as configuration for "depth-first" order, and alternative configuration (files named github-tree-*-depth_priority_1.*) for "breadth-first" order.

Scraped structure

Project crawls three GitHub's projects (scrapy/scrapy, scrapy/scrapyd, scrapinghub/scrapylib) and in each project crawls two or three directories, then in each of these directories one, two or three files.

Complete crawled structure follows, and is defined in project as github.spiders.PROJECTS:

github.com
 \_ github's search page
     \_ scrapy/scrapy
         \_ docs/
             \_ README
             \_ conf.py
             \_ faq.rst
         \_ scrapy/
             \_ VERSION
             \_ spider.py
         \_ extras/
             \_ scrapy.1
             \_ scrapy_zsh_completion
     \_ scrapy/scrapyd
         \_ docs/
             \_ conf.py
             \_ index.rst
             \_ install.rst
         \_ scrapyd/
             \_ VERSION
             \_ app.py
             \_ utils.py
         \_ extras/
             \_ test-scrapyd.sh
     \_ scrapinghub/scrapylib
         \_ scrapylib/
             \_ redisqueue.py
             \_ links.py
         \_ tests/
             \_ test_links.py
             \_ test_magicfields.py

For each parent node, its direct children order is specified above as top-bottom, e.g. crawler at scrapy/scrapy project and docs/ directory will request README first, then conf.py and finally faq.rst. This order is represented in figures tree/github-tree* as left-to-right.

Configurations

Configuration of project is done through github/settings.py. Default configuration, as documented by Scrapy here: http://doc.scrapy.org/en/1.0/faq.html#does-scrapy-crawl-in-breadth-first-or-depth-first-order , is for "depth-first" order.

To switch to "breadth-first" order, uncomment last lines as such:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

To ensure that requests are proceed without any randomized delay, in the order they are emitted by spider functions, RANDOMIZE_DOWNLOAD_DELAY is set to False.

Collecting requests and responses

In parsing function, requests emitted and responses received are stored in spider's GitHubSpider.requests and GitHubSpider.responses lists.

Requests are stored just before being emitted by parsing function:

def parse_directory(self, response):
    #...
    for filename in project.dirs[crawled_infos.current_dir]:
        #...
        request = Request(...)
        self.requests.append(request)
        yield request

Responses are stored at parsing function beginning, with:

def parse_directory(self, response):
    self.responses.append(response)
    #...

About

Demo of scraping github's pages with Scrapy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages