Skip to content

yhjohn163/scrapy-bench

 
 

Repository files navigation

Benchmarking CLI for Scrapy

(The project is still in development.)

A command-line interface for benchmarking Scrapy, that reflects real-world usage.

Why?

  • Currently, the scrapy bench option present just spawns a spider which aggressively crawls randomly generated links at a high speed.
  • The speed thus obtained, which maybe useful for comparisons, does not actually reflects a real-world scenario.
  • The actual speed varies with the python version and scrapy version.

Current Features

  • Spawns a CPU-intensive spider which follows a fixed number of links of a static snapshot of the site Books to Scrape.
  • Follows a real-world scenario where various information of the books is extracted, and stored in a .csv file.
  • A broad crawl benchmark that uses 1000 copies of the site Books to Scrape which are dynamically generated using twisted. The server file is present here.
  • A micro benchmark that tests LinkExtractor() function by extracting links from a collection of html pages.
  • A micro benchmark that tests extraction using css from a collection of html pages.
  • A micro benchmark that tests extraction using xpath from a collection of html pages
  • Profile the benchmarkers with vmprof and upload to their website

Options

  • --n-runs option for performing more than one iteration of spider to improve the precision.
  • --only_result option for viewing the results only.
  • --upload_result option to upload the results to local codespeed for better comparison.

Installation

For Ubuntu

  • Firstly, download the static snapshot of the website Books to Scrape. That can be done by using wget.

      wget --mirror --convert-links --adjust-extension --page-requisites --no-parent \
          http://books.toscrape.com/index.html
    
  • Then place the whole file in the folder var/www/html:

      sudo ln -s `pwd`/books.toscrape.com/ /var/www/html/
    
  • nginx is required for deploying the website. Hence it is required to be installed and configured. If it is, you would be able to see the site here.

  • If not, then follow the given steps :

      sudo apt-get update
      sudo apt-get install nginx
    
  • For the broad crawl, use the server.py file to generate the various sites of local copy of Books to Scrape, which would already be in /var/www/html.

  • Add the following entries to /etc/hosts file :

    127.0.0.1    domain1
    127.0.0.1    domain2
    127.0.0.1    domain3
    127.0.0.1    domain4
    127.0.0.1    domain5
    127.0.0.1    domain6
    127.0.0.1    domain7
    127.0.0.1    domain8
    ....................
    127.0.0.1    domain1000
    
  • This would point the sites http://domain1:8880/index.html to the original site generated at http://localhost:8880/index.html.

There are 130 html files present in sites.tar.gz, which were downloaded using download.py from the top sites from Alexa top sites list.

There are 200 html files present in bookfiles.tar.gz, which were downloaded using download.py from the website Books to Scrape.

The spider download.py, dumps the response body as unicode to the files. The list of top sites was taken from here.

  • Do the following to complete the installation:

    git clone https://github.com/scrapy/scrapy-bench.git  
    cd scrapy-bench/  
    virtualenv env  
    . env/bin/activate   
    pip install --editable .
    

Usage

Usage: scrapy-bench [OPTIONS] COMMAND [ARGS]...

  A benchmark suite for Scrapy.

Options:
  --n-runs INTEGER  Take multiple readings for the benchmark.
  --only_result     Display the results only.
  --upload_result   Upload the results to local codespeed
  --book_url TEXT   Use with bookworm command. The url to book.toscrape.com on your local machine
  --vmprof          Profile the benchmarker with Vmprof
  --help            Show this message and exit.

Commands:
  bookworm       Spider to scrape locally hosted site
  broadworm      Broad crawl spider to scrape locally hosted...
  cssbench       Micro-benchmark for extraction using css
  linkextractor  Micro-benchmark for LinkExtractor()
  xpathbench     Micro-benchmark for extraction using xpath

About

A CLI for benchmarking Scrapy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%