Skip to content

Big-Data/scrapy-simple-http-queue

 
 

Repository files navigation

scrapy-simple-http-queue

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling.

First run simple-http-queue:

cd externals/simple-http-queue/simple_http_queue
python HttpQueue.py /tmp/queue.dat 8888

Initialize externals libs:

git submodule init
git submodule update

Example: run_example.sh

In settings.py:

HTTP_HOST (default is localhost)
HTTP_PORT (default is 8888)
SCHEDULER_PERSIST (default is True)
SCHEDULER_QUEUE_NAME (default is the name of the spider)
QUEUE_TYPE: FIFO (default) or LIFO

Use FIFO if you want to do a breadth-first crawling. Use LIFO if you want to do a depth-first crawling.

LIFO will consume less memory as the queue will be shorter when crawling pages.

About

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published