scrapy-simple-http-queue

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling.

First run simple-http-queue:

cd externals/simple-http-queue/simple_http_queue
python HttpQueue.py /tmp/queue.dat 8888

Initialize externals libs:

git submodule init
git submodule update

Example: run_example.sh

In settings.py:

HTTP_HOST (default is localhost)
HTTP_PORT (default is 8888)
SCHEDULER_PERSIST (default is True)
SCHEDULER_QUEUE_NAME (default is the name of the spider)
QUEUE_TYPE: FIFO (default) or LIFO

Use FIFO if you want to do a breadth-first crawling. Use LIFO if you want to do a depth-first crawling.

LIFO will consume less memory as the queue will be shorter when crawling pages.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
example-project		example-project
externals		externals
scrapy_simple_http_queue		scrapy_simple_http_queue
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
run_example.sh		run_example.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example-project

example-project

externals

externals

scrapy_simple_http_queue

scrapy_simple_http_queue

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

run_example.sh

run_example.sh

setup.py

setup.py

Repository files navigation

scrapy-simple-http-queue

About

Releases

Packages

Big-Data/scrapy-simple-http-queue

Folders and files

Latest commit

History

Repository files navigation

scrapy-simple-http-queue

About

Resources

Stars

Watchers

Forks