Scraper

简介

Scraper是一个基于Scrapy修改而来，依靠Redis实现任务分配的分布式网络爬虫框架，采用Python作为开发语言。

架构图

组件

Scraper daemon: Daemon应用，从Redis服务器轮询爬虫的任务请求，根据请求创建Scrapy进程来运行对应的爬虫任务。因为Scrapy是基于Twisted的事件驱动模型实现的单线程应用，并且由于Python的全局解释锁的存在，多线程的性能并不理想，所以这里采用了多进程的模型。
Spider A,B,C...: 具体运行爬虫任务的进程。等同于运行scrapy crawl [spidername]，但同样利用Redis进行了分布式的改造，相同的任务可以在多个进程或多台主机上同时运行。

安装和运行

安装Python 2.7
安装pip
运行git clone https://github.com/LightKool/scraper.git
运行cd scraper && python setup.py install
运行scraperd
现在scraper daemon已经在运行了，利用任意Redis客户端（控制程序尚未开发完成）运行Redis指令zadd scraperd:spider:queue 100 "{\"name\": \"[spidername]\"}"，此时一个新的进程就会被创建并且运行名为[spidername]的Spider（具体参见Scrapy文档）。

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
asset		asset
scraper		scraper
.gitignore		.gitignore
Dockerfile		Dockerfile
MANIFEST.in		MANIFEST.in
README.md		README.md
push_url.py		push_url.py
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset

asset

scraper

scraper

.gitignore

.gitignore

Dockerfile

Dockerfile

MANIFEST.in

MANIFEST.in

README.md

README.md

push_url.py

push_url.py

scrapy.cfg

scrapy.cfg

setup.py

setup.py

Repository files navigation

Scraper

简介

架构图

组件

安装和运行

About

Releases

Packages

Languages

LightKool/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

简介

架构图

组件

安装和运行

About

Resources

Stars

Watchers

Forks

Languages