- Messy spiders crawl news.sina.com.cn, news.qq.com, chinanews.com, weibo.cn, based on Python Scrapy frame.
- This proj is serves for a web data analysing proj as a base module.
- It's a remote cooperative practice project, so I can't guarantee that every spider works well.
- It may contains code which copied from anywhere, even without a valid licence.
- Few comments in the code.
- Our team's git server looks like it will never be ready, so I persuaded the same group to use at least github.
- Documents and notes are mostly Chinese.
- But I'll try my best to standardize this proj with my buddy.
¯\(ツ)/¯
Our project is a bit special:
It accepts a keyword and starts searching and crawling data that contains keywords,
instead of building a website topology in general.
For a diagram type view of this project, click here
- /ScrapySwarm, yeh that's a scrapy project.
- /Doc, documentions about ScrapySwarm.
- scrapy.cfg, auto-generated by scrapy console when init project.
- /mysite, a django app, witch have a web interface to run all spiders.
- Can only do run-all-spiders process.
- You'd better to use python script to import Scrapyswarm.control.swarm_api, to run spiders.
https://github.com/boholder/ScrapySwarm/wiki/Set-Up-Environment