Skip to content
This repository has been archived by the owner on Sep 8, 2019. It is now read-only.

Messy spiders crawl sorts of china news websites, based on Python Scrapy module. It's a cooperative practice project, so I can't guarantee that every spider works well. It may contains code which copied from anywhere, without a valid licence.

qieting/ScrapySwarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapySwarm

About

  • Messy spiders crawl news.sina.com.cn, news.qq.com, chinanews.com, weibo.cn, based on Python Scrapy frame.
    • This proj is serves for a web data analysing proj as a base module.
  • It's a remote cooperative practice project, so I can't guarantee that every spider works well.
    • It may contains code which copied from anywhere, even without a valid licence.
    • Few comments in the code.
  • Our team's git server looks like it will never be ready, so I persuaded the same group to use at least github.
  • Documents and notes are mostly Chinese.
  • But I'll try my best to standardize this proj with my buddy.
    ¯\(ツ)

Our project is a bit special:
It accepts a keyword and starts searching and crawling data that contains keywords,
instead of building a website topology in general.

For a diagram type view of this project, click here

Directory introduction:

  • /ScrapySwarm, yeh that's a scrapy project.
  • /Doc, documentions about ScrapySwarm.
  • scrapy.cfg, auto-generated by scrapy console when init project.
  • /mysite, a django app, witch have a web interface to run all spiders.
    • Can only do run-all-spiders process.
    • You'd better to use python script to import Scrapyswarm.control.swarm_api, to run spiders.

Set up environment

https://github.com/boholder/ScrapySwarm/wiki/Set-Up-Environment

How to run

About

Messy spiders crawl sorts of china news websites, based on Python Scrapy module. It's a cooperative practice project, so I can't guarantee that every spider works well. It may contains code which copied from anywhere, without a valid licence.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published