rss_crawl rss爬虫

A crawl for rss websites. The more websites the better.

LICENSE: GPL V2

requirements:

python 2.7
Scrapy==0.16.5
Twisted==13.1.0
MySQL-python==1.2.4

how to use:

config the database vie the rss_spider/settings.py file
```
  HOST = ""
  DB = ""
  USER = ""
  PASSWD = ""
```
create the database via the sql.sql file
init the table rss_urls via the script in the script folder:
```
  python insert_rss_list.py  
```

(the rss_list_init.txt file include more then 800 rss links)

start a scrapy server:
```
  scrapy server &
```
start a spider:
```
  cd /script
  ./start_spider.py
```
stop a spider:
```
  cd /script
  ./stop_spider.py
```

autostart:

you can autostart the spider, for example:

    crontab -e
    0 * * * * /home/name/work/rss_spider/script/./start_spider.py
    28 * * * * /home/name/work/rss_spider/script/./stop_spider.py
    30 * * * * /home/name/work/rss_spider/script/./start_spider.py
    58 * * * * /home/name/work/rss_spider/script/./stop_spider.py

(start the spider at 0 and 30 per hour, and stop the spider at 28 and 58 per hour.)

interface:

I also write a website using Clojure for handling the infomation downing from these rss sources. I will open source it some day.

如何使用：

通过rss_spider/settings.py配置数据库

  HOST = ""
  DB = ""
  USER = ""
  PASSWD = ""

通过sql.sql这个文件来初始化数据库。
通过script文件夹下的脚本来初始化rss_urls这个表。
```
  python insert_rss_list.py
```

(rss_list_init.txt这个文件下有超过800个rss链接)

开启一个爬虫服务：
```
  scrapy server &
```
启动一个爬虫:
```
  cd /script
  ./start_spider.py
```
停止爬虫:
```
  cd /script
  ./stop_spider.py
```

自启动脚本：

    crontab -e
    0 * * * * /home/name/work/rss_spider/script/./start_spider.py
    28 * * * * /home/name/work/rss_spider/script/./stop_spider.py
    30 * * * * /home/name/work/rss_spider/script/./start_spider.py
    58 * * * * /home/name/work/rss_spider/script/./stop_spider.py

(每个小时的0分和30分启动爬虫，28分和58分停止爬虫)

界面:

我用clojure写了一个关于对所抓取倒数据操作，展示的网站。近期会对它开源。

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
rss_spider		rss_spider
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
sql.sql		sql.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rss_spider

rss_spider

script

script

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

scrapy.cfg

scrapy.cfg

sql.sql

sql.sql

Repository files navigation

rss_crawl rss爬虫

LICENSE: GPL V2

requirements:

how to use:

autostart:

interface:

如何使用：

自启动脚本：

界面:

About

Releases

Packages

License

qq40660/rss_spider

Folders and files

Latest commit

History

Repository files navigation

rss_crawl rss爬虫

LICENSE: GPL V2

requirements:

how to use:

autostart:

interface:

如何使用：

自启动脚本：

界面:

About

Resources

License

Stars

Watchers

Forks