Skip to content

ryh95/pyspider-stock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspider-stock

Note:This README will have both Chinese and English version, Chinese first because it is for Chinese stock market.

Update :

  • 增加IT版块股票的抓取和分析

这个项目做什么?

这个项目使用pyspider抓取东方财富网股吧雪球网新浪股吧的帖子,然后使用自然语言处理(情感分析)的方式分析舆论

所以

它有两个部分

  1. 抓取帖子
  2. 情感分析

如何运行它?

第一步 抓取帖子

  • 下载pyspidermongoDBredissnowNLPpymongo(2.9)及相应的依赖库
  • 运行set_codes/set_hs300.pyset_IT.py(为了将HS300成份股的股票代码装入mongoDB,后者的目的是放入IT股票的代码)
  • 然后,将resultdb.py放入pyspider的database/mongodb目录下(为了将爬取到的数据放入mongoDB),pyspider路径使用pip show pyspider命令
  • 启动redis
  • 然后,在有config.json的目录下,command line 运行pyspider -c config.json all &
  • 其次,将script里的脚本复制后,粘贴到localhost:5000下你自己的工程里(想要爬取哪个网站就粘贴哪个script),保存
  • 最后在网页localhost:5000里单击run

在早上开盘前执行完最后两步即可在每天早上开盘前获取到HS300昨日的舆论数据

第二步 情感分析

在完成第一步30分钟后即可执行该步骤

第一次运行时在和main.py同目录下新建目录data

运行 main.py即可

发生了什么?

默认使用gubaEast.py抓取东方财富网下的股友汇版块,因为它最稳定

执行完第一步后,你会在名为[stockcode]eastmoney的database下发现[date]GuYouHui的collection,其中[stockcode][date]分别是HS300成份股的股票代码和昨天的日期

接着是情感分析部分

核心是3段代码:

produceFactor.getSentimentFactor(stockCode, grab_time)

用于获得抓取日期的特定股票帖子的情感因子和情感值(由情感因子乘以阅读量获得)

aggregateFactor.aggregate(stockCode, grab_time)

用于获得抓取日期的特定股票的情感值(由所有帖子情感值相加得到),结果保存在[stockcode]eastmoney下的[date]SentimentFactor

dailyResult.setDailyResult(stockCode, grab_time)

用于汇总抓取日期的所有HS300股票的情感值和帖子数,结果在[date]database的DailyResultcollection下

而后结果会以excel的格式保存在data目录下

结果会以邮件形式发给你指定的人,通过sendMail模块

最后taskdb里面这个任务会被清除,以便明天增量抓取。同时会将5天前数据库中的数据导出,存在本地,并删除数据库中的数据

如果想用app在android端查看结果,就保留

os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html')

English version

What's the aim of this project?

This project use pyspider to get posts of eastmoney, xueqiu, sinaguba,then use NLP techs to analyze the sentiment of public in order to select stocks.

SO

It has two parts

  1. crawl posts
  2. sentiment analysis

How to run this project?

Step 1 Crawl posts

  • Download pyspidermongoDBredissnowNLP and other dependencies
  • run set_hs300/setCodes.py(in order to get all symbols of HS300 and load them into mongoDB)
  • put resultdb.py into database/mongodb directory of pyspider(in order to save the crawling data to mongoDB)
  • start redis
  • command line run pyspider -c config.json all & under directory of config.json
  • copy script in script folder, paste code to your own project in localhost:5000, save
  • click run button in localhost:5000

Complete two last steps before the market is open, then you'll get sentiment data everyday periodically.

Step 2 Sentiment analysis

Run main.py after the posts been crawled and stored, also remember to create data directory for the first running.

What happened?

Because of the stability, use gubaEast.py to crawl GuYouHui section is by default.

After Step 1 finished,you'll find a collection named [date]GuYouHui under a database called [stockcode]eastmoney, where [stockcode] and [date]are symbols of HS300 and date of yesterday.

Another part is sentiment analysis

The core part is three pieces of code:

produceFactor.getSentimentFactor(stockCode, grab_time)

To obtain sentiment values and sentiment factor for a specific symbol post and crawl date(sentiment values are computed by snowNLP while sentiment factor is sentiment values times read numbers)

aggregateFactor.aggregate(stockCode, grab_time)

To obtain sentiment values and sentiment factor for a specific symbol and crawl date(by adding all the posts for that stock on that day), result is in [date]SentimentFactor under [stockcode]eastmoney

dailyResult.setDailyResult(stockCode, grab_time)

To collect all the sentiment factors and number of posts for the crawl date,result is in the DailyResult collection which is under the [date]database.

Then an excel would be saved under the data directory as the final result.

The result would be mailed to specific users, through sendMail module.

Tasks under taskdb would be deleted in order to crawl posts periodically. Meanwhile data which stored 5 days ago would be dumped as backup and mongoDB would delete the original one.

If you want to use the app to check the result on android, keep the following code

os.system('mv data/' + grab_time + 'result.xls' + ' /var/www/html')

About

A project using pyspider to collect data and NLP techs to analyze the correlation among the data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages