Skip to content

c-wchen/xpc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

新片场爬虫

爬取范围, 评论和主页信息(覆盖video图片和video链接)

│  README.md
│  requirements.txt   需要安装包资源
│  scrapy.cfg
│  xpc.sql            新片场数据库设计
└─xpc
    │  items.py      
    │  middlewares.py  
    │  pipelines.py   item管道
    │  run.py
    │  settings.py  
    │  __init__.py
    │
    ├─spiders
       │  discovery.py
       │  __init__.py
       │
       └─__pycache__
               discovery.cpython-37.pyc
               __init__.cpython-37.pyc

相关难点解决

  1. 429 too many requests
# 在setting中添加请求频率
DOWNLOAD_DELAY = 1
# 单个IP的最大请求值
CONCURRENT_REQUESTS_PER_IP = 16
  1. 通过debug调试报错
# run.py 使用run.py代替运行脚本
from scrapy import cmdline
cmdline.execute('scrapy crawl discovery'.split(' '))

可以修改配置

  1. 数据库配置
  • 创建数据库 xpc
  • pipelines.py配置数据库host、user、password
  • 开启MySQLPipeline
  1. scrapy-redis配置
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
REDIS_URL = 'redis://:password@106.14.136.195:6379'
SCHEDULER_PERSIST = True

About

xinpianchang.com crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published