SinaSpider - A Distributed Spider System for Sina Weibo

1. Working Schema

This spider system is programmed by pure Python code and works as Master-Slave schema.

The master node does nothing for crawling, it's just responsible for task assignment and data storage; while the slave nodes mainly do the crawling job and commit the parsed data to the master node for persistence.

For one spider running on a slave node, everytime it fetches a batch of uid (Weibo user ID) as its crawling task from the master node. Then the spider starts to crawl the data, and there are four parts for one user's Weibo data, that's followee, follower, timeline and profile. It's noting that one spider use multiple Weibo accounts to do the crawling with the round-robin strategy. That's to say, when one account is working, the remain ones are in their rest, then the second account starts to work after a period and the previous one takes its rest. Things go like this.

2. Environment Deployment

2.1 Get the Source Code

git clone https://github.com/ChenghaoZHU/SinaSpider.git

2.2 Install Relevant Dependencies

rsa
PIL
sqlalchemy
pymysql
caca-utils (Only necessary for Linux)

If you have installed anaconda and use Red Hat Linux, following commands may be helpful:

conda install -c https://conda.anaconda.org/jiangxiluning rsa
conda install PIL
conda install sqlalchemy
conda install pymysql

sudo yum install caca-utils

2.3 Install MySQL Database

MySQL is only required to be installed in the master node. The corresponding table structures are saved in the sina_weibo_table_structures.sql file. You can create the database by executing the sql file easily.

3. Get Started

Before you run the spider with:

python CompleteCrawl.py

You should edit the Config.py file first. All the parameters in this file are listed as follows:

Variable	Description
LOG_FILE	Log file path
SLEEP_BETWEEN_2FPAGES	Program sleeping time after reading one relationship page
SLEEP_BETWEEN_TIMELINE_PAGES	Program sleeping time between two timeline pages' reading
SLEEP_WHEN_EXCEPTION	Program sleeping time when encountering exceptions
ACCOUNT_CHANGE_TIME	Single account working time span
TABLES	Mapping relationships from program variables to database tables
DB_USER	Database user name
DB_PASSWD	Database user password
DB_HOST	IP address of database
DB_DATABASE	Database name
DB_CHARSET	Database character set
ACCOUNT_NUM	Account number one spider uses
TASK_NUM	Amount one batch of uid contains
OS	0 is for Windows, and 1 is for Linux

Usually, you could just only edit DB_USER, DB_PASSWD, DB_HOST and OS to start a spider. While other parameters are designed for personal customization.

4. FAQ

Q: Why I couldn't view the captcha picture in Windows 7? A: http://stackoverflow.com/questions/7715501/pil-image-show-doesnt-work-on-windows-7

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.idea		.idea
Log		Log
.gitignore		.gitignore
APISpider.py		APISpider.py
AddAccounts.py		AddAccounts.py
CommercialAPISpider.py		CommercialAPISpider.py
CompleteCrawl.py		CompleteCrawl.py
Config.py		Config.py
CrawlTimelineByAPI.py		CrawlTimelineByAPI.py
Dao.py		Dao.py
Log.py		Log.py
Parser.py		Parser.py
README.md		README.md
Spider.py		Spider.py
Test.py		Test.py
Utility.py		Utility.py
Weibo.py		Weibo.py
captcha.jpg		captcha.jpg
sina_weibo_table_structures.sql		sina_weibo_table_structures.sql

ChenghaoZHU/SinaSpider

Folders and files

Latest commit

History

Repository files navigation

SinaSpider - A Distributed Spider System for Sina Weibo

1. Working Schema

2. Environment Deployment

2.1 Get the Source Code

2.2 Install Relevant Dependencies

2.3 Install MySQL Database

3. Get Started

4. FAQ

About

Resources

Stars

Watchers

Forks

Languages