Skip to content

Veterun/Distributed_Microblog_Spider

 
 

Repository files navigation

=========================================================================================================
                                            INTRODUCTION

        This is Distributed Microblog Spider which can used to getting the social network of users
    and the microblog tweeted by users. Besides, it can update the latest information everyday.
        The spider can mainly divided into two parts, the server and client. Server is responsable
    for arranging tasks, collecting data, maintaining proxy pool and managing databases. While cli-
    -ent is responsable for collecting data from website and sending them to server.

=========================================================================================================
                                             DEPENDENCY

                      Language              :   Python 3.4

                      External Python Lib   :   tornado
                                                pymongo
                                                redis
                                                pymysql

                      Database              :   MySQL
                                                Redis
                                                MongoDB

                      Others                :   nginx

                      URL   :   python  :   https://www.python.org/
                                MySQL   :   http://www.mysql.com/downloads/
                                MongoDB :   https://www.mongodb.org/
                                Redis   :   http://redis.io/
                                tornado :   http://www.tornadoweb.org/en/stable/
                                nginx   :   http://nginx.org/

=========================================================================================================
                                              USAGE

        Befor running this spider, you have to build several tables in MySQL first.
        -------------------------------------------------------------
        1. Build new database.  Name            :   microblog_spider
                                Character Sets  :   gb2312
                                Collations      :   gb2312_chinese_ci
        -------------------------------------------------------------
        2. Create some tables.  Name            :   accuracy_table
                                Column          :   float   |   accuracy
                                                    datetime|   time
                                                    int     |   wanted_blog_num
                                -------------------------------------
                                Name            :   cache_atten_web
                                Column          :   varchar |   from_uid
                                                    int     |   from_fans_num
                                                    int     |   from_blog_num
                                                    varchar |   to_uid
                                                    int     |   to_fans_num
                                                    int     |   to_blog_num
                                -------------------------------------
                                Name            :   cache_attends
                                Column          :   varchar |   uid
                                                    varchar |   basic_page
                                                    varchar |   name
                                                    varchar |   gender
                                                    int     |   blog_num
                                                    varchar |   description
                                                    int     |   fans_num
                                -------------------------------------
                                Name            :   cache_history
                                Column          :   datetime|   latest_time
                                                    int     |   latest_timestamp
                                                    varchar |   container_id
                                                    int     |   checkin_timestamp
                                                    datetime|   is_dealing
                                
The component and details of the project(__edition_0.9_)
---------------------------------------------------------------------------------------------------------
This is a Sina Microblog Spider archieved by server terminal and client terminal.
The Server terminal mainly deal with 4 tasks.

    1.      Get proxy from http proxy website. Unfortunately, not all proxy scratched from website
        is valid. In fact, almost one quarter of them is valid. So, it is important to check if
        those proxy is valid. One thread of server will manage a proxy pool.
            Raw and unchecked proxy will scratched from website, and this thread will arrange these
        raw proxy to several sub thread, each sub thread will check if the arranged proxy is
        valid. If so, these proxy will be sent to valid proxy pool and ready for use.
            Besides, the valid proxy may extend after some time. So it is necessary to check if the
        proxy in proxy pool is valid.

    2.      The second part is communicating with clients. Clients will require proxy and task from
        server. When client finish the task of getting info of users, they will return the user info
        to server. When received this user info , server will store then in cache database.

    3.      The 3rd part is about receiving the data translated from client. Sometimes client will
        translate data which is sized of hundred of MBs. When server is receiving there data, the query
        from other client will be denied. So it's necessary to add some data server which is used to
        receive data specially.

    4.      The 3rd part of function is dealing with the data in cache databases. The attends of user
        will be checked if already exist in user_info_table. If not, the user will be inserted into
        user_info_table. And no matter whether it exists, the user will be deleted from ready_to_get
        table. And ,as for attends, if they are not contained in user_info_table, they will be inserted
        to ready_to_get table.
==========================================================================================================

UPDATE HISTORY:
    _0.9_   :   add update function. Can update microblog
    _0.8_   :   start to use multi data servers and nginx
    _0.7_   :   the function of putting data to MongoDB is moved from server.py to server_database.py,
                which improve the speed of reaction of main server
    _0.6_   :   add a data server to avoid the broken of server. And the data of weibo is translated in
                small parts
    _0.5_   :   add the function of scratching the history of certain user
    _0.4_   :   if the average proxy size less than 30, refuse to assignment task to client
    _0.3_   :   drop the attends whoes fans num less than 1000
    _0.2_   :   add the partition to monitor the state of proxy pool
    _0.1_   :   the initial edition

==========================================================================================================

开机流程:

1.开启MySQL
	启动 mysqld --user=mysql
2.开启MongoDB
	启动 mongod --dbpath I:\MongoDB\data
3.将redis持久化关闭
	config set save ""
4.将user info table 写入redis
5.启动nginx
6.开启花生壳
7.启动服务器
8.SSL连接云主机
9.启动客户端

==========================================================================================================

爬虫重新冷启动流程:
1.drop mongodb中的 assemble factory 和 formal两表
2.清空mysql 中的cache_history表
	truncate table cache_history
3.将mysql-》user_info_table中的update_time,latest_blog和isGettingBlog设为null
	update user_info_table set update_time=null,latest_blog=null,isGettingBlog=null

About

分布式新浪微博爬虫

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%