Skip to content

Spider, Proxypool and MachineLearning. A big project😁

License

Notifications You must be signed in to change notification settings

lyp607720/bilibiliRankML

 
 

Repository files navigation

Update Record

- 2019/6/12 
    finish spider module, now could get original data from bilibili
- 2019/7/2
    refactoring almost everything
- 2019/7/4
    perfect try and catch
    fix program not working bug for getting a not existed or deleted video info
- 2019/7/5 
    add proxy function
- 2019/7/6 
    add error logging module
- 2019/7/8 
    bug fix and add MachineLearning Perceptron Module
- 2019/7/9 
    perfect perceptron module and optimize code design 
- 2019/7/10 
    fixed perceptron module bug, add failure statistics
- 2019/7/13 
    formally support multiple processes 
- 2019/7/16 
    configuration supported
- 2019/7/17 
    add generator function for fakedata module
    add proxy_pool module, not ready for using
- 2019/7/18
    Greatly adjust file structure
- 2019/7/19
    Changed the way of generating error data, more scientific
- 2019/7/21
    Try to use coroutines for proxy pool module(not avaliable now)
- 2019/7/22
    Apply coroutines to proxy pool module succeesully
- 2019/7/23
    Add redis module
    Redis Module bugs fix, improve stability
- 2019/7/24
    Add Flask Server module for connecting with proxypool database
- 2019/7/25
    Add lots of interfaces for proxypool(pool module, database module)
- 2019/7/26
    Cancel the coroutines for fetching (kuaidaili.com) proxies
    (because of its anti spider strategy)
- 2019/7/28
    Perfect flask server api, get_one api, feedback api
- 2019/8/2
    Add GradientDescent module(.ipynb for jupyter Notebook)
- 2019/8/3
    Perceptron bias calculate bug fixed
- 2019/8/8
    Proxy class equal bug fixed
- 2019/8/9
    ProxyPool evaluate rule bug fixed
    ProxyPool ready for using steadily

Journal

1.2019/7/4
Bug and program exception occurred frequently when I update spider module to get additional video info through some other api. And I gradually find the risk to get info directly from a video page is so great. A big problem is you got aid of a video in the ranking list but this video has been deleted. Almost every functions that get info by a aid won't work well as expected.
Another problem is 'guochuang' video partition has a lot of official videos. They have entirely different html page that will cause some functions collapse, like 'get_video_upload_time'.
Aim at the above two problems, I have tried to use unique flags and 'Try-Except' to avoid their happening. But will it finally work? Somehow I have no confidence in it.
2.2019/7/5
I didn't expect that the program failed to fetch info again when I check the data. And my ip was blocked by bilibili.com when I traversed some interfaces of it. That's was too shocking for me. Because I have thought that the bilibili.com was weak for anti-spider technology for a long time.
I used a proxy pool to solve this problem. For the majority, it works well, my ip is safe. But for some video category, the free proxy pool will be blocked by bilibili.com, such as 'guochuang' category. Yep, 'guochuang' became the problem hardship again.
3.2019/7/6
It's necessary to add error logging module for analysis. We need to delete these videos that could't get length or upload time through speical flag in data processing.
4.2019/7/8
Holy shit! I forgot to get video's points info! Fuck it. So I have to do it again now. To figure out how did the website calculate the video points from some indexes, assume it a multiple linear regression model is a good choice. Perceptron is a a easy enough but nice classifier to solve linear problem.
Well, the spider now is steady after I used proxy module and add try-catch to almost every procedure that may make errors.
5.2019/7/9
A 8-element linear function could be fit with 1000 test data in no more than 20 iterations by perceptron. But it can't fit bilibili data. I think there are at least three possible reasons caused that:
1. they are nonlinear
2. some data need to clean
3. missing some features or too much features
Whatever, I choose to analyse the data first.
6.2019/7/13
I finally choose to use multiple processes to accelerate spider procedure after days of trying. At first I want to use threads, but threads is dead in python because of GIL. Today I tried coroutines, but it will take many changes if I use it. And multiple processes is very easy to apply to my original program framework.
Through testing the whole bilibili ranking(about 1300 info), the spider result are below
- single process: about 14 minutes
- multiple processes: about 5 minutes
It improved at least a half compared with single process.
7.2019/7/17
Nowadays, I'm doing some mathematic foundation for building a fully connected network with perceptron. It's easy on theory but hard on practice.
And today's later update has added a test module, proxy pool module. I borrowed from some others' proxy_pool. The biggest problem for this module is the spider efficiency is too low. Multiple process, thread, and coroutines are all need for better peroformance. Can't avoid using them.
8.2019/7/22
Finally successfully used coroutines on proxy pool module! I just failed to apply it to bilibili spider module for many times a few days ago. aiohttp, Python async documentations help me a lot.
The proxy pool module used multi thread, coroutins up to now. I have to say the spider procedure is more than 10 times faster than before after using coroutines. The next few days, I will try redis and Flask to build my own proxy pool.
9.2019/8/9
Proxypool is all ready for using. It could fetch about 20 available HTTP proxies after one spider timer up to now. But with little regret, the asyncio.semaphore couldn't work normally when I try to implement semaphore to proxy check to control concurrency speed.
Now I just need to perfect Flask Server for the spider to use conveniently. And it will be a big work as I could see.

About

Spider, Proxypool and MachineLearning. A big project😁

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.7%
  • Jupyter Notebook 37.2%
  • Shell 0.1%