A spider which crawls and analyzes ads in the web page. Our main objective is generating ABP filters automatic.
A lot of ads are presented with the model of below.
<a href ="ads Target URL">
<img src = "ads content URL" />
</a>
The host URL is the URL of page which hosts the ads.
- Based on Ads present model, crawl the web and record the ads profile item into database.
- Analyze the profiles and find the items which are probably ads.
- Generate ABP filters from ads profile item.
- Python 2.7
- pip and setuptools Python packages. Nowadays pip requires and installs setuptools if not installed.
- Install tld through pip
$ pip install tld
- Install lxml for python
$ pip install lxml
- Install MySQL-python through yum
$ yum install MySQL-python
- Install python-gflags
$ pip install python-gflags
- Intsall Scrapy
$ pip install scrapy
- Install Frontera
$ pip install frontera[distributed,zeromq,sql]
Thanks for ScarpyHub, AdSpider integrate Scrapy with Frontera to achieve a broad distributed Spdier.