Skip to content

yuanbei/adspider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AdSpider

A spider which crawls and analyzes ads in the web page. Our main objective is generating ABP filters automatic.

Ads present model

A lot of ads are presented with the model of below.

<a href ="ads Target URL">
  <img src = "ads content URL" />
</a>

The host URL is the URL of page which hosts the ads.

Core logics of AdSpider

  1. Based on Ads present model, crawl the web and record the ads profile item into database.
  2. Analyze the profiles and find the items which are probably ads.
  3. Generate ABP filters from ads profile item.

Common Requirements

  1. python 2.7
  2. tld
  3. lxml

Requirements for MySQL tools

  1. MySQL-python
  2. python-gflags
  3. google mysql-tools

Requirements for Spider

  1. Scrapy
  2. Frontera

Installation Guide

  1. Python 2.7
  2. pip and setuptools Python packages. Nowadays pip requires and installs setuptools if not installed.
  3. Install tld through pip
$ pip install tld
  1. Install lxml for python
$ pip install lxml
  1. Install MySQL-python through yum
$ yum install MySQL-python
  1. Install python-gflags
$ pip install python-gflags
  1. Intsall Scrapy
$ pip install scrapy
  1. Install Frontera
$ pip install frontera[distributed,zeromq,sql]

Deployment

Thanks for ScarpyHub, AdSpider integrate Scrapy with Frontera to achieve a broad distributed Spdier.

About

A spider which crawls and analyze ads in the webpage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published