Skip to content

zanachka/webpager

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webpager

A simple library to classify if an anchor on HTML page is a pagination link or not.

Installation ========

Clone the repository, then install package requirements (package requires lxml, scikit-learn):

$ pip install -r requirements.txt

then install package itself:

$ python setup.py install

Usage

Get a HTML page somewhere.:

>>> from urllib import urlopen
>>> url = 'http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-Trattoria_Caffe_Monteverdi-Hong_Kong.html'
>>> html = urlopen(url).read()

Load web pager and classify.:

>>> from webpager import WebPager
>>> webpager = WebPager()
>>> for anchor, label in webpager.paginate(html, url):
>>>     if label:
>>>          print anchor.get('href')

http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or40-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS
http://www.tripadvisor.com/Restaurant_Review-g294217-d3639657-Reviews-or10-Trattoria_Caffe_Monteverdi-Hong_Kong.html#REVIEWS

Training

see train.ipynb for more details.

Releases

No releases published

Packages

No packages published

Languages

  • C 88.6%
  • Python 11.4%