Skip to content

ZoeyYoung/python-readability

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

readability_lxml

This is a python port of a ruby port of arc90's readability project

Given a html document, it pulls out the main body text and cleans it up. It also can clean up title based on latest readability.js code.

Inspiration

Try it out!

You can try out the parser by entering your test urls on the following test service.

http://readable.bmark.us

Installation

$ easy_install readability-lxml
# or
$ pip install readability-lxml

Usage

Command Line Client

$ readability http://pypi.python.org/pypi/readability-lxml
$ readability /home/rharding/sampledoc.html

As a Library

from readability.readability import Document
import urllib
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

You can also use the get_summary_with_metadata method to get back other metadata such as the confidence score found while processing the input.

doc = Document(html).summary_with_metadata()
print doc.html
print doc.confidence

Optional Document keyword argument:

  • attributes:
  • debug: output debug messages
  • min_text_length:
  • multipage: should we try to parse and combine multiple page articles?
  • retry_length:
  • url: will allow adjusting links to be absolute

Test and BUild Status

Tests are run against the package at:

http://build.bmark.us/job/readability-lxml/

You can view it for build history and test status.

History

  • 0.2.5 Update setup.py for uploading .tar.gz to pypi

About

fast python port of arc90's readability tool, updated to match latest readability.js!

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%