Skip to content

CarterCole/dragnet

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dragnet

Dragnet isn't interested in the shiny chrome or boilerplate dressing of a webpage. It's interested in... 'just the facts.'

It is meant to become a collection of reference implementations of various dechroming / content extraction algorithms.

Each of the algorithms is implemented as a class of static methods that can be imported from the top level of dragnet, and implement a method analyze, which accepts a string of HTML and returns a string representative of the content.

Running

Fill a directory documents with per-site folders of the HTML sources of documents from that site, and then run.py will iterate through each of the input files and produce a corresponding file in output with just the content. For example,

documents/
    wired.com/
        latest-higgs-rumors
    seomoz.org/
        8-attributes-of-content-that-inspire-action

Arias et al.

Based on Language Independent Content Extraction from Web Pages

from dragnet import Arias
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Arias.analyze(r.content)

Kohlschütter et al.

Based on Boilerplate Detection using Shallow Text Features

from dragnet import Kohlschuetter
import requests
r = requests.get(
    'http://www.wired.com/wiredscience/2012/06/latest-higgs-rumors/')
print Kohlschuetter.analyze(r.content)

About

Just the facts -- web page content extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published