Libextract: extract data from websites

___ __              __                  __

/ (_) /_ ___ _ __/ /__________ ______/ /_

/ / / __ / _ | |/_/ __/ ___/ __ `/ ___/ __/

/ / / /_/ / __/> </ /_/ / / /_/ / /__/ /_

/_/_/_.___/___/_/__/_/ __,_/___/__/

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.

Overview

libextract.api.extract(document, encoding='utf-8', count=5): Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).

Installation

pip install libextract

Usage

Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you.

from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

Using lxml's built-in methods for post-processing:

>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with tabular data:

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

Dependencies

lxml
statscounter

Disclaimer

This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated

:)

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
libextract		libextract
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libextract

libextract

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.rst

README.rst

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

About

Releases

Packages

Languages

License

abcin/libextract

Folders and files

Latest commit

History

Repository files navigation

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages