GitHub - netconstructor/pydepta: A python implementation of DEPTA

pydepta

pydepta is a Python implementation of Yanhong Zhai and Bing Liu's work on Web Data Extraction Based on Partial Tree Alignment. [1] The basic idea is to extract the data region with tree match algorithm (see Bing Lius' previous work on MDR [3]) and then build a seed tree on top of records to extract the data fields.

Special thanks to SDE[2] a Java implementation of DEPTA. I basically rewrote it with Python with some improvement.

Usage

extract from html page

>>> import depta
>>> from urllib2 import urlopen
>>> d = depta.Depta()
>>> html = urlopen('http://www.amazon.com').read()
>>> d.extract(html)

extract from url

>>> import depta
>>> d = depta.Depta()
>>> d.extract(url='http://www.amazon.com')

extract and annoate the data records with colors

>>> import depta
>>> d = depta.Depta()
>>> d.extract(url='http://www.amazon.com', annotate='1.html')

get the data fields

>>> import depta
>>> depta = Depta()
>>> items = depta.extract(url=sys.argv[1])
>>> for item in enumerate(items):
        print ' | '.join(map(lambda x: x.text, item.fields))

Author

pengtaoo AT gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
LICENSE		LICENSE
README.md		README.md
depta.py		depta.py
htmls.py		htmls.py
mdr.py		mdr.py
requirements.txt		requirements.txt
setup.py		setup.py
trees.py		trees.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

depta.py

depta.py

htmls.py

htmls.py

mdr.py

mdr.py

requirements.txt

requirements.txt

setup.py

setup.py

trees.py

trees.py

Repository files navigation

pydepta

Usage

Author

About

Releases

Packages

License

netconstructor/pydepta

Folders and files

Latest commit

History

Repository files navigation

pydepta

Usage

Author

About

Resources

License

Stars

Watchers

Forks