Introduction

Textminer extracts values, lists and dicts from text. It works on all text formats and is heavily used on html pages.

Giving a piece of html

<html>
<body>
<div id="value1">111</div>
...
<div id="value2">222</div>
</body>
</html>

The usual way of extracting the two values "111" and "222" is:

start1 = html.find('<div id="value1">') + len('<div id="value1">')
if start1 == -1:
    end1 = 0
    value1 = None
else:
    end1 = html.find('</div>', start1)
    value1 = html[start1:end1]
    value1 = int(value1)
start2 = html.find('<div id="value2">') + len('<div id="value1">', end1)
if start2 == -1:
    value2 = None
else:
    end2 = html.find('</div>', start2)
    value2 = html[start2:end2]
    value2 = int(value2)

The textminer's way of doing the same thing is:

import textminer

rule = '''
dict:
- key: value1
  prefix: <div id="value1">
  suffix: </div>
  type: int
- key: value2
  prefix: <div id="value2">
  suffix: </div>
  type: int
'''
results = textminer.extract(html, rule)

Textminer uses yaml to define rules, which is far more clear and expressive. This enables you to write very complicated rule for hierarchical extraction (see below).

Installation

pip install textminer

Demo

You can test your rules here.

Basic Usage

Extract a single value from html

import textminer

html = '<html><body><div>abc</div></body></html>'
rule = '''
value:
  prefix: <div>
  suffix: </div>
'''
result = textminer.extract(html, rule)
# result == 'abc'

Extract a list from html

import textminer

html = '''
<html>
<body>
<ul>
	<li>aaa</li>
	<li>bbb</li>
	<li>ccc</li>
</ul>
</body>
</html>
'''
rule = '''
list:
  prefix: <li>
  suffix: </li>
'''
result = textminer.extract(html, rule)
# result == ['aaa', 'bbb', 'ccc']

Extract a dict from html

import textminer

html = '''
<html>
<body>
<div id="code">001</div>
<div id="value">123</div>
</body>
</html>
'''
rule = '''
dict:
- key: code
  prefix: <div id="code">
  suffix: </div>
- key: value
  prefix: <div id="value">
  suffix: </div>
'''
result = textminer.extract(html, rule)
# result == {'code': '001', 'value': '123'}

Note that the fields in the rule should be in the order they appear in the html.

Hierarchical extraction

The real power of textminer is to do hierarchical extraction.

import textminer

html = '''
<html>
<body>
<h1>Test Page</h1>
<table>
    <tr>
        <td>001</td>
        <td>123</td>
    </tr>
    <tr>
        <td>002</td>
        <td>321</td>
    </tr>
</table>
</body>
</html>
'''
rule = '''
dict:
- key: title
  prefix: <h1>
  suffix: </h1>
- key: items
  prefix: <table>
  suffix: </table>
  list:
    prefix: <tr>
    suffix: </tr>
    dict:
    - key: id
      prefix: <td>
      suffix: </td>
    - key: value
      prefix: <td>
      suffix: </td>
      type: int
'''
result = textminer.extract(html, rule)
# result == {
#     'title': 'Test Page',
#     'items': [
#         {'code': '001', 'value': 123},
#         {'code': '002', 'value': 321}
#     ]
# }

Extract from a url

Since textminer is heavily used on web pages. It provides a utility function extract_from_url to download html and extract from it. This saves you a few lines of code.

import textminer

rule = '''
value:
  prefix: <title>
  suffix: </title>
'''
textminer.extract_from_url('http://www.google.com/', rule)

Advanced Usage

Filters

import textminer

html = '<html><body><div>1<b>2</b>3</div></body></html>'
rule = '''
value:
  prefix: <div>
  suffix: </div>
  filters:
  - strip_html
  - float
  - eval('value / 100')
'''
result = textminer.extract(html, rule)
# result == 1.23

Regular expressions for prefix & suffix

Regular expressions are denoted by "/" before and after the string.

import textminer

html = '<html><body><div sessionId="123456789">aaa</div></body></html>'
rule = '''
value:
  prefix: /<div sessionId="\\d+">/
  suffix: </div>
'''
result = textminer.extract(html, rule)

Using rules of other formats

Yaml is perfect for the rules, but textminer also supports json and raw python dict.

import textminer

html = '<html><body><div>123</div></body></html>'

python_rule = {'value': {'prefix': '<body>', 'suffix': '</body>'}}
result = textminer.extract(html, python_rule, fmt=None)

json_rule = '{"value": {"prefix": "<body>", "suffix": "</body>"}}'
result = textminer.extract(html, json_rule, fmt='json')

Python3 Support

Textminer is tested under python 2.7 and python 3.3.

Author

Mengchen LEE: Google Plus, LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
src/textminer		src/textminer
test		test
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/textminer

src/textminer

test

test

.gitignore

.gitignore

README.md

README.md

setup.py

setup.py

Repository files navigation

Introduction

Installation

Demo

Basic Usage

Extract a single value from html

Extract a list from html

Extract a dict from html

Hierarchical extraction

Extract from a url

Advanced Usage

Filters

Regular expressions for prefix & suffix

Using rules of other formats

Python3 Support

Author

About

Releases

Packages

Languages

CooledCoffee/textminer

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Demo

Basic Usage

Extract a single value from html

Extract a list from html

Extract a dict from html

Hierarchical extraction

Extract from a url

Advanced Usage

Filters

Regular expressions for prefix & suffix

Using rules of other formats

Python3 Support

Author

About

Resources

Stars

Watchers

Forks

Languages