Textminer extracts values, lists and dicts from text. It works on all text formats and is heavily used on html pages.
Giving a piece of html
<html>
<body>
<div id="value1">111</div>
...
<div id="value2">222</div>
</body>
</html>
The usual way of extracting the two values "111" and "222" is:
start1 = html.find('<div id="value1">') + len('<div id="value1">')
if start1 == -1:
end1 = 0
value1 = None
else:
end1 = html.find('</div>', start1)
value1 = html[start1:end1]
value1 = int(value1)
start2 = html.find('<div id="value2">') + len('<div id="value1">', end1)
if start2 == -1:
value2 = None
else:
end2 = html.find('</div>', start2)
value2 = html[start2:end2]
value2 = int(value2)
The textminer's way of doing the same thing is:
import textminer
rule = '''
dict:
- key: value1
prefix: <div id="value1">
suffix: </div>
type: int
- key: value2
prefix: <div id="value2">
suffix: </div>
type: int
'''
results = textminer.extract(html, rule)
Textminer uses yaml to define rules, which is far more clear and expressive. This enables you to write very complicated rule for hierarchical extraction (see below).
pip install textminer
You can test your rules here.
import textminer
html = '<html><body><div>abc</div></body></html>'
rule = '''
value:
prefix: <div>
suffix: </div>
'''
result = textminer.extract(html, rule)
# result == 'abc'
import textminer
html = '''
<html>
<body>
<ul>
<li>aaa</li>
<li>bbb</li>
<li>ccc</li>
</ul>
</body>
</html>
'''
rule = '''
list:
prefix: <li>
suffix: </li>
'''
result = textminer.extract(html, rule)
# result == ['aaa', 'bbb', 'ccc']
import textminer
html = '''
<html>
<body>
<div id="code">001</div>
<div id="value">123</div>
</body>
</html>
'''
rule = '''
dict:
- key: code
prefix: <div id="code">
suffix: </div>
- key: value
prefix: <div id="value">
suffix: </div>
'''
result = textminer.extract(html, rule)
# result == {'code': '001', 'value': '123'}
Note that the fields in the rule should be in the order they appear in the html.
The real power of textminer is to do hierarchical extraction.
import textminer
html = '''
<html>
<body>
<h1>Test Page</h1>
<table>
<tr>
<td>001</td>
<td>123</td>
</tr>
<tr>
<td>002</td>
<td>321</td>
</tr>
</table>
</body>
</html>
'''
rule = '''
dict:
- key: title
prefix: <h1>
suffix: </h1>
- key: items
prefix: <table>
suffix: </table>
list:
prefix: <tr>
suffix: </tr>
dict:
- key: id
prefix: <td>
suffix: </td>
- key: value
prefix: <td>
suffix: </td>
type: int
'''
result = textminer.extract(html, rule)
# result == {
# 'title': 'Test Page',
# 'items': [
# {'code': '001', 'value': 123},
# {'code': '002', 'value': 321}
# ]
# }
Since textminer is heavily used on web pages. It provides a utility function extract_from_url to download html and extract from it. This saves you a few lines of code.
import textminer
rule = '''
value:
prefix: <title>
suffix: </title>
'''
textminer.extract_from_url('http://www.google.com/', rule)
import textminer
html = '<html><body><div>1<b>2</b>3</div></body></html>'
rule = '''
value:
prefix: <div>
suffix: </div>
filters:
- strip_html
- float
- eval('value / 100')
'''
result = textminer.extract(html, rule)
# result == 1.23
Regular expressions are denoted by "/" before and after the string.
import textminer
html = '<html><body><div sessionId="123456789">aaa</div></body></html>'
rule = '''
value:
prefix: /<div sessionId="\\d+">/
suffix: </div>
'''
result = textminer.extract(html, rule)
Yaml is perfect for the rules, but textminer also supports json and raw python dict.
import textminer
html = '<html><body><div>123</div></body></html>'
python_rule = {'value': {'prefix': '<body>', 'suffix': '</body>'}}
result = textminer.extract(html, python_rule, fmt=None)
json_rule = '{"value": {"prefix": "<body>", "suffix": "</body>"}}'
result = textminer.extract(html, json_rule, fmt='json')
Textminer is tested under python 2.7 and python 3.3.
Mengchen LEE: Google Plus, LinkedIn