Skip to content

CooledCoffee/textminer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Textminer extracts values, lists and dicts from text. It works on all text formats and is heavily used on html pages.

Giving a piece of html

<html>
<body>
<div id="value1">111</div>
...
<div id="value2">222</div>
</body>
</html>

The usual way of extracting the two values "111" and "222" is:

start1 = html.find('<div id="value1">') + len('<div id="value1">')
if start1 == -1:
    end1 = 0
    value1 = None
else:
    end1 = html.find('</div>', start1)
    value1 = html[start1:end1]
    value1 = int(value1)
start2 = html.find('<div id="value2">') + len('<div id="value1">', end1)
if start2 == -1:
    value2 = None
else:
    end2 = html.find('</div>', start2)
    value2 = html[start2:end2]
    value2 = int(value2)

The textminer's way of doing the same thing is:

import textminer

rule = '''
dict:
- key: value1
  prefix: <div id="value1">
  suffix: </div>
  type: int
- key: value2
  prefix: <div id="value2">
  suffix: </div>
  type: int
'''
results = textminer.extract(html, rule)

Textminer uses yaml to define rules, which is far more clear and expressive. This enables you to write very complicated rule for hierarchical extraction (see below).

Installation

pip install textminer

Demo

You can test your rules here.

Basic Usage

Extract a single value from html

import textminer

html = '<html><body><div>abc</div></body></html>'
rule = '''
value:
  prefix: <div>
  suffix: </div>
'''
result = textminer.extract(html, rule)
# result == 'abc'

Extract a list from html

import textminer

html = '''
<html>
<body>
<ul>
	<li>aaa</li>
	<li>bbb</li>
	<li>ccc</li>
</ul>
</body>
</html>
'''
rule = '''
list:
  prefix: <li>
  suffix: </li>
'''
result = textminer.extract(html, rule)
# result == ['aaa', 'bbb', 'ccc']

Extract a dict from html

import textminer

html = '''
<html>
<body>
<div id="code">001</div>
<div id="value">123</div>
</body>
</html>
'''
rule = '''
dict:
- key: code
  prefix: <div id="code">
  suffix: </div>
- key: value
  prefix: <div id="value">
  suffix: </div>
'''
result = textminer.extract(html, rule)
# result == {'code': '001', 'value': '123'}

Note that the fields in the rule should be in the order they appear in the html.

Hierarchical extraction

The real power of textminer is to do hierarchical extraction.

import textminer

html = '''
<html>
<body>
<h1>Test Page</h1>
<table>
    <tr>
        <td>001</td>
        <td>123</td>
    </tr>
    <tr>
        <td>002</td>
        <td>321</td>
    </tr>
</table>
</body>
</html>
'''
rule = '''
dict:
- key: title
  prefix: <h1>
  suffix: </h1>
- key: items
  prefix: <table>
  suffix: </table>
  list:
    prefix: <tr>
    suffix: </tr>
    dict:
    - key: id
      prefix: <td>
      suffix: </td>
    - key: value
      prefix: <td>
      suffix: </td>
      type: int
'''
result = textminer.extract(html, rule)
# result == {
#     'title': 'Test Page',
#     'items': [
#         {'code': '001', 'value': 123},
#         {'code': '002', 'value': 321}
#     ]
# }

Extract from a url

Since textminer is heavily used on web pages. It provides a utility function extract_from_url to download html and extract from it. This saves you a few lines of code.

import textminer

rule = '''
value:
  prefix: <title>
  suffix: </title>
'''
textminer.extract_from_url('http://www.google.com/', rule)

Advanced Usage

Filters

import textminer

html = '<html><body><div>1<b>2</b>3</div></body></html>'
rule = '''
value:
  prefix: <div>
  suffix: </div>
  filters:
  - strip_html
  - float
  - eval('value / 100')
'''
result = textminer.extract(html, rule)
# result == 1.23

Regular expressions for prefix & suffix

Regular expressions are denoted by "/" before and after the string.

import textminer

html = '<html><body><div sessionId="123456789">aaa</div></body></html>'
rule = '''
value:
  prefix: /<div sessionId="\\d+">/
  suffix: </div>
'''
result = textminer.extract(html, rule)

Using rules of other formats

Yaml is perfect for the rules, but textminer also supports json and raw python dict.

import textminer

html = '<html><body><div>123</div></body></html>'

python_rule = {'value': {'prefix': '<body>', 'suffix': '</body>'}}
result = textminer.extract(html, python_rule, fmt=None)

json_rule = '{"value": {"prefix": "<body>", "suffix": "</body>"}}'
result = textminer.extract(html, json_rule, fmt='json')

Python3 Support

Textminer is tested under python 2.7 and python 3.3.

Author

Mengchen LEE: Google Plus, LinkedIn

Releases

No releases published

Packages

No packages published

Languages