Skip to content

PieterjanMontens/yate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#yate Yet Another Text Extrator: a small, simple command-line python tool for extracting data from raw text. Goal is to feed that data to search engines and graph databases.

It outputs a simple key/value list.

##How to use## Yate receives text input, and according to the data points defined in the config file, extracts them and outputs them as key/value, in different formats.

The data point definitions are regexes (match, extract or multi-extract) or lambda functions that extract value(s) from the raw text.

##Tester## Yate also includes a tester, which scans outgoing key/value pairs for missing or invalid values, which is meant for singling out documents which have parsing problems.

##Differ & Patcher## These two tools make it possible to generate patches to do on-the-fly correction of texts with minor errors which make data points extraction troublesome. It works extactly like GNU diff & patch, although Google's diff-match-patch library is used, so it's entirely done in Python.

#TODO#

  • Include more config examples in config_default
  • Review code to match common Python code conventions (I'm not that much of a Python programmer, yet)
  • Include a working example and use case (Content I'm currently working on is not public)
  • Reorganize code, external lib directory, executable scripts in root directory
  • Better error handling, more options, ...

About

Yet another text stuff extractor

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages