======================================================================== GECCO - Generic Environment for Context-Aware Correction of Orthography

A generic modular framework for spelling correction. Aimed to quickly build a complete context-aware spelling correction system given your own data set. Most components will be language independent and trainable from a source corpus, training is explicitly included in the framework. The framework aims to easily extendible, modules can be written in Python 3. Moreover, the framework is scalable and distributable over multiple servers.

The system can be invoked from the command-line, as a Python binding, as a RESTful webservice, or through the web application (two interfaces).

Features:

Generic built-in modules (all will have a server mode):
- Confusible Module
  - Timbl-based (IGTree)
  - alternative: Colibri-Core based
- Language Model Module
  - Timbl-based (inspired on WOPR)
  - alternative: SRILM-based, Colibri-Core based
- Aspell Module
- Lexicon Module
  - Colibri-core
- Errorlist Module
- Split Module
  - Colibri-core
- Runon Module
  - Colibri-core
- Garbage Module
- Punctuation Module (new)
- Language Detection
Easily extendible by adding modules using the gecco module API
Language independent
Built-in training pipeline (given corpus input)
Built-in testing pipeline (given test corpus), returns simple report of evaluation metrics per module
Built-in tuning pipeline (given development corpus), tunes and stores parameter weights
Distributable & Scalable:
- Load balancing: backend servers can run on multiple hosts, master process chooses host with least load
- Master process always on a single host (reduces network load and increases performance)
  - Multithreaded, modules can be invoked in parallel
Input and output is FoLiA XML
- Automatic input conversion from plain text using ucto

Dependencies:

Generic:
python 3
pynlpl (for FoLiA)
python-ucto & ucto
Module-specific:
Timbl
Aspell
colibri-core (py3)
aspell-python-py3
python-timbl
Webservice:
CLAM (port to Python 3 still in progress)

Workload:

Build framework
Abstract Module
- Client/server functionality
Load balancing
Reimplement all modules within new framework, in Python
CLAM integration
Generic client (separate from main command line tool)

Configuration

A Gecco system consists of a configuration, either in the form of a simple Python script or an external YAML configuration file.

Example YAML configuration:

name: fowlt
path: /path/to/fowlt
language: en
modules:
    - module: IGTreeConfusibleModule
      id: confusibles
      source: train.txt
      test: test.txt                        [or a floating point value to automatically reserve a portion of the train set]
      tune: tune.txt                        [or a floating point value to automatically reserve a portion of the train set]
      model: confusible.model
      servers:
          - host: blah
            port: 12345
          - host: blah2
            port: 12345
      confusible: [then,than]

Alternatively, the configuration can be done in Python directly, in which case the script will be the tool that exposes all functionality:

corrector = Corrector(id="fowlt", root="/path/to/fowlt/")
corrector.append( IGTreeConfusibleModule("thenthan", source="train.txt",test_crossvalidate=True,test=0.1,tune=0.1,model="confusibles.model", confusible=('then','than')))
corrector.append( IGTreeConfusibleModule("its", source="train.txt",test_crossvalidate=True,test=0.1,tune=0.1,model="confusibles.model", confusible=('its',"it's")))
corrector.append( ErrorListModule("errorlist", source="errorlist.txt",model="errorlist.model", servers=[("blah",1234),("blah2",1234)]  )
corrector.append( LexiconModule("lexicon", source=["lexicon.txt","lexicon2.txt"],model=["lexicon.model","lexicon2.model"], servers=[("blah",1235)]  )
corrector.main()

Command line usage

Invoke all gecco functionality through a single command line tool

$ gecco myconfig.yml [subcommand]

or

$ myspellingcorrector.py [subcommand]

Subcommands:

reset [$id] -- delete models for all modules or specified module
train [$id] -- train all modules or specified module
start [$id] -- Start module servers (all or specified) that match the current host, this will have to be run for each host
run [filename] [$id] [$parameters] -- Run (all or specified module) on specified FoLiA document
build -- Builds CLAM configuration and wrapper and django webinterface
runserver -- Starts CLAM webservice (using development server, not recommended for production use) and Django interface (using development server, not for producion use)

Server setup

On each server that runs one or more modules a gecco myconfig.yml start has to be issued to start the modules. Modules not set up to run as a server will simply be started and invoked locally on request.

gecco run <input.folia.xml> is executed on the master server to process a given FoLiA document or plaintext document, it will invoke all the modules which may be distributed over multiple servers, if multiple server instances of the same module are running, the one with the lowest load is selected. Output will be delivered in FoLiA XML.

RESTUL webservice access is available through CLAM, the CLAM service can be automatically generated. Web-application access is available either through the generic interface in CLAM, as well as the more user-friendly interface of Valkuil/Fowlt.

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
gecco		gecco
test		test
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py
test.sh		test.sh
test.yml		test.yml
virtualenv-install.sh		virtualenv-install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gecco

gecco

test

test

.travis.yml

.travis.yml

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

test.py

test.py

test.sh

test.sh

test.yml

test.yml

virtualenv-install.sh

virtualenv-install.sh

Repository files navigation

======================================================================== GECCO - Generic Environment for Context-Aware Correction of Orthography

Configuration

Command line usage

Server setup

About

Releases

Packages

Languages

License

wollmers/gecco

Folders and files

Latest commit

History

Repository files navigation

======================================================================== GECCO - Generic Environment for Context-Aware Correction of Orthography

Configuration

Command line usage

Server setup

About

Resources

License

Stars

Watchers

Forks

Languages