Skip to content

petrduda/kontext

 
 

Repository files navigation

KonText screenshot

Build status

Introduction

KonText is an advanced corpus query interface for the Manatee-open corpus search engine. It builds on top of core server-side libraries from NoSketchEngine and both applications are data-compatible as well. The development is maintained by the Institute of the Czech National Corpus.

Features

new features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • support for spoken corpora
    • defined concordance segments can be played back as audio
    • KWIC detail provides a custom rendering with easily distinguishable speeches
  • support for user-defined line groups
    • user can define custom numeric tags attached to concordance lines, filter out other lines, review groups ratios
  • improved subcorpus creation
    • user can easily examine corpus structure by selecting some text types and see how other text type attributes availability changed ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • a sub-corpus can be created by a custom CQL expression
    • subcorpora are backed up as CQL queries which makes further modification/restoring possible
  • frequency distribution
    • 2-dimensional frequency distribution for both positional and structural attributes
    • result caching decreases time required to navigate between pages
    • on the multilevel frequency distribution page, starting word can be specified for multi-word KWICs
  • persistent URLs for large queries - you can send a link to someone even if the query was in megabytes
  • access to previous queries, named queries
  • access to favorite corpora (subcorpora, aligned corpora)
  • interactive PoS tag tool - in case of positional PoS tag formats an interactive tool can be used to write tag queries
  • a concordance/frequency/collocation listing can be saved in Excel format (xlsx)
  • a correct (i.e. the one calculating only with selected text types) i.p.m. can be calculated on-demand for ad-hoc subcorpora
  • result shuffling can be pre-set
  • less full page reloads

internal changes

  • server-side rewritten as a WSGI application (Bonito-open is CGI-based)
  • completely rewritten client-side code (React+Flux architecture, TypeScript + ES6, modularized)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)
  • fully decoupled background concordance/frequency/collocation calculation based on the Celery task queue (alternatively, Python's multiprocessing package can be used)
  • improved logging, error processing and debugging support
  • improved code documentation

Requirements

  • a WSGI-compatible server
    • recommended setup: Gunicorn + a reverse proxy (e.g. Nginx or Apache2)
    • supported setup: Apache2 with mod_wsgi
  • Python 2.7 and:
    • Cheetah Template Engine
    • lxml library
    • werkzeug library (provides WSGI middleware)
    • PyICU library (optional but preferred)
    • markdown library (optional, for formatted corpora references)
    • openpyxl library (optional, for XLSX export)
  • corpus search engine Manatee
    • versions from 2.83.3 to 2.150 are supported (the latest one is highly recommended); unless there is an incompatible change in Manatee, newer versions should work too
  • a key-value storage
    • any custom implementation (Redis and SQLite backends are available by default)
  • (optional) Celery task queue task queue for (asynchronous) background calculations and maintenance tasks

Build and installation

Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable installations

About

An alternative web front-end for the Manatee-open corpus search engine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 37.8%
  • TypeScript 37.3%
  • JavaScript 21.1%
  • CSS 3.4%
  • HTML 0.2%
  • Shell 0.2%