Skip to content

tomachalek/kontext

 
 

Repository files navigation

KonText screenshot

Build status

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration middleware built around corpus search engine Manatee-open. The development is maintained by the Institute of the Czech National Corpus.

Features

new features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • advanced CQL editor with syntax highlighting and attribute recognition
  • support for spoken corpora
    • defined concordance segments can be played back as audio
    • KWIC detail provides a custom rendering with easily distinguishable speeches
  • support for user-defined line groups
    • user can define custom numeric tags attached to concordance lines, filter out other lines, review groups ratios
  • improved subcorpus creation
    • user can easily examine corpus structure by selecting some text types and see how other text type attributes availability changed ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • a sub-corpus can be created by a custom CQL expression
    • a sub-corpus can be published so other users can access it
    • subcorpora are backed up as CQL queries which makes further modification/restoring possible
  • frequency distribution
    • 2-dimensional frequency distribution for both positional and structural attributes
    • result caching decreases time required to navigate between pages
    • on the multilevel frequency distribution page, starting word can be specified for multi-word KWICs
  • persistent URL for any query - you can send a link to someone even if the query string was megabytes long
  • access to previous queries, named queries
  • access to favorite corpora (subcorpora, aligned corpora)
  • interactive PoS tag tool - in case of positional PoS tag formats an interactive tool can be used to write tag queries
  • a concordance/frequency/collocation listing can be saved in Excel format (xlsx)
  • concordance tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • a correct (i.e. the one calculating only with selected text types) i.p.m. can be calculated on-demand for ad-hoc subcorpora
  • integrability with external data resources (e.g. dictionaries, media libraries)

internal features

  • server-side written as a WSGI application
  • modern client-side application (event stream architecture, React components, extensible, written in TypeScript)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)
  • fully decoupled background concordance/frequency/collocation calculation based on the Celery task queue (alternatively, Python's multiprocessing package can be used)
  • improved logging, error processing and debugging support
  • improved code documentation

Requirements

  • WSGI-compatible server
  • Rerverse proxy server
  • Python 2.7 and:
    • Cheetah Template Engine
    • lxml library
    • werkzeug library (provides WSGI middleware)
    • PyICU library (optional but preferred)
    • markdown library (optional, for formatted corpora references)
    • openpyxl library (optional, for XLSX export)
  • corpus search engine Manatee
    • versions from 2.83.3 to 2.158.8 are supported (the latest one is highly recommended); unless there is an incompatible change in Manatee, newer versions should work too
  • a key-value storage
    • any custom implementation (Redis and SQLite backends are available by default)
  • (optional) Celery task queue task queue for (asynchronous) background calculations and maintenance tasks

Build and installation

Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable installations

About

An alternative web front-end for the Manatee corpus search engine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 57.8%
  • Python 36.3%
  • CSS 3.6%
  • JavaScript 0.7%
  • Shell 0.7%
  • TSQL 0.6%
  • Other 0.3%