Skip to content

svitalsky/kontext

 
 

Repository files navigation

KonText screenshot

Important note

Please note that due to Python 2 EOL, KonText version 0.13.x is the last one running in Python 2. It means that the next release (planned for Q3 Q4 2020) will run only in Python 3. For the master branch users - the last commit supporting Python 2 is tagged py2_last_version and the first one supporting Python 3 is tagged py3_initial_version. To upgrade, please refer to doc/py2to3.md for details. For new installations, please follow doc/INSTALL.md.

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration middleware built around corpus search engine Manatee-open. The development is maintained by the Institute of the Czech National Corpus.

Notable end-user features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • advanced CQL editor with syntax highlighting and attribute recognition
    • interactive PoS tag tool - in case of positional PoS tag formats an interactive tool can be used to write tag queries
  • support for spoken corpora
    • defined concordance segments can be played back as audio
    • KWIC detail provides a custom rendering with easily distinguishable speeches
  • support for user-defined line groups
    • user can define custom numeric tags attached to concordance lines, filter out other lines, review groups ratios
  • rich subcorpus-related functionality
    • user can easily examine corpus structure by selecting some text types and see how other text type attributes availability changed ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • a sub-corpus can be created by a custom CQL expression
    • a sub-corpus can be published so other users can access it
    • subcorpora are backed up as CQL queries which makes further modification/restoring possible
  • frequency distribution
    • 2-dimensional frequency distribution for both positional and structural attributes
    • result caching decreases time required to navigate between pages
    • on the multilevel frequency distribution page, starting word can be specified for multi-word KWICs
  • persistent URL for any query - you can send a link to someone even if the query string was megabytes long
  • access to previous queries, named queries
  • access to favorite corpora (subcorpora, aligned corpora)
  • a concordance/frequency/collocation listing can be saved in Excel format (xlsx)
  • concordance tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • a correct (i.e. the one calculating only with selected text types) i.p.m. can be calculated on-demand for ad-hoc subcorpora
  • integrability with external data resources (e.g. dictionaries, media libraries)

Internal features

  • modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
  • server-side written as a WSGI application with fully decoupled background concordance/frequency/collocation calculation (using an integrated worker server)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)

Requirements

  • Rerverse proxy server
  • Python 3.6 (or newer) and:
    • WSGI-compatible server
    • Werkzeug web application library
    • Jinja2 template engine
    • lxml library
    • PyICU library (optional but preferred)
    • markdown library (optional, for formatted corpora references)
    • openpyxl library (optional, for XLSX export)
    • Babel library
  • corpus search engine Manatee
    • versions 2.167.8 and newer are supported by KonText 0.15 and newer
    • versions from 2.83.3 to 2.158.8 are supported by KonText 0.13 and older
  • a key-value storage
    • any custom implementation (Redis and SQLite backends are available by default)
  • a task queue for asynchronous/demanding background calculations and maintenance tasks
    • Celery task queue (more mature implementation in KonText)
    • Rq (lightweight worker, more recent implementation in KonText)

Note: KonText versions up to 0.13.x (incl.) run on Python 2. To use Python 3, 0.15.x and newer versions of KonText must be used.

Build and installation

KonText provides a script for automatic installation to an existing Ubuntu system. The easiest way to install KonText is to create an LXC/LXD container, clone the repository there and run the script. On a decently fast network, the whole process takes only a couple of seconds. Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable users

About

An advanced web front-end for the Manatee-open corpus search engine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 59.7%
  • Python 34.1%
  • Less 3.8%
  • HTML 1.4%
  • JavaScript 0.7%
  • Shell 0.3%