Skip to content
/ kontext Public
forked from czcorpus/kontext

An alternative web front-end for the Manatee corpus search engine

License

Notifications You must be signed in to change notification settings

dlukes/kontext

 
 

Repository files navigation

KonText screenshot

Important note

Please note that due to Python 2 EOL, KonText version 0.13.x is the last one running in Python 2. It means that the next release (planned for Q1 2020) will run only in Python 3. For the master branch users - the last commit supporting Python 2 is tagged py2_last_version and the first one supporting Python 3 is tagged py3_initial_version. To upgrade, please refer to doc/py2to3.md for details. For new installations, please follow doc/INSTALL.md.

Contents

Introduction

KonText is an advanced corpus query interface and corpus data integration middleware built around corpus search engine Manatee-open. The development is maintained by the Institute of the Czech National Corpus.

Features

notable end-user features

  • fully editable query chain
    • any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed and the whole sequence is then re-executed.
  • advanced CQL editor with syntax highlighting and attribute recognition
  • support for spoken corpora
    • defined concordance segments can be played back as audio
    • KWIC detail provides a custom rendering with easily distinguishable speeches
  • support for user-defined line groups
    • user can define custom numeric tags attached to concordance lines, filter out other lines, review groups ratios
  • improved subcorpus creation
    • user can easily examine corpus structure by selecting some text types and see how other text type attributes availability changed ("which publishers are there in case only fiction is selected?")
    • a custom text types ratio can be defined ("give me 20% fiction and 80% journalism")
    • a sub-corpus can be created by a custom CQL expression
    • a sub-corpus can be published so other users can access it
    • subcorpora are backed up as CQL queries which makes further modification/restoring possible
  • frequency distribution
    • 2-dimensional frequency distribution for both positional and structural attributes
    • result caching decreases time required to navigate between pages
    • on the multilevel frequency distribution page, starting word can be specified for multi-word KWICs
  • persistent URL for any query - you can send a link to someone even if the query string was megabytes long
  • access to previous queries, named queries
  • access to favorite corpora (subcorpora, aligned corpora)
  • interactive PoS tag tool - in case of positional PoS tag formats an interactive tool can be used to write tag queries
  • a concordance/frequency/collocation listing can be saved in Excel format (xlsx)
  • concordance tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
  • a correct (i.e. the one calculating only with selected text types) i.p.m. can be calculated on-demand for ad-hoc subcorpora
  • integrability with external data resources (e.g. dictionaries, media libraries)

internal features

  • server-side written as a WSGI application
  • modern client-side application (event stream architecture, React components, extensible, written in TypeScript)
  • modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database adapters, authentication method, corpus listing widgets, HTTP session management)
  • fully decoupled background concordance/frequency/collocation calculation based on the Celery task queue (alternatively, Python's multiprocessing package can be used)
  • improved logging, error processing and debugging support
  • improved code documentation

Requirements

  • Rerverse proxy server
  • Python 3.6 (or newer) and:
    • WSGI-compatible server
    • Werkzeug web application library
    • Jinja2 template engine
    • lxml library
    • PyICU library (optional but preferred)
    • markdown library (optional, for formatted corpora references)
    • openpyxl library (optional, for XLSX export)
  • corpus search engine Manatee
    • versions 2.167.8 and newer are supported by KonText 0.15 and newer
    • versions from 2.83.3 to 2.158.8 are supported by KonText 0.13 and older
  • a key-value storage
    • any custom implementation (Redis and SQLite backends are available by default)
  • Celery task queue task queue for (asynchronous) background calculations and maintenance tasks

Note: KonText versions up to 0.13.x (incl.) run on Python 2. To use Python 3, 0.15.x and newer versions of KonText must be used.

Build and installation

KonText provides a script for automatic installation to an existing Ubuntu system. The easiest way to install KonText is to create an LXC/LXD container, clone the repository there and run the script. On a decently fast network, the whole process takes only a couple of seconds. Please refer to the doc/INSTALL.md file for details.

Customization and contribution

Please refer to our Wiki.

Notable users

About

An alternative web front-end for the Manatee corpus search engine

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • TypeScript 57.4%
  • Python 35.8%
  • CSS 3.6%
  • HTML 1.6%
  • JavaScript 0.7%
  • TSQL 0.6%
  • Other 0.3%