Comperio text analytics

ElasticSearch based text analytics.

Implementation of:

Single document significant terms - Trello board (private)
Classification - Trello board (private)
Sentiment analysis - Trello board (private)

Norwegian linguistics support for text analytics

There is currently partial experimental support for some linguistic analysis of Norwegian Bokmål. This support depends on the following resources:

Norwegian Dependency Treebank (NDT) (freely available, permissive licensing).
Norsk Ordbank (available on request, GPL or commercial licensing).

Norsk Ordbank must be obtained separately and unzipped in the data directory for it to be used automatically by linguistic processing components.

Tokenizer

Currently a simple application of the UAX29 standard for Unicode tokenization. Will be expanded to handle hyphens in accordance with Norwegian norms.

tokenizer = NOTokenizer()
tokenizer.tokenize(u'Vi er konsulenter, med fokus på søk!')

[u'Vi',
 u'er',
 u'konsulenter',
 u',',
 u'med',
 u'fokus',
 u'på',
 u'søk',
 u'!']

Part of speech annotation

Adds part of speech descriptions. THe default annotation is a very simplified version of the one used by Norsk Ordbank and NDT.

tagger = NOBTagger()
tagger.tag(u'Vi spiste lunsj ute i det fine været.')

[(u'Vi', 'PRON_PERS'),
 (u'spiste', 'VERB'),
 (u'lunsj', 'SUBST'),
 (u'ute', 'PREP'),
 (u'i', 'PREP'),
 (u'det', 'DET'),
 (u'fine', 'ADJ'),
 (u'været', 'SUBST'),
 (u'.', 'PUNKT')]

Evaluation of the tagger precision is forthcoming, but users should expect a reasonable error rate given the limited trraining data available.

Lemmatization

Based on Norsk Ordbank. It is possible to pass the part of speech tag in order to disambiguate words which can have more than one lemma form.

sent = tagger.tag(u'Vi er godt forberedt.')
[(word, lem.lemmatize(word, tag)) for word, tag in sent]

[(u'Vi', u'vi'),
 (u'er', u'være'),
 (u'godt', u'god'),
 (u'forberedt', u'forberedt'),
 (u'.', u'.')]

Decompounder

Simple heuristics based decompounder based on the word forms in Norsk Ordbank. This can overgenerate so it should primarily be used on wellformed text.

dec = NOBDecompounder()
dec.decompound(u'lampekostbatteri'), dec.decompound(u'søkekonsulenter')

[u'lampe', u'kost', u'batteri'], [u'søke', u'konsulenter']

Vagrant development server

In order to set up up a Vagrant development server run:

vagrant up
fab vagrant provision_server

#Installation notes

To use the tagger, decompounder and lemmatizer you will need to download Norsk ordbank. You can download it by registering at http://www.edd.uio.no/prosjekt/ordbanken/

You will need to build models for the tagger, by running:

from es_text_analytics.tagger import install_hunpos
install_hunpos()
comperio-text-analytics\python-client\bin\build-all-models.bat

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
provision		provision
python-client		python-client
spark-jobs		spark-jobs
.env		.env
.gitignore		.gitignore
.travis.yml		.travis.yml
Vagrantfile		Vagrantfile
environment.yml		environment.yml
fabfile.py		fabfile.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provision

provision

python-client

python-client

spark-jobs

spark-jobs

.env

.env

.gitignore

.gitignore

.travis.yml

.travis.yml

Vagrantfile

Vagrantfile

environment.yml

environment.yml

fabfile.py

fabfile.py

readme.md

readme.md

Repository files navigation

Comperio text analytics

Norwegian linguistics support for text analytics

Tokenizer

Part of speech annotation

Lemmatization

Decompounder

Vagrant development server

About

Releases

Packages

Languages

aandreassen/comperio-text-analytics

Folders and files

Latest commit

History

Repository files navigation

Comperio text analytics

Norwegian linguistics support for text analytics

Tokenizer

Part of speech annotation

Lemmatization

Decompounder

Vagrant development server

About

Resources

Stars

Watchers

Forks

Languages