segments

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 ).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from __future__ import unicode_literals, print_function
>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
src/segments		src/segments
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASING.md		RELEASING.md
faq.md		faq.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

Anaphory/segments

Folders and files

Latest commit

History

Repository files navigation

segments

Command line usage

API

About

Resources

License

Stars

Watchers

Forks

Languages