Razmetka

This repository contains a Python utility for training and testing part-of-speech taggers from a provided training file.

Training Files

This package assumes a training file structured according to the following rules:

Each line contains one sentence (i.e., sentences are separated by the newline character \n)
Each sentence is white-space delimited---a space should precede every punctuation mark.
Each token in each sentence consists of three parts: the word/punctuation mark itself, the separator character, and the associated tag. For example, here's a breakdown of a token Men_PN1s:
- Men -- the word as it would appear in an untagged sentence.
- _ -- the separator character (the underbar _ is the default, but other separators may be specified. The slash / is also common).
- PN1s -- the (part-of-speech) tag, here indicating a first-person singular pronoun.
UTF-8 is the default encoding, but other encodings may be specified.

Men_PN1s besh_NU minut_N usul_N oynidim_Vt-PST.dir-1s1 ._PUNCT
Sen_PN2si poluni_N-ACC yéding_Vt-PST.dir-2si2 dédi_Vt-PST.dir-3s2 Tursun_Npr ._PUNCT
Xinjiangda_Ntop-LOC turghan_Vi-REL.PST méning_PN1s.GEN ayalim_N-POSS1s qaytip_Vi-CNV keldi_Vdirc-PST.dir-3s2 ._PUNCT

Provided Files

This repository includes a sample file, uyghurtagger.train, structured according to the standards described above. The Uyghur sentences in this file are taken from the public online corpus of the Uyghur Light Verbs Project (PI Arienne M. Dwyer, NSF BCS-1053152).

Usage

Train a Brill tagger on a provided training file:

import razmetka.tag
btt = razmetka.tag.TTBrillTaggerTrainer(file_name='uyghurtagger.train',
                                        language='Uyghur')
btt.train(verbose=True)

Train and test Stanford log-linear taggers from a provided training file using ten-fold cross-validation:

import razmetka.tag
tst = razmetka.tag.TaggerTester(file_name='uyghurtagger.train',
                                language='Uyghur')
tst.split_groups()
tst.estimate_tagger_accuracy()
tst.print_results()

Repeat the entire ten-fold cross-validation process multiple times:

import razmetka.tag
razmetka.tag.repeat_tagger_tests(fname='uyghurtagger.train',
                                 number_of_tests=3, language='Uyghur')

Requirements

The Razmetka package requires NLTK 3.0+.

TODOs

Use with nltk.compat.TemporaryDirectory() as tempdir: for storing the properties files and training files generated when using the Stanford NLP POS tagger.

Support

This Python package is being written to support the work of the Annotating Turki Manuscripts Online project (Principal Investigators: Arienne M. Dwyer and C.M. Sperberg-McQueen), sponsored by the Luce Foundation. The support of the Luce Foundation is gratefully acknowledged.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
razmetka		razmetka
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
setup.py		setup.py
uyghurtagger.train		uyghurtagger.train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

razmetka

razmetka

tests

tests

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

setup.py

setup.py

uyghurtagger.train

uyghurtagger.train

Repository files navigation

Razmetka

Training Files

Provided Files

Usage

Requirements

TODOs

Support

About

Releases

Packages

Languages

License

menzenski/Razmetka

Folders and files

Latest commit

History

Repository files navigation

Razmetka

Training Files

Provided Files

Usage

Requirements

TODOs

Support

About

Resources

License

Stars

Watchers

Forks

Languages