Overview

Last updated on August 24th, 2017 by Mirith

Overview

This project uses many modules from nltk. And mean from numpy to average the results. It takes an previously tagged dataset and trains a trigram tagger based on that dataset. The trigram tagger backs off into a bigram tagger, which backs off into a unigram tagger, which in turn backs off into a default tagger.

Usage

You'll need python (this was done in python 3) and the dataset. The small one provided is probably not going to give you very accurate results (with the full set it's about 88-89% accurate). But it will give you an idea of how it works, while drastically reducing the training time.

Files

estonianSmall.txt

Tagged Estonian data. Each word has its own tag, comprised of one letter. IE

word/single letter tag

Only includes the first 200 lines of a much, much larger dataset.

taggers.py

Currently kfold validation doesn't work unless hard-coded, which is less than ideal. But this file basically loads the tagged corpus, then splits the data, then trains based off the hard-coded split data, and prints the results. Capitalizing all the words improves accuracy just slightly.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
estonianSmall.txt		estonianSmall.txt
readme.md		readme.md
taggers.py		taggers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estonianSmall.txt

estonianSmall.txt

readme.md

readme.md

taggers.py

taggers.py

Repository files navigation

Overview

Usage

Files

estonianSmall.txt

taggers.py

About

Releases

Packages

Languages

Mirith/Trigram-tagger

Folders and files

Latest commit

History

Repository files navigation

Overview

Usage

Files

estonianSmall.txt

taggers.py

About

Topics

Resources

Stars

Watchers

Forks

Languages