Skip to content

Python, tags trigrams! K-fold validation doesn't work yet.

Notifications You must be signed in to change notification settings

Mirith/Trigram-tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Last updated on August 24th, 2017 by Mirith

Overview

This project uses many modules from nltk. And mean from numpy to average the results. It takes an previously tagged dataset and trains a trigram tagger based on that dataset. The trigram tagger backs off into a bigram tagger, which backs off into a unigram tagger, which in turn backs off into a default tagger.

Usage

You'll need python (this was done in python 3) and the dataset. The small one provided is probably not going to give you very accurate results (with the full set it's about 88-89% accurate). But it will give you an idea of how it works, while drastically reducing the training time.

Files

estonianSmall.txt

Tagged Estonian data. Each word has its own tag, comprised of one letter. IE

word/single letter tag

Only includes the first 200 lines of a much, much larger dataset.

taggers.py

Currently kfold validation doesn't work unless hard-coded, which is less than ideal. But this file basically loads the tagged corpus, then splits the data, then trains based off the hard-coded split data, and prints the results. Capitalizing all the words improves accuracy just slightly.

About

Python, tags trigrams! K-fold validation doesn't work yet.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages