-
Notifications
You must be signed in to change notification settings - Fork 0
boerschi/artlangseg
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Word Segmentation and Artificial Languages Makefile use this to run an experiment by default, the Unigram grammar is used on the first random hawaiian corpus to run on a different version (there are currently 4 sets, there is nothing special about them, they were just generated by running Robert's scripts 4 times), use make SET=0[1234] to run on a different language (right now, there only is Berber in addition) make LANGUAGE=berber to keep the model as simple as possible, the base-distribution is assumed to be fixed, that is, words are generated assuming a uniform phoneme distribu- tion and a constant stopping probability of 0.5, although hyper-parameter inference is performed to get around having to set arbitrary hyper-parameters (as of now, it uses the original DP-model instead of the slightly more expres- -sive PYP model) runs/ where experiments are put there are two folders for every experiment, a Tmp-folder which contains actual data (inputs, outputs, tracefiles) for different runs of the same experiment, and an Eval-folder that only contains a "summary"-file which gives a single score experiments are referred to by the language plus the grammar name, e.g. runs/hawaiian_unigram(Tmp|Eval) to distinguish different corpus versions, each file has "s0[1234]" in its name. To illustrate, runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s01.trscore runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02.trscore are the score-files for the unigram-model on two different random berber corpora data/ contains test-corpora, named as corpus_<language>_<set>.txt, e.g. corpus_hawaiian_01.txt corpus_berber_04.txt scripts/ contains scripts to generate artificial corpora / analyse their properties corpus/ scripts to generate corpora analysis/ scripts to perform posterior analysis (input / output projections) use as follows: cat runs/hawaiian_unigramTmp/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02_fold0*.trsws | python scripts/analysis/posteriorOutputprojections.py runs/hawaiian_unigramTmp/AGgold_02.txt ambiguity/ scripts to calculate segmentation ambiguity dictionaries/ contains dictionaries generated by Robert's MaxEnt-grammar-generator prog_seg/ contains scripts required by the Adaptor Grammar evaluation py-cfg/ contains source code for Adaptor Grammars (use Makefile to build) only needs to be compiled once
About
segmentation of artificial languages
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published