GitHub - boerschi/artlangseg: segmentation of artificial languages

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
data		data
dictionaries		dictionaries
prog_seg		prog_seg
py-cfg		py-cfg
scripts		scripts
writeup		writeup
.gitignore		.gitignore
Makefile		Makefile
README		README

Repository files navigation

Word Segmentation and Artificial Languages

Makefile
  use this to run an experiment
  by default, the Unigram grammar is used on the first random hawaiian corpus
  to run on a different version (there are currently 4 sets, there is nothing 
  special about them, they were just generated by running Robert's scripts 4 
  times), use

  make SET=0[1234]

  to run on a different language (right now, there only is Berber in addition)

  make LANGUAGE=berber

  to keep the model as simple as possible, the base-distribution is assumed to
  be fixed, that is, words are generated assuming a uniform phoneme distribu-
  tion and a constant stopping probability of 0.5, although hyper-parameter
  inference is performed to get around having to set arbitrary hyper-parameters
  (as of now, it uses the original DP-model instead of the slightly more expres-
  -sive PYP model)

runs/
  where experiments are put
  there are two folders for every experiment, a Tmp-folder which contains actual
  data (inputs, outputs, tracefiles) for different runs of the same experiment,
  and an Eval-folder that only contains a "summary"-file which gives a single
  score
  experiments are referred to by the language plus the grammar name, e.g.

  runs/hawaiian_unigram(Tmp|Eval)

  to distinguish different corpus versions, each file has "s0[1234]" in its
  name. To illustrate,

  runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s01.trscore
  runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02.trscore

  are the score-files for the unigram-model on two different random berber
  corpora

data/
  contains test-corpora, named as corpus_<language>_<set>.txt, e.g.

  corpus_hawaiian_01.txt
  corpus_berber_04.txt




scripts/
  contains scripts to generate artificial corpora / analyse their properties

  corpus/
    scripts to generate corpora

  analysis/
    scripts to perform posterior analysis (input / output projections)
    use as follows:

    cat runs/hawaiian_unigramTmp/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02_fold0*.trsws | python scripts/analysis/posteriorOutputprojections.py runs/hawaiian_unigramTmp/AGgold_02.txt

  ambiguity/
    scripts to calculate segmentation ambiguity

dictionaries/
  contains dictionaries generated by Robert's MaxEnt-grammar-generator

prog_seg/
  contains scripts required by the Adaptor Grammar evaluation

py-cfg/
  contains source code for Adaptor Grammars (use Makefile to build)
  only needs to be compiled once