scripts/setup.sh
Prepares corpus chunks from the limited subset of Hungarian Webcorpus in build/text_template
scripts/run.sh
The browser should be pointed to http://localhost:9999
gzcat resource/sg3_nom_acc_sentences_xaa.txt.gz | langmodel/gibberize.py | less
replaces content words in input sentences with gibberish word forms. The input should be formatted as follows:
# sentence_number
word <TAB> lemma <TAB> analysis
word <TAB> lemma <TAB> analysis
...
# sentence_number
...
basic_sentence_demo.py
generates 1000 sentences with definite_article subject verb indefinite_article adjective object
structure
phonmodel.py <list-of-existing-stems
This will create a trigram model based on the input character sequences and output 100 generated stems
gzcat webcorpus.tagged.gz | iconv -f latin2 -t utf8 | resource/webcorp-parse.py | resource/sentence-filter.py