GitHub - kelayamatoz/JapaneseFuriganaInferencer: Automatic Furigana Inferencer for Japanese Learners

Furigana Inferencer for Japanese Learners

Japanese has a complicated syntax system that consists of three totally different sets of glyphs, Hiragana, Katakana and Kanji (Chinese characters). Hiragana and Katakana are alphabetical, which means they represent a certain syllable that has one unique way of pronunciation. But kanjis are ideographs whose pronunciation are not explicit at all. Moreover, the pronunciation (with noted by hiraganas, called furigana) is usually polysyllabic and not necessarily unique. These language features of kanji post great difficulties for Japanese learners, especially at their beginning stage, and require a great amount of effort and memorization.

Language learning is a process of accumulation. Learners are gradually learning new vocabularies and expressions. In the special case of Japanese vocabulary, learners are learning tuples of kanji, furigana and explanation. For example, (戦う, たたかう, “to fight; to struggle”) or (大手企業, おおてきぎょう, “large enterprises”). Note that every vocabulary might contain a mixture of hiraganas and multiple kanjis, so it is really useful if the learner can know: The pronunciation of each Kanji, separately. For example 戦＝たたか. Other pronunciation of the same Kanji in the vocabs the learner learned before. For example if the learner learned (戦争, せんそう, “war”) before, it would be quite helpful to hint the learner of 戦＝たたか or せん. Learned kanji with the same pronunciation that might be confused with the new one. For example, if the learner learned (洗濯, せんたく, “washing, laundry”) before, it would also be helpful to hint the learner of せん＝戦 or 洗.

Actually, when I personally learned Japanese before, I wrote an application that did exactly what I said here. But it is far from perfect. It uses a deterministic inference model that infers the pronunciation only when the inference is unique. For example, it can infer that the pronunciation of “戦＝たたか” from (戦う, たたかう, “to fight; to struggle”) because that is the only possibility. But it can not infer the pronunciation of any Kanji in “大手企業”, because there are many possible ways to split the furigana as a whole. But we might be possible to do so if we introduce a probability-based model, like a MDP. Say there is another vocab tuple (大阪, おおさか, “Osaka”). Though we can not infer from each tuple deterministically, it is totally possible to infer that the probability of “大＝おお” is much higher than “大＝おおて”, because if the latter is the case, it is impossible for “大阪” to be pronounced as “おおさか”.

So basically we wish to create a furigana inferencer for Japanese learners. It should be different from general word-partition algorithm because it uses a probabilistic logic inference model on a limit set of data (the vocabs that have been already learned), so it could be used a tool for learning, reviewing and organizing knowledge. The basic interactions with the inferencer would be the following: a new word tuple (kanji, furigana, explanation) information (the three points I mentioned above) regarding kanjis in that new word

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
corpus		corpus
minimal_example		minimal_example
plot		plot
real_example		real_example
result		result
test		test
README.md		README.md
baseline_test_result.txt		baseline_test_result.txt
learner.py		learner.py
model.py		model.py
parser.py		parser.py
test.py		test.py
test.txt		test.txt
test_log.txt		test_log.txt
test_plot.m		test_plot.m
test_result.txt		test_result.txt
tuples.txt		tuples.txt
util.py		util.py

kelayamatoz/JapaneseFuriganaInferencer

Folders and files

Latest commit

History

Repository files navigation

Furigana Inferencer for Japanese Learners

About

Resources

Stars

Watchers

Forks

Languages