Skip to content

kelayamatoz/JapaneseFuriganaInferencer

Repository files navigation

Furigana Inferencer for Japanese Learners

Japanese has a complicated syntax system that consists of three totally different sets of glyphs, Hiragana, Katakana and Kanji (Chinese characters). Hiragana and Katakana are alphabetical, which means they represent a certain syllable that has one unique way of pronunciation. But kanjis are ideographs whose pronunciation are not explicit at all. Moreover, the pronunciation (with noted by hiraganas, called furigana) is usually polysyllabic and not necessarily unique. These language features of kanji post great difficulties for Japanese learners, especially at their beginning stage, and require a great amount of effort and memorization.

Language learning is a process of accumulation. Learners are gradually learning new vocabularies and expressions. In the special case of Japanese vocabulary, learners are learning tuples of kanji, furigana and explanation. For example, (戦う, たたかう, “to fight; to struggle”) or (大手企業, おおてきぎょう, “large enterprises”). Note that every vocabulary might contain a mixture of hiraganas and multiple kanjis, so it is really useful if the learner can know: The pronunciation of each Kanji, separately. For example 戦=たたか. Other pronunciation of the same Kanji in the vocabs the learner learned before. For example if the learner learned (戦争, せんそう, “war”) before, it would be quite helpful to hint the learner of 戦=たたか or せん. Learned kanji with the same pronunciation that might be confused with the new one. For example, if the learner learned (洗濯, せんたく, “washing, laundry”) before, it would also be helpful to hint the learner of せん=戦 or 洗.

Actually, when I personally learned Japanese before, I wrote an application that did exactly what I said here. But it is far from perfect. It uses a deterministic inference model that infers the pronunciation only when the inference is unique. For example, it can infer that the pronunciation of “戦=たたか” from (戦う, たたかう, “to fight; to struggle”) because that is the only possibility. But it can not infer the pronunciation of any Kanji in “大手企業”, because there are many possible ways to split the furigana as a whole. But we might be possible to do so if we introduce a probability-based model, like a MDP. Say there is another vocab tuple (大阪, おおさか, “Osaka”). Though we can not infer from each tuple deterministically, it is totally possible to infer that the probability of “大=おお” is much higher than “大=おおて”, because if the latter is the case, it is impossible for “大阪” to be pronounced as “おおさか”.

So basically we wish to create a furigana inferencer for Japanese learners. It should be different from general word-partition algorithm because it uses a probabilistic logic inference model on a limit set of data (the vocabs that have been already learned), so it could be used a tool for learning, reviewing and organizing knowledge. The basic interactions with the inferencer would be the following: a new word tuple (kanji, furigana, explanation) information (the three points I mentioned above) regarding kanjis in that new word

About

Automatic Furigana Inferencer for Japanese Learners

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published