Skip to content

Adds missing Romanian diacritics using Bidirectional Long-Short Term Memory Artificial Neural Networks

License

Notifications You must be signed in to change notification settings

Claudiu/keras-diacritics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adding Romanian Diacritics using Bidirectional LSTM

Diacritics are part of the Romanian identity but are usually dismissed in colloquial speech to favour faster typing. In this project I am proposing a faster alternative to inserting them, using Bidirectional Long-Short Term Memory Artificial Neural Networks.

Understanding the problem

  • There are 5 types of diacritics in the romanian language. (Ș /ʃ/, Ă /ə/, Ț /t͡s/, Â /ɨ/, Î /ɨ/ and their lowercase parts)
  • Comma-below (ș and ț) versus cedilla (ş and ţ) -- * Many printed and online texts still incorrectly use "s with cedilla" and "t with cedilla". * [Wikipedia:en@Romanian_alphabet]
  • According to the 1993 reform, the choice between î and â is thus again based on a rule that is neither strictly etymological nor phonological, but positional and morphological. The sound is always spelled as â, except at the beginning and the end of words, where î is to be used instead. Exceptions include proper nouns where the usage of the letters is frozen, whichever it may be, and compound words, whose components are each separately subjected to the rule (e.g. ne- + îndemânatic → neîndemânatic "clumsy", not *neândemânatic). [Wikipedia:en@Romanian_alphabet]

Output Targets

  • Ă ă (a with breve)
  • Â â (a with circumflex)
  • Î î (i with circumflex)
  • Ș ș (s with comma)
  • Ț ț (t with comma)
  • Not Diacritic (Ignore / Discard)

Now that we understand the problem at hand, we need to tinker around with the inputs and outputs of our ANN. Since we don't really need 6 outputs, we can simplify our targets using the following format:

ă or ş or ț î â Not diacritic
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1

Usage

usage: main.py [-h] [--weights WEIGHTS] [--epochs EPOCHS]
               [--timeseries TIMESERIES] [--lstmSize LSTMSIZE]
               [--dropout DROPOUT] [--train TRAIN] [--test TEST]
               action

positional arguments:
  action                Test or Train

optional arguments:
  -h, --help            show this help message and exit
  --weights WEIGHTS     A filename where to save our weights
  --epochs EPOCHS       The number of epochs to train the neural network for.
  --timeseries TIMESERIES
                        Size of a single timeseries
  --lstmSize LSTMSIZE   The size of our first LSTM layer
  --dropout DROPOUT     Percent Dropout
  --train TRAIN         Train dataset
  --test TEST           Test dataset

To train

python main.py train --train train.txt --test test.txt --epochs 200

For HTTP Backend

python main.py serve --weights save.hf

About

Adds missing Romanian diacritics using Bidirectional Long-Short Term Memory Artificial Neural Networks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published