Skip to content

mano143/telugu_ocr_banti

 
 

Repository files navigation

Comprehensive OCR System for Telugu Language

The Banti Framework

This framework relies on the ability of a segmentation algorithm to break the text in to glyphs. Hence it can be extended to other scripts with well seperated images like Malayalam, Oriya, Tamil, Kannada, Thai etc.

Features

  • Opens box files generated by banti (segmentation program)
  • Passes them to a neural network trained by theanet
  • n-gram modelling of the language
  • Ability to stich broken glyphs (using the language model).

Dependencies

  1. Python3
  2. Numpy, Scipy, Nose etc.
  3. Theano
  4. banti segmenter
  5. Theanet

Installation Instructions

  1. Install python3

You might already have it. Just type which python3 and check. Make sure you also have pip3. Python3.4 comes with pip3. Python3.3 and older need additional installation of pip3.

  1. Install Theano after installing its dependencies. Here are the General and the Ubuntu-specific instructions. You just need to install numpy, scipy, nose etc.

  2. Install Theanet by running the setup.py

  3. Clone this repo. (telugu_ocr_banti)

  4. Set the following theano flag(s). I just put the following in my .bashrc file.

export THEANO_FLAGS='floatX=float32'
  1. Get the required files to load the neural network and the ngram library.
# change to cloned project directory
mkdir library
wget http://stanford.edu/~rakesha/banti/library/4hidaux_252611_01.pkl -O library/nn.pkl
wget http://stanford.edu/~rakesha/banti/library/mega.123.pkl -P library/
  1. Run the ocr program
python3 recognize.py sample_images/praasa.box 
# Run for help
python3 recognize.py -h

Here you are running on the provided sample image praasa.box genereated from praasa.tif (both in the sample_images directory)

  1. OCRing your own images.
python3 recognize.py sample_images/praasa.tif
python3 recognize.py sample_images/praasa.jpg
python3 recognize.py sample_images/praasa.png

Note that recognize.py needs images in .box format. These files are genereated by banti segmenter. You can install it to genereate box files from your tiff files. Once you obtain the banti_segmenter binary/executable. You can leave that in the same directory as recognize.py or you can pass it as an argument. This will enable recognize.py to convert tiff files to box files. We can also convert png and jpg files to tiff before converting them to box files!

Alternatively to get box files, you can try to run [this binary] (https://stanford.edu/~rakesha/banti/banti_segmenter) that has been built on a 64-bit linux ubuntu machine. (Run it without any arguments to see all options.)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%