Comprehensive OCR System for Telugu Language

The Banti Framework

This framework relies on the ability of a segmentation algorithm to break the text in to glyphs. Hence it can be extended to other scripts with well seperated images like Malayalam, Oriya, Tamil, Kannada, Thai etc.

Features

Opens box files generated by banti (segmentation program)
Passes them to a neural network trained by theanet
n-gram modelling of the language
Ability to stich broken glyphs (using the language model).

Dependencies

Python3
Numpy, Scipy, Nose etc.
Theano
banti segmenter
Theanet

Installation Instructions

Install python3

You might already have it. Just type which python3 and check. Make sure you also have pip3. Python3.4 comes with pip3. Python3.3 and older need additional installation of pip3.

Install Theano after installing its dependencies. Here are the General and the Ubuntu-specific instructions. You just need to install numpy, scipy, nose etc.
Install Theanet by running the setup.py
Clone this repo. (telugu_ocr_banti)
Set the following theano flag(s). I just put the following in my .bashrc file.

export THEANO_FLAGS='floatX=float32'

Get the required files to load the neural network and the ngram library.

# change to cloned project directory
mkdir library
wget http://stanford.edu/~rakesha/banti/library/4hidaux_252611_01.pkl -O library/nn.pkl
wget http://stanford.edu/~rakesha/banti/library/mega.123.pkl -P library/

Run the ocr program

python3 recognize.py sample_images/praasa.box 
# Run for help
python3 recognize.py -h

Here you are running on the provided sample image praasa.box genereated from praasa.tif (both in the sample_images directory)

OCRing your own images.

python3 recognize.py sample_images/praasa.tif
python3 recognize.py sample_images/praasa.jpg
python3 recognize.py sample_images/praasa.png

Note that recognize.py needs images in .box format. These files are genereated by banti segmenter. You can install it to genereate box files from your tiff files. Once you obtain the banti_segmenter binary/executable. You can leave that in the same directory as recognize.py or you can pass it as an argument. This will enable recognize.py to convert tiff files to box files. We can also convert png and jpg files to tiff before converting them to box files!

Alternatively to get box files, you can try to run [this binary] (https://stanford.edu/~rakesha/banti/banti_segmenter) that has been built on a 64-bit linux ubuntu machine. (Run it without any arguments to see all options.)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
archive		archive
labellings		labellings
sample_images		sample_images
scaler		scaler
scalings		scalings
LICENSE		LICENSE
README.md		README.md
bantry.py		bantry.py
build_123grams.py		build_123grams.py
classifier.py		classifier.py
classify_paint.py		classify_paint.py
classify_training_data.py		classify_training_data.py
gen_training_data.py		gen_training_data.py
glyph.py		glyph.py
iast_unicodes.py		iast_unicodes.py
linegraph.py		linegraph.py
ngram.py		ngram.py
ngramgraph.py		ngramgraph.py
ocr.py		ocr.py
post_process.py		post_process.py
recognize.py		recognize.py
text2glyphs.py		text2glyphs.py

License

mano143/telugu_ocr_banti

Folders and files

Latest commit

History

Repository files navigation

Comprehensive OCR System for Telugu Language

The Banti Framework

Features

Dependencies

Installation Instructions

About

Resources

License

Stars

Watchers

Forks

Languages