Introduction

Implementation of Neural Language Correction (http://arxiv.org/abs/1603.09727) on Tensorflow. But this is a Japanese version!!!

Pre-processing data

$ python preprocessing.py --source_path="/path/to/create/ocr_text/x.txt" --target_path="/path/to/create/ground_truth_text/y.txt" --output_folder="/dir/to/save/model_dataset" --kanji_folder="/dir/to/kanji_characters" --original_path="/path/to/pairs_of_sentence/pairs.txt" --ratio=0.2

Generating pairs of fake dataset (Optional)

$ python generate_fake_data.py --ocr_path_1="/path/to/ocr_path_1/ocr_path_1.txt" --ground_truth_path_1="/path/to/ground_truth_path_1/ground_truth_path_1.txt" --ocr_path_2="/path/to/ocr_path_2/ocr_path_2.txt" --ground_truth_path_2="/path/to/ground_truth_path_2/ground_truth_path_2.txt" --ocr_output_path="/path/to/save/ocr_output_text/ocr.txt" --ground_truth_output_path="/path/to/save/ground_truth_text/gt.txt"

Generating binary files by using kenlm toolkit

$/path/to/kenlm_toolkit/kenlm/build/bin/lmplz -o N < /path/to/separated_sentence_path/separated_sent.txt /path/to/save/arpa_file/corpus.arpa
or /path/to/kenlm_toolkit/kenlm/build/bin/lmplz -o N --discount_fallback < /path/to/separated_sentence_path/separated_sent.txt > /path/to/save/arpa_file/corpus.arpa
$/path/to/kenlm_toolkit/kenlm/build/bin/build_binary -s /path/to/arpa_file/corpus.arpa /path/to/save/binary_file/corpus.binary

NOTICE: You can replace N by 3, 4, 5, ... Separated sentence is derived from ground_truth.txt file (install kenlm python packages to load Language model and use it. You can read this kenlm toolkits build and python package install ).

Training

To train character level model (for japanese, we'd better only use this mode ):

$ python train.py

NOTICE: We use Tensorflow FLAGS for customizing hyper-parameters (e.g, learning rate, dropout, batchsize, optimization functions, ...). If you want to train with another parameters, do your suitable adjustment.

Local Decoding

$ python decode.py --test_path="/path/to/test_path/test.txt" --output_path="/path/to/save/result.txt"

NOTICE: test.txt contains ocr output text (line-by-line) and result.txt contains desired lines respectively.

Gets OCR output text through OCR API

# python3 OCR_API.py --ocr_path="/path/to/save/ocr_output/x.txt" --gt_path="/path/to/save/gt_output/y.txt" --csv_file_path="/path/to/csv_file/file.csv" --images_folder="/dir/to/images"

NOTICE: file.csv contains list of ground truth of each image in images folder

Web Decoding

$ export FLASK_APP=server.py; flask run --host=0.0.0.0

Now, you can post a json-format data `{'original_text':'xxx'}` to url `http://server_ip:5000/revise`, get the revised sentence.

Tensorflow Dependency

Tensorflow 1.4.0 and more

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
data_augmentation		data_augmentation
data_dir		data_dir
interface		interface
pre_processing		pre_processing
samples		samples
README.md		README.md
decode.py		decode.py
evaluate.py		evaluate.py
nlc_data.py		nlc_data.py
nlc_model.py		nlc_model.py
ocr_correction_api.py		ocr_correction_api.py
ratios.pkl		ratios.pkl
test.py		test.py
train.py		train.py
util.py		util.py

anoopyear2020/OCR_Correction

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pre-processing data

Generating pairs of fake dataset (Optional)

Generating binary files by using kenlm toolkit

Training

Local Decoding

Gets OCR output text through OCR API

Web Decoding

Tensorflow Dependency

About

Resources

Stars

Watchers

Forks

Languages