School project for scanning printed text (receipts).
- Install tesseract
sudo apt install tesseract-ocr
- Adding support for other languages in tesseract -
Download
*.traineddata
file from github.com/tesseract-ocr/tessdata_fast. Then place it in your tesseract directory intessdata/
. (eg.:/usr/share/tesseract-ocr/4.00/tessdata
)- Support for polish language:
pol.traineddata
- Support for polish language:
Reader is meant to take an input image (eg. photo taken with a smartphone) and output formated contents of the scanned receipt. To achieve this goal it goes through the following steps:
To help OCR module and increase its accuracy module we preprocess images.
- Crop - "cut out" the receipt from original image
- Rescaling - Tesseract works best when the image is at least 300 dpi.
- Blurring - is used in order to reduce noise.
- Thersholding
Cropping works best when receipt is visible in its entirety (all four corners of the paper sheet have to be visible). It's best practice for pictures to have a dark, uniform background. Otherwise there might be problems detecting your receit.
For OCR we are using tesseract and pytesseract.
The problem with parsing contents of receipts is that every store have different receipt layout. Because of that we had to create different parsing strategies for different layouts.
Well you don't... at this point