Skip to content
forked from poke1024/origami

A suite of batches and tools for OCR tasks.

Notifications You must be signed in to change notification settings

sepastian/origami

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Origami

Origami is a self-contained suite of batches and tools for OCR processing of historical newspapers. It covers many essential steps in a digitization pipeline, including (1) building training data for training models, and (2) generating Page-XML OCR output from pages using trained models.

Apart from its specific features, Origami is

  • easy to setup
  • easy to use
  • based on file-based intermediary results that allow customization

Origami's current default implementation features:

  • DNN segmentation
  • dewarping
  • reading order detection
  • simple table support
  • Page-XML export

Origami also provides additional tools for:

  • annotating ground truth
  • debugging
  • creating annotated images
  • evaluation of OCR quality

Installation

Basics

conda create --name origami python=3.7 -c defaults -c conda-forge --file origami/requirements/conda.txt
conda activate origami
pip install -r origami/requirements/pip.txt

Troubleshooting scikit-geometry

On some systems (e.g. macOS 10.15.7) the conda installation of scikit-geometry is broken. In these cases, you can always build scikit-geometry from scratch, i.e.:

conda activate origami
git clone https://github.com/scikit-geometry/scikit-geometry
cd scikit-geometry
python setup.py install

General Usage

cd /path/to/origami
python -m origami.batch.detect.segment

All command line tools will give you help information on their arguments when called as above.

The given data path should contain processed pages as images. Generated data is put into the same path. Images may be structured into any hierarchy of sub folders.

Batches

Artifacts

Origami's processing happens in separated stages, with batches that read and write information from well-defined files (also called artifacts). Each batch creates and depends upon various artifacts, as shown in the following table. Rows depict artifacts, columns depict detection batches (i.e. the batches found under origami.batch.detect). Blank circles indicate a read, filled circles indicate a write. As illustrated here, later batches depend on information provided by earlier batches.

Click on the names of the artifacts (left column) or batches (top row) below to get more information.

segment contours flow dewarp layout lines order ocr compose
page image
segment.zip
contours.0.zip
flow.zip
lines.0.zip
contours.1.zip
dewarp.zip
contours.2.zip
tables.json
contours.3.zip
lines.3.zip
order.json
ocr.zip
compose.zip

Running Batches

Order

Given an OCR model, and as illustrated in the table from last section, the necessary order of detection batches for performing OCR for a folder of documents is:

1 segment
2 contours
3 flow
4 dewarp
5 layout
6 lines
7 order
8 ocr
9 compose

Concurrency

Batch processes can be run concurrently. Origami supports file-based locking or by using a database (see --lock-strategy). The latter strategy is more compatible and set by default. Use --lock-database to specify the path to a lock database (if none is specified, Origami will create one in your data folder).

Modifying Results

It is possible to replace Origami pipeline stages/batches by custom implementations by simply reading and writing Origami's artifacts using the documented file formats.

It is also possible to run Origami stages and then postprocess the generated artifacts before continuing with later stages.

The Detection Batches

segment

origami.batch.detect.segment
Performs segmentation (e.g. separation into text and background) on all images using a neural network model.
If you have not trained a custom model, you should download and use origami’s default model. You need to specify the path to that downloaded model via the `--model` argument when calling `origami.batch.detect.segment`.
The predicted classes and labels are embedded in the specified model.

contours

origami.batch.detect.contours
From the pixelwise segmentation information, detects connected components to produce vectorized polygonal contours for blocks and separator lines.

flow

origami.batch.detect.flow
Detects baselines and warping in separators to produce an overall description of page curvature.

dewarp

origami.batch.detect.dewarp
Creates a dewarping transformation that is used in subsequent stages.

layout

origami.batch.detect.layout
Refines regions by fixing over- and under-segmentation via heuristic rules.

lines

origami.batch.detect.lines
Detects baselines and line boundaries for each text line.

order

origami.batch.detect.order
Finds a reading order using a variant of the XY Cut algorithm.

ocr

origami.batch.detect.ocr
Performs OCR on each detected line using the specified Calamari OCR model. For more details on OCR models, see the section on Origami OCR models..

compose

origami.batch.detect.compose
Composes text into one file using the detected reading order. Can also produce PageXML output.

Debugging

origami.batch.detect.stats
Prints out statistics on computed artifacts and errors. This is useful for understanding how many pages for processed, and for which stages this processing is finished.
origami.batch.annotate.contours
Produces debug images for understanding the result of the contours batch stage.
origami.batch.annotate.lines
Produces debug images for understanding the line detection stage.
origami.batch.annotate.layout
Produces debug images for understanding the result of the layout and order batch stage.

Tools for Ground Truth and Evaluation

Tools

origami.tool.annotate
Tool for annotating, viewing and searching for ground truth.
origami.tool.pick
Tool for adding or removing single lines from the ground truth for fine tuning.
origami.tool.sample
Create a new annotation database by randomly sampling lines from a corpus. The details of sampling (numbers of items for each segmentation label type per page) can be specified. Allows import of transcriptions stored in accompanying PageXML. See command line help for more details.
origami.tool.schema
⁂ Run an annotation normalization schema on the given ground truth text files.
origami.tool.export
From the given annotation database, export line images of the specified height and binarization together with accompanying ground truth text files. Annotation normalization through a schema is supported. Use this command to generate training data for Calamari. See command line for details.
origami.tool.xycut
Debug internal X-Y cut implementation.
origami.batch.export.lines (debugging only)
Export images of lines detected during lines batch.
origami.batch.export.pagexml (debugging only)
Export polygons of lines detected during lines batch as PageXML.

How to create ground truth

For generating ground truth for training an OCR engine from a corpus, we suggest this general process:

  • Run batches up to lines on your page images.
  • Sample random lines using origami.tool.sample.
  • Fine tune your training corpus using origami.tool.pick (optional).
  • Annotate using origami.tool.annotate.
  • Export annotations using origami.tool.export.
  • Train your OCR model.

Origami Models

For line-based OCR, Origami uses Calamari internally and therefore can be used with any Calamari model.

However, Origami's way of segmenting lines is slightly different from other pipelines: lines are not binarized and they are not scaled horizontally (therefore they might be wider than what some models are trained on).

One model specifically trained for Origami is the model used to perform OCR on the Berliner Börsen-Zeitung. The model (and more context on its training) is available under https://github.com/poke1024/origami_models

Another suitable model is the GT4HistOCR model for Calamari. Note that you need to enable binarization in the OCR for the latter.

Evalulation via Dinglehopper

To evaluate performance using Dinglehopper, you probably want to use:

python -m origami.batch.utils.evaluate DATA_PATH

Alternatively, you can create PAGE XMLs manually:

python -m origami.batch.detect.compose DATA_PATH \
    --page-xml --only-page-xml-regions \
    --regions regions/TEXT \
    --ignore-letters "{}[]"

About

A suite of batches and tools for OCR tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.1%
  • C++ 4.9%