Human-Machine Text Extraction for Biocollections

This repository provides access to the scripts utilized during the research titled "Quality-aware Human-Machine Text Extraction for Biocollections using Ensembles of OCRs".

The automated steps of the text extraction process are the following (in order)

Lines' Extraction
1.1. Resize the images that are bigger than 10MB (Google Cloud limitations). Manually use script resizeDir_mt.py.
1.2. Line segmentation from the Google Cloud Vision API. Script get_lines_google.py. This process also extracts the text from the lines.
1.3. Binarization of the lines with OCRopus. Script binarizeDir_mt.py.
1.4. Extraction of the lines' text using OCRopus. Script recognizeDir_mt.py.
1.5. Extraction of the lines' text using Tesseract. Script tessDir_mt.py.
Ensemble of OCRs
2.1. Accept line through majority voting. Script getLinesAccepted.py.
2.2. Separate the lines with match for the 3 OCR engines. Script getLinesAccepted_Match3.py.
2.3. N-grams construction. Script get_n_grams.py.
2.4. Computation of the per-character descriptive statistics. Script get_stats_from_probs.py.
2.5. Augment the probabilities of the characters in the lines using the n-grams and descriptive statistics. Script augment_prob_ngrams.py.
2.6. Accept the lines with all their characters with probability 1.0. Script accept_from_ngrams.py
Compose the Full Transcription Text of the Images.
3.1. Construction of the full text transcriptions from the lines. Script build_labels.py.
3.2. Computation of the Damerau-Levenshtein similarity to the ground truth data. Script fulltext_similarity_DL_dir.py.

For a more detailed description of the text extraction process, review the following Jupyter Notebooks:
1. Lines' Extraction: [L_aocr_entomology.ipynb](https://github.com/acislab/HuMaIN_Text_Extraction/blob/master/notebooks/L_aocr_entomology.ipynb).
2. Ensemble of OCRs: [E_aocr_entomology.ipynb](https://github.com/acislab/HuMaIN_Text_Extraction/blob/master/notebooks/E_aocr_entomology.ipynb).

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Human-Machine Text Extraction for Biocollections

About

Releases

Packages

Languages

License

ialzuru/HuMaIN_Text_Extraction

Folders and files

Latest commit

History

Repository files navigation

Human-Machine Text Extraction for Biocollections

About

Resources

License

Stars

Watchers

Forks

Languages