Auto Annotation of Pathology Images

Columbia Data Science Institute Capstone Project, Fall 2020

Mentor: Dr. Adler Perotte

Instructor: Dr. Adam S. Kelleher

Team member:

Yihao Li, Chao Huang, Yufeng Ma, Xiaoyun Zhu, Shuo Yang

This project aims to create a machine learning-driven user interface for the annotation of very large pathology images. Each image may be 10s of thousands by 10s of thousands of pixels. As a result, annotation of the entire slide for object recognition or semantic/instance segmentation can be time consuming when entities are only a few pixels in diameter. This project aims to build a framework for maximally leveraging expert annotator (clinician) time by interleaving annotation (label generation) with inference to provide an intuitive notion of model fit and the minimal amount of labeling required for acceptable model performance.

Project Final Report

The final report for this project can be seen from: Final Report

Video Demonstration

A video presentation with slides can be found on Youtube via https://youtu.be/XTHRxxOoG-k.

Installation

Required packages can be found in the requirements file, it's recommended to use a virtual environment to install all required packages through pip.
Note that although detectron2 is used in this repository, it's NOT explicitly listed in the requirements due to its complex dependencies on the version of PyTorch and CUDA. Therefore, it's better to build it from source by following the official guide.

Repository Structure

Collage Generator: the module for generating synthetic whole slide images (a.k.a, collages) from vignettes, which utilize a complex algorithm. The algorithm is fully described and explained in the sub-directory called illustration.
Vignettes Data: contains vignettes used for generating synthetic whole slide images.
COCO-Format Converter: the module for generating instance segmentation datasets from collages using COCO-compatible format.
Core ML Components: the module storing essential functions and tools for training and serving UNet models for segmentation.
- preprocessing: contains functions for the preprocessing pipeline, namely cropping images as patches, saving patches as HDF5 files and loading data as PyTorch Datasets with augmentations.
- modeling: contains UNet model architecture, which is wrapped as a PyTorch Lightning model. Also, essential functions for postprocessing are also provided.
- utils: contains essential utility functions for manipulating slides and annotations.
- api: high level APIs exposed for the model serving component.
- config: a configuration file denoting target classes and parameters for the segmentation task.
Scripts: contains useful scripts for tuning (using Optuna) and testing models. Can also be used as a reference for calling low-level functions.
Demo Notebooks: contains several useful demo notebooks showing the usage of core components.

Name		Name	Last commit message	Last commit date
Latest commit History 177 Commits
Collage_generator		Collage_generator
data		data
format_converter		format_converter
ml_core		ml_core
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
Auto-annotation_of_Pathology_Images.pdf		Auto-annotation_of_Pathology_Images.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collage_generator

Collage_generator

data

data

format_converter

format_converter

ml_core

ml_core

notebooks

notebooks

scripts

scripts

.gitignore

.gitignore