Bengali.AI Handwritten Grapheme Classification

Kaggle Competition Entry

Bengali.AI Handwritten Grapheme Classification

Classify the components of handwritten Bengali

https://www.kaggle.com/c/bengaliai-cv19

Scores

Position	Private	Public	Prize	Notes
1st Place	0.9762	0.9952	$5000
2nd Place	0.9689	0.9955	$2000
3rd Place	0.9645	0.9945	$1000
Top 14	0.9491	0.9913	Gold Medal
Top 5% (102)	0.9348	0.9858	Silver Medal
Top 10% (205)	0.9306	0.9791	Bronze Medal
Final Writeup	0.9038	0.9381	1410/2059	Late Submission
ImageDataGenerator - CNN	0.9010	0.9413		44 CNN + 1256 Dense
ImageDataGenerator - CNN	0.8961	0.9482		35 CNN + 1256 Dense + Y+=grapheme
ImageDataGenerator - CNN	0.8921	0.9396		34 CNN + 2256 Dense + Y+=grapheme
Multi Output DF CNN	0.8901	0.9402		35 CNN + 1256 Dense + Y+=grapheme - no augmentation
Multi Output DF CNN	0.8828	0.9337		34 CNN + 2256 Dense + Y+=grapheme - no augmentation
ImageDataGenerator - CNN	0.8797	0.9198		44 CNN + 1256 Dense + Y+=grapheme + Regularization
sample_submission.csv	0.0614	0.0614	Random Score

Dataset

The parquet training dataset consists of 200,840 grayscale images (in 4 files) at 137x236 resolution with a total filesize of 4.8GB of data.

./requirements.sh                                             # create ./venv/

kaggle competitions download -c bengaliai-cv19 -p ./input/
unzip ./input/bengaliai-cv19.zip -d ./input/bengaliai-cv19/

time ./src/preprocessing/write_images_to_filesystem.py        # optional

Bengali is the 5th most spoken language in the world with hundreds of million of speakers. It’s the official language of Bangladesh and the second most spoken language in India. Considering its reach, there’s significant business and educational interest in developing AI that can optically recognize images of the language handwritten . This challenge hopes to improve on approaches to Bengali recognition.

Optical character recognition is particularly challenging for Bengali. While Bengali has 49 letters (to be more specific 11 vowels and 38 consonants) in its alphabet, there are also 18 potential diacritics, or accents. This means that there are many more graphemes, or the smallest units in a written language. The added complexity results in ~13,000 different grapheme variations (compared to English’s 250 graphemic units).

For this competition, you’re given the image of a handwritten Bengali grapheme and are challenged to separately classify three constituent elements in the image: grapheme root, vowel diacritics, and consonant diacritics.

This dataset contains images of individual hand-written Bengali characters. Bengali characters (graphemes) are written by combining three components: a grapheme_root, vowel_diacritic, and consonant_diacritic. Your challenge is to classify the components of the grapheme in each image. There are roughly 10,000 possible graphemes, of which roughly 1,000 are represented in the training set. The test set includes some graphemes that do not exist in train but has no new grapheme components. It takes a lot of volunteers filling out sheets like this to generate a useful amount of real data; focusing the problem on the grapheme components rather than on recognizing whole graphemes should make it possible to assemble a Bengali OCR system without handwriting samples for all 10,000 graphemes.

Notebooks

Previous Competitions

Kaggle Competition Entry - MNIST Digit Recognizer

Technical Research

Image Preprocessing

Image Processing [Kaggle Version]
Dataset as Image Directory [Kaggle Version]

Exploratory Data Analysis

Writeup and Submission

Bengali AI - CNN Data Pipeline + Problem Solving

The Exploratory Data Analysis showed that only certain combinations of vowel/consonant diacritics where regularly combined with certain grapheme roots.

Vowel / Consonant Combinations:

Vowel #0 and Consonant #0 combine with everything
Vowels #3, #5, #6, #8 have limited combinations with Consonants
Consonant #3 is never combined except with Vowel #0
Consonant #6 only combineds with Vowels #0 and #1

Grapheme Root Combinations:

Vowel #0 and Consonant #0 combine with (nearly) everything
ALL Roots combine with some Consonant #0
Several Roots do NOT combine with Vowel #0 = [26, 28, 33, 34, 73, 82, 108, 114, 126, 152, 157, 158, 163]
Several Roots do combine ALL Vowels = [13, 23, 64, 72, 79, 81, 96, 107, 113, 115, 133, 147]}
Only Root #107 combines with ALL Consonants

It was further discovered that Unicode itself is encoded as a multibyte string, using a lower level of base_graphemes than root/vowel/consonant diacritics. Some Benglai Graphemes have multiple renderings for the same root/vowel/consonant combination, which is implemented in unicode by allowing duplicate base_graphemes within the encoding.

A combination_matrix() function was written that permitted a Unicode Visualization of the Bengali Language, tabulating the different combinations of grapheme_roots, vowels and consonants.

src/jupyter/combination_matrix.py

Technical Challenges

Kaggle Kernels Only Competition

Solution: kaggle_compile.py

This codebase is approximately 2k CLOC, and builds upon my previous efforts for the MINST Digit Recognizer competition.

Developing a professionally engineered data pipeline using only a Jupyter Notebook would be impractical, and my preferred development workflow is to use the IntelliJ IDE and commit my code to github.

My first attempt involved manually copy and pasting required functions into a Kaggle Script Notebook, but then I decided that this process could be automated with a script.

./kaggle_compile.py is a simple python script compiler. It reads in a python executable script, parses the import headers for any local include files, then recursively builds a dependency tree and concatenates these into single python file (much like an old-school javascript compiler). It can be called with either --save or --commit cli flags to automattically save to disk or commit the result to git.

Limitations: the script only works for from local.module import function syntax, and does not support import local .module syntax, as calling module.function() inside a script would not work with script concatenation. Also the entire python file for the dependency is included, which does not guarantee the absence of namespace conflicts, but with awareness of this issue good coding practices it is sufficiently practical for generating Kaggle submissions.

There are other more robust solutions to this problem such as stickytape, which allow for module imports, however the code for dependency files is obfuscated into a single line string variable, which makes for an unreadable Kaggle submission. ./kaggle_compile.py produces readable and easily editable output.

Settings File

According to the Kaggle Winning Model Documentation Guidelines

https://www.kaggle.com/WinningModelDocumentationGuidelines

settings.json This file specifies the path to the train, test, model, and output directories. Here is an example file.

This is the only place that specifies the path to these directories. Any code that is doing I/O should use the appropriate base paths from SETTINGS.json

This was implemented as a python script src/settings.py which defines default settings and hyperparameters.

Different Hyperparameter Configurations for Localhost, Kaggle Interactive and Kaggle Batch Mode

I discovered that it is possible to detect which environment the python code is running in using

os.environ.get('KAGGLE_KERNEL_RUN_TYPE', 'Localhost')  # == 'Localhost' | 'Interactive' | 'Batch'

This was explored in the following notebook:

Jupyter Environment Variable os.environ

The main uses for this are in the settings.py file to specify different environment filesystem paths (eg ./input/bengaliai-cv19/ vs ../input/bengaliai-cv19/) as well as different default timeout and verbosity settings.

A second use is tweak hyperparameter settings in Interactive Mode (ie editing a script in Kaggle), such as setting epochs=0 or epochs=1 to allow quick sanity checking of the final script and that submission .csv is generated without throwing an exception. Using this flag removes the danger forgetting to revert these changes and Batch Mode commit accidentally running using debug configuration.

Modifing Hyperparameters using CLI

Solution: src/util/argparse.py

Most of the hyperparameters in this codebase have been abstracted into dictionary arguments. As an alternative to editing the code, or using KAGGLE_KERNEL_RUN_TYPE it would be nice to be able to directly modify them in Localhost mode on the CLI.

src/util/argparse.py provides two useful functions argparse_from_dict({}) and argparse_from_dicts([{},{}]) that will autogenerate argparse CLI config from a dictionary or list of dictionaries , with the option to edit them inplace. The result is that useful CLI flags can be implemented with a single line of code.

This is useful for passing --timeout 1m or --epochs 0 when debugging locally. It can also be used as a basic method for hyperparameter search.

Out of Memory Errors

Early attempts at running this codebase as a Kaggle Script often terminated in Exit code 137 (Out of Memory), and that the 16GB of RAM available is insufficient to both load the entire training dataset into memory as well as a tensorflow model.

This prompted a full investigation into memory profiling and the different methods to read parquet files:

Reading Parquet Files RAM CPU Optimization

The dataformat of the parquet files means they cannot be read row-by-row, but only in their entirety.

The optimal method chosen was to using pd.read_parquet() but to place this inside a generator function. Each parquet would be read 3 times, but the generator would only store 1/3 of the raw parquet data in memory at any one time. This reduced peak memory usage from 1.7GB to 555MB whilst increasing disk-IO timings from 46s to 116s (+74s) per epoch.

This was considered an acceptable (if overly conservative) RAM/Disk-IO tradeoff, but solved the Out of Memory errors.

Generator Source Code:

src/dataset/ParquetImageDataGenerator.py

Additional memory optimizations involved using float16 rather than float64 during image normalization, and rescaling the image size by 50% before feeding it into the neural network model.

Notebook Exceeded Allowed Compute

This Kaggle competition has a 120m maximum runtime constraint.

The simplest solution to this is to write a custom tensorflow callback that sets a timer and exits training before the deadline. The source code for KaggleTimeoutCallback() is in:

src/callbacks/KaggleTimeoutCallback.py

Pretrained Model Weights

Another way around the timeout issue is to use pretrained model weights, run on Localhost without a timer, upload these files to Kaggle as a private dataset, and then Commit the kaggle script using epochs=0.

This requires using the ModelCheckpoint() callback to save the weights to a file, glob2.glob(f"../input /**/{os.path.basename(model_file)}") to find the relevant file within the attached Kaggle dataset, and model.load_weights() to load them again before training.

This was the method used to generate a score of 0.9396, which took 3h to train to convergence on a Localhost laptop.

Submission - Notebook Threw Exception

The public test dataset comprises only 12 test images, which is deceptively small. The result is that the Kaggle Kernel can compile without exception, yet submitting the final submission.csv will generate a Notebook Threw Exception error.

The root cause is that the hidden test dataset is significantly larger, and presumably comparable in size to the training dataset. This can cause both out-of-memory errors and timeout errors if the code for generating a submission.csv file is not performance optimized.

There is known bug in model.predict(...) that results in a memory leak if the function is called in a loop. The solution to this is to call model.predict_on_batch(...) which was combined with using a generator to read the test parquet files, and a for loop to manually batch the data before calling model.predict_on_batch(...)

keras-team/keras#13118

A second issue is that the submission.csv has a funny format, which is not a 4 column table, but rather in {image_id}_{column}, {value} format. My initial naive code implementation used a nested for loop to write the values of submission.csv cell by cell. Rewriting this using a vectorized df.apply() solved the Notebook Threw Exception error.

Optimized code for generating submission.csv:

src/util/csv.py

Model Generating Random Results

The random score of the sample_submission.csv is 0.0614, and several of my submission returned this result.

Debugging this discovered several code issues:

Normalizing the data for training, but not for testing
Incorrect ordering of keys in submission.csv - https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/69366
Training using loss_weights=sqrt(output_shape.values())
Using tf.keras.backend.set_floatx('float16') then retraining on pretrained model weights

Image Preprocessing

The following transforms where conducted as preprocessing:

Invert: set background to 0 and text to 255
Resize: reduce image size by 50% using skimage.measure.block_reduce((1,2,2,1), np.max)
Rescale: rescale brightness of maximum pixel to 255
Denoise: set background pixels less than maximum mean (42) to 0
Center: crop image background border, and recenter image
Normalize: Convert pixel value range from 0 - 255 to 0.0 - 1.0

This provided a clean and standarized format for both training and test data.

An exploration and testing of Image Preprocessing was conducted in Jupyter Notebook

Image Processing

The final code for image transforms

src/dataset/Transforms.py

Image Augmentation

ImageDataGenerator was used to provide Image Augmentation with the following settings:

datagen_args = {
    # "rescale":          1./255,  # "normalize": True is default in Transforms
    # "brightness_range": 0.5,   # Preprocessing normalized brightness
    "zoom_range":         0.2,
    "width_shift_range":  0.1,     # Preprocessing already centered image
    "height_shift_range": 0.1,     # Preprocessing already centered image
    "rotation_range":     45/2,
    "shear_range":        45/2,
    "fill_mode":         'constant',
    "cval": 0,
    # "featurewise_center": True,             # No visible effect in plt.imgshow()
    # "samplewise_center": True,              # No visible effect in plt.imgshow()
    # "featurewise_std_normalization": True,  # No visible effect in plt.imgshow() | requires .fit()
    # "samplewise_std_normalization": True,   # No visible effect in plt.imgshow() | requires .fit()
    # "zca_whitening": True,                  # Kaggle, insufficent memory
}

Due the memory constraints of not being able to load the entire training dataset into a single dataframe, and the performance issues of saving the test/train datasets to a directory tree of image files, a ImageDataGenerator subclass was written with a custom ParquetImageDataGenerator.flow_from_parquet()
method. This used the generator pattern as described above to read from the parquet files, and run preprocessing Transforms on batches of data, whilst keep the minimum data possible in RAM.

src/dataset/ParquetImageDataGenerator.py

Writing Images to Filesystem

As an experiment, a script was written to write the images in the parquet file to the filesystem after Image Preprocessing.

This took 11 minutes of runtime to complete on Kaggle, generating 200,840 image files.

In theory this would have allowed ImageDataGenerator.flow_from_dictectory() to be used, and this directory tree could have been imported as an external dataset. On localhost, having 200k files in a single directory tree produces performance issues, and this method was abandoned in favour of ParquetImageDataGenerator.flow_from_parquet()

Neural Network Models

Single Output CNN

First attempt, simplest thing that could possibly work, was to port the CNN code used for MINST, load the data for each parquet file into a pandas DataFrame and train 3 separate neural networks, one for each output variable.

The main issue with this simplistic approach is that the data must be read three times, and the entire training process takes three times as long.

Unable to obtain a Kaggle Score using this method.

Multi Output CNN

The next evolution was to write a single CNN neural network with multiple outputs, thus reducing training time by 3x.

The model was trained using all 4 of the columns provided in the train.csv

grapheme_root
vowel_diacritic
consonant_diacritic
grapheme

The CNN Architecture uses multiple layers of:

Conv2D(3x3)
MaxPooling2D()
BatchNormalization()
Flatten()
Dropout()
Dense()
Dropout()

TODO:

split the grapheme into constituent unicode bytes and retrain using softmax Multi-Hot-Encoding
Test to see if this method can generate a Kaggle score using only Image Preprocessing

ImageDataGenerator CNN

This approach used ImageDataGenerator() for Image Augmentation with a custom ParquetImageDataGenerator.flow_from_parquet() subclass function to read the parquet files in a memory optimized fashion

The following hyperparameters generated a 0.9396 Kaggle Score

model_hparams = {
    "cnns_per_maxpool":   3,
    "maxpool_layers":     4,
    "dense_layers":       2,
    "dense_units":      256,
    "regularization": False,
    "global_maxpool": False,
}
train_hparams = {
    "optimizer":     "RMSprop",
    "scheduler":     "constant",
    "learning_rate": 0.001,
    "best_only":     True,
    "batch_size":    128,     
    "patience":      10,
    "epochs":        999,
    "loss_weights":  False,
}

EarlyStopping(patience=10) was used with a 75/25 train/test split (3 parquet files for train, 1 for test), and took 3.4 hours to converge on Localhost after 77 epochs

Increasing dense_units from 256 -> 512:

converged in less epochs: 77 -> 66
slightly quicker to train: 3.4h -> 2.5h
had slightly worse accuracy: 0.969 -> 0.9346 (val_grapheme_root_accuracy)

TODO:

Run a full hyperparameter search to discover if it is possible to obtain a higher score

ImageDataGenerator Keras Application

An alternative to a custom designed CNN is to use a published CNN network architecture from tf.keras.applications

This requires removing the first and last layers of the application, then adding additional Dense layers before splitting the output.

The first application investigated was NASNetMobile, however compared to the custom CNN model:

converged in less epochs: 77 -> 53
much slower to train: 3.4h -> 4.15h
had much worse accuracy: 0.969 -> 0.873 (val_grapheme_root_accuracy)

TODO:

Test the full range of tf.keras.applications
Test if running for a fixed number of epochs produces more consistent results

Hyperparameter Search

Additional Visualizations of Hyperparameter Search

notebooks/Hyperparamer Search.ipynb

Previous research on MINST produced the following shortlist of Optimizer / Learning Rate / Scheduler combinations

"optimized_scheduler": {
    "Adagrad_triangular": { "learning_rate": 0.1,    "optimizer": "Adagrad",  "scheduler": "CyclicLR_triangular"  },
    "Adagrad_plateau":    { "learning_rate": 0.1,    "optimizer": "Adagrad",  "scheduler": "plateau2"      },
    "Adam_triangular2":   { "learning_rate": 0.01,   "optimizer": "Adam",     "scheduler": "CyclicLR_triangular2" },
    "Nadam_plateau":      { "learning_rate": 0.01,   "optimizer": "Nadam",    "scheduler": "plateau_sqrt"  },
    "Adadelta_plateau":   { "learning_rate": 1.0,    "optimizer": "Adadelta", "scheduler": "plateau10"     },
    "SGD_triangular2":    { "learning_rate": 1.0,    "optimizer": "SGD",      "scheduler": "CyclicLR_triangular2" },
    "RMSprop_constant":   { "learning_rate": 0.001,  "optimizer": "RMSprop",  "scheduler": "constant"      },
}

https://github.com/JamesMcGuigan/kaggle-digit-recognizer

Optimizers and Learning Rates

Time	Epocs	Loss	Optimizer	Scheduler	Learning Rate	Notes
7580	54	0.637	Adadelta	plateau10	1.000	Lowest loss, least epochs, middle speed
9558	68	0.759	Adam	constant	0.001	Lowest loss with constant optimizer
4514	32	0.818	RMSProp	constant	0.001	Fastest with decent loss

CNN Model Hparams

A hyperparameter grid search was run varying the number of layers within the CNN model

regularization

regularization=False almost always outperforms regularization=True
regularization=True prefers fewer dense units

global_maxpool

regularization=True prefers global_maxpool=False (but not vice veras)
global_maxpool=True prefers double the number of dense_units and +1 cnns_per_maxpool

cnns

increasing maxpool_layers prefers fewer cnns_per_maxpool (ideal total CNNs = 15 / 16)

dense units

dense_layers=1 is preferred over 2 or 3

Fastest with high score

maxpool_layers=5 | cnns_per_maxpool=3 | dense_layers=1 | dense_units=256 | global_maxpool=False | regularization=False

Shortlist:

maxpool_layers=5 | cnns_per_maxpool=3 | dense_layers=1 | dense_units=512 | global_maxpool=True | regularization=False
maxpool_layers=4 | cnns_per_maxpool=4 | dense_layers=1 | dense_units=256 | global_maxpool=False | regularization=False
maxpool_layers=4 | cnns_per_maxpool=4 | dense_layers=1 | dense_units=256 | global_maxpool=False | regularization=True

Results:

Optimizing architecture from 3\*4 CNN + 2\*256 Dense -> 3\*5 CNN + 1\*256 Dense improves kaggle score 0.8921/0.9396 -> 0.8961/0.9482
Adding regularization reduces score from 0.8961/0.9482 -> 0.8797/0.9198
Removing Y+=grapheme improves private score from 0.8961/0.9482 -> 0.9010/0.9413

TODO:

Train Y on unicode grapheme components
Hyperparameter search on X_transforms + Augmentation params
Kaggle trained models still timeout, so pretrained models might produce better results

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.dvc		.dvc
data_output		data_output
input		input
notebooks		notebooks
src		src
.gitignore		.gitignore
.pythonstartup.py		.pythonstartup.py
README.md		README.md
kaggle_compile.py		kaggle_compile.py
logfiles_to_csv.sh		logfiles_to_csv.sh
requirements.in		requirements.in
requirements.sh		requirements.sh
requirements.txt		requirements.txt

leeloolee/kaggle-bengali-ai

Folders and files

Latest commit

History

Repository files navigation

Kaggle Competition Entry

Bengali.AI Handwritten Grapheme Classification

Scores

Dataset

Notebooks

Previous Competitions

Technical Research

Image Preprocessing

Exploratory Data Analysis

Writeup and Submission

Technical Challenges

Kaggle Kernels Only Competition

Settings File

Different Hyperparameter Configurations for Localhost, Kaggle Interactive and Kaggle Batch Mode

Modifing Hyperparameters using CLI

Out of Memory Errors

Notebook Exceeded Allowed Compute

Pretrained Model Weights

Submission - Notebook Threw Exception

Model Generating Random Results

Image Preprocessing

Image Augmentation

Writing Images to Filesystem

Neural Network Models

Single Output CNN

Multi Output CNN

ImageDataGenerator CNN

ImageDataGenerator Keras Application

Hyperparameter Search

Optimizers and Learning Rates

CNN Model Hparams

About

Resources

Stars

Watchers

Forks

Languages