coco_mcu

Description

Submission to the Visual Wake Words Challenge @ CVPR'19. All the code developed by the contributors (see contributors section) is released under MIT license.

Contributions

Equally contributed: Javier Fernández-Marqués and Fernando García-Redondo
Support: Shidhartha Das and Paul Whatmough
Advise: Ramón Matas

Contact

fernando.garciaredondo@arm.com and javier.fernandez-marques@arm.com

Performance Metrics Summary

Following the instructions in Visual Wake Words Challenge, we generated the TFrecords using the script ‘build_visualwakewords_data.py’ from slim/dataset lib.

Best Validation accuracy: 88.85%
Validation on minival_dataset: 87.96%
Model size: 247.5KB
Peak memory usage: 249 KB
MACs for inference: 56.8 Million OPs

In the following subsection we provide a description of our model architecture and further information on how the parameters above were obtained. Our model is still training and we will report the final validation once the 140 epochs are completed.

Network Architecture.

Related work

The skeleton of the developed network is based on Google's MobilenetV3, taking the basic bottleneck layer implementation from Bisonai as a basic CNN layer.

Proposed network

The network architecture is described in the file mb_att.py, and depicted in the following picture:

The attention and bneck layers are defined as follows:

Key points in NN interpretation for MCU deployement

Total operations

TF Profiler reports total number of flops 1983338, which is called at the beginning of the training. See get_flops() method in keras_q_model.py.

Quantization

The network is quantized using TF (v1.13) quantization functions
Activations are quantized using 8 bits.
Weights are quantized using 2 bits.
The quantization delay is non-zero as it gets applied after 10K batches (approx. at epoch 10 for a batch size of 64). For more info see mb_att.py.

ROM Size

The network is composed of a total of 1,013,554 parameters.
Each parameter uses 2-bits in ROM, giving a total of 247.5KB.

RAM Usage

During the development of the proposed model, we had in mind how Neural Networks libraries for MCUs (see Arm's CMSIS) are able to optimize not only the computation but also the memory fingerprint, allowing, when possible, to reuse temporal buffers, or perform in-place computations. Following these principles,

Point-wise additions and multiplications are considered in-place. Should a custom operation f(t) be performed over the tensor t, a single uint8 can hold temporarily the value aux = f(t_i), later replacing the memory space where t_i was in.
Let h(t)=g(f(t)), being t not required for later operations, after f(t) has been computed, the memory space previously occupied by t can be freed and reasigned.
Let h(t) = y(g(t), f(t)) a graph section with two paths or branches, dependent on t, we compute first one branch i.e. g(t), then f(t), so t no longer requires to be hold in memory, and its memory space reallocated to compute y().

Taking into account the previous considerations, and given the architecture defined above (described in detail in mb_att.py), the RAM memory peack takes place in the nodes:

Input conv: 256x256x3 + 86x86x8 = 249.8KB
Bneck0: 86x86x8 [shared input] + 86x86x16 = 173.34KB. After computation buffer of 86x86x16
Att0: Max peak with 86x86x16 [bneck0 out] + 86x86x8 [shared input that can be discarted after first max_pool] + 43x43x8 = 218KB. After first pooling the peak is: 86x86x16 + 43x43x8 + 22x22x(16+16+32) = 164.1KB. After computation buffer of 22x22x16
Pointwise Mult Att0*Bneck0 = [attention is upscaled] 2x86x86x16 = 231.12KB
Att1: Max peak with 86x86x8 [shared input that can be discarted after first max_pool] + 43x43x8 = 218KB. After first pooling the peak is: 86x86x16 + 43x43x16 + 22x22x(16+16+32) = 178.9KB. After computation buffer of 22x22x16
Bneck1: 86x86x16 [shared input] + 86x86x16 + 22x22x16 [att1 output] = 238.7KB. After computation buffer of 86x86x16
Mult Att1*Bneck1 = [attention is upscaled] 2x86x86x16 = 231.12KB
Residual is stored: 22x22x16 = 7.5KB
Bneck2: Peak with 86x86x16 + 43x43x32 = 180.9KB, output 43x43x24
Bneck3: Peak with 43x43x24 + 22x2x64 + 22x22x32= 89KB, output 22x22x32
Bneck4: Peak with 22x22x32 + 11x11x(256 + 128) = 60.5KB, output 11x11x128
Bneck5: Peak with 11x11x128 + 6x6x128 [res] + 6x6x256 = 60.5KB, output 6x6x128
Bneck6: Peak with 6x6x128 + 6x6x128 [res] + 6x6x256 = 18KB, output 6x6x128
LastStage Layer: Peak with 6x6x(128 + 512 + 1200) = 64.7KB

Peak use: 249KB

Key points in NN design and training

General description

The proposed NN customizes the recently presented mobilenet_v3, introducing:

2-bit weights allow 4x number of weights in the final layers.
Attention mechanisms in the first layers so, giving the variety of input images, the features extraction can focus on interesting regions.
Instead of the traditional CNN approach, and giving the varied images dataset (size, etc), we try to keep as much resolution as possible as we deepen into the network.

Data augmentation

We have applied data augmentation during training to avoid overfitting. input_data.py file describes the mechanisms, wich includes random cropping, hue and brightness variation, rotations, etc.

Regularization and other anti-overfitting mechanishms

We used L2 regularization together with dropout layers. Additionally, 2b weights quantization helps reducing overfitting effects

Learning rate scheduling

A custom learning rate scheduler has been used (see mb_att.py) so both learning and quantization caused oscillations are controlled.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
models		models
src		src
LICENSE		LICENSE
README.md		README.md
arm_coco.png		arm_coco.png
arm_coco.svg		arm_coco.svg
arm_coco_small.png		arm_coco_small.png
arm_coco_small.svg		arm_coco_small.svg
bneck_mobilenet_v3.png		bneck_mobilenet_v3.png

License

fgr1986/coco_mcu

Folders and files

Latest commit

History

Repository files navigation