This repository moved to a new location - https://github.com/mlperf-deepts/nns
The nns
(Neural Network Summary) package implements a simple class ModelSummary
that estimates compute and memory requirements for neural networks:
- Compute requirements are estimated in terms of
FLOPs
where aFLOP
is a floating point multiply-add operation. The implemented algorithm follows standard approach that takes into account operators that dominate computations:matrix multiply
andconvoutional
operators. Backward pass is considered to be twice as compute intensive as forward pass:FLOPs(backward) = 2 * FLOPs(forward) FLOPs(training) = FLOPs(forward) + FLOPs(backward) = 3 * FLOPs(forward)
- Memory requirements are estimated based on memory required to store activations. Current implementation is quite naive and may only be used to compare different models.
MEMORY(backward) = MEMORY(forward) MEMORY(training) = MEMORY(forward) + MEMORY(backward) = 2 * MEMORY(forward)
While these estimations are not accurate for computing, for instance, batch times, they nevertheless can be used for comparing different models with each other.
Assumption: computations in deep NNs are dominated by multiply-adds in dense and convolutional layers. Other operators such as non-linearity, dropout, normalization and others are ignored. The description on this page may be outdated, so, study the source files, in particular, nns.py.
The class ModelSummary
computes FLOPs by iterating over neural network layers. A model itself is defined as Keras model. Only those layers that are supported are taken into account, so, make sure your model does not contain non-supported compute intensive layers. The class reports approximate FLOPs count for one input instance (batch size is 1). The following layers are supported (bias is not taken into account):
These operators do not contribute to FLOPs. Memory is estimated based on shape of the output tensors. List of operators: Dropout
, GlobalMaxPooling1D
, GlobalMaxPooling2D
, GlobalMaxPooling3D
, GlobalAveragePooling1D
, GlobalAveragePooling2D
, GlobalAveragePooling3D
, AveragePooling1D
, AveragePooling2D
, AveragePooling3D
, MaxPooling1D
, MaxPooling2D
, MaxPooling3D
, Flatten
, Reshape
, RepeatVector
, Lambda
, Permute
, Activation
, LeakyReLU
, PReLU
, ELU
, ThresholdedReLU
, Softmax
, ReLU
, Add
, Subtract
, Multiply
, Average
, Maximum
, Minimum
, Concatenate
, UpSampling1D
, UpSampling2D
, UpSampling3D
, ZeroPadding1D
, ZeroPadding2D
, ZeroPadding3D
, BatchNormalization
.
X
is the rank 2 input tensor.Y
is the rank 2 output tensor.W
is the rank 2 weight tensor.
- Matrix-matrix multiply: Y = X * W, FLOPs = W.nrows * W.ncols
- Matrix-matrix multiply: dX = dY * W.T, FLOPs = W.nrows * W.ncols
- Matrix-matrix multiply: dW = X.T * dY, FLOPs = W.nrows * W.ncols
Both forward and backward passes linearly depend on the batch size. Also (see expressions for backward pass above): FLOPs(backward) = 2 * FLOPs(forward)
.
Memory requirements are computed based on the size of an output tensor plus the weight tensor. NNS uses simple strategy to count number of activation tensors. Keras API for Dense layers supports activation parameter. If activation is not linear, NNS does not assume fused implementation. So, number of activation tensors are doubled and so are memory requirements: MEMORY(Dense) = MEMORY(Activations_Dense) + MEMORY(Activations_Activation) = 2 * MEMORY(Activations_Dense)
. This is also the case for other layers that accept activation parameter such as Conv1D and Conv2D.
We can think of Conv2D operation as a bunch of dot products between filters and elements of input feature map.
Filter
is a rank 3 tensor with filter weights. Shape is[Depth, Height, Width]
whereDepth
equals to depth of input feature map, and[Height, Width]
is the filter's receptive field.Output
is a rank 2 output tensor that is the convolution result of input feature map with one filter. Output's spatial shape is[Height, Weight]
.NumFilters
is the number of filters. In a more common problem definition, the output feature map is a rank 3 tensor of the following shape:[NumFilters, Height, Weight]
.
Given the above definitions, we have:
FLOPs = NumFilters * (Output.H * Output.W) * (Filter.D * Filter.H * Filter.W)
which is equal to a product between number of elements in the output feature map and number of FLOPs for one element. To simplify computations (primarily, for backward pass), the NNS assumes im2col
implementation [1, 2]. Below (forward/backward passes) we use notation from 2:
-
K
is the number of output feature maps (number of filters). -
N
is the batch size. It's assumed that total FLOPs are proportional to batch size, so we can safely assume it equals to 1. -
P
is the height of an output feature map. -
Q
is the width of an output feature map. -
C
is the number of input feature maps (depth of input rank 3 tensor, same asFilter.Depth
). -
R
is the filter height (Filter.Height
). -
S
is the filter width (Filter.Width
). -
Y
is the output feature map (Output
). -
F
is the filter tensor (Filter
). -
D
is the input data (X
)
- Matrix-matrix multiply: Y[K, NPQ] = F[K, CRS] * D[CRS, NPQ], FLOPS = KCRSNPQ. If we remove N from this equation (N is the batch size), then this expression equals exactly to the one presented above (the one based on computing FLOPs by counting number of dot products).
- Matrix-matrix multiple: dX = F.T[CRS, K] * dY[K, NPQ], FLOPs = CRSKNPQ
- Matrix-matrix multiple: dW = dY[K, NPQ] * D.T[NPQ, CRS], FLOPs = KNPQCRS = CRSKNPQ
Both forward and backward passes linearly depend on a batch size. Also FLOPs(backward) = 2 * FLOPs(forward)
.
Memory requirements are computed based on the size of an output tensor plus weight tensor. If one of these layers is coupled with non-linear layer, memory doubles - same as for Dense layer.
Bidirectional models double number of FLOPs. The following cells are supported: SimpleRNNCell
, LSTMCell
and GRUCell
. RNNs use matrix multiply, so forward/backward FLOPs are similar to those for Dense
layer.
Also: FLOPs(LSTM) ~ 4 * FLOPs(RNN)
and FLOPs(GRU) ~ 3 * FLOPs(RNN)
.
RNN
Two matrix multiplications for each time step: hidden[t] = x[t]*Wxh + hidden[t-1]*Whh.LSTM
Hidden and cell sizes are equal. In total, 4 matrix multiplications with input X and 4 matrix multiplications with hidden state H. Plus a bunch of element wise multiplications, sums and activations that we do not take into account.GRU
Update/reset/hidden, each has 1 matrix multiply with X and one with H, so in total 3 matrix multiplies with X and 3 with H.
Time distributed layer applies a base layer to every temporal slice of an input. The base layer can be any layer described above. Number of FLOPS of the base layer is multiplied by the sequence length. Memory requirements are computed based on memory of a base layer times number of time steps.
Examples are in notebooks folder. Before, install Python virtual environment:
virtualenv -p python3 ./.tf2
source ./.tf2/bin/activate
# This will install TensorFlow, pandas and their dependencies. Jupyter needs to be installed manually.
pip install -r ./requirements.txt
jupyter notebook