Skip to content

sergey-serebryakov/nns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository moved to a new location - https://github.com/mlperf-deepts/nns

Neural network runtime characteristics - FLOPS and memory requirements

The nns (Neural Network Summary) package implements a simple class ModelSummary that estimates compute and memory requirements for neural networks:

  1. Compute requirements are estimated in terms of FLOPs where a FLOP is a floating point multiply-add operation. The implemented algorithm follows standard approach that takes into account operators that dominate computations: matrix multiply and convoutional operators. Backward pass is considered to be twice as compute intensive as forward pass:
    FLOPs(backward) = 2 * FLOPs(forward)
    FLOPs(training) = FLOPs(forward) + FLOPs(backward) = 3 * FLOPs(forward)
    
  2. Memory requirements are estimated based on memory required to store activations. Current implementation is quite naive and may only be used to compare different models.
    MEMORY(backward) = MEMORY(forward)
    MEMORY(training) = MEMORY(forward) + MEMORY(backward) = 2 * MEMORY(forward)  
    

While these estimations are not accurate for computing, for instance, batch times, they nevertheless can be used for comparing different models with each other.

Assumption: computations in deep NNs are dominated by multiply-adds in dense and convolutional layers. Other operators such as non-linearity, dropout, normalization and others are ignored. The description on this page may be outdated, so, study the source files, in particular, nns.py.

The class ModelSummary computes FLOPs by iterating over neural network layers. A model itself is defined as Keras model. Only those layers that are supported are taken into account, so, make sure your model does not contain non-supported compute intensive layers. The class reports approximate FLOPs count for one input instance (batch size is 1). The following layers are supported (bias is not taken into account):

Operators

Generic operators

These operators do not contribute to FLOPs. Memory is estimated based on shape of the output tensors. List of operators: Dropout, GlobalMaxPooling1D, GlobalMaxPooling2D, GlobalMaxPooling3D, GlobalAveragePooling1D, GlobalAveragePooling2D, GlobalAveragePooling3D, AveragePooling1D, AveragePooling2D, AveragePooling3D, MaxPooling1D, MaxPooling2D, MaxPooling3D, Flatten, Reshape, RepeatVector, Lambda, Permute, Activation, LeakyReLU, PReLU, ELU, ThresholdedReLU, Softmax, ReLU, Add, Subtract, Multiply, Average, Maximum, Minimum, Concatenate, UpSampling1D, UpSampling2D, UpSampling3D, ZeroPadding1D, ZeroPadding2D, ZeroPadding3D, BatchNormalization.

Dense

  • X is the rank 2 input tensor.
  • Y is the rank 2 output tensor.
  • W is the rank 2 weight tensor.

Forward pass

  1. Matrix-matrix multiply: Y = X * W, FLOPs = W.nrows * W.ncols

Backward pass

  1. Matrix-matrix multiply: dX = dY * W.T, FLOPs = W.nrows * W.ncols
  2. Matrix-matrix multiply: dW = X.T * dY, FLOPs = W.nrows * W.ncols

Both forward and backward passes linearly depend on the batch size. Also (see expressions for backward pass above): FLOPs(backward) = 2 * FLOPs(forward). Memory requirements are computed based on the size of an output tensor plus the weight tensor. NNS uses simple strategy to count number of activation tensors. Keras API for Dense layers supports activation parameter. If activation is not linear, NNS does not assume fused implementation. So, number of activation tensors are doubled and so are memory requirements: MEMORY(Dense) = MEMORY(Activations_Dense) + MEMORY(Activations_Activation) = 2 * MEMORY(Activations_Dense). This is also the case for other layers that accept activation parameter such as Conv1D and Conv2D.

Conv1D, Conv2D, Conv2DTranspose

We can think of Conv2D operation as a bunch of dot products between filters and elements of input feature map.

  • Filter is a rank 3 tensor with filter weights. Shape is [Depth, Height, Width] where Depth equals to depth of input feature map, and [Height, Width] is the filter's receptive field.
  • Output is a rank 2 output tensor that is the convolution result of input feature map with one filter. Output's spatial shape is [Height, Weight].
  • NumFilters is the number of filters. In a more common problem definition, the output feature map is a rank 3 tensor of the following shape: [NumFilters, Height, Weight].

Given the above definitions, we have:

FLOPs = NumFilters * (Output.H * Output.W) * (Filter.D * Filter.H * Filter.W)

which is equal to a product between number of elements in the output feature map and number of FLOPs for one element. To simplify computations (primarily, for backward pass), the NNS assumes im2col implementation [1, 2]. Below (forward/backward passes) we use notation from 2:

  • K is the number of output feature maps (number of filters).

  • N is the batch size. It's assumed that total FLOPs are proportional to batch size, so we can safely assume it equals to 1.

  • P is the height of an output feature map.

  • Q is the width of an output feature map.

  • C is the number of input feature maps (depth of input rank 3 tensor, same as Filter.Depth).

  • R is the filter height (Filter.Height).

  • S is the filter width (Filter.Width).

  • Y is the output feature map (Output).

  • F is the filter tensor (Filter).

  • D is the input data (X)

Forward pass

  1. Matrix-matrix multiply: Y[K, NPQ] = F[K, CRS] * D[CRS, NPQ], FLOPS = KCRSNPQ. If we remove N from this equation (N is the batch size), then this expression equals exactly to the one presented above (the one based on computing FLOPs by counting number of dot products).

Backward pass

  1. Matrix-matrix multiple: dX = F.T[CRS, K] * dY[K, NPQ], FLOPs = CRSKNPQ
  2. Matrix-matrix multiple: dW = dY[K, NPQ] * D.T[NPQ, CRS], FLOPs = KNPQCRS = CRSKNPQ

Both forward and backward passes linearly depend on a batch size. Also FLOPs(backward) = 2 * FLOPs(forward). Memory requirements are computed based on the size of an output tensor plus weight tensor. If one of these layers is coupled with non-linear layer, memory doubles - same as for Dense layer.

RNN / Bidirectional layers

Bidirectional models double number of FLOPs. The following cells are supported: SimpleRNNCell, LSTMCell and GRUCell. RNNs use matrix multiply, so forward/backward FLOPs are similar to those for Dense layer.
Also: FLOPs(LSTM) ~ 4 * FLOPs(RNN) and FLOPs(GRU) ~ 3 * FLOPs(RNN).

  1. RNN Two matrix multiplications for each time step: hidden[t] = x[t]*Wxh + hidden[t-1]*Whh.
  2. LSTM Hidden and cell sizes are equal. In total, 4 matrix multiplications with input X and 4 matrix multiplications with hidden state H. Plus a bunch of element wise multiplications, sums and activations that we do not take into account.
  3. GRU Update/reset/hidden, each has 1 matrix multiply with X and one with H, so in total 3 matrix multiplies with X and 3 with H.

TimeDistributed

Time distributed layer applies a base layer to every temporal slice of an input. The base layer can be any layer described above. Number of FLOPS of the base layer is multiplied by the sequence length. Memory requirements are computed based on memory of a base layer times number of time steps.

Examples

Examples are in notebooks folder. Before, install Python virtual environment:

virtualenv -p python3 ./.tf2
source ./.tf2/bin/activate
# This will install TensorFlow, pandas and their dependencies. Jupyter needs to be installed manually.
pip install -r ./requirements.txt
jupyter notebook

License

Apache License 2.0

Questions?

Contact me.

References

  1. Convnet: Implementing Convolution Layer with Numpy
  2. cuDNN: Efficient Primitives for Deep Learning
  3. Memory usage and computational considerations