This is a demo of
- The method described here for finding the best learning rate; and
- The use of cyclical learning rates (from the same paper). It's not a drop-in module to help find a good learning rate for other problems, though the method is transferrable.
Takes a 'warm-up' run of a simple LeNet-style MNIST or a WRN-28-10 Cifar10 model (https://arxiv.org/abs/1605.07146), varying the learning rate from a very small number to a very large number, to determine the value at which the loss is as small as possible. Then, either trains the model at a multiple of that learning rate or cycles the learning rate, depending on flags passed at runtime.
In addition to Keras with Tensorflow backend and the usual NumPy/Matplotlib/SciPy stack, you need the custom clr_callback for Keras found here.
Basic usage: python best_lr.py
With Cifar10 and some additional flags: python best_lr.py --dataset=cifar10 --num_epochs=100 --lr_min=1e-6 --cycle --save_model
Both the MNIST and the Cifar10 models are pretty much hardcoded, but options to change the kernel initializer, the activation function, and the amount of weight decay for the Cifar10 model are provided.
--dataset
: default='mnist'
, one of'mnist', 'cifar10'
--kernel_initializer
: default is'he_normal'
, only modifies the cifar10 model--activation
: default is'relu'
, only modifies the cifar10 model--weight_decay
: default is0.0005
, only modifies cifar10 model
--optimizer
: one of adam, rmsprop, or sgd. Default is sgd--momentum
: default is0.9
. Only used with sgd--batch_size
': default is32
--num_epochs
: default is55
--num_learning_batches
: How many batches to vary the learning rate over to determine the best learning rate? default is4000
--lr_min
: default is1e-4
, minimum learning rate to try when determining best learning rate--lr_max
: default is10
, maximum learning rate to try when determining best learning rate--lr_max_multiplier
: when training with fixed learning rate, that rate will belr_max_multiplier * <lr-with-min-loss>
. When cycling learning rate, this will be the maximum learning rate. Default is1
.--lr_min_multiplier
: When cycling the learning rate, the minimum learning rate will belr_min_multiplier * <lr-with-min-loss>
. Default is0.1
.--cycle
: Cycle the learning rate? Default is false--skip_test
: Want to skip the learning rate test and just get on with training? Specify learning rate here. This learning rate will be treated as<lr-with-min-loss>
(Seelr_max_multiplier
andlr_min_multiplier
above). Should be a float and default isNone
.
--no_plots
: Suppress plot generation?--save_model
: Default is false--log_dir
: Where to save tensorboard logs/model weights. default is'./logs'
On Cifar10, with learning rate cycling, batch size 128, and all other parameters as default will typically get a test accuracy in the high 80's or low 90's in 20-30 epochs (approximately one hour on a single Titan Xp). On MNIST, batch size 64, cycling the learning rate, typically acheives greater than 99% test accuracy in under 10 epochs.