def __init__(self, learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-7, centered=False, name="RMSprop", **kwargs): """Construct a new RMSprop optimizer. Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an `IndexedSlices` object, typically because of `tf.gather` or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an "eventual" correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm. Args: learning_rate: A Tensor or a floating point value. The learning rate. rho: Discounting factor for the history/coming gradient momentum: A scalar tensor. epsilon: Small value to avoid zero denominator. centered: If True, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to True may help with training, but is slightly more expensive in terms of computation and memory. Defaults to False. name: Optional name prefix for the operations created when applying gradients. Defaults to "RMSprop". @compatibility(eager) When eager execution is enabled, `learning_rate`, `decay`, `momentum`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ if epsilon is None: epsilon = backend_config.epsilon() super(RMSprop, self).__init__(name, **kwargs) self._set_hyper("learning_rate", kwargs.get("lr", learning_rate)) self._set_hyper("decay", self._initial_decay) self._set_hyper("rho", rho) self._momentum = False if isinstance(momentum, ops.Tensor) or callable(momentum) or momentum > 0: self._momentum = True if isinstance(momentum, (int, float)) and (momentum < 0 or momentum > 1): raise ValueError("`momentum` must be between [0, 1].") self._set_hyper("momentum", momentum) self._set_hyper("epsilon", epsilon) self.centered = centered
def __init__(self, learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-7, name='Adagrad', **kwargs): """Construct a new Adagrad optimizer. Args: learning_rate: A `Tensor` or a floating point value. The learning rate. initial_accumulator_value: A floating point value. Starting value for the accumulators, must be positive. epsilon: A floating point value. Starting value for the accumulators, must be positive. name: Optional name prefix for the operations created when applying gradients. Defaults to "Adagrad". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. Raises: ValueError: If the `initial_accumulator_value` or `epsilon` is invalid. @compatibility(eager) When eager execution is enabled, `learning_rate` can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility """ if initial_accumulator_value < 0.0: raise ValueError('initial_accumulator_value must be non-negative: %s' % initial_accumulator_value) if epsilon is None: epsilon = backend_config.epsilon() if epsilon < 1e-7: raise ValueError('epsilon must be larger than 1e-7: %s' % epsilon) super(Adagrad, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._initial_accumulator_value = initial_accumulator_value self._set_hyper('epsilon', epsilon)
def __init__(self, learning_rate=0.001, rho=0.95, epsilon=1e-7, name='Adadelta', **kwargs): """Construct a new Adadelta optimizer. Adadelta is a more robust extension of Adagrad that adapts learning rates based on a moving window of gradient updates, instead of accumulating all past gradients. This way, Adadelta continues learning even when many updates have been done. Compared to Adagrad, in the original version of Adadelta you don't have to set an initial learning rate. In this version, initial learning rate can be set, as in most other Keras optimizers. Args: learning_rate: A `Tensor` or a floating point value. The learning rate. To match the exact form in the original paper use 1.0. rho: A `Tensor` or a floating point value. The decay rate. epsilon: A `Tensor` or a floating point value. A constant epsilon used to better conditioning the grad update. name: Optional name prefix for the operations created when applying gradients. Defaults to "Adadelta". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. @compatibility(eager) When eager execution is enabled, `learning_rate`, `rho`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility """ if epsilon is None: epsilon = backend_config.epsilon() super(Adadelta, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('rho', rho) self._set_hyper('epsilon', epsilon)
def __init__( self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name="AdamMultilr", pattern_lrs=None, **kwargs, ): super(AdamMultilr, self).__init__(name, **kwargs) self._set_hyper("learning_rate", kwargs.get("lr", learning_rate)) self._set_hyper("decay", self._initial_decay) self._set_hyper("beta_1", beta_1) self._set_hyper("beta_2", beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad self.pattern_lrs = pattern_lrs # {["pattern": [pattern1, pattern2], "lr": lr]}
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', **kwargs): """Construct a new Adam optimizer. Args: learning_rate: A `Tensor`, floating point value, or a schedule that is a `tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable that takes no arguments and returns the actual value to use, The learning rate. Defaults to 0.001. beta_1: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use. The exponential decay rate for the 1st moment estimates. Defaults to 0.9. beta_2: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use, The exponential decay rate for the 2nd moment estimates. Defaults to 0.999. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to 1e-7. amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". Defaults to `False`. name: Optional name for the operations created when applying gradients. Defaults to "Adam". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(NonFusedAdam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad
def __init__(self, decay_steps, warmup_steps, min_lr=0.0, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, weight_decay=0., weight_decay_pattern=None, amsgrad=False, name='AdamWarmup', **kwargs): r"""Construct a new Adam optimizer. Args: decay_steps: Learning rate will decay linearly to zero in decay steps. warmup_steps: Learning rate will increase linearly to lr in first warmup steps. lr: float >= 0. Learning rate. beta_1: float, 0 < beta < 1. Generally close to 1. beta_2: float, 0 < beta < 1. Generally close to 1. epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`. weight_decay: float >= 0. Weight decay. weight_decay_pattern: A list of strings. The substring of weight names to be decayed. All weights will be decayed if it is None. amsgrad: boolean. Whether to apply the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and Beyond". """ super(AdamWarmup, self).__init__(name, **kwargs) self._set_hyper('decay_steps', float(decay_steps)) self._set_hyper('warmup_steps', float(warmup_steps)) self._set_hyper('min_lr', min_lr) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self._set_hyper('weight_decay', weight_decay) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad self._initial_weight_decay = weight_decay self._weight_decay_pattern = weight_decay_pattern
def __init__(self, keep_range=False, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='AdamApprox', **kwargs): super(AdamApprox, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() ###################################################### self.keep_range = keep_range self.precision = 32 self.coeff_values = {'values': np.array([])}
def __init__(self, learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-7, name='Adagrad', **kwargs): """Construct a new Adagrad optimizer. Args: learning_rate: A `Tensor` or a floating point value. The learning rate. initial_accumulator_value: A floating point value. Starting value for the accumulators, must be positive. epsilon: A floating point value. Starting value for the accumulators, must be positive. name: Optional name prefix for the operations created when applying gradients. Defaults to "Adagrad". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. Raises: ValueError: If the `initial_accumulator_value` or `epsilon` is invalid. @compatibility(eager) When eager execution is enabled, `learning_rate` can be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility """ if initial_accumulator_value < 0.0: raise ValueError('initial_accumulator_value must be non-negative: %s' % initial_accumulator_value) if epsilon is None: epsilon = backend_config.epsilon() if epsilon < 1e-7: raise ValueError('epsilon must be larger than 1e-7: %s' % epsilon) super(Adagrad, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._initial_accumulator_value = initial_accumulator_value
def __init__(self, accumulation_steps, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', **kwargs): r"""Construct a new Adam optimizer. Args: accumulation_steps: An integer. Update gradient in every accumulation steps. learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. amsgrad: boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". name: Optional name for the operations created when applying gradients. Defaults to "Adam". @compatibility(eager) When eager execution is enabled, `learning_rate`, `beta_1`, `beta_2`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(AdamAccumulated, self).__init__(name, **kwargs) self._set_hyper('accumulation_steps', accumulation_steps) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Nadam', **kwargs): """Construct a new Nadam optimizer. Args: learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm. epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients. Defaults to "Adamax". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ # Backwards compatiblity with keras NAdam optimizer. kwargs['decay'] = kwargs.pop('schedule_decay', 0.004) learning_rate = kwargs.get('lr', learning_rate) if isinstance(learning_rate, learning_rate_schedule.LearningRateSchedule): raise ValueError('The Nadam optimizer does not support ' 'tf.keras.optimizers.LearningRateSchedules as the ' 'learning rate.') if epsilon is None: epsilon = backend_config.epsilon() super(Nadam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self._set_hyper('epsilon', epsilon) self._m_cache = None
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Nadam', **kwargs): """Construct a new Nadam optimizer. Args: learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm. epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients. Defaults to "Nadam". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ # Backwards compatibility with keras NAdam optimizer. kwargs['decay'] = kwargs.pop('schedule_decay', 0.004) learning_rate = kwargs.get('lr', learning_rate) if isinstance(learning_rate, learning_rate_schedule.LearningRateSchedule): raise ValueError( 'The Nadam optimizer does not support ' 'tf.keras.optimizers.LearningRateSchedules as the ' 'learning rate.') super(Nadam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self._m_cache = None
def ctc_batch_cost(y_true, y_pred, input_length, label_length): """Runs CTC loss algorithm on each batch element. Arguments: y_true: tensor `(samples, max_string_length)` containing the truth labels. y_pred: tensor `(samples, time_steps, num_categories)` containing the prediction, or output of the softmax. input_length: tensor `(samples, 1)` containing the sequence length for each batch item in `y_pred`. label_length: tensor `(samples, 1)` containing the sequence length for each batch item in `y_true`. Returns: Tensor with shape (samples,1) containing the CTC loss of each element. """ label_length = backend.math_ops.cast( backend.array_ops.squeeze(label_length, axis=-1), backend.dtypes_module.int32) input_length = backend.math_ops.cast( backend.array_ops.squeeze(input_length, axis=-1), backend.dtypes_module.int32) sparse_labels = backend.math_ops.cast( backend.ctc_label_dense_to_sparse(y_true, label_length), backend.dtypes_module.int32) y_pred = backend.math_ops.log( backend.array_ops.transpose(y_pred, perm=[1, 0, 2]) + backend_config.epsilon()) # overwrite here return backend.array_ops.expand_dims( backend.ctc.ctc_loss( inputs=y_pred, labels=sparse_labels, sequence_length=input_length, preprocess_collapse_repeated=config.preprocess_collapse_repeated, ctc_merge_repeated=config.ctc_merge_repeated, time_major=config.time_major), 1)
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, amsgrad=False, total_iterations=0, total_iterations_wd=None, use_cosine_annealing=False, weight_decays=None, lr_multipliers=None, init_verbose=True, eta_min=0, eta_max=1, name="AdamW", **kwargs): super(AdamW, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.eta_min = K.constant(eta_min, name='eta_min') self.eta_max = K.constant(eta_max, name='eta_max') if use_cosine_annealing: self._eta_t = None self._iter_updates = None self.total_iterations = total_iterations self.total_iterations_wd = total_iterations_wd or total_iterations self.lr_multipliers = lr_multipliers self.weight_decays = weight_decays or {} self.init_verbose = init_verbose self.use_cosine_annealing = use_cosine_annealing self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad _check_args(total_iterations, use_cosine_annealing, self.weight_decays) self._updates_processed = 0 # to track num calls to '_resource_apply_...' self._init_notified = False
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, weight_decay=0.0, delta=0.1, wd_ratio=0.1, nesterov=False, name='AdamP', **kwargs): super(AdamP, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self._set_hyper('delta', delta) self._set_hyper('wd_ratio', wd_ratio) self.epsilon = epsilon or backend_config.epsilon() self.weight_decay = weight_decay self.nesterov = nesterov
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', param_lrs=None, **kwargs): super(DLR_Adam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad self.param_lrs = param_lrs initiation_dict = {k: 1 for (k, v) in self.param_lrs.items()} self.initiation_dict = initiation_dict print("NOTE: Discriminative LR Adam is used.")
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0., amsgrad=False, model=None, zero_penalties=True, total_iterations=0, total_iterations_wd=None, use_cosine_annealing=False, lr_multipliers=None, weight_decays=None, autorestart=None, init_verbose=True, eta_min=0, eta_max=1, t_cur=0, name="AdamW", **kwargs): if total_iterations > 1: weight_decays = _init_weight_decays(model, zero_penalties, weight_decays) eta_t = kwargs.pop('eta_t', 1.) super(AdamW, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.eta_min = K.constant(eta_min, name='eta_min') self.eta_max = K.constant(eta_max, name='eta_max') self.eta_t = K.variable(eta_t, dtype='float32', name='eta_t') self.t_cur = K.variable(t_cur, dtype='int64', name='t_cur') self.total_iterations = total_iterations self.total_iterations_wd = total_iterations_wd or total_iterations self.lr_multipliers = lr_multipliers self.weight_decays = weight_decays or {} self.init_verbose = init_verbose self.use_cosine_annealing = use_cosine_annealing self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad _set_autorestart(self, autorestart, use_cosine_annealing) _check_args(self, total_iterations, use_cosine_annealing, weight_decays) self._init_lr = kwargs.get('lr', learning_rate) # to print lr_mult setup self._updates_processed = 0 # to track num calls to '_resource_apply_...' self._init_notified = False self._init_lr = kwargs.get('lr', learning_rate)
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Plugin_Adam', **kwargs): super(Adam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper("beta_1", beta_1) self._set_hyper("beta_2", beta_2) self.epsilon = epsilon or backend_config.epsilon() self._beta1 = beta_1 self._beta2 = beta_2 self._optimizer_handler = kit_lib.create_global_adam_optimizer( beta1=self._beta1, beta2=self._beta2, epsilon=self.epsilon) if not kit_lib.in_tensorflow2(): collections = [sok_GraphKeys.SparseOperationKitOptimizer] ops.add_to_collections(collections, self) _initializer_op = kit_lib.optimizer_init(self._optimizer_handler) self._initializer_op = control_flow_ops.group(_initializer_op)
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Nadam', **kwargs): # Backwards compatibility with keras NAdam optimizer. kwargs['decay'] = kwargs.pop('schedule_decay', 0.004) learning_rate = kwargs.get('lr', learning_rate) if isinstance(learning_rate, learning_rate_schedule.LearningRateSchedule): raise ValueError( 'The Nadam optimizer does not support ' 'tf.keras.optimizers.LearningRateSchedules as the ' 'learning rate.') super(Nadam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self._m_cache = None
def __init__(self, learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-7, centered=False, name="RMSprop", **kwargs): """Construct a new RMSprop optimizer. Note that in the dense implementation of this algorithm, variables and their corresponding accumulators (momentum, gradient moving average, square gradient moving average) will be updated even if the gradient is zero (i.e. accumulators will decay, momentum will be applied). The sparse implementation (used when the gradient is an `IndexedSlices` object, typically because of `tf.gather` or an embedding lookup in the forward pass) will not update variable slices or their accumulators unless those slices were used in the forward pass (nor is there an "eventual" correction to account for these omitted updates). This leads to more efficient updates for large embedding lookup tables (where most of the slices are not accessed in a particular graph execution), but differs from the published algorithm. Args: learning_rate: A `Tensor`, floating point value, or a schedule that is a `tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable that takes no arguments and returns the actual value to use. The learning rate. Defeaults to 0.001. rho: Discounting factor for the history/coming gradient. Defaults to 0.9. momentum: A scalar or a scalar `Tensor`. Defaults to 0.0. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to 1e-7. centered: Boolean. If `True`, gradients are normalized by the estimated variance of the gradient; if False, by the uncentered second moment. Setting this to `True` may help with training, but is slightly more expensive in terms of computation and memory. Defaults to `False`. name: Optional name prefix for the operations created when applying gradients. Defaults to "RMSprop". @compatibility(eager) When eager execution is enabled, `learning_rate`, `decay`, `momentum`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(RMSprop, self).__init__(name, **kwargs) self._set_hyper("learning_rate", kwargs.get("lr", learning_rate)) self._set_hyper("decay", self._initial_decay) self._set_hyper("rho", rho) self._momentum = False if isinstance(momentum, ops.Tensor) or callable(momentum) or momentum > 0: self._momentum = True if isinstance(momentum, (int, float)) and (momentum < 0 or momentum > 1): raise ValueError("`momentum` must be between [0, 1].") self._set_hyper("momentum", momentum) self.epsilon = epsilon or backend_config.epsilon() self.centered = centered
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Adamax', **kwargs): """Construct a new Adamax optimizer. Initialization: ``` m_0 <- 0 (Initialize initial 1st moment vector) v_0 <- 0 (Initialize the exponentially weighted infinity norm) t <- 0 (Initialize timestep) ``` The update rule for `variable` with gradient `g` uses an optimization described at the end of section 7.1 of the paper: ``` t <- t + 1 m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- max(beta2 * v_{t-1}, abs(g)) variable <- variable - learning_rate / (1 - beta1^t) * m_t / (v_t + epsilon) ``` Similar to AdamOptimizer, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t = 0). Contrast to AdamOptimizer, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of `tf.gather` or an embedding lookup in the forward pass) only updates variable slices and corresponding `m_t`, `v_t` terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used). Args: learning_rate: A `Tensor`, floating point value, or a schedule that is a `tf.keras.optimizers.schedules.LearningRateSchedule`. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm. epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients. Defaults to "Adamax". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(Adamax, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon()
def __init__( self, learning_rate, momentum=0.9, weight_decay=0.0001, # The LARS coefficient is a hyperparameter eeta=0.001, epsilon=0.0, name="LARSOptimizer", # Enable skipping variables from LARS scaling. # TODO(sameerkm): Enable a direct mechanism to pass a # subset of variables to the optimizer. skip_list=None, use_nesterov=False, **kwargs): """Construct a new LARS Optimizer. Args: learning_rate: A `Tensor`, floating point value, or a schedule that is a `tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable that takes no arguments and returns the actual value to use. The learning rate. momentum: A floating point value. Momentum hyperparameter. weight_decay: A floating point value. Weight decay hyperparameter. eeta: LARS coefficient as used in the paper. Dfault set to LARS coefficient from the paper. (eeta / weight_decay) determines the highest scaling factor in LARS. epsilon: Optional epsilon parameter to be set in models that have very small gradients. Default set to 0.0. name: Optional name prefix for variables and ops created by LARSOptimizer. skip_list: List of strings to enable skipping variables from LARS scaling. If any of the strings in skip_list is a subset of var.name, variable 'var' is skipped from LARS scaling. For a typical classification model with batch normalization, the skip_list is ['batch_normalization', 'bias'] use_nesterov: when set to True, nesterov momentum will be enabled **kwargs: keyword arguments. Raises: ValueError: If a hyperparameter is set to a non-sensical value. """ if momentum < 0.0: raise ValueError("momentum should be positive: %s" % momentum) if weight_decay < 0.0: raise ValueError("weight_decay should be positive: %s" % weight_decay) super(LARSOptimizer, self).__init__(name=name, **kwargs) self._set_hyper("learning_rate", learning_rate) # When directly using class members, instead of # _set_hyper and _get_hyper (such as learning_rate above), # the values are fixed after __init(), and not being # updated during the training process. # This provides better performance but less flexibility. self.momentum = momentum self.weight_decay = weight_decay self.eeta = eeta self.epsilon = epsilon or backend_config.epsilon() self._skip_list = skip_list self.use_nesterov = use_nesterov
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', **kwargs): r"""Construct a new Adam optimizer. If amsgrad = False: initialize $m_0$ as 1st moment vector initialize $v_0$ as 2nd moment vector The update rule for $\theta$ with gradient $g$ uses an optimization described at the end of section 2 of the paper: $$lr_t = \mathrm{learning\_rate} * \sqrt{1 - \beta_2^t} / (1 - \beta_1^t)$$ $$m_t = \beta_1 * m_{t-1} + (1 - \beta_1) * g$$ $$v_t = \beta_2 * v_{t-1} + (1 - \beta_2) * g^2$$ $$\theta_t = \theta_{t-1} - lr_t * m_t / (\sqrt{v_t} + \epsilon)$$ If amsgrad = True: initialize $m_0$ as 1st moment vector initialize $v_0$ as 2nd moment vector initialize $\hat{v}_0$ as 2nd moment vector The update rule for $\theta$ with gradient $g$ uses an optimization described at the end of section 2 of the paper: $$lr_t = \mathrm{learning\_rate} * \sqrt{1 - \beta_2^t} / (1 - \beta_1^t)$$ $$m_t = \beta_1 * m_{t-1} + (1 - \beta_1) * g$$ $$v_t = \beta_2 * v_{t-1} + (1 - \beta_2) * g^2$$ $$\hat{v}_t = \max(\hat{v}_{t-1}, v_t)$$ $$\theta_t = \theta_{t-1} - lr_t * m_t / (\sqrt{\hat{v}_t} + \epsilon)$$ The default value of 1e-7 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper. The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of `tf.gather` or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used). Usage: >>> opt = tf.keras.optimizers.Adam(learning_rate=0.1) >>> var1 = tf.Variable(10.0) >>> loss = lambda: (var1 ** 2)/2.0 # d(loss)/d(var1) == var1 >>> step_count = opt.minimize(loss, [var1]).numpy() >>> # The first step is `-learning_rate*sign(grad)` >>> var1.numpy() 9.9 Args: learning_rate: A `Tensor`, floating point value, or a schedule that is a `tf.keras.optimizers.schedules.LearningRateSchedule`, or a callable that takes no arguments and returns the actual value to use, The learning rate. Defaults to 0.001. beta_1: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use. The exponential decay rate for the 1st moment estimates. Defaults to 0.9. beta_2: A float value or a constant float tensor, or a callable that takes no arguments and returns the actual value to use, The exponential decay rate for the 2nd moment estimates. Defaults to 0.999. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. Defaults to 1e-7. amsgrad: Boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". Defaults to `False`. name: Optional name for the operations created when applying gradients. Defaults to "Adam". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(Adam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, model=None, zero_penalties=True, total_iterations=0, total_iterations_wd=None, use_cosine_annealing=False, lr_multipliers=None, weight_decays=None, autorestart=None, init_verbose=True, eta_min=0, eta_max=1, t_cur=0, name="NCustomOptimizer", **kwargs): if total_iterations > 1: weight_decays = _init_weight_decays(model, zero_penalties, weight_decays) # Backwards compatibility with keras NAdam optimizer. kwargs['decay'] = kwargs.pop('schedule_decay', 0.004) eta_t = kwargs.pop('eta_t', 1.) learning_rate = kwargs.get('lr', learning_rate) if isinstance(learning_rate, learning_rate_schedule.LearningRateSchedule): raise ValueError( 'The Nadam optimizer does not support ' 'tf.keras.optimizers.LearningRateSchedules as the ' 'learning rate.') super(NCustomOptimizer, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self._m_cache = None self.eta_min = K.constant(eta_min, name='eta_min') self.eta_max = K.constant(eta_max, name='eta_max') self.eta_t = K.variable(eta_t, dtype='float32', name='eta_t') self.t_cur = K.variable(t_cur, dtype='int64', name='t_cur') self.total_iterations = total_iterations self.total_iterations_wd = total_iterations_wd or total_iterations self.lr_multipliers = lr_multipliers self.weight_decays = weight_decays or {} self.init_verbose = init_verbose self.use_cosine_annealing = use_cosine_annealing self.epsilon = epsilon or backend_config.epsilon() _set_autorestart(self, autorestart, use_cosine_annealing) _check_args(self, total_iterations, use_cosine_annealing, weight_decays) self._init_lr = kwargs.get('lr', learning_rate) # to print lr_mult setup self._updates_processed = 0 # to track num calls to '_resource_apply_...' self._init_notified = False
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', **kwargs): r"""Construct a new Adam optimizer. If amsgrad = False: Initialization: $$m_0 := 0 \text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$t := 0 \text{(Initialize timestep)}$$ The update rule for `variable` with gradient `g` uses an optimization described at the end of section 2 of the paper: $$t := t + 1$$ $$lr_t := \text{learning\_rate} * \sqrt{1 - beta_2^t} / (1 - beta_1^t)$$ $$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (\sqrt{v_t} + \epsilon)$$ If amsgrad = True: Initialization: $$m_0 := 0 \text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$v_hat_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$t := 0 \text{(Initialize timestep)}$$ The update rule for `variable` with gradient `g` uses an optimization described at the end of section 2 of the paper: $$t := t + 1$$ $$lr_t := \text{learning\_rate} * \sqrt{1 - beta_2^t} / (1 - beta_1^t)$$ $$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$v_hat_t := max(v_hat_{t-1}, v_t) $$variable := variable - lr_t * m_t / (\sqrt{v_hat_t} + \epsilon)$$ The default value of 1e-7 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper. The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of `tf.gather` or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used). Args: learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. amsgrad: boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". name: Optional name for the operations created when applying gradients. Defaults to "Adam". @compatibility(eager) When eager execution is enabled, `learning_rate`, `beta_1`, `beta_2`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ if epsilon is None: epsilon = backend_config.epsilon() super(Adam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self._set_hyper('epsilon', epsilon) self.amsgrad = amsgrad
def test_epsilon(self): epsilon = 1e-2 backend_config.set_epsilon(epsilon) self.assertEqual(backend_config.epsilon(), epsilon) backend_config.set_epsilon(1e-7) self.assertEqual(backend_config.epsilon(), 1e-7)
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, amsgrad=False, name='Adam', **kwargs): r"""Construct a new Adam optimizer. If amsgrad = False: Initialization: $$m_0 := 0 \text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$t := 0 \text{(Initialize timestep)}$$ The update rule for `variable` with gradient `g` uses an optimization described at the end of section 2 of the paper: $$t := t + 1$$ $$lr_t := \text{learning\_rate} * \sqrt{1 - beta_2^t} / (1 - beta_1^t)$$ $$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$variable := variable - lr_t * m_t / (\sqrt{v_t} + \epsilon)$$ If amsgrad = True: Initialization: $$m_0 := 0 \text{(Initialize initial 1st moment vector)}$$ $$v_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$v_hat_0 := 0 \text{(Initialize initial 2nd moment vector)}$$ $$t := 0 \text{(Initialize timestep)}$$ The update rule for `variable` with gradient `g` uses an optimization described at the end of section 2 of the paper: $$t := t + 1$$ $$lr_t := \text{learning\_rate} * \sqrt{1 - beta_2^t} / (1 - beta_1^t)$$ $$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$ $$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$ $$v_hat_t := max(v_hat_{t-1}, v_t)$$ $$variable := variable - lr_t * m_t / (\sqrt{v_hat_t} + \epsilon)$$ The default value of 1e-7 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper. The sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of `tf.gather` or an embedding lookup in the forward pass) does apply momentum to variable slices even if they were not used in the forward pass (meaning they have a gradient equal to zero). Momentum decay (beta1) is also applied to the entire momentum accumulator. This means that the sparse behavior is equivalent to the dense behavior (in contrast to some momentum implementations which ignore momentum unless a variable slice was actually used). Args: learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates. epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper. amsgrad: boolean. Whether to apply AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and beyond". name: Optional name for the operations created when applying gradients. Defaults to "Adam". @compatibility(eager) When eager execution is enabled, `learning_rate`, `beta_1`, `beta_2`, and `epsilon` can each be a callable that takes no arguments and returns the actual value to use. This can be useful for changing these values across different invocations of optimizer functions. @end_compatibility **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ super(Adam, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self.epsilon = epsilon or backend_config.epsilon() self.amsgrad = amsgrad
def __init__(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-7, name='Adamax', **kwargs): """Construct a new Adamax optimizer. Initialization: ``` m_0 <- 0 (Initialize initial 1st moment vector) v_0 <- 0 (Initialize the exponentially weighted infinity norm) t <- 0 (Initialize timestep) ``` The update rule for `variable` with gradient `g` uses an optimization described at the end of section 7.1 of the paper: ``` t <- t + 1 m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- max(beta2 * v_{t-1}, abs(g)) variable <- variable - learning_rate / (1 - beta1^t) * m_t / (v_t + epsilon) ``` Similar to AdamOptimizer, the epsilon is added for numerical stability (especially to get rid of division by zero when v_t = 0). Contrast to AdamOptimizer, the sparse implementation of this algorithm (used when the gradient is an IndexedSlices object, typically because of `tf.gather` or an embedding lookup in the forward pass) only updates variable slices and corresponding `m_t`, `v_t` terms when that part of the variable was used in the forward pass. This means that the sparse behavior is contrast to the dense behavior (similar to some momentum implementations which ignore momentum unless a variable slice was actually used). Args: learning_rate: A Tensor or a floating point value. The learning rate. beta_1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates. beta_2: A float value or a constant float tensor. The exponential decay rate for the exponentially weighted infinity norm. epsilon: A small constant for numerical stability. name: Optional name for the operations created when applying gradients. Defaults to "Adamax". **kwargs: keyword arguments. Allowed to be {`clipnorm`, `clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients by norm; `clipvalue` is clip gradients by value, `decay` is included for backward compatibility to allow time inverse decay of learning rate. `lr` is included for backward compatibility, recommended to use `learning_rate` instead. """ if epsilon is None: epsilon = backend_config.epsilon() super(Adamax, self).__init__(name, **kwargs) self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) self._set_hyper('decay', self._initial_decay) self._set_hyper('beta_1', beta_1) self._set_hyper('beta_2', beta_2) self._set_hyper('epsilon', epsilon)