Skip to content

rb876/mlmi4-vcl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variational Continual Learning

Original paper by Cuong V. Nguyen, Yingzhen Li, Thang D. Bui and Richard E. Turner

Part 1. Paper Summary

1. Introduction

  • Continual Learning
  • Challenge for Continual Learning
  • Variational Continual Learning

2. Continual Learning by Approximate Bayesian Inference

  • Online updating, derived from Bayes' rule

  • Posterior after Tth dataset is proportional to the Posterior after the (T-1)th dataset multiplied by the likelihood of the Tth dataset
  • Projection Operation: approximation for intractable posterior (recursive)

  • This paper will use Online VI as it outperforms other methods for complex models in the static setting (Bui et al., 2016)

2.1. VCL and Episodic Memory Enhancement

  • Projection Operation: KL Divergence Minimization

  • : normalizing constant (not required when computing the optimum)
  • VCL becomes Bayesian inference if
  • Potential Problems
  • Errors from repeated approximation → forget old tasks
  • Minimization at each step is also approximate → information loss
  • Solution: Coreset
  • Coreset: small representative set of data from previously observed tasks
  • Analogous to episodic memory (Lopez-Paz & Ranzato, 2017)
  • Coreset VCL: equivalent to a message-passing implementation of VI in which the coreset data point updates are scheduled after updating the other data
  • : updated using and selected data points from (e.g. random selection, K-center algorithm, ...)
  • K-center algorithm: return K data points that are spread throughout the input space (Gonzalez, 1985)
  • Variational Recursion

  • Algorithm
  • Step 1: Observe
  • Step 2: Update using and
  • Step 3: Update (used for propagation)

  • Step 4: Update (used for prediction)

  • Step 5: Perform prediction

3. VCL in Deep Discriminative Models

  • Multi-head Networks
  • Standard architecture used for multi-task learning (Bakker & Heskes, 2003)
  • Share parameters close to the inputs / Separate heads for each output
  • More advanced model structures:
  • for continual learning (Rusu et al., 2016)
  • for multi-task learning in general (Swietojanski & Renals, 2014; Rebuffi et al., 2017)
  • automatic continual model building: adding new structure as new tasks are encountered
  • This paper assumes that the model structure is known a priori
  • Formulation
  • Model parameters
  • Shared parameters: updated constantly
  • Head parameter: at the beginning, updated incrementally as each task emerges
  • For simplicity, use Gaussian mean-field approximate posterior:
  • Network Training
  • Maximize the negative online variational free energy or the variational lower bound to the online marginal likelihood with respect to the variational parameters

4. VCL in Deep Generative Models

  • Deep Generative Models
  • Formulation - VAE approach (batch learning)

  • : prior over latent variables / typically Gaussian
  • : defined by DNN, , where collects weight matrices and bias vectors
  • Learning : approximate MLE (maximize variational lower bound w.r.t. and )

  • No parameter uncertainty estimates (used to weight the information learned from old data)
  • Formulation - VCL approach (continual learning)
  • Approximate full posterior over parameters:
  • Maximize full variational lower bound w.r.t. and

  • : task-specific → likely to be beneficial to share (parts of) these encoder networks
  • Model Architecture
  • Latent variables → Intermediate-level representations
  • Architecture 1: shared bottom network - suitable when data are composed of a common set of structural primitives (e.g. strokes)
  • Architecture 2: shared head network - information tend to be entirely encoded in bottom network

5. Related Work

  • Continual Learning for Deep Discriminative Models (regularized MLE)

  • ML Estimation - set
  • MAP Estimation - assume Gaussian prior and use CV to find → catastrophic forgetting
  • Laplace Propagation (LP) (Smola et al., 2004) - recursion for using Laplace's approximation
  • Diagonal LP: retain only the diagonal terms of to avoid computing full Hessian

  • Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) - modified diagonal LP
  • Approximate the average Hessian of the likelihoods using Fisher information
  • Regularization term: introduce hyperparameter, remove prior, regularize intermediate estimates

  • Synaptic Intelligence (SI) (Zenke et al., 2017) - compute using importance of each parameter to each task
  • Approximate Bayesian Training of NN (focused on )

|Approach|References| |-|-| |extended Kalman filtering|Singhal & Wu, 1989| |Laplace's approximation|MacKay, 1992| |variational inference|Hinton & Van Camp, 1993; Barber & Bishop, 1998; Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016| |sequential Monte Carlo|de Freitas et al., 2000| |expectation propagation|Hernández-Lobato & Adams, 2015| |approximate power EP|Hernández-Lobato et al., 2016|

  • Continual Learning for Deep Generative Models
  • Naïve approach: apply VAE to with parameters initialized at → catastrophic forgetting
  • Alternative: add EWC regularization term to VAE objective & approximate marginal likelihood by variational lower bound
  • Similar approximation can be used for Hessian matrices for LP and for SI (Importance sampling: Burda et al., 2016)

About

MLMI 4 - Team 1 implementation for variational continual learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%