Dicussion

Chainer Variational Factorization Machine implementation.

Run python run_movielens.py to download, extract, and run the FM model on MovieLens 1M data.

model	batchsize	rank	intx term	lambda0	lambda1	lambda2	RMSE	Notes
FM	8192	0	N	0	1e-2	0	0.9305	Regression with regularization
FM	8192	0	N	0	0	0	0.9115	Regression with no regularization
FM	8192	0	N	0	1e-3	0	0.9112	Regression with less regularization
FM	8192	20	Y	0	0	1e-3	0.8633	FM model w/ 20D latent vector
FM	8192	20	Y	0	1e-3	1e-3	0.8618	FM model w/ 20D latent vector and regularization
VFM	8192	20	Y	0	1e-3	1e-3	0.8625	Variational FM model with arbitrary reularization
VFM	8192	20	Y	1	1	1	0.8620	Variational FM model with default priors
VFM	8192	20	Y	1	1	1	0.8585	Variational FM model with grouping
VFM	8192	64	Y	1	1	1	0.8800	Higher rank model does worse

After Dec 19:

model	batchsize	rank	intx term	lambda0	lambda1	lambda2	RMSE	Notes
VFM	4096	8	Y	0	0	0	0.8782	no regularization
VFM	4096	8	Y	0	1	1	0.8775
VFM	4096	8	Y	1	1	1	0.8870	with alpha=1e-2, fast but inaccurate
VFM	4096	8	Y	10	10	10	0.8628	more regularization than default
VFM	4096	8	Y	1	1	1	0.8805	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	10	1	1	0.8805	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	1	10	1	0.8793	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	1	1	10	0.8623	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	10	10	10	0.8619	added 300 epochs to 10-10-10 run
VFM	4096	8	Y	0	1	10	0.8629	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	0	0	10	0.8793	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	0	0	1	0.8815	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	1	1	50	0.8561	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	0	1	50	0.8561	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	0	1	100	0.8672	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	1	1	100	0.8673	default, initialized from 10-10-10 run for 200->500 epochs
VFM	4096	8	Y	100	100	100	0.8708	initialized from 10-10-10 model

Yamada [1] reports the following errors on a 25% test set of the same ML-1M dataset root mean squared errors (RMSE):

Model	RMSE
libFM ALS	0.981
libFM SGD	0.943
libFM MCMC 0.05	0.877
CFM	0.866
CFM (BCD)	0.850
libFM MCMC 0.10	0.846

[1] https://arxiv.org/pdf/1507.01073.pdf

Dicussion

Within the Variational FM framework we can get more than a good point estimate, we can can get an estimate of the mean and variance of a single feature. This means we can estimate the variance conditioned on a few active features (e.g. conditioned on a single user) and retrieve the most uncertain item for that user. The idea here is to switch inference from a gradient descent model (which makes point estimates) to variational stochastic gradient descent (which estimates approximate posteriors) to build an active learning model.

Regression

For typical linear regression with interactions we have:

Note that x_i is usually a sparse feature vector (but doesn't have to be). In the land of recommenders, we're usually interested in the coefficient w_ij in front of an interaction such as x_i x_j where x_i might be a dummy-encoded user id and x_j is an item_id. The big problem here is that w_ij is quadratic in the number of features (e.g. # of users + # of items), so there are lots of parameters to estimate with sparse observations. (Note: we've also left off any regularization, but might choose to L2 penalize w_ij or beta_ij.)

FMs

FMs fix this by doing a low-rank approximation to w_ij by saying that w_ij= v_i0 * v_j0 + ... + v_ik * v_jk where each feature i has a latent rank-k vector v_i. Instead of computing an N x N w_ij matrix, we compute N x k parameters in the form of N v_i vectors, yielding a new objective function:

VFMs

In variational FMs we impose a bit more hierarchy by grouping feature vectors and swap out L2 regularization for Gauassian priors:

//: # (\beta_i \sim \mathcal{N}( \mu_\beta, \sigma_\beta)) //: # (\vec{v_i} \sim \mathcal{N}( \vec{\mu_v}, \vec{\sigma}_v)))

And then group these (hyper)priors together assuming a normal prior with unity variance. The vectors v_i are drawn from a multivariate prior with a diagonal covariance matrix. The assumption is that there's a group feature-vector, and individual feature vectors need evidence to deviate from that group vector. The log-normal prior on the variance isn't the disciplined choice (inverse Wishart I think?) but it is convenient and amenable to Stochastic Variational Bayes inference.

//: # (\mu_\beta \sim \mathcal{N}(0, 1)) //: # (log\sigma_\beta \sim \mathcal{N}(0, 1))

As you can see in the results table, shrinking to the groups improves test set validation scores (.8620 --> .8580).

This forms a deep model: the hyperpriors mu_b and sigma_b pick the group mean and group variance from which individual beta_i and v_i are drawn. In variational inference, those beta_i and v_i in turn have their own means and variances, so that we're not just point estimating beta_i but in fact estimate mu_beta_i and sigma_beta_i. If you're curious how this mode of inference works, read this or this for the trick in 140 characters -- it's at the heart of Bayesian deep learning techniques.

Model uncertainty

With estimates of mu_v_i = E[v_i] and sigma_v_i = Var[v_i] we finally get the critical ingredient to do active learning on FMs -- an uncertainty estimate around the feature vector v_i. But we need the uncertainty for the whole model, which is composed of interactions on v_i:

Note that the above is just the identity for the product of two independent random variables. Technically v_i is a vector, but the components are independent so replace that above v_i with an arbitrary component of that vector:

The variances of the beta components do not covary with the v_i components, so the full model variance is decomposes into the sum of the individual variances:

We've used the fact that beta and v_i are independent to sum the variances independently.

So in picking the next question we can rank by the above measure to get the highest variance question. The observation features x_i x_j are known for each trial (they're just usually the user ID and item ID) and the means mu and variances sigma are easily accessible model parameters.

Example

For a concrete example, we may be interested in user 19 (e.g. x_19=1) and want to know how uncertain we are on what item 45 might be rated (if there are 1000 users, then item 45 is feature 1000 + 45, then x_1045=1):

So the interpretation is that the variance is driven by an overall constant factor, the variances of the user and item summed, and then interactions terms combine pairs of feature variances, and feature weights with variances.

Next steps

We should be able to emulate active learning within this dataset. At training time instead of drawing the next example randomly from your dataset, use this model to rank the available training data and re-train only using most informative datapoint at every timestemp.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
auto_vfm.py		auto_vfm.py
fm.py		fm.py
run_movielens.py		run_movielens.py
vfm.py		vfm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

auto_vfm.py

auto_vfm.py

fm.py

fm.py

run_movielens.py

run_movielens.py

vfm.py

vfm.py

Repository files navigation

Dicussion

Regression

FMs

VFMs

Model uncertainty

Example

Next steps

About

Releases

Packages

Languages

License

jilljenn/vfm

Folders and files

Latest commit

History

Repository files navigation

Dicussion

Regression

FMs

VFMs

Model uncertainty

Example

Next steps

About

Resources

License

Stars

Watchers

Forks

Languages