Skip to content

LiuShifeng/Matrix_Factor_Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matrix_Factor_Python

Notes for mf.py

  • This code is based on the version of Sapphire1211(https://github.com/Sapphire1211). Thanks a lot!
  • This code was developed with python version 2.7.8
  • This code is a example to know how SGD works on the matrix factorization.
  • This code hold a example in it. You can run it directly.

Notes for PMF.py

  • This code is based on the version of Sapphire1211(https://github.com/Sapphire1211). Thanks a lot!
  • This code was developed with python version 2.7.8
  • This code suits measurements of both RMSE and Top-K.The measurement of Top-K is under development.
  • Data is taken from a subsample of the MovieLens 100K data set.More details is under below.

Notes for dsgd_mf.py && eval_acc.py

  • This code is based on the version of Anttthea(https://github.com/Anttthea). Thanks a lot!
  • This code was developed with python version 2.7.8
  • This code suits measurements of both RMSE and Top-K.The measurement of Top-K is under development.
  • Data is taken from a subsample of the MovieLens 100K data set. It is in the triples format:
<user_1>,<movie_1>,<rating_11>
...
<user_i>,<movie_j>,<rating_ij>
...
<user_M>,<movie_N>,<rating_ij>

Here, user i is an integer ID for the ith user, movie j is an integer ID for the jth movie, and rating ij is the rating given by user i to movie j. For example, the content of the input file to the script will look like:

196,242,3
186,302,3
22,377,1
244,51,2
166,346,1
298,474,4
...
276,1090,1
13,225,2
12,203,3

Pre requisites

As pre requisites you need

  • python 2.7
  • pyspark installation version spark-1.4.0-bin-hadoop2.4

Big Matrix Factorization using Spark

  • This script implements Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (DSGD-MF) in Spark.
  • The reference paper sets forth a solution for matrix factorization using minimization of sum of local losses. The solution involves dividing the matrix into strata for each iteration and performing sequential stochastic gradient descent within each stratum in parallel. The two losses considered are the plain non-zero square loss and the non-zero square loss with L2 regularization of parameters W and H.
  • DSGD-MF is a fully distributed algorithm i.e. both the data matrix V and factor matrices W and H can be carefully split and distributed to multiple workers for parallel computation without communication costs between the workers. Hence, it is a good match for implementation in a distributed in-memory data processing system like Spark.
  • if Wm = 0, it is the same to the basic version of Matrix Factorization. Otherwise, it is used for recommend system of recommending the topK items.

Usage

>>> spark-submit dsgd_mf.py <num_factors> <num_workers> <num_iterations> <beta_value> <lambda_value> <Wm_value> <inputV_filepath> <outputW_filepath> <outputH_filepath>

Notice: It is recommended using the input file on hdfs and output file on local

Example:

>>> spark-submit dsgd_mf.py 100 10 50 0.8 1.0 0.2 u.csv w.csv h.csv
>>> python eval_acc.py u.csv w.csv h.csv
Reconstruction error: 0.826437682942

Reference

Gemulla, Rainer, et al. "Large-scale matrix factorization with distributed stochastic gradient descent." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2011.
Yang X, Steck H, Guo Y, et al. On top-k recommendation using social networks[C]//Proceedings of the sixth ACM conference on Recommender systems. ACM, 2012: 67-74.
Yang X, Guo Y, Liu Y, et al. A survey of collaborative filtering based social recommender systems[J]. Computer Communications, 2014, 41: 1-10.

About

Matrix Factor Methods using SGD based on Spark and standalone

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages