The Netflix Prize was an open competition to find the best filtering algorithm for predicting how customers would rate specific movies, given rating data on thousands of movies and customers. A winner was announced in September of 2009 and they were awarded $1,000,000. More information is available at https://en.wikipedia.org/wiki/Netflix_Prize.
In my Software Engineering class (CS 373), we were tasked with the same competition. The goal of our assignment was to design an algorithm that would acheive predictions with a RMSE of less that 1.00.
We were given the following:
Training Data:
- 17,770 movies
- 480,189 customers
- about 100,000,000 ratings
- about 5,600 ratings per movie
- about 200 ratings per customer
Probe Data:
- 1,425,333 ratings
- subset of training data used to test prediction algorithms
Movie Data:
- 17,770 movies
- Title and year of release for each movie
The Training Data was separated into four different text files. Each file started with the movie id (followed by a colon), and the following lines each contained a customer id, rating, and date rated for that movie. More movie blocks followed. Each of the four files contained approximately an equal number of movies.
2043: 716091,2,2003-10-02 1990901,5,2001-09-27 1481271,3,2000-09-09 2098867,4,2005-07-12
The Probe Data was structured like the training data. Each movie block started with a movie id (followed by a colon), followed by lines of customer id's.
2043: 1417435 1828683 818484 10851: 1417435 2312054 462685
The Movie Data was one text file with a movie id, year released, and a title on each line.
2043,1953,Shane 10851,1948,Red River 16306,1960,Spartacus
I used the training data to create caches (Python dictionaries) in the form of a pickle file. Examples of the caches are the average customer rating {(int) customer_id : (float) avg_rating} and the average movie rating {(int) movie_id : (float) avg_rating} These caches allowed me to quickly and more accurately predict ratings.
Netflix.py - utilizes caches to generate predictions
RunNetflix.in - subset of probe data, used for testing
RunNetflix.out - prediction results for RunNetflix.in, RMSE printed at bottom
RunNetflix.py - uses Netflix.py to solve for predictions
TestNetflix.out - testing results
TestNetflix.py - contains 22 unit tests, testing read, print, predict, rmse, solve, and cache
makefile - used for automated building
probe.out - prediction results for probe.txt data, RMSE printed at bottom
probe.txt - subset of training data
caches/
createCaches.py - Creates dictionary caches using given data, dumps caches into pickle files
movieYears.p - Year in which each movie was released {(int) movie_id : (int) year_released}
ratingsMovies.p - Training Data Ratings {(int) movie_id : {(int) cust_id : (int) actual_rating} }
ratingsCustomers.p - Training Data Ratings {(int) cust_id : {(int) movie_id : (int) actual_rating} }
avgCustomerRatings.p - Average Customer Rating from Training Data {(int) customer_id : (float) avg_rating}
avgMovieRatings.p - Average Movie Rating from Training Data {(int) movie_id : (float) avg_rating}
yearsSinceRelease.p - Contains how many years have passed since movie release at the time of rating { (int) movie_id : {(int) cust_id : (int) years_passed} }
moviePredictionErrorCorrelations.p - Contains correlations between prediction (Approach 1) errors of the three top watched movies. Example: For each customer, if prediction Approach 1 tends err in the same direction (guess above/below actual rating) for a pair of movies, the correlation would be positive. { (int) movie_id : {(int) movie_id : (float) correlation} }
- Overall Average - The overall average rating of all movies and customers
- Customer offset - The amount by which the average rating for a given customer exceeds the overall average
- Movie offset - The amount by which the average rating for a given movie exceeds the overall average
PREDICTED_RATING = OVERALL_AVG + CUSTOMER_OFFSET + MOVIE_OFFSET
Thoughts behind approach:
- The overall average serves as a good baseline prediction.
- Customers whose average rating is higher than the overall average are (presumably) more likely to rate any given movie higher than its average.
- Movies whose average rating is higher than the overall average are (presumably) more likely to be rated higher than the customer average rating.
- Each movie is similar to other movies to some degree.
- If prediction approach 1 was too high (or low) on other movies I watched, then based on how similar the movies are to one another, approach 1 may be likely to predict high (or low) on this movie.
I used the correlations between the prediction approach 1 errors of the top 3 most watched movies to enhance prediction approach 1.
APPROACH_2_OFFSET = CORRELATION(OTHER_MOVIE_WATCHED, THIS_MOVIE) * OTHER_MOVIE_PREDICTION_ERROR
PREDICTED_RATING = OVERALL_AVG + CUSTOMER_OFFSET + MOVIE_OFFSET + APPROACH_2_OFFSET
The actual benefit to this approach was minimal, but given more time and resources, more correlations can be calculated and utilized.
- Add more (or all) movie pair correlations
- Incorporate Customer pair correlations (similar to movie pair correlations)