NETFLIX PRIZE

The Netflix Prize was an open competition to find the best filtering algorithm for predicting how customers would rate specific movies, given rating data on thousands of movies and customers. A winner was announced in September of 2009 and they were awarded $1,000,000. More information is available at https://en.wikipedia.org/wiki/Netflix_Prize.

In my Software Engineering class (CS 373), we were tasked with the same competition. The goal of our assignment was to design an algorithm that would acheive predictions with a RMSE of less that 1.00.

We were given the following:

Training Data:

17,770 movies
480,189 customers
about 100,000,000 ratings
about 5,600 ratings per movie
about 200 ratings per customer

Probe Data:

1,425,333 ratings
subset of training data used to test prediction algorithms

Movie Data:

17,770 movies
Title and year of release for each movie

The Training Data was separated into four different text files. Each file started with the movie id (followed by a colon), and the following lines each contained a customer id, rating, and date rated for that movie. More movie blocks followed. Each of the four files contained approximately an equal number of movies.

2043:
716091,2,2003-10-02
1990901,5,2001-09-27
1481271,3,2000-09-09
2098867,4,2005-07-12

The Probe Data was structured like the training data. Each movie block started with a movie id (followed by a colon), followed by lines of customer id's.

The Movie Data was one text file with a movie id, year released, and a title on each line.

2043,1953,Shane
10851,1948,Red River
16306,1960,Spartacus

I used the training data to create caches (Python dictionaries) in the form of a pickle file. Examples of the caches are the average customer rating {(int) customer_id : (float) avg_rating} and the average movie rating {(int) movie_id : (float) avg_rating} These caches allowed me to quickly and more accurately predict ratings.

FILE DESCRIPTIONS:

Netflix.py - utilizes caches to generate predictions

RunNetflix.in - subset of probe data, used for testing

RunNetflix.out - prediction results for RunNetflix.in, RMSE printed at bottom

RunNetflix.py - uses Netflix.py to solve for predictions

TestNetflix.out - testing results

TestNetflix.py - contains 22 unit tests, testing read, print, predict, rmse, solve, and cache

makefile - used for automated building

probe.out - prediction results for probe.txt data, RMSE printed at bottom

probe.txt - subset of training data

caches/

createCaches.py - Creates dictionary caches using given data, dumps caches into pickle files

movieYears.p - Year in which each movie was released {(int) movie_id : (int) year_released}

ratingsMovies.p - Training Data Ratings {(int) movie_id : {(int) cust_id : (int) actual_rating} }

ratingsCustomers.p - Training Data Ratings {(int) cust_id : {(int) movie_id : (int) actual_rating} }

avgCustomerRatings.p - Average Customer Rating from Training Data {(int) customer_id : (float) avg_rating}

avgMovieRatings.p - Average Movie Rating from Training Data {(int) movie_id : (float) avg_rating}

yearsSinceRelease.p - Contains how many years have passed since movie release at the time of rating { (int) movie_id : {(int) cust_id : (int) years_passed} }

moviePredictionErrorCorrelations.p - Contains correlations between prediction (Approach 1) errors of the three top watched movies. Example: For each customer, if prediction Approach 1 tends err in the same direction (guess above/below actual rating) for a pair of movies, the correlation would be positive. { (int) movie_id : {(int) movie_id : (float) correlation} }

PREDICTION APPROACH 1:

Overall Average - The overall average rating of all movies and customers
Customer offset - The amount by which the average rating for a given customer exceeds the overall average
Movie offset - The amount by which the average rating for a given movie exceeds the overall average

PREDICTED_RATING = OVERALL_AVG + CUSTOMER_OFFSET + MOVIE_OFFSET

Thoughts behind approach:

The overall average serves as a good baseline prediction.
Customers whose average rating is higher than the overall average are (presumably) more likely to rate any given movie higher than its average.
Movies whose average rating is higher than the overall average are (presumably) more likely to be rated higher than the customer average rating.

PREDICTION APPROACH 2:

Each movie is similar to other movies to some degree.
If prediction approach 1 was too high (or low) on other movies I watched, then based on how similar the movies are to one another, approach 1 may be likely to predict high (or low) on this movie.

I used the correlations between the prediction approach 1 errors of the top 3 most watched movies to enhance prediction approach 1.

APPROACH_2_OFFSET = CORRELATION(OTHER_MOVIE_WATCHED, THIS_MOVIE) * OTHER_MOVIE_PREDICTION_ERROR

PREDICTED_RATING = OVERALL_AVG + CUSTOMER_OFFSET + MOVIE_OFFSET + APPROACH_2_OFFSET

The actual benefit to this approach was minimal, but given more time and resources, more correlations can be calculated and utilized.

FUTURE APPROACH:

Add more (or all) movie pair correlations
Incorporate Customer pair correlations (similar to movie pair correlations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caches

caches

data

data

.gitignore

.gitignore

Netflix.py

Netflix.py

README.md

README.md

RunNetflix.in

RunNetflix.in

RunNetflix.out

RunNetflix.out

RunNetflix.py

RunNetflix.py

SampleTrainingFileMovie2043.txt

SampleTrainingFileMovie2043.txt

TestNetflix.py

TestNetflix.py

makefile

makefile

probe.txt

probe.txt

Repository files navigation

NETFLIX PRIZE

FILE DESCRIPTIONS:

PREDICTION APPROACH 1:

PREDICTION APPROACH 2:

FUTURE APPROACH:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
caches		caches
data		data
.gitignore		.gitignore
Netflix.py		Netflix.py
README.md		README.md
RunNetflix.in		RunNetflix.in
RunNetflix.out		RunNetflix.out
RunNetflix.py		RunNetflix.py
SampleTrainingFileMovie2043.txt		SampleTrainingFileMovie2043.txt
TestNetflix.py		TestNetflix.py
makefile		makefile
probe.txt		probe.txt

13lheytens/Netflix-Prize

Folders and files

Latest commit

History

Repository files navigation

NETFLIX PRIZE

FILE DESCRIPTIONS:

PREDICTION APPROACH 1:

PREDICTION APPROACH 2:

FUTURE APPROACH:

About

Resources

Stars

Watchers

Forks

Languages