- graphlab-create
- SFrames
- numpy
- matplotlib
Final+Notebook.py contains the executable code. The script will call suggest() function which maps all the outputs from different models together. This file alone (with required data) is enough to make predicitons, all the other notebooks are there only to show data preprocessing and training of models. You can adjust suggest() function and refer to the comments to know more about the parameters of function. The output of this function call will be information about five recommended books and their cover image. [Look at Final_Notebook.ipynb notebook to know more about predicted output format]
In layman terms, a recommender systems can predict something about a user based on user's past activities. Here the activity can be - a purchase history of the user, a book rated by some user, a movie rated by user on netflix. Example of Recommender System is the recommended products that amazon website shows to its users based on their past purchases.
It works in two phases
In this phase we need to provide the data about every single user(means a lots of data). Then a mathematical model is formulated using some state of the art prediction algorithm which has a lots of statistics behind it. And we get a Black Box of trained model that can be used to make predictions.
In this phase only the user id is required(in most cases) and our model can output a list of items that a user is likely to purchase or in our case the books the user might love to read.
Recommender Systems are hot research topics and they lie in the realm of Machine Learning. There are many approches that can be used to build a recommender system like Matrix Factorization, Collaborative filtering, Item-based similarity, User-based similarity etc.
Generally, most simplest of recommender systems are constructed based on one of the techniques mentioned above. But these recommender systems do not provide the right accuracy which is the most crucial aspect of recommender systems that are to be used in commercial applications. Inaccurate results can incurr significant loss and wastage of resources. Now there are some ways to increase accuracy like:
So, my dataset was static and I already had the cutting edge techniques for predictions, so I chose to ensemble models trained differently and output cumulative predictions. The five models that I made use of were:
Programming Language: Python 2.7 IDE: jupyter-notebook libraries: graphlab-create/SFrames | (Just for the Matrix factrization model, all other models are implemented from SCRATCH)
All of these folders contain the clean data derived from the noisy data that I was working with. Various preprocessing techniques were used to make this data usable. The information about what paritcular folder is about can be found in comments, also the different preprocessing techniques used to filter this data is mentioned in comments
This folder contains the first step for construction of classification based model which is to train explicit data on regression model. Data is stored in some format used by graphlab-create so it might not make sense but again comments can be referred to know what's that folder is doing.
All the notebooks are in .ipynb format which can be directly viewed on github only without downloading or running them.
As the name suggest this notebook contains the implementation of Co-occurence matrix based model.
All the preprocessing steps taken to clean data for usability are employed in this notebook.
This notebook contains the code assembled from all other distinct notebooks i.e. this is where ENSEMbLING of all 5 models is performed. This notbook alone with the required data is enough for using this model.
Contains code for popularity based model.
Some play around code for increasing accuracy of the model. Comments can be referred for the use of this code in improving accuracy of the model.
A brief info of the challenges I faced and tricks i have used to make best predictions.
Code for classification based model.
Code for User-similarity based model
A python dictionary storing numpy arrays, which are used in cooccurence model
A comparison between matrix factorization based models trained fon different data (implicit and explicit, with or without rankings)
Another dictionary storing numpy arrays used in user-simlilarity model
This python code is well modularized and can be integrated with any webapp or mobile application with some glue code and hence it can work as a Core Engine for your Book Recommending Website. There is even no need to include the CSV files as the cleaned data is already stored in folders and hence it won't consume much space. There is only one function call which takes as parameter a UserId and another parameter denoting wheater the user is a new user or an old user and ouput is well defined description about book with book's image(amazon url's are used to display images hence internet connection is required for displaying images)
-graphlab-create and python 2.7 must be installed to execute this project.
CAUTION: Matrix Factorization model cannot be used directly as it was trained solely on graphlab-create and was quite large in size to be uploaded on github and hence the code using matrix factorization based model has been commented so that it do not hinder the execution of rest of the code. But, the output shown in the final notebook include predictions from Matrix Factorization model, so you can get a fair idea of what's this model is doing.
Yes, one feature of this Recommender System is that it can also be used for prediction for completely new users with no past rating history. Popularity based model and Classification based model will be used for making predictions for such users based on the characteristics of users like thier location etc. This feature is not a rare one, today almost every recommender system is capable of making predictions for new users but its definitely one of the strongest feature that a recommender system can provide.