Overview

Our objective is to predict a new venue's popularity from information available when the venue opens. We will do this by machine learning from a dataset of venue popularities provided by Yelp. The dataset contains meta data about the venue (where it is located, the type of food served, etc ...). It also contains a star rating. This tutorial will walk you through one way to build a machine-learning algorithm.

Metric

Your model will be assessed based on the root mean squared error of the number of stars you predict. There is a reference solution (which should not be too hard to beat). The reference solution has a score of 1.

Download and parse the incoming data

link

Notice that each row of the file is a json blurb. You can read it in python.

Hints:

gzip.open) has the same interface as open but is for .gz files.
ujson) has the same interfase as the built-in json library, but is substantially faster (at the cost of non-robust handling of malformed json)

Setup cross-validation:

In order to track the performance of your machine-learning, you might want to use cross_validation.train_test_split.

Building models in sklearn

All estimators (e.g. linear regression, kmeans, etc ...) support fit and predict methods. In fact, you can build your own by inheriting from classes in sklearn.base by using this template:

class Estimator(base.BaseEstimator, base.RegressorMixin):
    def __init__(self, ...):
        # initialization code

    def fit(self, X, y):
        # fit the model ...
        return self

    def predict(self, X):
        return # prediction

The intended usage is:

estimator = Estimator(...)  # initialize
estimator.fit(X_train, y_train)  # fit data
y_pred = estimator.predict(X_test)  # predict answer
estimator.score(X_test, y_test)  # evaluate performance

The regressor provides an implementation of .score. Conforming to this convention has the benefit that many tools (e.g. cross-validation, grid search) rely on this interface so you can use your new estimators with the existing sklearn infrastructure.

For example grid_search.GridSearchCV (docs) takes an estimator and some hyperparameters as arguments, and returns another estimator. Upon fitting, it fits the best model (based on the inputted hyperparameters) and uses that for prediction.

Of course, we sometimes need to process or transform the data before we can do machine-learning on it. sklearn has Transformers to help with this. They implement this interface:

class Transformer(base.BaseEstimator, base.TransformerMixin):
    def __init__(self, ...):
        # initialization code

    def fit(self, X, y=None):
        # fit the transformation
        # ...
        return self

    def transform(self, X):
        return ... # transformation

When combined with our previous estimator, the intended usage is

transformer = Transformer(...)  # initialize
X_trans_train = transformer.fit_transform(X_train)  # fit / transform data
estimator.fit(X_trans_train, y_train)  # fit new model on training data
X_trans_test = transformer.transform(X_test)  # transform test data
estimator.score(X_trans_test, y_test)  # fit new model

Here, .fit_transform is implemented based on the .fit and .transform methods in sklearn.base.TransformerMixin. For many transformers, .fit is empty and only .transform actually does something.

The real reason we use transformers is that we can chain them together with pipelines. For example, this

new_model = pipeline.Pipeline([('trans', Transformer(...)),
                               ('est', Estimator(...))
                              ])
new_model.fit(X_train, y_train)
new_model.score(X_test, y_test)

would replace all the fitting and scoring code above. That is, the pipeline itself is an estimator (and implements the .fit and .predict methods). Note that a pipeline can have multiple transformers chained up but at most one (optional) terminal estimator.

A few helpful notes about performance.

To deploy a model, we suggest using the dill library or joblib to save it to disk and check it into git. This allows you to train the model offline in another file but run it here by reading it in this file. The model is way too complicated to be trained in real-time!
Make sure you load the dill file upon server start, not upon a call to the solution function. This can be done by loading the model into the global scope. The model is too complicated to be even loaded in real-time!
Make sure you call predict once per call of the solution function, and that it returns a single number. For testing convenience you may want to allow it to work using a list of json dicts as input, but during grading the model will be passed a single JSON blob at a time.
You probably want to use GridSearchCV to find the best hyperparameters by splitting the data into training and test. But for the final model that you submit, don't forget to retrain on all your data (training and test) with these best parameters.
GridSearchCV objects are capable of prediction, but they contain many versions of your model which you'll never use. From a deployment standpoint, it makes sense to only submit the best estimator once you've trained on the full data set. To troubleshoot deployment errors look here.

Questions

city_model

The venues belong to different cities. You can image that the ratings in some cities are probably higher than others and use this as an estimator.

Build an estimator that uses groupby and mean to compute the average rating in that city. Use this as a predictor.

Note: def city_model etc. takes an argument record.

lat_long_model

You can imagine that a city-based model might not be sufficiently fine-grained. For example, we know that some neighborhoods are trendier than others. We might consider a K Nearest Neighbors or Random Forest based on the latitude longitude as a way to understand neighborhood dynamics.

You should implement a generic ColumnSelectTransformer that is passed which columns to select in the transformer and use a non-linear model like sklearn.neighbors.KNeighborsRegressor or sklearn.ensemble.RandomForestRegressor as the estimator (why would you choose a non-linear model?). Bonus points if you wrap the estimator in grid_search.GridSearchCV and use cross-validation to determine the optimal value of the parameters.

category_model

While location is important, we could also try seeing how predictive the venues' category. Build a custom transformer that massages the data so that it can be fed into a sklearn.feature_extraction.DictVetorizer which in turn generates a large matrix gotten by One-Hot-Encoding. Feed this into a Linear Regression (and cross validate it!). Can you beat this with another type of non-linear estimator?

Hints:

With a large sparse feature set like this, we often use a cross-validated regularized linear model.
Some categories (e.g. Restaurants) are not very specific. Others (Japanese sushi) are much more so. How can we account for this in our model (Hint: look at TF-IDF).

attribute_knn_model

Venues have (potentially nested) attributes:

    {'Attire': 'casual',
     'Accepts Credit Cards': True,
     'Ambience': {'casual': False, 'classy': False}}

Categorical data like this should often be transformed by a One Hot Encoding. For example, we might flatten the above into something like this:

    {'Attire_casual' : 1,
     'Accepts Credit Cards': 1,
     'Ambience_casual': 0,
     'Ambience_classy': 0 }

Build a custom transformer that flattens attributes and feed this into DictVectorizer. Feed it into a (cross-validated) linear model (or something else!)

full_model

So far we have only built models based on individual features. We could obviously combine them. One (highly recommended) way to do this is through a sklearn.pipelline.FeatureUnion.

Combine all the above models using a feature union. Notice that a feature union takes transformers, not models as arguements. The way around this is to build a transformer that outputs the prediction in the transform method, thus turning the model into a transformer. Use a cross-validated linear regression (or some other algorithm) to weight these signals.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
.kernel-4161f0ed-94f7-45e4-b17d-5258ffb20fd8.json		.kernel-4161f0ed-94f7-45e4-b17d-5258ffb20fd8.json
.kernel-8a9e2e81-024a-47bf-a70c-e71e550b6617.json		.kernel-8a9e2e81-024a-47bf-a70c-e71e550b6617.json
README.md		README.md
Untitled.ipynb		Untitled.ipynb
Untitled1.ipynb		Untitled1.ipynb
Untitled2.ipynb		Untitled2.ipynb
Untitled3.ipynb		Untitled3.ipynb
Untitled4.ipynb		Untitled4.ipynb
__init__.py		__init__.py
__init__.pyc		__init__.pyc
attribute_model		attribute_model
categories.txt		categories.txt
category_model		category_model
city.txt		city.txt
city_model		city_model
clf_model		clf_model
combined_features_model		combined_features_model
data.py		data.py
data.pyc		data.pyc
dict_attribute_model		dict_attribute_model
dict_category_model		dict_category_model
dict_city_model		dict_city_model
dict_full_model		dict_full_model
extractdata.py		extractdata.py
filename.pkl		filename.pkl
filename.pkl_01.npy		filename.pkl_01.npy
filename.pkl_02.npy		filename.pkl_02.npy
filename.pkl_03.npy		filename.pkl_03.npy
filename.pkl_04.npy		filename.pkl_04.npy
filename.pkl_05.npy		filename.pkl_05.npy
filename.pkl_06.npy		filename.pkl_06.npy
filename.pkl_07.npy		filename.pkl_07.npy
filename.pkl_08.npy		filename.pkl_08.npy
filename.pkl_09.npy		filename.pkl_09.npy
filename.pkl_10.npy		filename.pkl_10.npy
full_model		full_model
lat_long_model		lat_long_model
location.txt		location.txt
question1.py		question1.py
question2.py		question2.py
question3.py		question3.py
question4.py		question4.py
question5.py		question5.py
sklearn_train.py		sklearn_train.py
untitle.py		untitle.py
yelp_train_academic_dataset_business.json.gz		yelp_train_academic_dataset_business.json.gz

huongttlan/ml

Folders and files

Latest commit

History

Repository files navigation

Overview

Metric

Download and parse the incoming data

Hints:

Setup cross-validation:

Building models in sklearn

A few helpful notes about performance.

Questions

city_model

lat_long_model

category_model

attribute_knn_model

full_model

About

Resources

Stars

Watchers

Forks

Languages