Kaggle-Whats-Cooking

Here we are going to solve the What's Cooking challenge on Kaggle

https://www.kaggle.com/c/whats-cooking

I will be working with python's sci-kit learn library. I am going to use TfidfVectorizer for feature selection and LogisticRegression classifier for classification. I am going to combine these two components in python's GridSearchCV Pipeline

Python's GridSearchCV is a very useful package for building and training machine learning models. We will use this packge to

Build a pipeline consisting of a feature selector and a classifier.

Provide a set of options/parameters for the pipeline's constituents.

Store the best set of parameters that GridSearchCV comes up with.

Use the optimal parameters in the above step to build an optimal pipeline for predictions.

Structure

data

This folder has the training and test data.
picks

This folder is used to store intermediate results such as the best set of parameters determined by GridSearchCV.
results

This folder is used to store the feedback loop results and the final predictions.
src

This folder contains all of the source code.

Training

Please look at the well documented cookTrain.py script in the cook package inside src. Here is the flow of events

Create GridSearchCV Pipeline comprising of TfidfVectorizer and LogisticRegression classifier
Load training data, perform cleanup on the relevant columns. For our purpose, this will be the ingredients column. Finally, create training and validation sets
Fit the training set on the pipeline
Calculate the best set of parameters and make predictions on the validation set
Evaluate metrics and scores for the pipeline's best_estimator
Document the prediction results on the validation set and create a pandas DataFrame for feedback
Extract the 'mistakes' from the validation set predictions and re-train the pipeline. This is our feedback loop.
After incorporating the mistakes into the training data, recalculate training and validation sets. Finally, make predictions on the new validation set and evaluate metrics and scores for the pipeline's best_estimator
Store the best set of parameters

Predict

Please look at the well documented cookPredict.py script in the cook package inside src. Here is the flow of events

Create GridSearchCV Pipeline comprising of TfidfVectorizer and LogisticRegression classifier. This time we use the best set set of parameters obtained from training.
Load test data and perform cleanup on the relevant columns. For our purpose, this will be the ingredients column.
Make predictions on the test data and store results

What I Observed

Here are the stats that i observed before and after feedback

Fitting 3 folds for each of 300 candidates, totalling 900 fits
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  4.6min
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed: 23.1min
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed: 71.4min
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed: 164.7min
[Parallel(n_jobs=3)]: Done 900 out of 900 | elapsed: 194.8min finished
best score: 0.779
best parameters set:
	 clf__C: 10
	 vect__max_df: 0.7
	 vect__ngram_range: (1, 1)
	 vect__use_idf: True
('Accuracy:', 0.78521746417497695)
Fitting 3 folds for each of 300 candidates, totalling 900 fits
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  3.9min
[Parallel(n_jobs=3)]: Done 194 tasks      | elapsed: 17.7min
[Parallel(n_jobs=3)]: Done 444 tasks      | elapsed: 54.2min
[Parallel(n_jobs=3)]: Done 794 tasks      | elapsed: 131.1min
[Parallel(n_jobs=3)]: Done 900 out of 900 | elapsed: 153.2min finished
best score: 0.719
best parameters set:
	 clf__C: 20
	 vect__max_df: 0.6
	 vect__ngram_range: (1, 2)
	 vect__use_idf: True
('Accuracy:', 0.73130892348169263)

It is evident that the best set of parameters has changed after feedback. Hence, the classifier did 'learn' something.
Note that the accuracy actually decreases when we use the new set of parameters. This is probably due to overfitting.
Nonetheless, when i submitted my results on Kaggle, i got about 78% accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

picks

picks

results

results

src

src

README.md

README.md

Repository files navigation

Kaggle-Whats-Cooking

Structure

Training

Predict

What I Observed

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
picks		picks
results		results
src		src
README.md		README.md

verram/Kaggle-Whats-Cooking

Folders and files

Latest commit

History

Repository files navigation

Kaggle-Whats-Cooking

Structure

Training

Predict

What I Observed

About

Resources

Stars

Watchers

Forks

Languages