Find the best classification model for your dataset

There are several classification models. In addition, they require tunning to get the best parameter for a given dataset. This task might get challenging, that is why I present a python library to get the best parameter from a group of classifiers. These classifiers are: K Neighbors Classifier, Support Vector Classifier, Decision Tree Classifier, Radorm Forest Classifier, Adaboost classifier.

In the first part I use a library to merge dataframes containing foods' nutrition features. The source of this data is on: https://ndb.nal.usda.gov/ndb/ . I am not including the CSVs yet, since the data can get obtained from this source using a json format. In addition, I will post another publication of how to get the data.

I use tSNE to show how vegetables, fruits, beef and fishes are clustered.
Afterwards, I show some examples of how to use the mentionated library model_score_plot . The function msp.modelsCalculation allows to get at once the best setting parameters for all the said classifiers.

To see more information please type help(model_score_plot) after importing.

import pandas as pd
import os
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import imp
import sys
sys.path.insert(0, 'PyLib')

#A modified library from the internet suitet to this demonstration
import plot_confusion_matrix_compact 
imp.reload(plot_confusion_matrix_compact)
from plot_confusion_matrix_compact import plot_confusion_matrix

#The mentioned library
import model_score_plot
imp.reload(model_score_plot)
from model_score_plot import ModelScorePlot as MSP

#Library to merge dataframes and show a tSNE image
import df_proc
imp.reload(df_proc)
from df_proc import *

Prepare Dataset

csvDir='CSVs'
beef_df=pd.read_csv(csvDir+'/'+'beef_df_100g.csv',sep=',',index_col=0)
fish_df=pd.read_csv(csvDir+'/'+'fish_df_100g.csv',sep=',',index_col=0)
veg_df=pd.read_csv(csvDir+'/'+'veg_df_100g.csv',sep=',',index_col=0)
fruit_df=pd.read_csv(csvDir+'/'+'fruit_df_100g.csv',sep=',',index_col=0)
dfL=[beef_df,fish_df,fruit_df,veg_df]
typesL=['beef','fish','fruit','vegetable']
colorL=['black','blue','red','green']
colorL=['black','blue','red','green']
dataDic=combineDfs(iDfL=dfL,typesL=typesL,colorL=colorL,figSize=(50,50),iDpi=80,plotText=False)

dataDic['df'].head(3)

	Proximates Water g	Proximates Energy kcal	Proximates Energy kJ	Proximates Protein g	Proximates Total lipid (fat) g	Proximates Ash g	Proximates Carbohydrate, by difference g	Minerals Calcium, Ca mg	Minerals Iron, Fe mg	type
Beef_ribeye_cap_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw	66.50	187.0	784.0	19.46	11.40	0.89	1.75	6.0	2.64	beef
Beef_loin_tenderloin_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw	72.04	143.0	597.0	21.78	6.16	1.11	0.00	13.0	2.55	beef
Beef_rib_eye_steakslashroast_boneless_lip-on_separable_lean_only_trimmed_to_1slash8"_fat_select_raw	70.89	148.0	619.0	22.55	6.41	1.03	0.00	5.0	1.80	beef

cols=list(dataDic['df'].columns.values)
colsX=cols.copy()
colsX.remove('type')
labels=list(set(dataDic['df']['type'].values))
print(cols)
print(labels)
print(colsX)
X=dataDic['df'][colsX].values
y=dataDic['df']['type'].values
for i,t in enumerate(labels):
    y[y==t]=i
y=y.flatten().astype(int)
X=X.astype(float)

mms=MinMaxScaler()
Xn=mms.fit_transform(X)

['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg', 'type']
['vegetable', 'beef', 'fish', 'fruit']
['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg']

Create a Model Score Plot object

msp = MSP()

K Neighbors Classifier

knc_df=msp.kncScores(Xn,y,cv=5,param_name='n_neighbors',paramRange=(1,100,1),trainW=1,testW=2,title='KNC',plot=True)

knc_df

	best_param	model	param_name	test_score	train_score	weighted_score
0	5	SVC poly5	n_neighbors	0.921001	0.937735	0.926579

SVC

Reference:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

svc_df=msp.svcScores(Xn,y,cv=5,param_name='C',max_iter=50000,degrees=(2,4,1),paramRange=(100,1000,100),plot=True)

svc_df

	best_param	model	param_name	test_score	train_score	weighted_score
3	400	SVC rbf	C	0.938881	0.952694	0.943486
0	100	SVC linear	C	0.940086	0.947906	0.942693
4	800	SVC sigmoid	C	0.936479	0.945807	0.939588
1	800	SVC poly2	C	0.925813	0.936825	0.929484
2	700	SVC poly3	C	0.909024	0.907187	0.908412

Decision tree classifier

dtc_df=msp.dtcScores(Xn,y,cv=5,param_name='max_depth',paramRange=(1,10,1),trainW=1,testW=2,title='Decision Tree classifier',plot=True)
dtc_df

	best_param	model	param_name	test_score	train_score	weighted_score
0	8	Decision Tree classifier gini	max_depth	0.937733	0.996106	0.957191
1	7	Decision Tree classifier entropy	max_depth	0.937698	0.988019	0.954472

Random Forest Classifier

rfc_df=msp.rfcScores(Xn,y,cv=5,param_name='max_depth',estimatorsRange=(2,11,1),paramRange=(1,15,1),trainW=1,testW=2,title='Randorm Forest classifier',clfArg={},plot=True)

rfc_df

	best_param	model	param_name	test_score	train_score	weighted_score
6	11	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.954459	0.996706	0.968541
8	14	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.952071	0.997308	0.967150
14	14	Randorm Forest classifier. Criterion: entropy....	max_depth	0.952106	0.996103	0.966771
15	12	Randorm Forest classifier. Criterion: entropy....	max_depth	0.950852	0.995804	0.965836
13	8	Randorm Forest classifier. Criterion: entropy....	max_depth	0.952078	0.992813	0.965656
5	8	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.950986	0.994608	0.965527
7	9	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.947265	0.996107	0.963546
12	11	Randorm Forest classifier. Criterion: entropy....	max_depth	0.949682	0.991015	0.963460
4	7	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.948527	0.990404	0.962486
17	5	Randorm Forest classifier. Criterion: entropy....	max_depth	0.953261	0.978150	0.961557
16	5	Randorm Forest classifier. Criterion: entropy....	max_depth	0.951986	0.980534	0.961502
3	11	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.944842	0.992211	0.960631
10	7	Randorm Forest classifier. Criterion: entropy....	max_depth	0.944884	0.981743	0.957170
2	10	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.940107	0.989223	0.956479
11	6	Randorm Forest classifier. Criterion: entropy....	max_depth	0.938825	0.979630	0.952427
1	7	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.934112	0.982631	0.950285
9	6	Randorm Forest classifier. Criterion: entropy....	max_depth	0.924610	0.963467	0.937562
0	4	Randorm Forest classifier. Criterion: gini. Es...	max_depth	0.914940	0.935347	0.921743

Adaboost Classifier

Reference:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

abc_df=msp.abcScores(Xn,y,cv=5,param_name='n_estimators',paramRange=(1,10,1),trainW=1,testW=2,title='Adaboost classifier',clfArg={},plot=True)

abc_df

	best_param	model	param_name	test_score	train_score	weighted_score
0	2	Adaboost classifier	n_estimators	0.766362	0.777865	0.770196

models=[knc_df,svc_df,dtc_df,rfc_df,abc_df]
pd.concat(models).sort_values(by='weighted_score',ascending=False)

	best_param	model	param_name	test_score	train_score	weighted_score
5	14	Randorm Forest classifier. Estimators: 7	max_depth	0.958066	0.995210	0.970447
8	6	Randorm Forest classifier. Estimators: 10	max_depth	0.952141	0.987423	0.963902
6	9	Randorm Forest classifier. Estimators: 8	max_depth	0.948484	0.994608	0.963859
7	6	Randorm Forest classifier. Estimators: 9	max_depth	0.948477	0.990122	0.962359
4	7	Randorm Forest classifier. Estimators: 6	max_depth	0.944891	0.988020	0.959268
3	6	Randorm Forest classifier. Estimators: 5	max_depth	0.948435	0.979036	0.958635
2	11	Randorm Forest classifier. Estimators: 4	max_depth	0.940178	0.987420	0.955925
0	6	Decision Tree classifier	max_depth	0.936564	0.982339	0.951822
1	5	Randorm Forest classifier. Estimators: 3	max_depth	0.941284	0.966166	0.949578
3	400	SVC rbf	C	0.938881	0.952694	0.943486
0	100	SVC linear	C	0.940086	0.947906	0.942693
4	800	SVC sigmoid	C	0.936479	0.945807	0.939588
1	800	SVC poly2	C	0.925813	0.936825	0.929484
0	12	Randorm Forest classifier. Estimators: 2	max_depth	0.910207	0.962560	0.927658
0	5	KNC	n_neighbors	0.921001	0.937735	0.926579
2	700	SVC poly3	C	0.909024	0.907187	0.908412
0	2	Adaboost classifier	n_estimators	0.766362	0.777865	0.770196

Selecion of best model and classification reports

X_train, X_test, y_train, y_test = train_test_split(Xn,y)
rfc=RFC(max_depth=14,n_estimators=7)
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)
conf_matr=confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,target_names=labels))

[[103   0   0   0]
 [  0  23   0   4]
 [  1   0  25   0]
 [  1   4   0  48]]
             precision    recall  f1-score   support

       beef       0.98      1.00      0.99       103
      fruit       0.85      0.85      0.85        27
       fish       1.00      0.96      0.98        26
  vegetable       0.92      0.91      0.91        53

avg / total       0.95      0.95      0.95       209

plot_confusion_matrix(conf_matr,labels,normalize=False)

Confusion matrix, without normalization

Run all the models to get scores at once

models_df=msp.modelsCalculation(Xn,y,abc={'paramRange':(2,20,2)},rfc={'estimatorsRange':(10,20,1),'paramRange':(1,20,1)},dtc={'paramRange':(1,20,1)})
models_df

	best_param	model	param_name	test_score	train_score	weighted_score
8	18	Randorm Forest classifier. Estimators: 18	max_depth	0.955635	0.998205	0.969825
9	14	Randorm Forest classifier. Estimators: 19	max_depth	0.954473	0.999101	0.969349
6	12	Randorm Forest classifier. Estimators: 16	max_depth	0.954466	0.998804	0.969245
5	16	Randorm Forest classifier. Estimators: 15	max_depth	0.954494	0.998499	0.969163
2	16	Randorm Forest classifier. Estimators: 12	max_depth	0.954466	0.997906	0.968946
7	19	Randorm Forest classifier. Estimators: 17	max_depth	0.953289	0.999701	0.968760
3	10	Randorm Forest classifier. Estimators: 13	max_depth	0.952063	0.998505	0.967544
1	14	Randorm Forest classifier. Estimators: 11	max_depth	0.950837	0.998503	0.966726
4	19	Randorm Forest classifier. Estimators: 14	max_depth	0.950887	0.998202	0.966659
0	17	Randorm Forest classifier. Estimators: 10	max_depth	0.950922	0.997601	0.966482
0	13	Decision Tree classifier	max_depth	0.937733	1.000000	0.958489
0	5	KNC	n_neighbors	0.921001	0.937735	0.926579
0	8	SVC linear	C	0.919817	0.920061	0.919899
5	9	SVC rbf	C	0.892192	0.895507	0.893297
6	9	SVC sigmoid	C	0.850308	0.853003	0.851206
0	2	Adaboost classifier	n_estimators	0.766362	0.777865	0.770196
1	9	SVC poly2	C	0.734166	0.735332	0.734555
2	6	SVC poly3	C	0.730580	0.730243	0.730467
4	1	SVC poly5	C	0.492236	0.492217	0.492230
3	1	SVC poly4	C	0.492236	0.492217	0.492230

References

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
http://scikit-learn.org/stable/model_selection.html
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
http://scikit-learn.org/stable/modules/classes.html
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
https://github.com/andreashsieh/stacked_generalization
https://stackoverflow.com/questions/37095246/how-to-use-adaboost-with-different-base-estimator-in-scikit-learn

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
PyLib		PyLib
README.md		README.md
model_selector_example.ipynb		model_selector_example.ipynb
output_11_0.png		output_11_0.png
output_15_0.png		output_15_0.png
output_15_1.png		output_15_1.png
output_15_2.png		output_15_2.png
output_15_3.png		output_15_3.png
output_15_4.png		output_15_4.png
output_18_0.png		output_18_0.png
output_18_1.png		output_18_1.png
output_20_0.png		output_20_0.png
output_20_1.png		output_20_1.png
output_20_10.png		output_20_10.png
output_20_11.png		output_20_11.png
output_20_12.png		output_20_12.png
output_20_13.png		output_20_13.png
output_20_14.png		output_20_14.png
output_20_15.png		output_20_15.png
output_20_16.png		output_20_16.png
output_20_17.png		output_20_17.png
output_20_2.png		output_20_2.png
output_20_3.png		output_20_3.png
output_20_4.png		output_20_4.png
output_20_5.png		output_20_5.png
output_20_6.png		output_20_6.png
output_20_7.png		output_20_7.png
output_20_8.png		output_20_8.png
output_20_9.png		output_20_9.png
output_24_0.png		output_24_0.png
output_29_1.png		output_29_1.png
output_4_0.png		output_4_0.png

bsaldivaremc2/classification_model_selector

Folders and files

Latest commit

History

Repository files navigation

Find the best classification model for your dataset

Prepare Dataset

Create a Model Score Plot object

K Neighbors Classifier

SVC

Decision tree classifier

Random Forest Classifier

Adaboost Classifier

Selecion of best model and classification reports

Run all the models to get scores at once

References

About

Resources

Stars

Watchers

Forks

Languages