Skip to content

bsaldivaremc2/classification_model_selector

Repository files navigation

Find the best classification model for your dataset

There are several classification models. In addition, they require tunning to get the best parameter for a given dataset. This task might get challenging, that is why I present a python library to get the best parameter from a group of classifiers. These classifiers are: K Neighbors Classifier, Support Vector Classifier, Decision Tree Classifier, Radorm Forest Classifier, Adaboost classifier.

In the first part I use a library to merge dataframes containing foods' nutrition features. The source of this data is on: https://ndb.nal.usda.gov/ndb/ . I am not including the CSVs yet, since the data can get obtained from this source using a json format. In addition, I will post another publication of how to get the data.

I use tSNE to show how vegetables, fruits, beef and fishes are clustered.
Afterwards, I show some examples of how to use the mentionated library model_score_plot . The function msp.modelsCalculation allows to get at once the best setting parameters for all the said classifiers.

To see more information please type help(model_score_plot) after importing.

import pandas as pd
import os
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import imp
import sys
sys.path.insert(0, 'PyLib')

#A modified library from the internet suitet to this demonstration
import plot_confusion_matrix_compact 
imp.reload(plot_confusion_matrix_compact)
from plot_confusion_matrix_compact import plot_confusion_matrix

#The mentioned library
import model_score_plot
imp.reload(model_score_plot)
from model_score_plot import ModelScorePlot as MSP

#Library to merge dataframes and show a tSNE image
import df_proc
imp.reload(df_proc)
from df_proc import *

Prepare Dataset

csvDir='CSVs'
beef_df=pd.read_csv(csvDir+'/'+'beef_df_100g.csv',sep=',',index_col=0)
fish_df=pd.read_csv(csvDir+'/'+'fish_df_100g.csv',sep=',',index_col=0)
veg_df=pd.read_csv(csvDir+'/'+'veg_df_100g.csv',sep=',',index_col=0)
fruit_df=pd.read_csv(csvDir+'/'+'fruit_df_100g.csv',sep=',',index_col=0)
dfL=[beef_df,fish_df,fruit_df,veg_df]
typesL=['beef','fish','fruit','vegetable']
colorL=['black','blue','red','green']
colorL=['black','blue','red','green']
dataDic=combineDfs(iDfL=dfL,typesL=typesL,colorL=colorL,figSize=(50,50),iDpi=80,plotText=False)

png

dataDic['df'].head(3)
Proximates Water g Proximates Energy kcal Proximates Energy kJ Proximates Protein g Proximates Total lipid (fat) g Proximates Ash g Proximates Carbohydrate, by difference g Minerals Calcium, Ca mg Minerals Iron, Fe mg type
Beef_ribeye_cap_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw 66.50 187.0 784.0 19.46 11.40 0.89 1.75 6.0 2.64 beef
Beef_loin_tenderloin_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw 72.04 143.0 597.0 21.78 6.16 1.11 0.00 13.0 2.55 beef
Beef_rib_eye_steakslashroast_boneless_lip-on_separable_lean_only_trimmed_to_1slash8"_fat_select_raw 70.89 148.0 619.0 22.55 6.41 1.03 0.00 5.0 1.80 beef
cols=list(dataDic['df'].columns.values)
colsX=cols.copy()
colsX.remove('type')
labels=list(set(dataDic['df']['type'].values))
print(cols)
print(labels)
print(colsX)
X=dataDic['df'][colsX].values
y=dataDic['df']['type'].values
for i,t in enumerate(labels):
    y[y==t]=i
y=y.flatten().astype(int)
X=X.astype(float)

mms=MinMaxScaler()
Xn=mms.fit_transform(X)
['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg', 'type']
['vegetable', 'beef', 'fish', 'fruit']
['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg']

Create a Model Score Plot object

msp = MSP()

K Neighbors Classifier

knc_df=msp.kncScores(Xn,y,cv=5,param_name='n_neighbors',paramRange=(1,100,1),trainW=1,testW=2,title='KNC',plot=True)

png

knc_df
best_param model param_name test_score train_score weighted_score
0 5 SVC poly5 n_neighbors 0.921001 0.937735 0.926579

SVC

Reference:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

svc_df=msp.svcScores(Xn,y,cv=5,param_name='C',max_iter=50000,degrees=(2,4,1),paramRange=(100,1000,100),plot=True)

png

png

png

png

png

svc_df
best_param model param_name test_score train_score weighted_score
3 400 SVC rbf C 0.938881 0.952694 0.943486
0 100 SVC linear C 0.940086 0.947906 0.942693
4 800 SVC sigmoid C 0.936479 0.945807 0.939588
1 800 SVC poly2 C 0.925813 0.936825 0.929484
2 700 SVC poly3 C 0.909024 0.907187 0.908412

Decision tree classifier

dtc_df=msp.dtcScores(Xn,y,cv=5,param_name='max_depth',paramRange=(1,10,1),trainW=1,testW=2,title='Decision Tree classifier',plot=True)
dtc_df

png

png

best_param model param_name test_score train_score weighted_score
0 8 Decision Tree classifier gini max_depth 0.937733 0.996106 0.957191
1 7 Decision Tree classifier entropy max_depth 0.937698 0.988019 0.954472

Random Forest Classifier

rfc_df=msp.rfcScores(Xn,y,cv=5,param_name='max_depth',estimatorsRange=(2,11,1),paramRange=(1,15,1),trainW=1,testW=2,title='Randorm Forest classifier',clfArg={},plot=True)

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

rfc_df
best_param model param_name test_score train_score weighted_score
6 11 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.954459 0.996706 0.968541
8 14 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.952071 0.997308 0.967150
14 14 Randorm Forest classifier. Criterion: entropy.... max_depth 0.952106 0.996103 0.966771
15 12 Randorm Forest classifier. Criterion: entropy.... max_depth 0.950852 0.995804 0.965836
13 8 Randorm Forest classifier. Criterion: entropy.... max_depth 0.952078 0.992813 0.965656
5 8 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.950986 0.994608 0.965527
7 9 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.947265 0.996107 0.963546
12 11 Randorm Forest classifier. Criterion: entropy.... max_depth 0.949682 0.991015 0.963460
4 7 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.948527 0.990404 0.962486
17 5 Randorm Forest classifier. Criterion: entropy.... max_depth 0.953261 0.978150 0.961557
16 5 Randorm Forest classifier. Criterion: entropy.... max_depth 0.951986 0.980534 0.961502
3 11 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.944842 0.992211 0.960631
10 7 Randorm Forest classifier. Criterion: entropy.... max_depth 0.944884 0.981743 0.957170
2 10 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.940107 0.989223 0.956479
11 6 Randorm Forest classifier. Criterion: entropy.... max_depth 0.938825 0.979630 0.952427
1 7 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.934112 0.982631 0.950285
9 6 Randorm Forest classifier. Criterion: entropy.... max_depth 0.924610 0.963467 0.937562
0 4 Randorm Forest classifier. Criterion: gini. Es... max_depth 0.914940 0.935347 0.921743

Adaboost Classifier

Reference:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

abc_df=msp.abcScores(Xn,y,cv=5,param_name='n_estimators',paramRange=(1,10,1),trainW=1,testW=2,title='Adaboost classifier',clfArg={},plot=True)

png

abc_df
best_param model param_name test_score train_score weighted_score
0 2 Adaboost classifier n_estimators 0.766362 0.777865 0.770196
models=[knc_df,svc_df,dtc_df,rfc_df,abc_df]
pd.concat(models).sort_values(by='weighted_score',ascending=False)
best_param model param_name test_score train_score weighted_score
5 14 Randorm Forest classifier. Estimators: 7 max_depth 0.958066 0.995210 0.970447
8 6 Randorm Forest classifier. Estimators: 10 max_depth 0.952141 0.987423 0.963902
6 9 Randorm Forest classifier. Estimators: 8 max_depth 0.948484 0.994608 0.963859
7 6 Randorm Forest classifier. Estimators: 9 max_depth 0.948477 0.990122 0.962359
4 7 Randorm Forest classifier. Estimators: 6 max_depth 0.944891 0.988020 0.959268
3 6 Randorm Forest classifier. Estimators: 5 max_depth 0.948435 0.979036 0.958635
2 11 Randorm Forest classifier. Estimators: 4 max_depth 0.940178 0.987420 0.955925
0 6 Decision Tree classifier max_depth 0.936564 0.982339 0.951822
1 5 Randorm Forest classifier. Estimators: 3 max_depth 0.941284 0.966166 0.949578
3 400 SVC rbf C 0.938881 0.952694 0.943486
0 100 SVC linear C 0.940086 0.947906 0.942693
4 800 SVC sigmoid C 0.936479 0.945807 0.939588
1 800 SVC poly2 C 0.925813 0.936825 0.929484
0 12 Randorm Forest classifier. Estimators: 2 max_depth 0.910207 0.962560 0.927658
0 5 KNC n_neighbors 0.921001 0.937735 0.926579
2 700 SVC poly3 C 0.909024 0.907187 0.908412
0 2 Adaboost classifier n_estimators 0.766362 0.777865 0.770196

Selecion of best model and classification reports

X_train, X_test, y_train, y_test = train_test_split(Xn,y)
rfc=RFC(max_depth=14,n_estimators=7)
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)
conf_matr=confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,target_names=labels))
[[103   0   0   0]
 [  0  23   0   4]
 [  1   0  25   0]
 [  1   4   0  48]]
             precision    recall  f1-score   support

       beef       0.98      1.00      0.99       103
      fruit       0.85      0.85      0.85        27
       fish       1.00      0.96      0.98        26
  vegetable       0.92      0.91      0.91        53

avg / total       0.95      0.95      0.95       209
plot_confusion_matrix(conf_matr,labels,normalize=False)
Confusion matrix, without normalization

png

Run all the models to get scores at once

models_df=msp.modelsCalculation(Xn,y,abc={'paramRange':(2,20,2)},rfc={'estimatorsRange':(10,20,1),'paramRange':(1,20,1)},dtc={'paramRange':(1,20,1)})
models_df
best_param model param_name test_score train_score weighted_score
8 18 Randorm Forest classifier. Estimators: 18 max_depth 0.955635 0.998205 0.969825
9 14 Randorm Forest classifier. Estimators: 19 max_depth 0.954473 0.999101 0.969349
6 12 Randorm Forest classifier. Estimators: 16 max_depth 0.954466 0.998804 0.969245
5 16 Randorm Forest classifier. Estimators: 15 max_depth 0.954494 0.998499 0.969163
2 16 Randorm Forest classifier. Estimators: 12 max_depth 0.954466 0.997906 0.968946
7 19 Randorm Forest classifier. Estimators: 17 max_depth 0.953289 0.999701 0.968760
3 10 Randorm Forest classifier. Estimators: 13 max_depth 0.952063 0.998505 0.967544
1 14 Randorm Forest classifier. Estimators: 11 max_depth 0.950837 0.998503 0.966726
4 19 Randorm Forest classifier. Estimators: 14 max_depth 0.950887 0.998202 0.966659
0 17 Randorm Forest classifier. Estimators: 10 max_depth 0.950922 0.997601 0.966482
0 13 Decision Tree classifier max_depth 0.937733 1.000000 0.958489
0 5 KNC n_neighbors 0.921001 0.937735 0.926579
0 8 SVC linear C 0.919817 0.920061 0.919899
5 9 SVC rbf C 0.892192 0.895507 0.893297
6 9 SVC sigmoid C 0.850308 0.853003 0.851206
0 2 Adaboost classifier n_estimators 0.766362 0.777865 0.770196
1 9 SVC poly2 C 0.734166 0.735332 0.734555
2 6 SVC poly3 C 0.730580 0.730243 0.730467
4 1 SVC poly5 C 0.492236 0.492217 0.492230
3 1 SVC poly4 C 0.492236 0.492217 0.492230

References

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
http://scikit-learn.org/stable/model_selection.html
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
http://scikit-learn.org/stable/modules/classes.html
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
https://github.com/andreashsieh/stacked_generalization
https://stackoverflow.com/questions/37095246/how-to-use-adaboost-with-different-base-estimator-in-scikit-learn

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published