There are several classification models. In addition, they require tunning to get the best parameter for a given dataset. This task might get challenging, that is why I present a python library to get the best parameter from a group of classifiers. These classifiers are: K Neighbors Classifier, Support Vector Classifier, Decision Tree Classifier, Radorm Forest Classifier, Adaboost classifier.
In the first part I use a library to merge dataframes containing foods' nutrition features. The source of this data is on: https://ndb.nal.usda.gov/ndb/ . I am not including the CSVs yet, since the data can get obtained from this source using a json format. In addition, I will post another publication of how to get the data.
I use tSNE to show how vegetables, fruits, beef and fishes are clustered.
Afterwards, I show some examples of how to use the mentionated library model_score_plot . The function msp.modelsCalculation allows to get at once the best setting parameters for all the said classifiers.
To see more information please type help(model_score_plot) after importing.
import pandas as pd
import os
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import imp
import sys
sys.path.insert(0, 'PyLib')
#A modified library from the internet suitet to this demonstration
import plot_confusion_matrix_compact
imp.reload(plot_confusion_matrix_compact)
from plot_confusion_matrix_compact import plot_confusion_matrix
#The mentioned library
import model_score_plot
imp.reload(model_score_plot)
from model_score_plot import ModelScorePlot as MSP
#Library to merge dataframes and show a tSNE image
import df_proc
imp.reload(df_proc)
from df_proc import *
csvDir='CSVs'
beef_df=pd.read_csv(csvDir+'/'+'beef_df_100g.csv',sep=',',index_col=0)
fish_df=pd.read_csv(csvDir+'/'+'fish_df_100g.csv',sep=',',index_col=0)
veg_df=pd.read_csv(csvDir+'/'+'veg_df_100g.csv',sep=',',index_col=0)
fruit_df=pd.read_csv(csvDir+'/'+'fruit_df_100g.csv',sep=',',index_col=0)
dfL=[beef_df,fish_df,fruit_df,veg_df]
typesL=['beef','fish','fruit','vegetable']
colorL=['black','blue','red','green']
colorL=['black','blue','red','green']
dataDic=combineDfs(iDfL=dfL,typesL=typesL,colorL=colorL,figSize=(50,50),iDpi=80,plotText=False)
dataDic['df'].head(3)
Proximates Water g | Proximates Energy kcal | Proximates Energy kJ | Proximates Protein g | Proximates Total lipid (fat) g | Proximates Ash g | Proximates Carbohydrate, by difference g | Minerals Calcium, Ca mg | Minerals Iron, Fe mg | type | |
---|---|---|---|---|---|---|---|---|---|---|
Beef_ribeye_cap_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw | 66.50 | 187.0 | 784.0 | 19.46 | 11.40 | 0.89 | 1.75 | 6.0 | 2.64 | beef |
Beef_loin_tenderloin_steak_boneless_separable_lean_only_trimmed_to_0"_fat_choice_raw | 72.04 | 143.0 | 597.0 | 21.78 | 6.16 | 1.11 | 0.00 | 13.0 | 2.55 | beef |
Beef_rib_eye_steakslashroast_boneless_lip-on_separable_lean_only_trimmed_to_1slash8"_fat_select_raw | 70.89 | 148.0 | 619.0 | 22.55 | 6.41 | 1.03 | 0.00 | 5.0 | 1.80 | beef |
cols=list(dataDic['df'].columns.values)
colsX=cols.copy()
colsX.remove('type')
labels=list(set(dataDic['df']['type'].values))
print(cols)
print(labels)
print(colsX)
X=dataDic['df'][colsX].values
y=dataDic['df']['type'].values
for i,t in enumerate(labels):
y[y==t]=i
y=y.flatten().astype(int)
X=X.astype(float)
mms=MinMaxScaler()
Xn=mms.fit_transform(X)
['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg', 'type']
['vegetable', 'beef', 'fish', 'fruit']
['Proximates Water g', 'Proximates Energy kcal', 'Proximates Energy kJ', 'Proximates Protein g', 'Proximates Total lipid (fat) g', 'Proximates Ash g', 'Proximates Carbohydrate, by difference g', 'Minerals Calcium, Ca mg', 'Minerals Iron, Fe mg']
msp = MSP()
knc_df=msp.kncScores(Xn,y,cv=5,param_name='n_neighbors',paramRange=(1,100,1),trainW=1,testW=2,title='KNC',plot=True)
knc_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
0 | 5 | SVC poly5 | n_neighbors | 0.921001 | 0.937735 | 0.926579 |
Reference:
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
svc_df=msp.svcScores(Xn,y,cv=5,param_name='C',max_iter=50000,degrees=(2,4,1),paramRange=(100,1000,100),plot=True)
svc_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
3 | 400 | SVC rbf | C | 0.938881 | 0.952694 | 0.943486 |
0 | 100 | SVC linear | C | 0.940086 | 0.947906 | 0.942693 |
4 | 800 | SVC sigmoid | C | 0.936479 | 0.945807 | 0.939588 |
1 | 800 | SVC poly2 | C | 0.925813 | 0.936825 | 0.929484 |
2 | 700 | SVC poly3 | C | 0.909024 | 0.907187 | 0.908412 |
dtc_df=msp.dtcScores(Xn,y,cv=5,param_name='max_depth',paramRange=(1,10,1),trainW=1,testW=2,title='Decision Tree classifier',plot=True)
dtc_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
0 | 8 | Decision Tree classifier gini | max_depth | 0.937733 | 0.996106 | 0.957191 |
1 | 7 | Decision Tree classifier entropy | max_depth | 0.937698 | 0.988019 | 0.954472 |
rfc_df=msp.rfcScores(Xn,y,cv=5,param_name='max_depth',estimatorsRange=(2,11,1),paramRange=(1,15,1),trainW=1,testW=2,title='Randorm Forest classifier',clfArg={},plot=True)
rfc_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
6 | 11 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.954459 | 0.996706 | 0.968541 |
8 | 14 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.952071 | 0.997308 | 0.967150 |
14 | 14 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.952106 | 0.996103 | 0.966771 |
15 | 12 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.950852 | 0.995804 | 0.965836 |
13 | 8 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.952078 | 0.992813 | 0.965656 |
5 | 8 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.950986 | 0.994608 | 0.965527 |
7 | 9 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.947265 | 0.996107 | 0.963546 |
12 | 11 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.949682 | 0.991015 | 0.963460 |
4 | 7 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.948527 | 0.990404 | 0.962486 |
17 | 5 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.953261 | 0.978150 | 0.961557 |
16 | 5 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.951986 | 0.980534 | 0.961502 |
3 | 11 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.944842 | 0.992211 | 0.960631 |
10 | 7 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.944884 | 0.981743 | 0.957170 |
2 | 10 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.940107 | 0.989223 | 0.956479 |
11 | 6 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.938825 | 0.979630 | 0.952427 |
1 | 7 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.934112 | 0.982631 | 0.950285 |
9 | 6 | Randorm Forest classifier. Criterion: entropy.... | max_depth | 0.924610 | 0.963467 | 0.937562 |
0 | 4 | Randorm Forest classifier. Criterion: gini. Es... | max_depth | 0.914940 | 0.935347 | 0.921743 |
abc_df=msp.abcScores(Xn,y,cv=5,param_name='n_estimators',paramRange=(1,10,1),trainW=1,testW=2,title='Adaboost classifier',clfArg={},plot=True)
abc_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
0 | 2 | Adaboost classifier | n_estimators | 0.766362 | 0.777865 | 0.770196 |
models=[knc_df,svc_df,dtc_df,rfc_df,abc_df]
pd.concat(models).sort_values(by='weighted_score',ascending=False)
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
5 | 14 | Randorm Forest classifier. Estimators: 7 | max_depth | 0.958066 | 0.995210 | 0.970447 |
8 | 6 | Randorm Forest classifier. Estimators: 10 | max_depth | 0.952141 | 0.987423 | 0.963902 |
6 | 9 | Randorm Forest classifier. Estimators: 8 | max_depth | 0.948484 | 0.994608 | 0.963859 |
7 | 6 | Randorm Forest classifier. Estimators: 9 | max_depth | 0.948477 | 0.990122 | 0.962359 |
4 | 7 | Randorm Forest classifier. Estimators: 6 | max_depth | 0.944891 | 0.988020 | 0.959268 |
3 | 6 | Randorm Forest classifier. Estimators: 5 | max_depth | 0.948435 | 0.979036 | 0.958635 |
2 | 11 | Randorm Forest classifier. Estimators: 4 | max_depth | 0.940178 | 0.987420 | 0.955925 |
0 | 6 | Decision Tree classifier | max_depth | 0.936564 | 0.982339 | 0.951822 |
1 | 5 | Randorm Forest classifier. Estimators: 3 | max_depth | 0.941284 | 0.966166 | 0.949578 |
3 | 400 | SVC rbf | C | 0.938881 | 0.952694 | 0.943486 |
0 | 100 | SVC linear | C | 0.940086 | 0.947906 | 0.942693 |
4 | 800 | SVC sigmoid | C | 0.936479 | 0.945807 | 0.939588 |
1 | 800 | SVC poly2 | C | 0.925813 | 0.936825 | 0.929484 |
0 | 12 | Randorm Forest classifier. Estimators: 2 | max_depth | 0.910207 | 0.962560 | 0.927658 |
0 | 5 | KNC | n_neighbors | 0.921001 | 0.937735 | 0.926579 |
2 | 700 | SVC poly3 | C | 0.909024 | 0.907187 | 0.908412 |
0 | 2 | Adaboost classifier | n_estimators | 0.766362 | 0.777865 | 0.770196 |
X_train, X_test, y_train, y_test = train_test_split(Xn,y)
rfc=RFC(max_depth=14,n_estimators=7)
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)
conf_matr=confusion_matrix(y_test, y_pred)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred,target_names=labels))
[[103 0 0 0]
[ 0 23 0 4]
[ 1 0 25 0]
[ 1 4 0 48]]
precision recall f1-score support
beef 0.98 1.00 0.99 103
fruit 0.85 0.85 0.85 27
fish 1.00 0.96 0.98 26
vegetable 0.92 0.91 0.91 53
avg / total 0.95 0.95 0.95 209
plot_confusion_matrix(conf_matr,labels,normalize=False)
Confusion matrix, without normalization
models_df=msp.modelsCalculation(Xn,y,abc={'paramRange':(2,20,2)},rfc={'estimatorsRange':(10,20,1),'paramRange':(1,20,1)},dtc={'paramRange':(1,20,1)})
models_df
best_param | model | param_name | test_score | train_score | weighted_score | |
---|---|---|---|---|---|---|
8 | 18 | Randorm Forest classifier. Estimators: 18 | max_depth | 0.955635 | 0.998205 | 0.969825 |
9 | 14 | Randorm Forest classifier. Estimators: 19 | max_depth | 0.954473 | 0.999101 | 0.969349 |
6 | 12 | Randorm Forest classifier. Estimators: 16 | max_depth | 0.954466 | 0.998804 | 0.969245 |
5 | 16 | Randorm Forest classifier. Estimators: 15 | max_depth | 0.954494 | 0.998499 | 0.969163 |
2 | 16 | Randorm Forest classifier. Estimators: 12 | max_depth | 0.954466 | 0.997906 | 0.968946 |
7 | 19 | Randorm Forest classifier. Estimators: 17 | max_depth | 0.953289 | 0.999701 | 0.968760 |
3 | 10 | Randorm Forest classifier. Estimators: 13 | max_depth | 0.952063 | 0.998505 | 0.967544 |
1 | 14 | Randorm Forest classifier. Estimators: 11 | max_depth | 0.950837 | 0.998503 | 0.966726 |
4 | 19 | Randorm Forest classifier. Estimators: 14 | max_depth | 0.950887 | 0.998202 | 0.966659 |
0 | 17 | Randorm Forest classifier. Estimators: 10 | max_depth | 0.950922 | 0.997601 | 0.966482 |
0 | 13 | Decision Tree classifier | max_depth | 0.937733 | 1.000000 | 0.958489 |
0 | 5 | KNC | n_neighbors | 0.921001 | 0.937735 | 0.926579 |
0 | 8 | SVC linear | C | 0.919817 | 0.920061 | 0.919899 |
5 | 9 | SVC rbf | C | 0.892192 | 0.895507 | 0.893297 |
6 | 9 | SVC sigmoid | C | 0.850308 | 0.853003 | 0.851206 |
0 | 2 | Adaboost classifier | n_estimators | 0.766362 | 0.777865 | 0.770196 |
1 | 9 | SVC poly2 | C | 0.734166 | 0.735332 | 0.734555 |
2 | 6 | SVC poly3 | C | 0.730580 | 0.730243 | 0.730467 |
4 | 1 | SVC poly5 | C | 0.492236 | 0.492217 | 0.492230 |
3 | 1 | SVC poly4 | C | 0.492236 | 0.492217 | 0.492230 |
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
http://scikit-learn.org/stable/model_selection.html
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
http://scikit-learn.org/stable/modules/classes.html
http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
https://github.com/andreashsieh/stacked_generalization
https://stackoverflow.com/questions/37095246/how-to-use-adaboost-with-different-base-estimator-in-scikit-learn