def train(self, features, answers): """ Entrainement du classifier MLP Args: features: Array de données answers: Array de label ValidationMethod: Type de validation à utiliser """ print("1.Training") mlpPerf = [['Epoch', 'Batch Size', 'Accuracy']] model = KerasClassifier(build_fn=self.create_model, verbose=0) # Fix answer array answers = np_utils.to_categorical(answers) #Fit data to algo model.fit(features, answers) #Save results mlpPerf.append([ self.epoch, self.batch_size, "{0:.2f}".format(model.score(features, answers) * 100) ]) self.precision.append(model.score(features, answers)) self.best_score = self.precision[0] #Print table print(Tabulate(mlpPerf, headers='firstrow')) print() return model
def test_keras_classifier(): model = Sequential() model.add(Dense(input_dim, input_shape=(input_dim,))) model.add(Activation('relu')) model.add(Dense(nb_class)) model.add(Activation('softmax')) sklearn_clf = KerasClassifier(model, optimizer=optim, loss=loss, train_batch_size=batch_size, test_batch_size=batch_size, nb_epoch=nb_epoch) sklearn_clf.fit(X_train, y_train) sklearn_clf.score(X_test, y_test)
def runSonarNN(params): ####NOTE: Scikit Doesn't like Kerasclassier paramaters being passed in so i've just hardcoded the ones grid search found#### mod = KerasClassifier(build_fn=create_smaller, epochs=1200, batch_size=5, verbose=0, learn_rate=.0001) history = mod.fit(X, encoded_Y, validation_split=.2) print(history.history.keys()) # summarize history for accuracy plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.show() #print("Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) title = "NN curve" x = plot_learning_curve(mod, title, X, encoded_Y, cv=3, n_jobs=1) x.show() return mod.score(Xval, encoded_Yval)
def SimpleLoss(individual, data, labels, layers, activation, *_): network = KerasClassifier(build_fn=CreateNeuralNetwork, input_size=data['train'].shape[1], output_size=2 if len(labels['train'].shape) < 2 else labels['train'].shape[1], layers=layers,activation=activation,lr=individual[1], dropout=individual[2],epochs=int(individual[0]),verbose=0) network.fit(data['train'],labels['train']) score = network.score(data['test'],labels['test']) return 1 - score
def main(): # """ # ++++++++++++++++++++++++++++++++++++++++++ # DATA PREPROCESSING # """ ######### # EITHER sentences = treebank.tagged_sents() # OR # sentences = parsebrown() # have to dl brown corpus ("brown-universal.txt") and change path in parsebrown function ######### # trnstc, tststc, valstc = ttvsplit(sentences[0:50000], .6, .3, .1) trnstc, tststc, valstc = ttvsplit(sentences, .6, .3, .1) xtrn, ytrn = str2dct(trnstc) xtst, ytst = str2dct(tststc) xval, yval = str2dct(valstc) dict_encoder, xtrn, xtst, xval = dct2arr(xtrn, xtst, xval) label_encoder, ytrn, ytst, yval = catenc(ytrn, ytst, yval) ytrn, ytst, yval = ohenc(ytrn, ytst, yval) # # print(xtrn[0]) # treebank (61014, 44232) # brown (860100, 188) # # print(ytrn[0]) # treebank (61014, 46) # brown (860100, 9) # # """ # # ++++++++++++++++++++++++++++++++++++++++++ # # MODEL # # """ model_params = { 'build_fn': build_model, 'input_dim': xtrn.shape[1], 'hidden_neurons': 512, 'output_dim': ytrn.shape[1], 'epochs': 3, 'batch_size': 1024, 'verbose': 1, 'validation_data': (xval, yval), 'shuffle': True } m = KerasClassifier(**model_params) hist = m.fit(xtrn, ytrn) score = m.score(xtst, ytst) print("score") print(score) m.model.save('model')
def runNN(params): estimator = KerasClassifier(build_fn=create_smaller, epochs=params["epochs"], batch_size=params["batch_size"], verbose=0) #optimizer = ['SGD', 'RMSprop', 'Adam', 'Adamax'] #epoch = [3000,5000] #learn_rate = [0.01 / (10**i) for i in range(6)] #param_grid = dict(learn_rate=learn_rate,epochs = epoch) #print(param_grid) #grid = GridSearchCV(estimator=estimator, param_grid=param_grid)#, n_jobs=-1) #grid_result = grid.fit(X,encoded_Y,validation_split=.33) # summarize results #print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) #means = grid_result.cv_results_['mean_test_score'] #stds = grid_result.cv_results_['std_test_score'] #params = grid_result.cv_results_['params'] #for mean, stdev, param in zip(means, stds, params): # print("%f (%f) with: %r" % (mean, stdev, param)) history = estimator.fit(X, encoded_Y, validation_split=0.2) results = cross_val_score(estimator, X, encoded_Y, cv=3) print("Larger: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100)) #print(int(time.time()) - startTime) #print(history.history.keys()) # summarize history for accuracy #plt.plot(history.history['acc']) #plt.plot(history.history['val_acc']) #plt.title('model accuracy') #plt.ylabel('accuracy') #plt.xlabel('epoch') #plt.legend(['train', 'test'], loc='upper right') #plt.show() # summarize history for loss #plt.plot(history.history['loss']) #plt.plot(history.history['val_loss']) #plt.title('model loss') #plt.ylabel('loss') #plt.xlabel('epoch') #plt.legend(['train', 'test'], loc='upper right') #plt.show() #print(estimator.score(Xval,encoded_Yval),"nn test") #print(estimator.score(X, encoded_Y),"nn train") title = "NN curve" #cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0) print("Right before curve") x = plot_learning_curve(estimator, title, X, encoded_Y, cv=3, n_jobs=1) x.show() print("Training Time:", int(time.time()) - startTime, "\n") print(results) return estimator.score(Xval, encoded_Yval)
def main(): code_dir = '/home/john/git/kaggle/OttoGroup/' data_dir = '/home/john/data/otto/' training_file = 'train.csv' os.chdir(code_dir) np.random.seed(1337) print('Starting script...') print('Loading data...') X, labels = load_training_data(data_dir, training_file) print('Pre-processing...') scaler = create_scaler(X) X = apply_scaler(X, scaler) y, y_onehot, encoder = preprocess_labels(labels) num_features = X.shape[1] num_classes = y_onehot.shape[1] print('Features = ' + str(num_features)) print('Classes = ' + str(num_classes)) print('Building model...') model = define_model(num_features, num_classes) print('Complete.') print('Training model...') wrapper = KerasClassifier(model) wrapper.fit(X, y_onehot, nb_epoch=20) print('Complete.') print('Training score = ' + str(wrapper.score(X, y_onehot))) preds = wrapper.predict(X) print('Predictions shape = ' + str(preds.shape)) proba = wrapper.predict_proba(X) print('Probabilities shape = ' + str(proba.shape)) print('Building ensemble...') ensemble = BaggingClassifier(wrapper, n_estimators=3, max_samples=1.0, max_features=1.0) print('Complete.') print('Training ensemble...') ensemble.fit(X, y) print('Complete.') print('Ensemble score = ' + str(ensemble.score(X, y))) print('Script complete.')
def runSonarNN(params): #estimators = [] #estimators.append(('standardize', StandardScaler())) #estimators.append(('mlp', KerasClassifier(build_fn=create_smaller, epochs=200, batch_size=5, verbose=0,learn_rate=.0001))) #pipeline = Pipeline(estimators) #kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) #results = cross_val_score(pipeline, X, encoded_Y, cv=kfold) #print("Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) title = "NN curve" #cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0) ####NOTE: Scikit Doesn't like Kerasclassier paramaters being passed in so i've just hardcoded the ones grid search found#### mod = KerasClassifier(build_fn=create_smaller, epochs=200, batch_size=5, verbose=0, learn_rate=.0001) history = mod.fit(X, encoded_Y, validation_split=.2) print(history.history.keys()) # summarize history for accuracy plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper right') plt.show() x = plot_learning_curve(mod, title, X, encoded_Y, cv=3, n_jobs=1) x.show() #print("Smaller: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) #title = "NN curve" #cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0) #x = plot_learning_curve(pipeline, title, X, encoded_Y, cv=3, n_jobs=1) #x.show() return mod.score(Xval, encoded_Yval)
class BaseKerasSklearnModel(base_model.BaseModel): ''' base keras model based on keras's model(without sklearn) ''' ## def __init__(self, data_file, delimiter, lst_x_keys, lst_y_keys, log_filename=DEFAULT_LOG_FILENAME, model_path=DEFAULT_MODEL_PATH, create_model_func=create_model_demo): ## ''' ## init ## ''' ## import framework.tools.log as log ## loger = log.init_log(log_filename) ## self.load_data(data_file, delimiter, lst_x_keys, lst_y_keys) ## self.model_path = model_path ## self.create_model_func=create_model_func def __init__(self, **kargs): ''' init ''' import framework.tools.log as log self.kargs = kargs log_filename = self.kargs["basic_params"]["log_filename"] model_path = self.kargs["basic_params"]["model_path"] self.load_data_func = self.kargs["load_data"]["method"] self.create_model_func = self.kargs["create_model"]["method"] loger = log.init_log(log_filename) (self.dataset, self.X, self.Y, self.X_evaluation, self.Y_evaluation) = self.load_data_func( **self.kargs["load_data"]["params"]) self.model_path = model_path self.dic_params = {} def load_data(self, data_file, delimiter, lst_x_keys, lst_y_keys): ''' load data ''' # Load the dataset self.dataset = numpy.loadtxt(data_file, delimiter=",") self.X = self.dataset[:, lst_x_keys] self.Y = self.dataset[:, lst_y_keys] def init_callbacks(self): ''' init all callbacks ''' os.system("mkdir -p %s" % (self.model_path)) checkpoint_callback = ModelCheckpoint(self.model_path + '/weights.{epoch:02d}-{acc:.2f}.hdf5', \ monitor='acc', save_best_only=False) history_callback = LossHistory() callbacks_list = [checkpoint_callback, history_callback] self.dic_params["callbacks"] = callbacks_list def init_model(self): ''' init model ''' train_params = {"nb_epoch": 10, "batch_size": 10} self.dic_params.update(train_params) self.model = KerasClassifier(build_fn=self.create_model_func, **self.kargs["create_model"]["params"]) # self.model = KerasClassifier(build_fn=self.create_model_func) self.model.set_params(**self.dic_params) def train_model(self): ''' train model ''' X = self.X Y = self.Y X_evaluation = self.X_evaluation Y_evaluation = self.Y_evaluation seed = 7 numpy.random.seed(seed) # Load the dataset history = self.model.fit(X, Y) scores = self.model.score(X, Y) #history_callback = self.dic_params["callbacks"][1] # print dir(history_callback) # logging.info(str(history_callback.losses)) logging.info("final : %.2f%%" % (scores * 100)) logging.info(str(history.history)) def process(self): ''' process ''' self.init_callbacks() self.init_model() self.train_model()
print('Defining model') model = Sequential() model.add(Dense(784, 50)) model.add(Activation('relu')) model.add(Dense(50, 10)) model.add(Activation('softmax')) print('Creating wrapper') classifier = KerasClassifier(model) print('Fitting model') classifier.fit(X_train, Y_train, batch_size=batch_size, nb_epoch=nb_epoch) print('Testing score function') score = classifier.score(X_train, Y_train) print('Score: ', score) print('Testing predict function') preds = classifier.predict(X_test) print('Preds.shape: ', preds.shape) print('Testing predict proba function') proba = classifier.predict_proba(X_test) print('Proba.shape: ', proba.shape) print('Testing get params') print(classifier.get_params()) print('Testing set params') classifier.set_params(optimizer='sgd', loss='mse')
from keras.models import Sequential from keras.layers import Dense from keras.wrappers.scikit_learn import KerasClassifier from keras.utils.np_utils import to_categorical x_train, x_test, y_train, y_test = get_all_data() scaler.fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) n_input = x_train.shape[1] # number of features n_output = 6 # numner of possible labels n_samples = x_train.shape[0] #number of training samples n_hidden_units = 60 y_train = to_categorical(y_train) y_test = to_categorical(y_test) # print(Y_train.shape), print(Y_test.shape) def create_model(): model = Sequential() model.add(Dense(n_hidden_units,input_dim=n_input, activation='relu')) model.add(Dense(n_hidden_units,input_dim=n_input, activation='relu')) model.add(Dense(n_output, activation='softmax')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics= ['accuracy']) return model estimator = KerasClassifier(build_fn=create_model, epochs=20, batch_size=10, verbose=False) estimator.fit(x_train,y_train) print('Score:{}'.format(estimator.score(x_test,y_test)))
import numpy as np from keras.utils.np_utils import to_categorical from keras.wrappers.scikit_learn import KerasClassifier from ionyx.contrib.keras_builder import KerasBuilder from ionyx.datasets import DataSetLoader print('Beginning keras builder test...') data, X, y = DataSetLoader.load_forest_cover() n_classes = len(np.unique(y)) + 1 model = KerasBuilder.build_dense_model(input_size=X.shape[1], output_size=n_classes, loss='categorical_crossentropy', metrics=['accuracy']) model.fit(X, to_categorical(y, n_classes)) score = model.evaluate(X, to_categorical(y, n_classes)) print('Model score = {0}'.format(score[1])) estimator = KerasClassifier(build_fn=KerasBuilder.build_dense_model, input_size=X.shape[1], output_size=n_classes, loss='categorical_crossentropy', metrics=['accuracy']) estimator.fit(X, to_categorical(y, n_classes)) score = estimator.score(X, to_categorical(y, n_classes)) print('Estimator score = {0}'.format(score)) print('Done.')
# ax1.set_ylabel('loss') # ax1.tick_params('y') # ax1.legend(loc='upper right', shadow=False) # ax1.set_title('Model loss through #epochs', color=orange, fontweight='bold') # # # 绘制模型准确度曲线 # ax2.plot(range(1, len(train_acc) + 1), train_acc, blue, linewidth=5, label='training') # ax2.plot(range(1, len(train_val_acc) + 1), train_val_acc, green, linewidth=5, label='validation') # ax2.set_xlabel('# epoch') # ax2.set_ylabel('accuracy') # ax2.tick_params('y') # ax2.legend(loc='lower right', shadow=False) # ax2.set_title('Model accuracy through #epochs', color=orange, fontweight='bold') # # # plot_model_performance( # train_loss=hist.history.get('loss', []), # train_acc=hist.history.get('acc', []), # train_val_loss=hist.history.get('val_loss', []), # train_val_acc=hist.history.get('val_acc', []) # ) # # plot_model(clf.model, to_file='model.png', show_shapes=True) #14.模型的评估 score = clf.score(X_test, y_test) print(score) #15.模型的保存 clf.model.save('/tmp/keras_mlp.h5')
#Final Model finalModel = True if finalModel: best_model = KerasClassifier(build_fn=classification_model, epochs=epo, batch_size=bat, verbose=0) t_fit = time.time() best_model.fit(X_train1, y_train1, batch_size = bat, epochs = epo) #train on the whole training set print("Fit time = {}".format(time.time()-t_fit)) t_pred = time.time() y_pred = best_model.predict(X_test) print("Pred time = {}".format(time.time()-t_fit)) for motion_type in class_names: pred_score = best_model.score(X_test[y_test.motion_type==motion_type], y_test[y_test.motion_type==motion_type]) print("{} accuracy = {p:8.4f}".format(motion_type, p=pred_score)) print("Cohen Kappa: {}".format(cohen_kappa_score(y_pred, y_test))) print("Accuracy: {}".format(accuracy_score(y_pred, y_test))) print("F1 Score: {}".format(f1_score(y_pred, y_test, average = 'weighted'))) print("Precision: {}".format(precision_score(y_pred, y_test, average='weighted'))) print("Recall: {}".format(recall_score(y_pred, y_test, average='weighted'))) learning_curves = False if learning_curves: estimator = KerasClassifier(build_fn=classification_model, epochs=100, batch_size=bat, verbose=0) #scorer = make_scorer(cohen_kappa_score) plot_learning_curves(estimator, X_train1, y_train1, title = "Neural Network - Motions Set - Post-Tuning Learning Curves", low_limit=0.6)
x2 = Dense(8, activation='relu', name='hidden3')(x1) prediction = Dense(1, activation='sigmoid', name='output')(x2) model = Model(inputs=inputs, outputs=prediction) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model #모델 컴파일 from keras.wrappers.scikit_learn import KerasClassifier model = KerasClassifier(build_fn=build_network, verbose=1, epochs=20, batch_size=8) early = EarlyStopping(monitor='val_acc', patience=30, mode='auto') #모델 실행 from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler, StandardScaler pipe = Pipeline([("scaler", MinMaxScaler()), ('svm', model)]) pipe.fit(x_train, y_train) #결과 출력 # print("\n Accuracy: %.4f" % (model.evaluate(x_test,y_test)[1])) print(model.score(x_test, y_test))
learn_rate=learn_rate, dropout_rate=dropout_rate, neurons1=neurons1) grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring="f1_micro", cv=3) grid_search = grid.fit(train_X, train_y) print(grid_search.best_score_) print(grid_search.best_params_) model = grid_search.best_estimator_ # Plots # Plot Train/Test curve plotTrainTestLines("MLP(" + str(number_of_columns) + " features)", model, train_X.values, train_y, validation_X.values, validation_y) # Get probabilities from model probs = model.predict_proba(validation_X.values) # Calculate precision/recall values prec_rec_dict = precision_recall_values(probs, validation_y.values) # Plot Precision/Recall curve plotPrecRecCurve(train_X.values, train_y, validation_X.values, validation_y, {'MLP': prec_rec_dict}) # Test score print("MLP test score : " + str(model.score(test_X.values, test_y.values)))
def create_model(): model = Sequential() model.add(Dense(n_hidden_units, input_dim=n_input, activation="relu")) model.add(Dense(n_hidden_units, input_dim=n_input, activation="relu")) model.add(Dense(n_output, activation="softmax")) # Compile model model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy']) return model estimator = KerasClassifier(build_fn=create_model, epochs=20, batch_size=10, verbose=False) estimator.fit(X_train, Y_train) print("Score: {}".format(estimator.score(X_test, Y_test))) # accuracy 88.7% from sklearn.ensemble import RandomForestClassifier X_train, X_test, y_train, y_test = LoadAllData() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) model = RandomForestClassifier(n_estimators=500) model.fit(X_train, y_train) model.score(X_test, y_test) import numpy as np # linear algebra\n", import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", from sklearn.preprocessing import StandardScaler,LabelEncoder
class FinalModelATC(BaseEstimator, TransformerMixin): def __init__(self, model, model_name=None, ml_for_analytics=False, type_of_estimator='classifier', output_column=None, name=None, _scorer=None, training_features=None, column_descriptions=None, feature_learning=False, uncertainty_model=None, uc_results = None): self.model = model self.model_name = model_name self.ml_for_analytics = ml_for_analytics self.type_of_estimator = type_of_estimator self.name = name self.training_features = training_features self.column_descriptions = column_descriptions self.feature_learning = feature_learning self.uncertainty_model = uncertainty_model self.uc_results = uc_results if self.type_of_estimator == 'classifier': self._scorer = _scorer else: self._scorer = _scorer def get(self, prop_name, default=None): try: return getattr(self, prop_name) except AttributeError: return default def fit(self, X, y): self.model_name = get_name_from_model(self.model) X_fit = X if self.model_name[:12] == 'DeepLearning' or self.model_name in ['BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression', 'Perceptron', 'PassiveAggressiveClassifier', 'SGDClassifier', 'RidgeClassifier', 'LogisticRegression']: if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() if self.model_name[:12] == 'DeepLearning': # For Keras, we need to tell it how many input nodes to expect, which is our num_cols num_cols = X_fit.shape[1] model_params = self.model.get_params() del model_params['build_fn'] if self.type_of_estimator == 'regressor': self.model = KerasRegressor(build_fn=utils_models.make_deep_learning_model, num_cols=num_cols, feature_learning=self.feature_learning, **model_params) elif self.type_of_estimator == 'classifier': self.model = KerasClassifier(build_fn=utils_models.make_deep_learning_classifier, num_cols=num_cols, feature_learning=self.feature_learning, **model_params) try: if self.model_name[:12] == 'DeepLearning': print('\nWe will stop training early if we have not seen an improvement in training accuracy in 25 epochs') from keras.callbacks import EarlyStopping early_stopping = EarlyStopping(monitor='loss', patience=25, verbose=1) self.model.fit(X_fit, y, callbacks=[early_stopping]) elif self.model_name[:16] == 'GradientBoosting': if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() patience = 20 best_val_loss = -10000000000 num_worse_rounds = 0 best_model = deepcopy(self.model) X_fit, X_test, y, y_test = train_test_split(X_fit, y, test_size=0.15) # Add a variable number of trees each time, depending how far into the process we are num_iters = list(range(1, 50, 1)) + list(range(50, 100, 2)) + list(range(100, 250, 3)) + list(range(250, 500, 5)) + list(range(500, 1000, 10)) + list(range(1000, 2000, 20)) + list(range(2000, 10000, 100)) try: for num_iter in num_iters: warm_start = True if num_iter == 1: warm_start = False self.model.set_params(n_estimators=num_iter, warm_start=warm_start) self.model.fit(X_fit, y) try: val_loss = self._scorer.score(self, X_test, y_test) except Exception as e: val_loss = self.model.score(X_test, y_test) if val_loss > best_val_loss: best_val_loss = val_loss num_worse_rounds = 0 best_model = deepcopy(self.model) else: num_worse_rounds += 1 if num_worse_rounds >= patience: break except KeyboardInterrupt: print('Heard KeyboardInterrupt. Stopping training, and using the best checkpointed GradientBoosting model') pass self.model = best_model print('The number of estimators that were the best for this training dataset: ' + str(self.model.get_params()['n_estimators'])) print('The best score on a random 15 percent holdout set of the training data: ' + str(best_val_loss)) else: self.model.fit(X_fit, y) except TypeError as e: if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() self.model.fit(X_fit, y) except KeyboardInterrupt as e: print('Stopping training at this point because we heard a KeyboardInterrupt') print('If the model is functional at this point, we will output the model in its latest form') print('Note that not all models can be interrupted and still used, and that this feature generally is an unofficial beta-release feature that is known to fail on occasion') pass return self def remove_categorical_values(self, features): clean_features = set([]) for feature in features: if '=' not in feature: clean_features.add(feature) else: clean_features.add(feature[:feature.index('=')]) return clean_features def verify_features(self, X, raw_features_only=False): if self.column_descriptions is None: print('This feature is not enabled by default. Depending on the shape of the training data, it can add hundreds of KB to the saved file size.') print('Please pass in `ml_predictor.train(data, verify_features=True)` when training a model, and we will enable this function, at the cost of a potentially larger file size.') warnings.warn('Please pass verify_features=True when invoking .train() on the ml_predictor instance.') return None print('\n\nNow verifying consistency between training features and prediction features') if isinstance(X, dict): prediction_features = set(X.keys()) elif isinstance(X, pd.DataFrame): prediction_features = set(X.columns) # If the user passed in categorical features, we will effectively one-hot-encode them ourselves here # Note that this assumes we're using the "=" as the separater in DictVectorizer/DataFrameVectorizer date_col_names = [] categorical_col_names = [] for key, value in self.column_descriptions.items(): if value == 'categorical' and 'day_part' not in key: try: # This covers the case that the user passes in a value in column_descriptions that is not present in their prediction data column_vals = X[key].unique() for val in column_vals: prediction_features.add(key + '=' + str(val)) categorical_col_names.append(key) except: print('\nFound a column in your column_descriptions that is not present in your prediction data:') print(key) elif 'day_part' in key: # We have found a date column. Make sure this date column is in our prediction data # It is outside the scope of this function to make sure that the same date parts are available in both our training and testing data raw_date_col_name = key[:key.index('day_part') - 1] date_col_names.append(raw_date_col_name) elif value == 'output': try: prediction_features.remove(key) except KeyError: pass # Now that we've added in all the one-hot-encoded categorical columns (name=val1, name=val2), remove the base name from our prediction data prediction_features = prediction_features - set(categorical_col_names) # Get only the unique raw_date_col_names date_col_names = set(date_col_names) training_features = set(self.training_features) # Remove all of the transformed date column feature names from our training data features_to_remove = [] for feature in training_features: for raw_date_col_name in date_col_names: if raw_date_col_name in feature: features_to_remove.append(feature) training_features = training_features - set(features_to_remove) # Make sure the raw_date_col_name is in our training data after we have removed all the transformed feature names training_features = training_features | date_col_names # MVP means ignoring text features print_nlp_warning = False nlp_example = None for feature in training_features: if 'nlp_' in feature: print_nlp_warning = True nlp_example = feature training_features.remove(feature) if print_nlp_warning == True: print('\n\nWe found an NLP column in the training data') print('verify_features() currently does not support checking all of the values within an NLP column, so if the text of your NLP column has dramatically changed, you will have to check that yourself.') print('Here is one example of an NLP feature in the training data:') print(nlp_example) training_not_prediction = training_features - prediction_features if raw_features_only == True: training_not_prediction = self.remove_categorical_values(training_not_prediction) if len(training_not_prediction) > 0: print('\n\nHere are the features this model was trained on that were not present in this prediction data:') print(sorted(list(training_not_prediction))) else: print('All of the features this model was trained on are included in the prediction data') prediction_not_training = prediction_features - training_features if raw_features_only == True: prediction_not_training = self.remove_categorical_values(prediction_not_training) if len(prediction_not_training) > 0: # Separate out those values we were told to ignore by column_descriptions ignored_features = [] for feature in prediction_not_training: if self.column_descriptions.get(feature, 'False') == 'ignore': ignored_features.append(feature) prediction_not_training = prediction_not_training - set(ignored_features) print('\n\nHere are the features available in the prediction data that were not part of the training data:') print(sorted(list(prediction_not_training))) if len(ignored_features) > 0: print('\n\nAdditionally, we found features in the prediction data that we were told to ignore in the training data') print(sorted(list(ignored_features))) else: print('All of the features in the prediction data were in this model\'s training data') print('\n\n') return { 'training_not_prediction': training_not_prediction , 'prediction_not_training': prediction_not_training } def score(self, X, y, verbose=False): # At the time of writing this, GradientBoosting does not support sparse matrices for predictions if (self.model_name[:16] == 'GradientBoosting' or self.model_name in ['BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression']) and scipy.sparse.issparse(X): X = X.todense() if self._scorer is not None: if self.type_of_estimator == 'regressor': return self._scorer.score(self, X, y) elif self.type_of_estimator == 'classifier': return self._scorer.score(self, X, y) else: return self.model.score(X, y) def predict_proba(self, X, verbose=False): if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in ['BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression']) and scipy.sparse.issparse(X): X = X.todense() try: predictions = self.model.predict_proba(X) except AttributeError as e: try: predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict_proba(X) # If this model does not have predict_proba, and we have fallen back on predict, we want to make sure we give results back in the same format the user would expect for predict_proba, namely each prediction is a list of predicted probabilities for each class. # Note that this DOES NOT WORK for multi-label problems, or problems that are not reduced to 0,1 # If this is not an iterable (ignoring strings, which might be iterable), then we will want to turn our predictions into tupled predictions if not (hasattr(predictions[0], '__iter__') and not isinstance(predictions[0], str)): tupled_predictions = [] for prediction in predictions: if prediction == 1: tupled_predictions.append([0,1]) else: tupled_predictions.append([1,0]) predictions = tupled_predictions # This handles an annoying edge case with libraries like Keras that, for a binary classification problem, with return a single predicted probability in a list, rather than the probability of both classes in a list if len(predictions[0]) == 1: tupled_predictions = [] for prediction in predictions: tupled_predictions.append([1 - prediction[0], prediction[0]]) predictions = tupled_predictions if X.shape[0] == 1: return predictions[0] else: return predictions def predict(self, X, verbose=False): if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in ['BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression']) and scipy.sparse.issparse(X): X_predict = X.todense() else: X_predict = X prediction = self.model.predict(X_predict) # Handle cases of getting a prediction for a single item. # It makes a cleaner interface just to get just the single prediction back, rather than a list with the prediction hidden inside. if isinstance(prediction, np.ndarray): prediction = prediction.tolist() if isinstance(prediction, float) or isinstance(prediction, int) or isinstance(prediction, str): return prediction if len(prediction) == 1: return prediction[0] else: return prediction # transform is initially designed to be used with feature_learning def transform(self, X): predicted_features = self.predict(X) predicted_features = list(predicted_features) X = scipy.sparse.hstack([X, predicted_features], format='csr') return X def predict_uncertainty(self, X): if self.uncertainty_model is None: print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!') print('This model was not trained to predict uncertainties') print('Please follow the documentation to tell this model at training time to learn how to predict uncertainties') print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!') raise ValueError('This model was not trained to predict uncertainties') base_predictions = self.predict(X) if isinstance(base_predictions, Iterable): base_predictions_col = [[val] for val in base_predictions] base_predictions_col = np.array(base_predictions_col) else: base_predictions_col = [base_predictions] X_combined = scipy.sparse.hstack([X, base_predictions_col], format='csr') uncertainty_predictions = self.uncertainty_model.predict_proba(X_combined) results = { 'base_prediction': base_predictions , 'uncertainty_prediction': uncertainty_predictions } if isinstance(base_predictions, Iterable): results['uncertainty_prediction'] = [row[1] for row in results['uncertainty_prediction']] results = pd.DataFrame.from_dict(results, orient='columns') if self.uc_results is not None: calibration_results = {} # grab the relevant properties from our uc_results, and make them each their own list in calibration_results for key, value in self.uc_results[1].items(): calibration_results[key] = [] for proba in results['uncertainty_prediction']: max_bucket_proba = 0 bucket_num = 1 while proba > max_bucket_proba: calibration_result = self.uc_results[bucket_num] max_bucket_proba = self.uc_results[bucket_num]['max_proba'] bucket_num += 1 for key, value in calibration_result.items(): calibration_results[key].append(value) # TODO: grab the uncertainty_calibration data for DataFrames df_calibration_results = pd.DataFrame.from_dict(calibration_results, orient='columns') del df_calibration_results['max_proba'] results = pd.concat([results, df_calibration_results], axis=1) else: if self.uc_results is not None: # TODO: grab the uncertainty_calibration data for dictionaries for bucket_name, bucket_result in self.uc_results.items(): if proba > bucket_result['max_proba']: break results.update(bucket_result) del results['max_proba'] return results def score_uncertainty(self, X, y, verbose=False): return self.uncertainty_model.score(X, y, verbose=False)
'output_dim': y_train.shape[1], 'epochs': 5, 'batch_size': 256, 'verbose': 1, 'validation_data': (X_val, y_val), 'shuffle': True } # Create a new sklearn classifier clf = KerasClassifier(**model_params) # Finally, fit our classifier hist = clf.fit(X_train, y_train) # Plot model performance plot_model_performance( train_loss=hist.history.get('loss', []), train_acc=hist.history.get('acc', []), train_val_loss=hist.history.get('val_loss', []), train_val_acc=hist.history.get('val_acc', []) ) # Evaluate model accuracy score = clf.score(X_test, y_test, verbose=0) print('model accuracy: {}'.format(score)) # Visualize model architecture plot_model(clf.model, to_file='./model_structure.png', show_shapes=True) # Finally save model clf.model.save('/tmp/keras_mlp.h5')
epochs = [20, 25, 30] learn_rate = [0.005, 0.01, 0.015] momentum = [0.85, 0.9, 0.95] grid = dict(epochs=epochs, batch_size=batch_size, learn_rate=learn_rate, momentum=momentum) t1 = time.time() scores = [] model_tt = KerasClassifier(build_fn=create_model, verbose=0) for g in ParameterGrid(grid): model_tt.set_params(**g) model_tt.fit(X_train, Y_train) scores.append(dict(params=g, score=model_tt.score(X_test, Y_test))) print('model#', len(scores), scores[-1]) t2 = time.time() print("Training time:", t2 - t1, 'sec') df = pandas.DataFrame([{**row['params'], **row} for row in scores]) df = df.drop('params', axis=1) df.sort_values('score') model_tt.model.save('my_model.h5') model = keras.models.load_model('my_model.h5') print(model) model.predict(x)
finalModel = True if finalModel: best_model = KerasClassifier(build_fn=classification_model, batch_size=bat, verbose=0) t_fit = time.time() best_model.fit(X_train1, y_train1, batch_size=bat, epochs=epo) #train on the whole training set print("Fit time = {}".format(time.time() - t_fit)) t_pred = time.time() y_pred = best_model.predict(X_test) print("Pred time = {}".format(time.time() - t_fit)) for particle_type in class_names: pred_score = best_model.score(X_test[y_test.id == particle_type], y_test[y_test.id == particle_type]) print("{} accuracy = {p:8.4f}".format(particle_type, p=pred_score)) print("Cohen Kappa: {}".format(cohen_kappa_score(y_pred, y_test))) print("Accuracy: {}".format(accuracy_score(y_pred, y_test))) print("Balanced Accuracy: {}".format( balanced_accuracy_score(y_pred, y_test))) print("F1 Score: {}".format(f1_score(y_pred, y_test, average='weighted'))) print("Precision: {}".format( precision_score(y_pred, y_test, average='weighted'))) print("Recall: {}".format(recall_score(y_pred, y_test, average='weighted'))) learning_curves = False if learning_curves: estimator = KerasClassifier(build_fn=classification_model, epochs=epo,
kernel_initializer='uniform', input_dim=561)) seq.add(Dense(units=200, activation='relu', kernel_initializer='uniform')) seq.add(Dense(units=200, activation='relu', kernel_initializer='uniform')) seq.add(Dense(units=6, activation='softmax', kernel_initializer='uniform')) seq.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return seq from keras.wrappers.scikit_learn import KerasClassifier from sklearn.model_selection import cross_val_score, KFold #kfold = KFold(n_splits=10, shuffle=True) classifier = KerasClassifier(build_fn=build, batch_size=32, nb_epoch=20) csv = cross_val_score(estimator=classifier, X=X_train, y=Y_train, cv=10, n_jobs=-1) mean = csv.mean() std = csv.std() classifier.fit(X_train, Y_train) print("Accuracy: {}%\n".format(classifier.score(X_test, Y_test) * 100)) classifier.predict(X_test)
# - Cross validationなしで学習する. # In[57]: history = model.fit(x_train, y_train, validation_data=[x_test, y_test], epochs=epoch, batch_size=batch_size, verbose=1) # - テストデータで評価. # In[58]: score = model.score(x_test, y_test) #print('Test Loss:', score[0]) #print('Test accuracy:', score[1]) print('Testing Accuracy:', score) valiscore = score # - データオーギュメンテーションなしの実データで評価 # In[59]: score = model.score(tempx, tempy) #print('Test Loss:', score[0]) #print('Test accuracy:', score[1]) print('Testing Accuracy:', score) testscore = score
mlp = MlpTrading_old(symbol='^GSPC') df_all = mlp.data_prepare(0.33, 16660, False) sk_params = {'size_input': mlp.size_input, 'size_output':mlp.size_output, 'size_hidden':15, 'dropout':0.0, 'optimizer':'rmsprop', 'activation':'sigmoid'} model = KerasClassifier(build_fn=mlp.model_create_mlp, **sk_params) history = model.fit(mlp.x_train, mlp.y_train, sample_weight=None, batch_size=128, epochs=10 , verbose=1 )#validation_data=(mlp.x_test, mlp.y_test) kwargs=kwargs) # model = mlp.model_create_mlp(activation='softmax')#575/2575 [==============================] - 0s 25us/step - loss: 0.6850 - acc: 0.5538 - val_loss: 0.6900 - val_acc: 0.5296 ypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <keras.engine.sequential.Sequential object at 0x138a7fd68> does not. # history = mlp.model_fit(model,epochs=10, verbose=1) mlp.model_weights(model, mlp.x_test, mlp.y_test, mlp.names_input) plot_stat_loss_vs_accuracy2(history.history) plt.show() score = model.score(mlp.x_test, mlp.y_test) print(f'accuracy= {score} ') # # # # def baseline_model(inputs=4): # model = Sequential() # model.add(Dense(units=8, activation='sigmoid', input_shape=(inputs,))) # model.add(Dense(units=3, activation='sigmoid')) # model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # model.summary() # #model.fit()
model = Sequential() model.add(Dense(8, input_dim=4, init='normal', activation='relu')) model.add(Dense(3, init='normal', activation='sigmoid')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=200, batch_size=5, verbose=0) kfold = KFold(n_splits=10, shuffle=True, random_state=seed) X_train, X_test, Y_train, Y_test = train_test_split(X, dummy_y, test_size=0.20, random_state=seed) estimator.fit(X_train, Y_train) predictions = estimator.predict(X_test) # model = baseline_model() # loss_and_metrics = model.evaluate(X_test, Y_test) # print loss_and_metrics score = estimator.score(X_test, Y_test) # print dummy_y print "Taxa de acerto: %.2f%%" % (score * 100) # print predictions print 'Predicao: ' + encoder.inverse_transform(predictions) print Y_test # print encoder.inverse_transform(Y_test[1]) # print Y_test # results = cross_val_score(estimator, X, dummy_y, cv=kfold) # print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
model.add(Dense(n_hidden_units, input_dim=n_input, activation="relu")) model.add(Dense(n_output, activation="softmax")) # Compile Model model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy']) return model estimator = KerasClassifier(build_fn=create_model, epochs=20, batch_size=10, verbose=False) estimator.fit(Scaled_trainData, Y_train) print("Keras Classifier Score:{}".format( estimator.score(Scaled_testData, Y_test))) # 0.95 # Ensemble Methods Y_test = to_categorical(testLabelE) # one-hot encoded labels Y_train = to_categorical(trainLabelE) model = ExtraTreesClassifier(n_estimators=500) model.fit(Scaled_trainData, Y_train) print("ExtraTree Classifier results %.3f" % model.score(Scaled_testData, Y_test)) # 0.915 model = RandomForestClassifier(n_estimators=500) model.fit(Scaled_trainData, Y_train) print("RandomForest Classifier results %.3f" % model.score(Scaled_testData, Y_test)) # 0.90
class NNTR: def __init__(self, args): self.data_path = "readscore/all_score.csv" self.train = pd.read_csv(self.data_path, low_memory=False) self.features = [ 'infonoisescore', 'logcontentlength', 'hasinfobox', 'logreferences', 'logpagelinks', 'numimageslength', 'num_citetemplates', 'lognoncitetemplates', 'num_categories', 'lvl2headings', 'lvl3heading', 'number_chars', 'number_words', 'number_types', 'number_sentences', 'number_syllables', 'number_polysyllable_words', 'difficult_words', 'number_words_longer_4', 'number_words_longer_6', 'number_words_longer_10', 'number_words_longer_longer_13', 'flesch_reading_ease', 'flesch_kincaid_grade_level', 'coleman_liau_index', 'gunning_fog_index', 'smog_index', 'ari_index', 'lix_index', 'dale_chall_score', 'linsear_write_formula', 'grammar' ] self.classes = ['Stub', 'Start', 'C', 'B', 'GA', 'FA'] self.by = label_binarize(self.train['rating'], classes=self.classes) scaler = MinMaxScaler(feature_range=(-1, 1)) scaler.fit(self.train[self.features]) self.train_mm = scaler.transform(self.train[self.features]) self.X_train, self.X_test, self.y_train, self.y_test = train_test_split( self.train_mm, self.train['rating'], test_size=0.10, random_state=2) def auroc(self, y_true, y_pred): return tf.py_function(roc_auc_score, (y_true, y_pred), tf.double) def auc(self, y_true, y_pred): auc = tf.metrics.AUC(y_true, y_pred)[1] K.get_session().run(tf.local_variables_initializer()) return auc # define baseline model def baseline_model(self): initializer = initializers.he_uniform() self.clf = Sequential() self.clf.add(Dense(128, input_dim=32, kernel_initializer=initializer)) self.clf.add(LeakyReLU()) self.clf.add(Dense(512, kernel_initializer=initializer)) self.clf.add(LeakyReLU()) self.clf.add(Dropout(0.1)) self.clf.add(Dense(256, kernel_initializer=initializer)) self.clf.add(LeakyReLU()) self.clf.add(Dropout(0.1)) self.clf.add(Dense(6, activation='softmax')) # Final Layer using Softmax optimizer = optimizers.Adamax(0.0008) self.clf.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) return self.clf def learn(self): self.estimator = KerasClassifier(build_fn=self.baseline_model, epochs=1, verbose=1) self.estimator.fit(self.X_train, self.y_train) self.score = self.estimator.predict(self.X_test) #roc = roc_auc_score(self.y_test, self.score) acc = self.estimator.score(self.X_test, self.y_test) print( pd.crosstab(self.y_test, self.score, rownames=['Actual Species'], colnames=['predicted'])) print(acc) #print(roc) def computeROC(self): # Binarize the output self.n_classes = self.by.shape[1] self.y_score = label_binarize(self.score, classes=self.classes_n) kf = KFold(shuffle=True, n_splits=5) # roc_auc_score = cross_val_score(self.estimator, self.train_mm, self.by, cv=kf, scoring='roc_auc') # print('roc_auc_score ', np.mean(roc_auc_score), roc_auc_score) # Compute ROC curve and ROC area for each class fpr = dict() tpr = dict() roc_auc = dict() for i in range(self.n_classes): fpr[i], tpr[i], _ = roc_curve(self.byy[:, i], self.y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # Compute micro-average ROC curve and ROC area fpr["micro"], tpr["micro"], _ = roc_curve(self.byy.ravel(), self.y_score.ravel()) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) lw = 2 # First aggregate all false positive rates all_fpr = np.unique( np.concatenate([fpr[i] for i in range(self.n_classes)])) # Then interpolate all ROC curves at this points mean_tpr = np.zeros_like(all_fpr) for i in range(self.n_classes): mean_tpr += interp(all_fpr, fpr[i], tpr[i]) # Finally average it and compute AUC mean_tpr /= self.n_classes fpr["macro"] = all_fpr tpr["macro"] = mean_tpr roc_auc["macro"] = auc(fpr["macro"], tpr["macro"]) # Plot all ROC curves plt.figure() plt.plot(fpr["micro"], tpr["micro"], label='micro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["micro"]), color='deeppink', linestyle=':', linewidth=4) plt.plot(fpr["macro"], tpr["macro"], label='macro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["macro"]), color='navy', linestyle=':', linewidth=4) colors = cycle( ['aqua', 'darkorange', 'cornflowerblue', 'yellow', 'pink']) for i, color in zip(range(self.n_classes), colors): plt.plot(fpr[i], tpr[i], color=color, lw=lw, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(self.classes[i], roc_auc[i])) plt.plot([0, 1], [0, 1], 'k--', lw=lw) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title( 'Some extension of Receiver operating characteristic to multi-class' ) plt.legend(loc="lower right") plt.show()
class FinalModelATC(BaseEstimator, TransformerMixin): def __init__(self, model, model_name=None, ml_for_analytics=False, type_of_estimator='classifier', output_column=None, name=None, _scorer=None, training_features=None, column_descriptions=None, feature_learning=False, uncertainty_model=None, uc_results=None, training_prediction_intervals=False, min_step_improvement=0.0001, interval_predictors=None, keep_cat_features=False, is_hp_search=None, X_test=None, y_test=None): self.model = model self.model_name = model_name self.ml_for_analytics = ml_for_analytics self.type_of_estimator = type_of_estimator self.name = name self.training_features = training_features self.column_descriptions = column_descriptions self.feature_learning = feature_learning self.uncertainty_model = uncertainty_model self.uc_results = uc_results self.training_prediction_intervals = training_prediction_intervals self.min_step_improvement = min_step_improvement self.interval_predictors = interval_predictors self.is_hp_search = is_hp_search self.keep_cat_features = keep_cat_features self.X_test = X_test self.y_test = y_test if self.type_of_estimator == 'classifier': self._scorer = _scorer else: self._scorer = _scorer def get(self, prop_name, default=None): try: return getattr(self, prop_name) except AttributeError: return default def fit(self, X, y): global keras_imported, KerasRegressor, KerasClassifier, EarlyStopping, ModelCheckpoint, TerminateOnNaN, keras_load_model self.model_name = get_name_from_model(self.model) X_fit = X if self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression', 'Perceptron', 'PassiveAggressiveClassifier', 'SGDClassifier', 'RidgeClassifier', 'LogisticRegression' ]: if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() if self.model_name[:12] == 'DeepLearning': if keras_imported == False: # Suppress some level of logs os.environ['TF_CPP_MIN_VLOG_LEVEL'] = '3' os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' from keras.callbacks import EarlyStopping, ModelCheckpoint, TerminateOnNaN from keras.models import load_model as keras_load_model from keras.wrappers.scikit_learn import KerasRegressor, KerasClassifier keras_imported = True # For Keras, we need to tell it how many input nodes to expect, which is our num_cols num_cols = X_fit.shape[1] model_params = self.model.get_params() del model_params['build_fn'] try: del model_params['feature_learning'] except: pass try: del model_params['num_cols'] except: pass if self.type_of_estimator == 'regressor': self.model = KerasRegressor( build_fn=utils_models.make_deep_learning_model, num_cols=num_cols, feature_learning=self.feature_learning, **model_params) elif self.type_of_estimator == 'classifier': self.model = KerasClassifier( build_fn=utils_models.make_deep_learning_classifier, num_cols=num_cols, feature_learning=self.feature_learning, **model_params) if self.model_name[:12] == 'DeepLearning': try: if self.is_hp_search == True: patience = 5 verbose = 0 else: patience = 25 verbose = 2 X_fit, y, X_test, y_test = self.get_X_test(X_fit, y) try: X_test = X_test.toarray() except AttributeError as e: pass if not self.is_hp_search: print( '\nWe will stop training early if we have not seen an improvement in validation accuracy in {} epochs' .format(patience)) print( 'To measure validation accuracy, we will split off a random 10 percent of your training data set' ) early_stopping = EarlyStopping(monitor='val_loss', patience=patience, verbose=verbose) terminate_on_nan = TerminateOnNaN() now_time = datetime.datetime.now() time_string = str(now_time.year) + '_' + str( now_time.month) + '_' + str(now_time.day) + '_' + str( now_time.hour) + '_' + str(now_time.minute) temp_file_name = 'tmp_dl_model_checkpoint_' + time_string + str( random.random()) + '.h5' model_checkpoint = ModelCheckpoint(temp_file_name, monitor='val_loss', save_best_only=True, mode='min', period=1) callbacks = [early_stopping, terminate_on_nan] if not self.is_hp_search: callbacks.append(model_checkpoint) self.model.fit(X_fit, y, callbacks=callbacks, validation_data=(X_test, y_test), verbose=verbose) # TODO: give some kind of logging on how the model did here! best epoch, best accuracy, etc. if self.is_hp_search is False: self.model = keras_load_model(temp_file_name) try: os.remove(temp_file_name) except OSError as e: pass except KeyboardInterrupt as e: print( 'Stopping training at this point because we heard a KeyboardInterrupt' ) print( 'If the deep learning model is functional at this point, we will output the model in its latest form' ) print( 'Note that this feature is an unofficial beta-release feature that is known to fail on occasion' ) if self.is_hp_search is False: self.model = keras_load_model(temp_file_name) try: os.remove(temp_file_name) except OSError as e: pass elif self.model_name[:4] == 'LGBM': X_fit = X.toarray() X_fit, y, X_test, y_test = self.get_X_test(X_fit, y) try: X_test = X_test.toarray() except AttributeError as e: pass if self.type_of_estimator == 'regressor': eval_metric = 'rmse' elif self.type_of_estimator == 'classifier': if len(set(y_test)) > 2: eval_metric = 'multi_logloss' else: eval_metric = 'binary_logloss' verbose = True if self.is_hp_search == True: verbose = False if self.X_test is not None: eval_name = 'X_test_the_user_passed_in' else: eval_name = 'random_holdout_set_from_training_data' cat_feature_indices = self.get_categorical_feature_indices() if cat_feature_indices is None: self.model.fit(X_fit, y, eval_set=[(X_test, y_test)], early_stopping_rounds=100, eval_metric=eval_metric, eval_names=[eval_name], verbose=verbose) else: self.model.fit(X_fit, y, eval_set=[(X_test, y_test)], early_stopping_rounds=100, eval_metric=eval_metric, eval_names=[eval_name], categorical_feature=cat_feature_indices, verbose=verbose) elif self.model_name[:8] == 'CatBoost': X_fit = X_fit.toarray() if self.type_of_estimator == 'classifier' and len( pd.Series(y).unique()) > 2: # TODO: we might have to modify the format of the y values, converting them all to ints, then back again (sklearn has a useful inverse_transform on some preprocessing classes) self.model.set_params(loss_function='MultiClass') cat_feature_indices = self.get_categorical_feature_indices() self.model.fit(X_fit, y, cat_features=cat_feature_indices) elif self.model_name[:16] == 'GradientBoosting': if not sklearn_version > '0.18.1': X_fit = X_fit.toarray() patience = 20 best_val_loss = -10000000000 num_worse_rounds = 0 best_model = deepcopy(self.model) X_fit, y, X_test, y_test = self.get_X_test(X_fit, y) # Add a variable number of trees each time, depending how far into the process we are if os.environ.get('is_test_suite', False) == 'True': num_iters = list(range(1, 50, 1)) + list(range( 50, 100, 2)) + list(range(100, 250, 3)) else: num_iters = list(range( 1, 50, 1)) + list(range(50, 100, 2)) + list( range(100, 250, 3)) + list(range(250, 500, 5)) + list( range(500, 1000, 10)) + list(range( 1000, 2000, 20)) + list(range( 2000, 10000, 100)) # TODO: get n_estimators from the model itself, and reduce this list to only those values that come under the value from the model try: for num_iter in num_iters: warm_start = True if num_iter == 1: warm_start = False self.model.set_params(n_estimators=num_iter, warm_start=warm_start) self.model.fit(X_fit, y) if self.training_prediction_intervals == True: val_loss = self.model.score(X_test, y_test) else: try: val_loss = self._scorer.score(self, X_test, y_test) except Exception as e: val_loss = self.model.score(X_test, y_test) if val_loss - self.min_step_improvement > best_val_loss: best_val_loss = val_loss num_worse_rounds = 0 best_model = deepcopy(self.model) else: num_worse_rounds += 1 print( '[' + str(num_iter) + '] random_holdout_set_from_training_data\'s score is: ' + str(round(val_loss, 3))) if num_worse_rounds >= patience: break except KeyboardInterrupt: print( 'Heard KeyboardInterrupt. Stopping training, and using the best checkpointed GradientBoosting model' ) pass self.model = best_model print( 'The number of estimators that were the best for this training dataset: ' + str(self.model.get_params()['n_estimators'])) print('The best score on the holdout set: ' + str(best_val_loss)) else: self.model.fit(X_fit, y) if self.X_test is not None: del self.X_test del self.y_test return self def remove_categorical_values(self, features): clean_features = set([]) for feature in features: if '=' not in feature: clean_features.add(feature) else: clean_features.add(feature[:feature.index('=')]) return clean_features def verify_features(self, X, raw_features_only=False): if self.column_descriptions is None: print( 'This feature is not enabled by default. Depending on the shape of the training data, it can add hundreds of KB to the saved file size.' ) print( 'Please pass in `ml_predictor.train(data, verify_features=True)` when training a model, and we will enable this function, at the cost of a potentially larger file size.' ) warnings.warn( 'Please pass verify_features=True when invoking .train() on the ml_predictor instance.' ) return None print( '\n\nNow verifying consistency between training features and prediction features' ) if isinstance(X, dict): prediction_features = set(X.keys()) elif isinstance(X, pd.DataFrame): prediction_features = set(X.columns) # If the user passed in categorical features, we will effectively one-hot-encode them ourselves here # Note that this assumes we're using the "=" as the separater in DictVectorizer/DataFrameVectorizer date_col_names = [] categorical_col_names = [] for key, value in self.column_descriptions.items(): if value == 'categorical' and 'day_part' not in key: try: # This covers the case that the user passes in a value in column_descriptions that is not present in their prediction data column_vals = X[key].unique() for val in column_vals: prediction_features.add(key + '=' + str(val)) categorical_col_names.append(key) except: print( '\nFound a column in your column_descriptions that is not present in your prediction data:' ) print(key) elif 'day_part' in key: # We have found a date column. Make sure this date column is in our prediction data # It is outside the scope of this function to make sure that the same date parts are available in both our training and testing data raw_date_col_name = key[:key.index('day_part') - 1] date_col_names.append(raw_date_col_name) elif value == 'output': try: prediction_features.remove(key) except KeyError: pass # Now that we've added in all the one-hot-encoded categorical columns (name=val1, name=val2), remove the base name from our prediction data prediction_features = prediction_features - set(categorical_col_names) # Get only the unique raw_date_col_names date_col_names = set(date_col_names) training_features = set(self.training_features) # Remove all of the transformed date column feature names from our training data features_to_remove = [] for feature in training_features: for raw_date_col_name in date_col_names: if raw_date_col_name in feature: features_to_remove.append(feature) training_features = training_features - set(features_to_remove) # Make sure the raw_date_col_name is in our training data after we have removed all the transformed feature names training_features = training_features | date_col_names # MVP means ignoring text features print_nlp_warning = False nlp_example = None for feature in training_features: if 'nlp_' in feature: print_nlp_warning = True nlp_example = feature training_features.remove(feature) if print_nlp_warning == True: print('\n\nWe found an NLP column in the training data') print( 'verify_features() currently does not support checking all of the values within an NLP column, so if the text of your NLP column has dramatically changed, you will have to check that yourself.' ) print( 'Here is one example of an NLP feature in the training data:') print(nlp_example) training_not_prediction = training_features - prediction_features if raw_features_only == True: training_not_prediction = self.remove_categorical_values( training_not_prediction) if len(training_not_prediction) > 0: print( '\n\nHere are the features this model was trained on that were not present in this prediction data:' ) print(sorted(list(training_not_prediction))) else: print( 'All of the features this model was trained on are included in the prediction data' ) prediction_not_training = prediction_features - training_features if raw_features_only == True: prediction_not_training = self.remove_categorical_values( prediction_not_training) if len(prediction_not_training) > 0: # Separate out those values we were told to ignore by column_descriptions ignored_features = [] for feature in prediction_not_training: if self.column_descriptions.get(feature, 'False') == 'ignore': ignored_features.append(feature) prediction_not_training = prediction_not_training - set( ignored_features) print( '\n\nHere are the features available in the prediction data that were not part of the training data:' ) print(sorted(list(prediction_not_training))) if len(ignored_features) > 0: print( '\n\nAdditionally, we found features in the prediction data that we were told to ignore in the training data' ) print(sorted(list(ignored_features))) else: print( 'All of the features in the prediction data were in this model\'s training data' ) print('\n\n') return { 'training_not_prediction': training_not_prediction, 'prediction_not_training': prediction_not_training } def score(self, X, y, verbose=False): # At the time of writing this, GradientBoosting does not support sparse matrices for predictions if (self.model_name[:16] == 'GradientBoosting' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X = X.todense() if self._scorer is not None: if self.type_of_estimator == 'regressor': return self._scorer.score(self, X, y) elif self.type_of_estimator == 'classifier': return self._scorer.score(self, X, y) else: return self.model.score(X, y) def predict_proba(self, X, verbose=False): if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X = X.todense() elif (self.model_name[:8] == 'CatBoost' or self.model_name[:4] == 'LGBM') and scipy.sparse.issparse(X): X = X.toarray() try: if self.model_name[:4] == 'LGBM': try: best_iteration = self.model.best_iteration except AttributeError: best_iteration = self.model.best_iteration_ predictions = self.model.predict_proba( X, num_iteration=best_iteration) else: predictions = self.model.predict_proba(X) except AttributeError as e: try: predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict_proba(X) # If this model does not have predict_proba, and we have fallen back on predict, we want to make sure we give results back in the same format the user would expect for predict_proba, namely each prediction is a list of predicted probabilities for each class. # Note that this DOES NOT WORK for multi-label problems, or problems that are not reduced to 0,1 # If this is not an iterable (ignoring strings, which might be iterable), then we will want to turn our predictions into tupled predictions if not (hasattr(predictions[0], '__iter__') and not isinstance(predictions[0], str)): tupled_predictions = [] for prediction in predictions: if prediction == 1: tupled_predictions.append([0, 1]) else: tupled_predictions.append([1, 0]) predictions = tupled_predictions # This handles an annoying edge case with libraries like Keras that, for a binary classification problem, with return a single predicted probability in a list, rather than the probability of both classes in a list if len(predictions[0]) == 1: tupled_predictions = [] for prediction in predictions: tupled_predictions.append([1 - prediction[0], prediction[0]]) predictions = tupled_predictions if X.shape[0] == 1: return predictions[0] else: return predictions def predict(self, X, verbose=False): if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X_predict = X.todense() elif self.model_name[:8] == 'CatBoost' and scipy.sparse.issparse(X): X_predict = X.toarray() else: X_predict = X if self.model_name[:4] == 'LGBM': try: best_iteration = self.model.best_iteration except AttributeError: best_iteration = self.model.best_iteration_ predictions = self.model.predict(X, num_iteration=best_iteration) else: predictions = self.model.predict(X_predict) # Handle cases of getting a prediction for a single item. # It makes a cleaner interface just to get just the single prediction back, rather than a list with the prediction hidden inside. if isinstance(predictions, np.ndarray): predictions = predictions.tolist() if isinstance(predictions, float) or isinstance( predictions, int) or isinstance(predictions, str): return predictions if isinstance(predictions[0], list) and len(predictions[0]) == 1: predictions = [row[0] for row in predictions] if len(predictions) == 1: return predictions[0] else: return predictions def predict_intervals(self, X, return_type=None): if self.interval_predictors is None: print( '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' ) print('This model was not trained to predict intervals') print( 'Please follow the documentation to tell this model at training time to learn how to predict intervals' ) print( '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' ) raise ValueError('This model was not trained to predict intervals') base_prediction = self.predict(X) result = {'prediction': base_prediction} for tup in self.interval_predictors: predictor_name = tup[0] predictor = tup[1] result[predictor_name] = predictor.predict(X) if scipy.sparse.issparse(X): len_input = X.shape[0] else: len_input = len(X) if (len_input == 1 and return_type is None) or return_type == 'dict': return result elif (len_input > 1 and return_type is None ) or return_type == 'df' or return_type == 'dataframe': return pd.DataFrame(result) elif return_type == 'list': if len_input == 1: list_result = [base_prediction] for tup in self.interval_predictors: list_result.append(result[tup[0]]) else: list_result = [] for idx in range(len_input): row_result = [base_prediction[idx]] for tup in self.interval_predictors: row_result.append(result[tup[0]][idx]) list_result.append(row_result) return list_result else: print( 'Please pass in a return_type value of one of the following: ["dict", "dataframe", "df", "list"]' ) raise (ValueError( 'Please pass in a return_type value of one of the following: ["dict", "dataframe", "df", "list"]' )) # transform is initially designed to be used with feature_learning def transform(self, X): predicted_features = self.predict(X) predicted_features = list(predicted_features) X = scipy.sparse.hstack([X, predicted_features], format='csr') return X # Allows the user to get the fully transformed data def transform_only(self, X): return X def predict_uncertainty(self, X): if self.uncertainty_model is None: print( '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' ) print('This model was not trained to predict uncertainties') print( 'Please follow the documentation to tell this model at training time to learn how to predict uncertainties' ) print( '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!' ) raise ValueError( 'This model was not trained to predict uncertainties') base_predictions = self.predict(X) if isinstance(base_predictions, Iterable): base_predictions_col = [[val] for val in base_predictions] base_predictions_col = np.array(base_predictions_col) else: base_predictions_col = [base_predictions] X_combined = scipy.sparse.hstack([X, base_predictions_col], format='csr') uncertainty_predictions = self.uncertainty_model.predict_proba( X_combined) results = { 'base_prediction': base_predictions, 'uncertainty_prediction': uncertainty_predictions } if isinstance(base_predictions, Iterable): results['uncertainty_prediction'] = [ row[1] for row in results['uncertainty_prediction'] ] results = pd.DataFrame.from_dict(results, orient='columns') if self.uc_results is not None: calibration_results = {} # grab the relevant properties from our uc_results, and make them each their own list in calibration_results for key, value in self.uc_results[1].items(): calibration_results[key] = [] for proba in results['uncertainty_prediction']: max_bucket_proba = 0 bucket_num = 1 while proba > max_bucket_proba: calibration_result = self.uc_results[bucket_num] max_bucket_proba = self.uc_results[bucket_num][ 'max_proba'] bucket_num += 1 for key, value in calibration_result.items(): calibration_results[key].append(value) # TODO: grab the uncertainty_calibration data for DataFrames df_calibration_results = pd.DataFrame.from_dict( calibration_results, orient='columns') del df_calibration_results['max_proba'] results = pd.concat([results, df_calibration_results], axis=1) else: if self.uc_results is not None: # TODO: grab the uncertainty_calibration data for dictionaries for bucket_name, bucket_result in self.uc_results.items(): if proba > bucket_result['max_proba']: break results.update(bucket_result) del results['max_proba'] return results def score_uncertainty(self, X, y, verbose=False): return self.uncertainty_model.score(X, y, verbose=False) def get_categorical_feature_indices(self): cat_feature_indices = None if self.keep_cat_features == True: cat_feature_names = [ k for k, v in self.column_descriptions.items() if v == 'categorical' ] cat_feature_indices = [ self.training_features.index(cat_name) for cat_name in cat_feature_names ] return cat_feature_indices def get_X_test(self, X_fit, y): if self.X_test is not None: return X_fit, y, self.X_test, self.y_test else: X_fit, X_test, y, y_test = train_test_split(X_fit, y, test_size=0.15) return X_fit, y, X_test, y_test
class BaseKerasSklearnModel(base_model.BaseModel): ''' base keras model based on keras's model(without sklearn) ''' ## def __init__(self, data_file, delimiter, lst_x_keys, lst_y_keys, log_filename=DEFAULT_LOG_FILENAME, model_path=DEFAULT_MODEL_PATH, create_model_func=create_model_demo): ## ''' ## init ## ''' ## import framework.tools.log as log ## loger = log.init_log(log_filename) ## self.load_data(data_file, delimiter, lst_x_keys, lst_y_keys) ## self.model_path = model_path ## self.create_model_func=create_model_func def __init__(self, **kargs): ''' init ''' import framework.tools.log as log self.kargs = kargs log_filename = self.kargs["basic_params"]["log_filename"] model_path = self.kargs["basic_params"]["model_path"] self.load_data_func = self.kargs["load_data"]["method"] self.create_model_func = self.kargs["create_model"]["method"] loger = log.init_log(log_filename) (self.dataset, self.X, self.Y, self.X_evaluation, self.Y_evaluation) = self.load_data_func(**self.kargs["load_data"]["params"]) self.model_path = model_path self.dic_params = {} def load_data(self, data_file, delimiter, lst_x_keys, lst_y_keys): ''' load data ''' # Load the dataset self.dataset = numpy.loadtxt(data_file, delimiter=",") self.X = self.dataset[:, lst_x_keys] self.Y = self.dataset[:, lst_y_keys] def init_callbacks(self): ''' init all callbacks ''' os.system("mkdir -p %s" % (self.model_path)) checkpoint_callback = ModelCheckpoint(self.model_path + '/weights.{epoch:02d}-{acc:.2f}.hdf5', \ monitor='acc', save_best_only=False) history_callback = LossHistory() callbacks_list = [checkpoint_callback, history_callback] self.dic_params["callbacks"] = callbacks_list def init_model(self): ''' init model ''' train_params = {"nb_epoch": 10, "batch_size": 10} self.dic_params.update(train_params) self.model = KerasClassifier(build_fn=self.create_model_func, **self.kargs["create_model"]["params"]) # self.model = KerasClassifier(build_fn=self.create_model_func) self.model.set_params(**self.dic_params) def train_model(self): ''' train model ''' X = self.X Y = self.Y X_evaluation = self.X_evaluation Y_evaluation = self.Y_evaluation seed = 7 numpy.random.seed(seed) # Load the dataset history = self.model.fit(X, Y) scores = self.model.score(X, Y) #history_callback = self.dic_params["callbacks"][1] # print dir(history_callback) # logging.info(str(history_callback.losses)) logging.info("final : %.2f%%" % (scores * 100)) logging.info(str(history.history)) def process(self): ''' process ''' self.init_callbacks() self.init_model() self.train_model()
class FinalModelATC(BaseEstimator, TransformerMixin): def __init__(self, model, model_name=None, ml_for_analytics=False, type_of_estimator='classifier', output_column=None, name=None, scoring_method=None, training_features=None, column_descriptions=None): self.model = model self.model_name = model_name self.ml_for_analytics = ml_for_analytics self.type_of_estimator = type_of_estimator self.name = name self.training_features = training_features self.column_descriptions = column_descriptions if self.type_of_estimator == 'classifier': self._scorer = scoring_method else: self._scorer = scoring_method def fit(self, X, y): self.model_name = get_name_from_model(self.model) # if self.model_name[:3] == 'XGB' and scipy.sparse.issparse(X): # ones = [[1] for x in range(X.shape[0])] # # Trying to force XGBoost to play nice with sparse matrices # X_fit = scipy.sparse.hstack((X, ones)) # else: X_fit = X if self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression', 'Perceptron', 'PassiveAggressiveClassifier', 'SGDClassifier', 'RidgeClassifier', 'LogisticRegression' ]: if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() if self.model_name[:12] == 'DeepLearning': if keras_installed: # For Keras, we need to tell it how many input nodes to expect, which is our num_cols num_cols = X_fit.shape[1] model_params = self.model.get_params() del model_params['build_fn'] if self.type_of_estimator == 'regressor': self.model = KerasRegressor( build_fn=utils_models.make_deep_learning_model, num_cols=num_cols, **model_params) elif self.type_of_estimator == 'classifier': self.model = KerasClassifier( build_fn=utils_models. make_deep_learning_classifier, num_cols=num_cols, **model_params) else: print( 'WARNING: We did not detect that Keras was available.') raise TypeError( 'A DeepLearning model was requested, but Keras was not available to import' ) try: if self.model_name[:12] == 'DeepLearning': print( 'Stopping training early if we have not seen an improvement in training accuracy in 25 epochs' ) from keras.callbacks import EarlyStopping early_stopping = EarlyStopping(monitor='loss', patience=25, verbose=1) self.model.fit(X_fit, y, callbacks=[early_stopping]) else: self.model.fit(X_fit, y) except TypeError as e: if scipy.sparse.issparse(X_fit): X_fit = X_fit.todense() self.model.fit(X_fit, y) except KeyboardInterrupt as e: pass return self def remove_categorical_values(self, features): clean_features = set([]) for feature in features: if '=' not in feature: clean_features.add(feature) else: clean_features.add(feature[:feature.index('=')]) return clean_features def verify_features(self, X, raw_features_only=False): if self.column_descriptions is None: print( 'This feature is not enabled by default. Depending on the shape of the training data, it can add hundreds of KB to the saved file size.' ) print( 'Please pass in `ml_predictor.train(data, verify_features=True)` when training a model, and we will enable this function, at the cost of a potentially larger file size.' ) warnings.warn( 'Please pass verify_features=True when invoking .train() on the ml_predictor instance.' ) return None print( '\n\nNow verifying consistency between training features and prediction features' ) if isinstance(X, dict): prediction_features = set(X.keys()) elif isinstance(X, pd.DataFrame): prediction_features = set(X.columns) # If the user passed in categorical features, we will effectively one-hot-encode them ourselves here # Note that this assumes we're using the "=" as the separater in DictVectorizer/DataFrameVectorizer date_col_names = [] categorical_col_names = [] for key, value in self.column_descriptions.items(): if value == 'categorical' and 'day_part' not in key: try: # This covers the case that the user passes in a value in column_descriptions that is not present in their prediction data column_vals = X[key].unique() for val in column_vals: prediction_features.add(key + '=' + str(val)) categorical_col_names.append(key) except: print( '\nFound a column in your column_descriptions that is not present in your prediction data:' ) print(key) elif 'day_part' in key: # We have found a date column. Make sure this date column is in our prediction data # It is outside the scope of this function to make sure that the same date parts are available in both our training and testing data raw_date_col_name = key[:key.index('day_part') - 1] date_col_names.append(raw_date_col_name) elif value == 'output': try: prediction_features.remove(key) except KeyError: pass # Now that we've added in all the one-hot-encoded categorical columns (name=val1, name=val2), remove the base name from our prediction data prediction_features = prediction_features - set(categorical_col_names) # Get only the unique raw_date_col_names date_col_names = set(date_col_names) training_features = set(self.training_features) # Remove all of the transformed date column feature names from our training data features_to_remove = [] for feature in training_features: for raw_date_col_name in date_col_names: if raw_date_col_name in feature: features_to_remove.append(feature) training_features = training_features - set(features_to_remove) # Make sure the raw_date_col_name is in our training data after we have removed all the transformed feature names training_features = training_features | date_col_names # MVP means ignoring text features print_nlp_warning = False nlp_example = None for feature in training_features: if 'nlp_' in feature: print_nlp_warning = True nlp_example = feature training_features.remove(feature) if print_nlp_warning == True: print('\n\nWe found an NLP column in the training data') print( 'verify_features() currently does not support checking all of the values within an NLP column, so if the text of your NLP column has dramatically changed, you will have to check that yourself.' ) print( 'Here is one example of an NLP feature in the training data:') print(nlp_example) training_not_prediction = training_features - prediction_features if raw_features_only == True: training_not_prediction = self.remove_categorical_values( training_not_prediction) if len(training_not_prediction) > 0: print( '\n\nHere are the features this model was trained on that were not present in this prediction data:' ) print(sorted(list(training_not_prediction))) else: print( 'All of the features this model was trained on are included in the prediction data' ) prediction_not_training = prediction_features - training_features if raw_features_only == True: prediction_not_training = self.remove_categorical_values( prediction_not_training) if len(prediction_not_training) > 0: # Separate out those values we were told to ignore by column_descriptions ignored_features = [] for feature in prediction_not_training: if self.column_descriptions.get(feature, 'False') == 'ignore': ignored_features.append(feature) prediction_not_training = prediction_not_training - set( ignored_features) print( '\n\nHere are the features available in the prediction data that were not part of the training data:' ) print(sorted(list(prediction_not_training))) if len(ignored_features) > 0: print( '\n\nAdditionally, we found features in the prediction data that we were told to ignore in the training data' ) print(sorted(list(ignored_features))) else: print( 'All of the features in the prediction data were in this model\'s training data' ) print('\n\n') return { 'training_not_prediction': training_not_prediction, 'prediction_not_training': prediction_not_training } def score(self, X, y, verbose=False): # At the time of writing this, GradientBoosting does not support sparse matrices for predictions if (self.model_name[:16] == 'GradientBoosting' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X = X.todense() if self._scorer is not None: if self.type_of_estimator == 'regressor': return self._scorer.score(self, X, y) elif self.type_of_estimator == 'classifier': return self._scorer.score(self, X, y) else: return self.model.score(X, y) def predict_proba(self, X, verbose=False): # if self.model_name[:3] == 'XGB' and scipy.sparse.issparse(X): # ones = [[1] for x in range(X.shape[0])] # # Trying to force XGBoost to play nice with sparse matrices # X = scipy.sparse.hstack((X, ones)) if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X = X.todense() try: predictions = self.model.predict_proba(X) except AttributeError as e: # print('This model has no predict_proba method. Returning results of .predict instead.') try: predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict(X) except TypeError as e: if scipy.sparse.issparse(X): X = X.todense() predictions = self.model.predict_proba(X) # If this model does not have predict_proba, and we have fallen back on predict, we want to make sure we give results back in the same format the user would expect for predict_proba, namely each prediction is a list of predicted probabilities for each class. # Note that this DOES NOT WORK for multi-label problems, or problems that are not reduced to 0,1 # If this is not an iterable (ignoring strings, which might be iterable), then we will want to turn our predictions into tupled predictions if not (hasattr(predictions[0], '__iter__') and not isinstance(predictions[0], str)): tupled_predictions = [] for prediction in predictions: if prediction == 1: tupled_predictions.append([0, 1]) else: tupled_predictions.append([1, 0]) predictions = tupled_predictions # This handles an annoying edge case with libraries like Keras that, for a binary classification problem, with return a single predicted probability in a list, rather than the probability of both classes in a list if len(predictions[0]) == 1: tupled_predictions = [] for prediction in predictions: tupled_predictions.append([1 - prediction[0], prediction[0]]) predictions = tupled_predictions if X.shape[0] == 1: return predictions[0] else: return predictions def predict(self, X, verbose=False): # if self.model_name[:3] == 'XGB' and scipy.sparse.issparse(X): # ones = [[1] for x in range(X.shape[0])] # # Trying to force XGBoost to play nice with sparse matrices # X_predict = scipy.sparse.hstack((X, ones)) if (self.model_name[:16] == 'GradientBoosting' or self.model_name[:12] == 'DeepLearning' or self.model_name in [ 'BayesianRidge', 'LassoLars', 'OrthogonalMatchingPursuit', 'ARDRegression' ]) and scipy.sparse.issparse(X): X_predict = X.todense() else: X_predict = X prediction = self.model.predict(X_predict) # Handle cases of getting a prediction for a single item. # It makes a cleaner interface just to get just the single prediction back, rather than a list with the prediction hidden inside. if len(prediction) == 1: return prediction[0] else: return prediction
dummy_y = np_utils.to_categorical(encoded_Y) from sklearn.cross_validation import train_test_split # Shuffle and split the dataset into the number of training and testing points above X_train, X_test, Y_train, Y_test = train_test_split(X, dummy_y, test_size=0.1, random_state = 42, stratify=Y) # define The Neural Network baseline model def baseline_model(): # create model model = Sequential() model.add(Dense(4, input_dim=2, init='normal', activation='relu')) model.add(Dense(6, init='normal', activation='sigmoid')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model # Evaluate The Model with k-Fold Cross Validation estimator = KerasClassifier(build_fn=baseline_model, nb_epoch=200, batch_size=5, verbose=0) kfold = KFold(n_splits=10, shuffle=True, random_state=seed) results = cross_val_score(estimator, X_train, Y_train, cv=kfold) print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) # Make Predictions estimator.fit(X_train, Y_train) print "Accuracy: {}%\n".format(estimator.score(X_test, Y_test) *100) predictions = estimator.predict(X_test) print(predictions) print(encoder.inverse_transform(predictions))