def run_experiment(model_params: Dict[str, Any], X_train: pd.DataFrame, y: pd.Series, X_test: Optional[pd.DataFrame] = None, logging_directory: str = 'output/{time}', if_exists: str = 'error', eval_func: Optional[Callable] = None, algorithm_type: Union[str, Type[BaseEstimator]] = 'lgbm', fit_params: Optional[Union[Dict[str, Any], Callable]] = None, cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None, groups: Optional[pd.Series] = None, categorical_feature: Optional[List[str]] = None, sample_submission: Optional[pd.DataFrame] = None, submission_filename: Optional[str] = None, type_of_target: str = 'auto', feature_list: Optional[List[Union[int, str]]] = None, feature_directory: Optional[str] = None, inherit_experiment: Optional[Experiment] = None, with_auto_hpo: bool = False, with_auto_prep: bool = False, with_mlflow: bool = False): """ Evaluate metrics by cross-validation and stores result (log, oof prediction, test prediction, feature importance plot and submission file) under the directory specified. One of the following estimators are used (automatically dispatched by ``type_of_target(y)`` and ``gbdt_type``). * LGBMClassifier * LGBMRegressor * CatBoostClassifier * CatBoostRegressor The output files are laid out as follows: .. code-block:: none <logging_directory>/ log.txt <== Logging file importance.png <== Feature importance plot generated by nyaggle.util.plot_importance oof_prediction.npy <== Out of fold prediction in numpy array format test_prediction.npy <== Test prediction in numpy array format submission.csv <== Submission csv file metrics.json <== Metrics params.json <== Parameters models/ fold1 <== The trained model in fold 1 ... Args: model_params: Parameters passed to the constructor of the classifier/regressor object (i.e. LGBMRegressor). X_train: Training data. Categorical feature should be casted to pandas categorical type or encoded to integer. y: Target X_test: Test data (Optional). If specified, prediction on the test data is performed using ensemble of models. logging_directory: Path to directory where output of experiment is stored. if_exists: How to behave if the logging directory already exists. - error: Raise a ValueError. - replace: Delete logging directory before logging. - append: Append to exisitng experiment. - rename: Rename current directory by adding "_1", "_2"... prefix fit_params: Parameters passed to the fit method of the estimator. If dict is passed, the same parameter except eval_set passed for each fold. If callable is passed, returning value of ``fit_params(fold_id, train_index, test_index)`` will be used for each fold. eval_func: Function used for logging and calculation of returning scores. This parameter isn't passed to GBDT, so you should set objective and eval_metric separately if needed. If ``eval_func`` is None, ``roc_auc_score`` or ``mean_squared_error`` is used by default. gbdt_type: Type of gradient boosting library used. "lgbm" (lightgbm) or "cat" (catboost) cv: int, cross-validation generator or an iterable which determines the cross-validation splitting strategy. - None, to use the default ``KFold(5, random_state=0, shuffle=True)``, - integer, to specify the number of folds in a ``(Stratified)KFold``, - CV splitter (the instance of ``BaseCrossValidator``), - An iterable yielding (train, test) splits as arrays of indices. groups: Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``). sample_submission: A sample dataframe alined with test data (Usually in Kaggle, it is available as sample_submission.csv). The submission file will be created with the same schema as this dataframe. submission_filename: The name of submission file will be created under logging directory. If ``None``, the basename of the logging directory will be used as a filename. categorical_feature: List of categorical column names. If ``None``, categorical columns are automatically determined by dtype. type_of_target: The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``. Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported. feature_list: The list of feature ids saved through nyaggle.feature_store module. feature_directory: The location of features stored. Only used if feature_list is not empty. inherit_experiment: An experiment object which is used to log results. if not ``None``, all logs in this function are treated as a part of this experiment. with_auto_prep: If True, the input datasets will be copied and automatic preprocessing will be performed on them. For example, if ``gbdt_type = 'cat'``, all missing values in categorical features will be filled. with_auto_hpo: If True, model parameters will be automatically updated using optuna (only available in lightgbm). with_mlflow: If True, `mlflow tracking <https://www.mlflow.org/docs/latest/tracking.html>`_ is used. One instance of ``nyaggle.experiment.Experiment`` corresponds to one run in mlflow. Note that all output mlflow's directory (``mlruns`` by default). :return: Namedtuple with following members * oof_prediction: numpy array, shape (len(X_train),) Predicted value on Out-of-Fold validation data. * test_prediction: numpy array, shape (len(X_test),) Predicted value on test data. ``None`` if X_test is ``None`` * metrics: list of float, shape(nfolds+1) ``scores[i]`` denotes validation score in i-th fold. ``scores[-1]`` is overall score. * models: list of objects, shape(nfolds) Trained models for each folds. * importance: list of pd.DataFrame, feature importance for each fold (type="gain"). * time: Training time in seconds. * submit_df: The dataframe saved as submission.csv """ start_time = time.time() cv = check_cv(cv, y) if feature_list: X = pd.concat([X_train, X_test]) if X_test is not None else X_train X.reset_index(drop=True, inplace=True) X = load_features(X, feature_list, directory=feature_directory) ntrain = len(X_train) X_train, X_test = X.iloc[:ntrain, :], X.iloc[ntrain:, :].reset_index( drop=True) _check_input(X_train, y, X_test) if categorical_feature is None: categorical_feature = [ c for c in X_train.columns if X_train[c].dtype.name in ['object', 'category'] ] if type_of_target == 'auto': type_of_target = multiclass.type_of_target(y) model_type, eval_func, cat_param_name = _dispatch_models( algorithm_type, type_of_target, eval_func) if with_auto_prep: assert algorithm_type in ( 'cat', 'xgb', 'lgbm'), "with_auto_prep is only supported for gbdt" X_train, X_test = autoprep_gbdt(algorithm_type, X_train, X_test, categorical_feature) logging_directory = logging_directory.format( time=datetime.now().strftime('%Y%m%d_%H%M%S')) if inherit_experiment is not None: experiment = ExpeimentProxy(inherit_experiment) else: experiment = Experiment(logging_directory, if_exists=if_exists, with_mlflow=with_mlflow) with experiment as exp: exp.log('Algorithm: {}'.format(algorithm_type)) exp.log('Experiment: {}'.format(exp.logging_directory)) exp.log('Params: {}'.format(model_params)) exp.log('Features: {}'.format(list(X_train.columns))) exp.log_param('algorithm_type', algorithm_type) exp.log_param('num_features', X_train.shape[1]) if callable(fit_params): exp.log_param('fit_params', str(fit_params)) else: exp.log_dict('fit_params', fit_params) exp.log_dict('model_params', model_params) if feature_list is not None: exp.log_param('features', feature_list) if with_auto_hpo: assert algorithm_type == 'lgbm', 'auto-tuning is only supported for LightGBM' model_params = find_best_lgbm_parameter( model_params, X_train, y, cv=cv, groups=groups, type_of_target=type_of_target) exp.log_param('model_params_tuned', model_params) exp.log('Categorical: {}'.format(categorical_feature)) models = [model_type(**model_params) for _ in range(cv.get_n_splits())] if fit_params is None: fit_params = {} if cat_param_name is not None and not callable( fit_params) and cat_param_name not in fit_params: fit_params[cat_param_name] = categorical_feature if isinstance(fit_params, Dict): exp.log_params(fit_params) result = cross_validate(models, X_train=X_train, y=y, X_test=X_test, cv=cv, groups=groups, logger=exp.get_logger(), eval_func=eval_func, fit_params=fit_params, type_of_target=type_of_target) # save oof exp.log_numpy('oof_prediction', result.oof_prediction) exp.log_numpy('test_prediction', result.test_prediction) for i in range(cv.get_n_splits()): exp.log_metric('Fold {}'.format(i + 1), result.scores[i]) exp.log_metric('Overall', result.scores[-1]) # save importance plot if result.importance: importance = pd.concat(result.importance) plot_file_path = os.path.join(exp.logging_directory, 'importance.png') plot_importance(importance, plot_file_path) exp.log_artifact(plot_file_path) # save trained model for i, model in enumerate(models): _save_model(model, exp.logging_directory, i + 1, exp) # save submission.csv submit_df = None if X_test is not None: submit_df = make_submission_df(result.test_prediction, sample_submission, y) exp.log_dataframe( submission_filename or os.path.basename(exp.logging_directory), submit_df, 'csv') elapsed_time = time.time() - start_time return ExperimentResult(result.oof_prediction, result.test_prediction, result.scores, models, result.importance, elapsed_time, submit_df)
def adversarial_validate(X_train: pd.DataFrame, X_test: pd.DataFrame, importance_type: str = 'gain', estimator: Optional[BaseEstimator] = None, cat_cols = None, cv = None) -> ADVResult: """ Perform adversarial validation between X_train and X_test. Args: X_train: Training data X_test: Test data importance_type: The type of feature importance calculated. estimator: The custom estimator. If None, LGBMClassifier is automatically used. cv: Cross validation split. If ``None``, the first fold out of 5 fold is used as validation. Returns: Namedtuple with following members * auc: float, ROC AUC score of adversarial validation. * importance: pandas DataFrame, feature importance of adversarial model (order by importance) Example: >>> from sklearn.model_selection import train_test_split >>> from nyaggle.testing import make_regression_df >>> from nyaggle.validation import adversarial_validate >>> X, y = make_regression_df(n_samples=8) >>> X_train, X_test, y_train, y_test = train_test_split(X, y) >>> auc, importance = cross_validate(X_train, X_test) >>> >>> print(auc) 0.51078231 >>> importance.head() feature importance col_1 231.5827204 col_5 207.1837266 col_7 188.6920685 col_4 174.5668498 col_9 170.6438643 """ concat = pd.concat([X_train, X_test]).copy().reset_index(drop=True) y = np.array([1]*len(X_train) + [0]*len(X_test)) if estimator is None: requires_lightgbm() from lightgbm import LGBMClassifier estimator = LGBMClassifier(n_estimators=10000, objective='binary', importance_type=importance_type, random_state=0) else: assert is_instance(estimator, ('lightgbm.sklearn.LGBMModel', 'catboost.core.CatBoost')), \ 'Only CatBoostClassifier or LGBMClassifier is allowed' if cv is None: cv = Take(1, KFold(5, shuffle=True, random_state=0)) fit_params = {'verbose': -1} if cat_cols: fit_params['categorical_feature'] = cat_cols result = cross_validate(estimator, concat, y, None, cv=cv, eval_func=roc_auc_score, fit_params=fit_params, importance_type=importance_type) importance = pd.concat(result.importance) importance = importance.groupby('feature')['importance'].mean().reset_index() importance.sort_values(by='importance', ascending=False, inplace=True) importance.reset_index(drop=True, inplace=True) return ADVResult(result.scores[-1], importance)