Beispiel #1
0
from kinetica_proc import ProcData
"""Copies the data from input_data to output_data.
    Refer to the official documentation: https://www.kinetica.com/docs/udf/example_table_copy.html
"""

proc_data = ProcData()

for in_table, out_table in zip(proc_data.input_data, proc_data.output_data):
    out_table.size = in_table.size

    for in_column, out_column in zip(in_table, out_table):
        out_column.extend(in_column)

proc_data.results.update(proc_data.params)
proc_data.bin_results.update(proc_data.bin_params)

proc_data.complete()
from kinetica_proc import ProcData
from sklearn import tree
import pickle
import test_environment as te
from sklearn.metrics import accuracy_score
"""
    This is a distributed UDF that trains and stores a decision tree model - one per TOM. Note that there will be 
    running one instance of a UDF per rank per tom. If you only have one rank, you may want to increase the
    property ranks_per_tom to something greater than 1 (e.g. 8), in gpudb.conf.
    Since each instance of this UDF only sees data local to their rank and tom, it is important that at ingest time
    this data was distributed randomly (uniform). This can be achieved by not using a shard key. Alternatively a 
    column could be introduced that contains a random number, as the shard key.
    The data used in this example is about loans with target feature 'bad_loan' that can have the values 0 and 1.
"""

proc_data = ProcData()
"""Output rank & tom information"""
rank_number = proc_data.request_info['rank_number']
tom_number = proc_data.request_info['tom_number']
print('\nUDF train r{}_t{}: instantiated.'.format(rank_number, tom_number))
"""Load and prepare training data"""
training_data = proc_data.to_df().dropna()  # only use non-NAN rows
num_input_data = training_data.shape[0]
X = training_data[[
    'loan_amnt', 'int_rate', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs',
    'revol_util', 'total_acc', 'longest_credit_length'
]]
y = training_data[['bad_loan']]
"""Train model"""
print('UDF train r{}_t{}: learning model on {} data points.'.format(
    rank_number, tom_number, num_input_data))
Beispiel #3
0
    like to learn multiple models - each on a fraction of the data. However, in this situation it is important to 
    keep two things in mind:
    1) The data needs to be distributed across TOMs such that its distribution is (nearly) the same on each TOM. When
        you use KiFS this should automatically be the case since the data is distributed randomly.
    2) At the inference step all models are applied to a record and the prediction results need to be combined. 
        This could be done efficiently through another distributed UDF.
"""


"""Initialize demo dependencies"""
h2o.init(nthreads=-1)


"""Get H2O data frame via Kinetica UDF API"""
print('Receiving h2o df...')
proc_data = ProcData()
h20_df = proc_data.to_h2odf()

print('h2o df shape: {}'.format(h20_df.shape))


"""Use H2O API to learn a GLM model"""
print('Partitioning data')
splits = h20_df.split_frame(ratios=[0.7, 0.15], seed=1)
train = splits[0]
valid = splits[1]
test = splits[2]
print('Identify response and predictor variables')
y = 'bad_loan'
x = list(h20_df.columns)
x.remove(y)  # remove the response
import pickle
import test_environment as te
from sklearn.metrics import accuracy_score
import numpy as np


"""
    This is a distributed UDF that combines predictions of multiple models in an ensemble.
    There will be one instance of this UDF running, per rank per tom. If you only have one rank, you may want to 
    increase the property ranks_per_tom to something greater than 1 (e.g. 8), in gpudb.conf.
    An instance of this UDF loads all models into memory and executes all models against the inference data that is 
    local to on their rank and tom. The prediction results of all models on one inference record are then combined.
"""


proc_data = ProcData()

"""Output rank & tom information"""
rank_number = proc_data.request_info['rank_number']
tom_number = proc_data.request_info['tom_number']
print('\nUDF test r{}_t{}: instantiated.'.format(rank_number, tom_number))

"""Load test data - NOTE: same data prep steps required as in dt_train.py!"""
test_data = proc_data.to_df().dropna()
num_test_data = test_data.shape[0]
X = test_data[['loan_amnt', 'int_rate', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs', 'revol_util', 'total_acc',
               'longest_credit_length']]
y_actual = test_data[['bad_loan']]
record_ids = test_data[['record_id']]

"""Get output table information"""