UPSG is a standard methodology, an interchange format, and a Python library for writing machine learning pipelines.
It is designed primarily to provide different teams working on different machine learning problems a way to share code across different languages and environments.
install with:
pip install git+git://github.com/dssg/UPSG.git
To use the UPSG Python library, we currently require the following packages. In most environments, pip should take care of this for you.
This is how to implement the sklearn "Getting started" pipeline:
from sklearn import datasets
from sklearn.svm import SVC
from upsg.fetch.np import NumpyRead
from upsg.wrap.wrap_sklearn import wrap_and_make_instance
from upsg.export.csv import CSVWrite
from upsg.transform.split import SplitTrainTest
from upsg.pipeline import Pipeline
digits = datasets.load_digits()
digits_data = digits.data
# for now, we need a column vector rather than an array
digits_target = digits.target
p = Pipeline()
# load data from a numpy dataset
stage_data = NumpyRead(digits_data)
stage_target = NumpyRead(digits_target)
# train/test split
stage_split_data = SplitTrainTest(2, test_size=1, random_state=0)
# build a classifier
stage_clf = wrap_and_make_instance(SVC, gamma=0.001, C=100.)
# output to a csv
stage_csv = CSVWrite('out.csv')
node_data, node_target, node_split, node_clf, node_csv = map(
p.add, [
stage_data, stage_target, stage_split_data, stage_clf,
stage_csv])
# connect the pipeline stages together
node_data['output'] > node_split['input0']
node_target['output'] > node_split['input1']
node_split['train0'] > node_clf['X_train']
node_split['train1'] > node_clf['y_train']
node_split['test0'] > node_clf['X_test']
node_clf['y_pred'] > node_csv['input']
p.run()
# results are now in out.csv
Check out the documentation