Skip to content

macressler/UPSG

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UPSG: The Universal Pipeline for Social Good

Introduction

UPSG is a standard methodology, an interchange format, and a Python library for writing machine learning pipelines.

It is designed primarily to provide different teams working on different machine learning problems a way to share code across different languages and environments.

Installation

install with:

pip install git+git://github.com/dssg/UPSG.git

To use the UPSG Python library, we currently require the following packages. In most environments, pip should take care of this for you.

Required

Python packages

Other packages

Optional

Python packages

Other packages

Example

This is how to implement the sklearn "Getting started" pipeline:

from sklearn import datasets
from sklearn.svm import SVC

from upsg.fetch.np import NumpyRead
from upsg.wrap.wrap_sklearn import wrap_and_make_instance
from upsg.export.csv import CSVWrite
from upsg.transform.split import SplitTrainTest
from upsg.pipeline import Pipeline

digits = datasets.load_digits()
digits_data = digits.data
# for now, we need a column vector rather than an array
digits_target = digits.target

p = Pipeline()

# load data from a numpy dataset
stage_data = NumpyRead(digits_data)
stage_target = NumpyRead(digits_target)

# train/test split
stage_split_data = SplitTrainTest(2, test_size=1, random_state=0)

# build a classifier
stage_clf = wrap_and_make_instance(SVC, gamma=0.001, C=100.)

# output to a csv
stage_csv = CSVWrite('out.csv')

node_data, node_target, node_split, node_clf, node_csv = map(
    p.add, [
        stage_data, stage_target, stage_split_data, stage_clf,
        stage_csv])

# connect the pipeline stages together
node_data['output'] > node_split['input0']
node_target['output'] > node_split['input1']
node_split['train0'] > node_clf['X_train']
node_split['train1'] > node_clf['y_train']
node_split['test0'] > node_clf['X_test']
node_clf['y_pred'] > node_csv['input']

p.run()

# results are now in out.csv

Next Steps

Check out the documentation

About

A set of tools and conventions to help data scientists share code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • HTML 0.1%