Skip to content

mfkiwl/chevrons

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chevrons >>

Rapidly build pipelines for out-of-core data processing using higher order functions.

Table of Contents

#What is this?

chevrons is a collection of tools for building memory-efficient pipelines on iterators by composing functions. It introduces a clean syntax for passing data between functions which are composed. Data passing through each part of the pipeline is evaluated lazily, making it easy to work with very large (or infinite) streams of data in settings with limited resources.

chevrons is perfect for situations where you have a lot of data that you want to process out-of-core, but not enough to warrant a "big data" solution. It lets you build pipeline components that can be composed, reordered, and reused easily. The name of the package comes from the syntax of these pipelines, which looks like this:

output = input_data | function_1 >> function_2 >> function_3

This syntax makes pipelines readable and easy to understand. Similarly, subsections of the pipeline can be named and reused:

preprocess = function_1 >> function_2
output1 = input_data1 | preprocess >> function_3
output2 = input_data2 | preprocess >> function_4

The objects passed between functions are all generators/iterators, so output1 and output2 will be turned into generators which can be evaluated one element at a time (for example, written to a file or used to train a machine learning model).

You can construct your own pieces of the pipeline by writing a subclass of one of the classes in pipeline_base, or perform Map, Fold, and Reduce by importing those blocks from pipeline_hof.

Please note that at the moment, chevrons only supports Python 3 and later.

#Higher order Functions

Currently, the Map, Fold and Reduce functions are included as pipeline components. For a quick introduction to these functions and how they work, check out this page.

Each one takes a helper function which is user defined. For example, take the following (contrived) pipeline, which sums up the square roots of all inputs which are multiples of three:

from chevrons.pipeline_hof import Map, Filter, Fold
import math

def is_multiple_of_three(x):
    return x % 3 == 0
    
def add(x1,x2):
    return x1+x2
    
data = range(0,10)

sum_sqrt_threes = data | Filter(is_multiple_of_three) >> Map(math.sqrt) >> Fold(add)

#Parallelism

Note: Parallel processing is still in beta - while these classes are stable enough for most use cases, please report any bugs that you find.

One advantage of structuring computation as higher-order functions is that they are extremely easy to parallelize effectively. It's easy to make pipeline elements parallel using chevrons! The pipeline_parallel module defines pipeline pieces for running things in parallel. Note that all consecutive parallel elements should begin with a BeginParallel block and end with an EndParallel block or a FoldParallel block. The BeginParallel will take a batch size, which will be used throughout the pipeline. We can rewrite the above pipeline example to utilize parallelization very easily:

from chevrons.pipeline_parallel import MapParallel, FilterParallel, FoldParallel, BeginParallel, EndParallel

data = range(0,10)

data | BeginParallel(5) >> FilterParallel(is_multiple_of_three) >> MapParallel(math.sqrt) >> FoldParallel(add)

For pipelines where the individual functions are very computationally intensive, this can lead to very significant speedups when the correct batch size is chosen.

#Implementing Your Own Pipeline Pieces

#Example: Training a Scikit learn model on synthetic data

import random
from chevrons.pipeline_base import Zip
from chevrons.pipeline_hof import Map
from chevrons.pipe line_extra import TrainScikitModel
from sklearn.linear_model import LinearRegression

def add_noise(data_point):
    return random.gauss(data_point, 1)

def calculate_true_value(X):
    true_slope = 0.4
    true_intercept = 1
    return true_slope*X[0] + true_intercept

X = [[i] for i in range(1000)]
model = LinearRegression()

generate_synthetic_output = Map(calculate_true_value) >> Map(add_noise)
y = X | generate_synthetic_output
(X,y) | Zip() >> TrainScikitModel(model, batch_size=100)
print(model.coef_)

#Contact

I'd love to hear any feedback you have, or ideas for features you think would be interesting! Send me an email at louiscialdella@gmail.com if you'd like to let me know of your happiness/displeasure/exuberance/vexation.

About

Rapidly build pipelines for out-of-core data processing using higher order functions in Python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%