Machine Learning Input Data Processing as a Service

Cachew is a multi-tenant service for efficient input data processing in machine learning jobs.

To minimize end-to-end training time and cost, Cachew jointly optimizes:

elastic, distributed resource allocation for input data processing and
input data caching and materialization of preprocessed data within and across jobs.

Cachew builds on top of the tf.data data loading framework in TensorFlow, extending tf.data service with autoscaling and autocaching policies.

This repository is a fork of TensorFlow with the source code for Cachew.

Cachew System Architecture

Cachew consists of a centralized dispatcher, a dynamic number of input data workers, and a disaggregated storage cluster for data caching.

Users register training nodes (i.e., clients) with the Cachew dispatcher. To execute an input pipeline with Cachew, clients provide a graph representation of the input pipeline and a path to the input dataset in a cloud storage bucket. Cachew supports and extends the tf.data API for defining input data pipelines from a collection of composable and user-parametrizable operators. Users can annotate their tf.data input pipeline to mark candidate locations for caching/reusing data across executions. Cachew will automatically apply caching at the throughput-optimal location in the input pipeline among the candidate locations.

Cachew's input data workers are stateless components responsible for producing batches of preprocessed data for clients. The dispatcher dynamically adjusts the number of input data workers for each job to minimize epoch time while keeping costs low. The dispatcher also profiles and maintains metadata about input pipeline executions across jobs to make data caching decisions. Cachew stores cached datasets in a GlusterFS remote storage cluster.

Clients fetch data from the workers that are assigned to them by the dispatcher. Clients and workers periodically send heartbeats to the dispatcher to maintain membership in the service and provide metrics used for the autoscaling and autocaching policies.

Deploying and Using Cachew

The cachew_experiments repository provides scripts and instructions to get started with a Cachew deployment and execute example ML input data pipelines. The repository also provides detailed instructions for reproducing the key results from the Cachew research paper published at USENIX ATC'22.

Contributing

We welcome contributions and PRs to Cachew.

Referencing our work

Cachew will appear at USENIX ATC'22. If you decide to use Cachew in your work, please cite our paper:

@inproceedings{cachew,
  author    = {Dan Graur and
               Damien Aymon and
               Dan Kluser and
               Tanguy Albrici and
               Chandramohan A. Thekkath and
               Ana Klimovic},
  title     = {Cachew: Machine Learning Input Data Processing as a Service},
  booktitle = {Proceedings of the USENIX Annual Technical Confernece (ATC'22)},
  publisher = {{USENIX}},
  year      = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 123,785 Commits
.clwb		.clwb
.github		.github
docs/figures		docs/figures
tensorflow		tensorflow
third_party		third_party
tools		tools
.bazelrc		.bazelrc
.bazelversion		.bazelversion
.clang-format		.clang-format
.gitignore		.gitignore
.pylintrc		.pylintrc
.zenodo.json		.zenodo.json
AUTHORS		AUTHORS
BUILD		BUILD
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUES.md		ISSUES.md
ISSUE_TEMPLATE.md		ISSUE_TEMPLATE.md
LICENSE		LICENSE
README.md		README.md
README_TF.md		README_TF.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
WORKSPACE		WORKSPACE
arm_compiler.BUILD		arm_compiler.BUILD
configure		configure
configure.cmd		configure.cmd
configure.py		configure.py
erq		erq
models.BUILD		models.BUILD

License

eth-easl/cachew

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Input Data Processing as a Service

Cachew System Architecture

Deploying and Using Cachew

Contributing

Referencing our work

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages