IPTF

A distributed filesystem for TensorFlow models, data, and logs

IPTF brings the best parts of IPFS into the world of TensorFlow. If you're not familiar with IPFS, think of it as a cross between git and BitTorrent: a global peer-to-peer network of machines that provide instant, reliable access to massive datasets.

Why should I use IPTF?

IPTF is free, it’s peer-to-peer and it requires zero configuration. Use IPTF in your machine learning workflow to handle enormous quantities of data without needing to setup, configure, or orchestrate a separate storage solution.

Fully integrated

By embedding an IPFS daemon directly into TensorFlow, IPTF provides seamless read/write access to the entire IPFS network. IPTF is fully integrated with full tf.gfile and all of the io_ops, so iptf:// prefixed paths can be used in Python directly or with any core file operation in a TensorFlow graph.

Quicker, easier access to popular datasets

Instead of downloading each of the MNIST files and storing them somewhere on your machine, just use the path "iptf://ipfs/QmazxWBPrSfTzkuzQNvyzYWx438q98Q1pQ7vRJkQZh7x97", and TensorFlow will fetch and cache the data you need on the fly.

The QmazxWBPrSfTzkuzQNvyzYWx438q98Q1pQ7vRJkQZh7x97 part of the path is the fingerprint of the MNIST dataset. The only file in the world that has that fingerprint is the MNIST dataset. So not only is it enough information to request that file from the network, it's also enough information to check that you received a perfect, uncorrupted copy of MNIST. Similar to BitTorrent, IPTF downloads pieces of your files from many computers in parallel and still guarantees you're getting exactly the right sequence of bytes.

Better bandwidth usage

Addressing files and directories by their content helps optimize bandwidth usage as well. Fetching cached blocks from peers on your local network conserves use of internet bandwidth for data that hasn't been fetched from the Internet yet.

Models in IPTF

Datasets aren't the only large files in machine learning. Model weights get large as well. You can use IPTF to load a model like Inception in a single line of code.

Reproducibility and IPTF

Reproducibility is a key aspect of machine learning. IPTF provides a simple way to reproduce any training result at any time. When IPTF loads data to feed to a model, it remembers the fingerprint of the data in a way that's easy to refer to later, similar to git’s reflog. This makes it easy to reuse any data you've used in the past to reproduce a result or compare a new model to an old one.

Storage efficiency and IPTF

Using IPTF during your project’s exploration phase has a bonus benefit: storage efficiency. Whenever new data is stored in IPTF, only the blocks that differ from existing files will be stored.

Developing IPTF

Running tests

Run Python smoke tests

$ bazel run //iptf/python:file_system_test

Run C++ tensorflow::FileSystem tests

$ bazel run //iptf/cpp:file_system_test

Testing other projects

To experiment with IPTF-enabled TensorFlow on an existing project while still building from source

$ bazel run //iptf/python:iptf -- python foo.py bar

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
go-ipfs @ 068ff53		go-ipfs @ 068ff53
iptf		iptf
rules_go @ 9452218		rules_go @ 9452218
tensorboard @ 2c51f36		tensorboard @ 2c51f36
tensorflow @ a204fa0		tensorflow @ a204fa0
third_party		third_party
.gitmodules		.gitmodules
BUILD.bazel		BUILD.bazel
README.md		README.md
WORKSPACE		WORKSPACE
bower.BUILD		bower.BUILD
shell.nix		shell.nix

vasa-develop/iptf

Folders and files

Latest commit

History

Repository files navigation

IPTF

A distributed filesystem for TensorFlow models, data, and logs

Why should I use IPTF?

Fully integrated

Quicker, easier access to popular datasets

Better bandwidth usage

Models in IPTF

Reproducibility and IPTF

Storage efficiency and IPTF

Developing IPTF

Running tests

Testing other projects

Pre-launch task list

About

Resources

Stars

Watchers

Forks

Languages