Prototype for content-addressed self-deduplicating storage using Ceph. Similar to Plan 9 Venti, but on top of RADOS.
The core idea is to content address Ceph objects. Usually Ceph object names are arbitrarily chosen by a client application. In content addressed storage (CAS) this object name is determined by fingerprinting the content.
Therefore, writing the same data results in the same fingerprint. The same fingerprint means that we access the same object, since we use the fingerprint as its name.
To keep track about how many clients refer to a single object, an object stores a reference counter, that is incremented every time the same data is written by a client.
The reference count is also used to determine when an object is not used anymore.
With this approach in Ceph we get storage pools that deduplicate data.
The drawback of this approach is, that extra metadata is necessary to map client file names to CAS objects. The extra metadata is stored in index objects that use a filename as the object name just as regular Ceph objects. They contain an object map with a map of version numbers and references to recipe objects.
Recipe objects are stored like CAS objects, so that storing the same file twice will also deduplicate the recipe object.
A recipe object stores an extent-like list that maps file regions to fingerprints of CAS objects.
Situation: A client stores a backup tarball of several gigabytes using Veintidos.
First Veintidos uses a chunker to split the tarball. The chunker generates a list of chunk extents: (offset, size, data chunk).
The data chunk is written as a CAS object and the chunk extent is changed to (offset, size, fingerprint).
All of this tuples are recorded and after all CAS writes finish converted into a recipe.
This recipe is then written as another CAS object and its fingerprint together with a version number stored in an index object that takes the name of the original tarball.
The system consists of a Ceph Object Class (github), that implements reference counting and metadata storage for CAS objects, and a client library that adds chunking, recipes, and compression.
Source code documentation is available on Github Pages: https://irq0.github.io/veintidos/
From bottom-up the Veintidos stack looks like this:
- Ceph Object Class - New CAS RADOS Operations (Run on OSD)
- RADOS, librados, Python bindings
- cas.py - Python interface to CAS object class
- compressor.py - Compression for CAS class
- chunk.py - Chunker abstraction (write_full, read_full) and static chunker
- fingerprint.py - Fingerprinting functions
- recipe.py - Read / write recipe objects
- veintidos.py - CLI
- cas.put(data) -> fingerprint (refcount++)
- cas.get(fingerprint) -> data
- cas.up(fingerprint) -> (refcount++)
- cas.down(fingerprint) -> (refcount–)
- cas.list() -> [[fingerprint, refcount], …]
- cas.info() -> metadata
- chunker.write_full(name, data) -> version (cas.put(data), cas.put(recipe))
- chunker.read_full(name) -> data (cas.get(recipe), cas.get(data))
- chunker.read(name, off, size) -> data
- chunker.versions(name) -> [HEAD, …] chunker.remove_version(name, version)
- chunker.remove_version(name, version)
Veintidos uses a single Ceph pool, but two distinct namespaces to avoid name collisions between content-defined names and user-defined names.
- The CAS namespace for data chunks and recipe objects. Names are content defined
- INDEX namespace for index objects that map filenames to recipe objects. Names are user defined
- Store arbitrary data
- Created using
cas.put
and named after the fingerprint of their content - Reference counted by the CAS Ceph object class
- Store additional metadata for the compression and fingerprinting algorithm
- Store an encoded list of extents
- Each extent has the form
(offset, length, fingerprint)
- Stored as CAS objects
- Written after all chunks are successfully written
- Do not store the name of the original file
- Associate recipe objects with a filename and version
- The version is a UNIX timestamp
- Store version -> fingerprint mapping as an object map
Since Veintidos is a prototype the implementation has some limitations:
- Client-side code is written in Python
- CAS PUT uses JSON
- Data is Base64 encoded; Metadata is taken as is by the object class
- A rewrite in C++ could leverage the Ceph encode/decoder, which is unavailable in Python
- CAS GET is implemented using regular RADOS operations
- Can’t use Cephx to limit operations to CAS object class
- CAS Chunker has no partial write support
- Can’t partially update files
- No structure sharing between recipes of the same file
- Only static chunking
- Add dynamic chunker to increase dedup ratio
- Recipes are implemented client side and not in an object class
- Possibly move
get_extents_in_range
to OSD to speed up operations such as partial read/write
- Possibly move
- Client crashes result in orphaned objects
- If a client crashes before writing the recipe and index objects the data objects in the CAS pool end up being unreferenced
- Fix: Add intent logging to client
The code contains both a library and a command line utility to write files to a CAS pool.
veintidos.py put "backup" <(tar cvf - /)
veintidos.py get "backup" root_backup.tar
Ventidos has two layers:
Thin layer over the RADOS / CAS Object class. Provides methods to put, get and increment / decrement the reference counter of objects
Adds chunking and recipes on top of CAS.
- Ceph Cluster with CAS object class installed. Not part of mainline Ceph. Branch: github
- Python 2.7
- Python RADOS bindings with execute support
- msgpack
- python-snappy
- nose for the unittests