Skip to content

shvar/redfs

Repository files navigation

What is RedFS?

RedFS is an open-sourced distribution of P2P-oriented scalable distributed file synchronization/backup system. That is, kind of a cross of Dropbox with BitTorrent.

You can install this system on multiple computers and create your own tiny personal cloud, with the storage distributed between all of the peer clients (running Linux, OS X, Windows, or maybe any other OS if you find all the required dependencies). Such a cloud can span between multiple computers connected via public Internet; moreover, the clients of the system do not need to trust each other, cause even while providing storage space to the system, the clients won't be able to access the data of other clients, due to the sophisticated encryption.

Everybody is invited to join hacking RedFS, ensure its security and stability and improve it even further.

License

RedFS is multi-licensed under GPLv3 (+ “or any later version” clause) and under the proprietary closed source license.

GPLv3 logo

  1. You can use the RedFS code under the GPLv3 (or any later version) license conditions.
  2. If you want to use the RedFS code under any other licensing conditions, please contact us for proprietary closed-source licensing details.
  3. When you are contributing to RedFS project in any form (as an example, but not limited to, by explicitly sending us patches or GitHub pull requests) you accept that your contribution will be multi-licensed on these conditions.
  4. The RedFS team leaves itself the right to change these multi-licensing conditions at any time.

Features

RedFS supports (or ready to support) such features as:

  • Peer-to-peer storage on untrusted client computers;
  • Automatic replication of data between the client storage, to maintain the desired level of data redundancy;
  • Multi-level encryption to protect the data of the clients:
  • cloud-level (implemented) — to prevent occasional data snooping,
  • usergroup-level (implemented) to protect the data from unauthorized access,
  • end-user (planned) — as an ultimate measure to prevent any man-in-the-middle access to the data, including access by the cloud maintainer/administrator;
  • Data deduplication;
  • Continuous data protection (in the watched directory);
  • Data synchronization between the computers of the same user.

Requirements

RedFS is originally written and most tested under Linux (esp. Debian/Ubuntu); the RedFS clients are also intended (and expected) to be executed under OS X and Windows. It also requires a number of external packages to be installed, see Dependencies for details.

Overview

Assume an administrator planning to launch a RedFS-based cloud. What concepts and ideas should one know and understand, to plan such a cloud?

Cloud zones

From the maintenance point of view, the whole RedFS cloud is divided into two zones:

  • Trusted zone — the cloud components hosted within the reach of administrator, on controlled premises (such as, within the controlled data center). It is assumed that the communication between these cloud components is secured and cannot be intercepted, and unauthorized parties cannot gain access to any component in the zone due to reasonable security measures.
  • Untrusted zone — the cloud components hosted outside the Trusted zone, such as at the personal computers of third-party cloud users.

Cloud components

Cloud components in Trusted zone

RelDB (relational database)

The cloud needs a relational database to function, in particular to enforce the data deduplication, to store the structure and to version the user contents stored in the cloud.

Ops note: requires deployment of PostgreSQL database. The amount of the data stored in the database is not that huge; for example, one of the clouds running RedFS-based storage in production and storing about 128 GiB of real user data, utilized the PostgreSQL database about 60 MiB large. On the other hand, some (rarely executed but still crucial) queries may be pretty complex, so, a powerful high-CPU server would be a good choice for deployment of the database.

Docstore (non-relational database)

For some data with the usage profile less suitable for the relational databases, the cloud requires a non-relational/NoSQL database deployed. Currently, it uses MongoDB for all these purposes. Dev note: the code subsystem ensuring the access to the non-relational database is internally known as Docstore, meaning that it stores the “documents” rather than “relations”.

There are two significantly different scenarios how the data is stored in the non-relational database/Docstore: either the system stores the small volumes of often-updated/often-read pieces of data, or the system stores the significantly sized data chunks, on a kind-of “distributed Big-Data filesystem”, and seldom requires to write/update/read it. Luckily, MongoDB handles both scenarios pretty well, utilizing the general-purposes MongoDB collections for the first scenario, and the GridFS overlay for the second one. In terms of RedFS, the first kind of database is called FastDB and the second one is called BigDB, making it possible in the future to implement different backends, besides MongoDB.

Ops note: the usual suggestion for any MongoDB installation is to deploy the MongoDB on high-memory servers; this, and/or setting up the MongoDB storage on a SSD, is especially useful for FastDB. The FastDB doesn't occupy much space in normal utilization, though it is heavier than RelDB; for example, the production cloud with the 128 GiB of real user data utilized about 0.5 GiB of MongoDB-based FastDB. The BigDB deployment is out of scope of this documentation.

Node

Node is the central component of all the cloud. Technically, it is a process running on a (likely, dedicated) server, having access to multiple databases: RelDB, FastDB and BigDB. It keeps track of all the data stored in the distributed cloud, as well as all the clients currently connected and providing the storage space.

Cloud components in Untrusted zone

Host

Host is the name of a client-side process running, connected to the RedFS cloud (particularly, to the Node), capable of both providing some space to the cloud, to synchronize/backup the data in some directory (to ensure the continuous data protection). Each user may have multiple host processes running simultaneously (say, on different computers) authenticated under the same username.

When running, it stores some transient information about the process of backup/synchronization, as well as about the data being or having been backed up. It uses SQLite database for this purpose; Dev note: the data schema on the host is highly similar to one on the Trusted Zone, so most of the DB-related code is shared between them, thanks to the SQLAlchemy ORM.

During the synchronization of the contents from the cloud, and to store the data of the other peers, it also stores the so-called chunks of the data, in a predefined directory.

History

The RedFS project took more than 10 man-years to implement, with multiple strategy pivots in between. In particular, it was multiply rebranded in the process. You may probably spot these intermediate brands throughout the code, and you should not get too confused about it.

The very first (internal) project code name was Calathi (from in-team Lat. motto Ovbus calathi, meaning “(multiple) Buckets for the eggs”), accenting the idea of not storing the sensitive data on a single point-of-failure and distributing/replicating the data between the storage hosts. Dev note: consider “Calathi” an internal project codename, like “Chicago” or “Memphis” were for Windows OS.

Another pivotal point involved renaming the storage technology/framework behing the system to FreeBrie («free» as a beer, «brie» as a cheese), emphasizing the opportunity to join the cloud and backup any unlimited amount of data absolutely for free… but with the tongue-in-cheek notice that you must provide the appropriate amount of storage on your computer as well. Dev note: consider “FreeBrie” the historical name of the technology.

This technology is finally open sourced under the RedFS brand. Dev note: consider “RedFS” the name of the open-source project.

See also

Releases

No releases published

Packages

No packages published

Languages