3 APR, 2024

Network coding for distributed automated system: adar

I'm sure you all have data to store. This is a tale of how I tried to bring some alternative thinking to how consumers store their data amongst the devices they own.

This wasn't just for fun: I tried to do some real good and provide genuine solutions, as this body of work was completed as part of my university degree. It was submitted in 2023 as my honours dissertation, for which I recieved the highest grade.

What is the problem?

There are two perspectives to consider this from:

This project, adar, seeks to solve both perspectives with one solution.

What is distributed storage?

Typically, data is stored on a single device, such as your phone or laptop. Simple and easy. But it means it can only be accessed on that device, and losing that device means you've lost your data. That's no good! A very popular solution to this is to store nothing on your own device, and instead store it on...

The cloud

Data stored on the cloud is really just stored on someone else's device, and they setup access for you to view it from anywhere with an internet connection. Usually you have to pay for the right to store your data on their device, but it allows for you to use the same access (account) on all of your individual devices to see the same data. And losing your device doesn't affect your data. But losing your account does!

There are often mechanisms to keep some of the data on your local device for faster access, or if you're without internet for a bit. These mechanisms need to include some synchronisation service, too. Consumers would be most familiar with this as Microsoft's OneDrive service, which is integrated into the Windows File Explorer.

Windows File Explorer sidebar, showing the integration of OneDrive amongst local folders

Non-cloud

In terms of consumer products for distributed storage that aren't just "make it someone else's problem" (i.e. the cloud) there just... aren't any. What about a Network Attached Storage (NAS) device? They're good dedicated storage devices, but they're hardly a distributed solution.

Actual distributed storage is: segregating data seamlessly between multiple storage nodes (devices) in a system that allows for uniform access.

Why am I doing this?

As alluded to in past posts, I have been studying Computer Science at university. As part of my degree, I must complete a year-long project as a dissertation to earn my honours. I have personally struggled with managing data amongst the various different devices that I own, and I am tired and frustrated by the antiquity of it all. I brought these concerns to my supervisor, saying that I wanted to do something to help clear up this area of computing. They suggested that I investigate network coding within distributed storage.

After conducting a review of the literature, I deteremined that there are three key tenets that would lead to an effective solution: network coding, automation, and distribution. No other system implemented all three in a unique protocol, so I was determined to do so myself.

But why these three? Well, current distributed storage solutions are custom-built and tweak for cloud providers, and far too complex for consumer use. Thus, my system must be as simple as possible. However, the system cannot be without some form of technical complexity, so these aspects must be automated away from the user, as technology should always be easy to use. Finally, to have some actual use, it must be able to effectively distribute the data across peers, but with a very simple system. The simplest manner of doing so requires the use of network coding, which sounds complicated but actually solves a lot of problems.

How does this actually solve the problems?

Let's refer back to our original two user perspectives:

Oh no, I'm out of space again!

Out of space... but probably only on one of your devices. Your other devices are likely not full, and would be well able to store some extra stuff. For example, most people these days would have a phone and a laptop. Typically, their phone is full (or near enough), whilst their laptop is heavily underutilised. Or they may have a laptop that is full, and a desktop computer that is effectively unused. The typical usage pattern is that the user's primary device, the one that is most portable and convenient to them, is full of everything they want to have access to and store. This leaves their secondary other devices underutilised, when the storage load could be shared amongst all devices. It just makes sense, right?

Distributed storage using network coding means that each device stores a portion of the total data. For example, with 100 gigabytes of data amongst two devices, each one could store 50 gigabytes and overall the data is stored. Managing this manually would suck, so having it automated is key. Also, a proper protocol to make it seamlessly work so that each device can still access the full 100 gigabytes is necessary, and network coding is the most efficient way of doing all of this.

Ugh, I hate organising data between my own devices

Even though devices are very connected, and we have accounts for every little thing on the cloud to sync between devices, storage is not so. I don't know why we've been ignoring this whole sector of computing. Regardless, each device acts as a unique bastion of data storage, which is pretty obtuse when they're owned and used by the same physical person. As such, it would be great if all devices that you own could just work together so that you can access the same data on all of them seamlessly and natively as per each platform.

As for the other perspective, manually splitting up and managing data storage across devices is a tedious and painful task. Hence, an automated protocol would be pretty neat. Furthermore, it wouldn't make sense to just fully replicate all data on each device, as that means that every device needs enough storage space to store everything. So, it would be best to split up the data across devices, with the automated system keeping it all working in sync and easily accessible.

A giant overview

Hopefully the motivation behind this is starting to make a bit of sense now. Let's jump into a giant overview description of how this all works...

Each device is a storage node, and is known as a peer.

Each peer stores a proportion of the total data.

Once peers are paired, they will always connect and synchronise.

Data is stored in an encoded form using network coding.

All user interaction and usage is native to the peer's platform.

How is this actually accomplished?

Well, I can't really explain it all in one go, so I'm breaking it down into a few different articles: