Using BitTorrent to On-board New Datatrusts

One of the major challenges facing a multi-datatrust system is how a new datatrust can join a data market and catch up with the rest of the datatrusts on the system. In a previous post, I’ve suggested that a datatrust can be modeled as a key-value store. Then a multi-datatrust system can reach consensus on the set of key-value transactions that have occurred so far by using the Tendermint consensus algorithm. The transactions here would be simple updates such as “listing with listinghash 0xacwv3 added data with hash 0x3453f” that specify additions and deletions to the key value store.

The challenge here of course is that this protocol doesn’t specify how the actual data for each listing is sent to the new datatrust. It’s worth recalling that a listing might be a sizable chunk of data. Something on the order of gigabytes or more. The challenge for the new datatrust is downloading the set of all current listings. In aggregate, this could reach into terabytes or petabytes of data. Complicating the challenge, there’s no incentive for any one datatrust already in the market to shoulder the burden of on-boarding a new datatrust into the system. So we can’t necessarily just open a connection to any pre-existing datatrust and directly stream files over. What we need is some way to spread the burden across the existing set of datatrusts.

It’s worth noting here that the problem of streaming large datafiles across the internet has a long and storied history. In particular, p2p clients like BitTorrent have long been used for transferring large files across the internet. As a very brief summary, BitTorrent chops up a large file into “pieces” which are transmitted across the network. As a new node joins, it starts downloading individual pieces. As it finishes downloading each piece, it starts mirroring the piece for other nodes in the network. Once all pieces are downloaded (not necessarily sequentially), the torrent client assembles the full file locally.

We can easily adapt this idea to the datatrust context. Each datatrust in the data market runs a BitTorrent client. Each listing is a “file” to be shared across the network. A new datatrust joining the data market uses the BitTorrent protocol to download existing listings. Torrents have been used to transmit very large files, so this protocol should be capable of handling the transfer of very large listings. As each new datatrust gets up to speed, it then becomes a new node on the torrent network. BitTorrent networks become more robust with more nodes, so the usability of this system will increase as more datatrusts join the network.

Notice that we can also use this BitTorrent protocol to handle the addition of new listings. Let’s suppose a new candidate has been transmitted to an existing datatrust. This datatrust will make the candidate available for download as a torrent. Other datatrusts in the data market can start downloading the listing. The download will proceed slowly at first since there’s only a single source, but as additional nodes start mirroring pieces of the listing, the download speed should speed up significantly.

You might ask, is this what existing blockchain systems do for transmitting transactions? Sort of. Ethereum for example has a wire protocol that it uses to transmit batches of transactions across the network. However, these types of protocol are optimized for transmitting transactions, a task much different from transmitting a large file across the network. The needs of multi-datatrust systems is much more similar to existing torrent networks for transmitting large movies or files across the web.