A Programming Model for Datatrusts

In the current version of the protocol, datatrusts serve as repositories of data. It’s possible for interested parties to purchase raw data from datatrust, priced with the parameter COST_PER_BYTE. This raw data purchase is useful for many downstream applications. It maintains considerable flexibility since the datatrust doesn’t try to curtail what can or cannot be done with the data from the data market.

At the same time, this raw flexibility comes with a cost. As our recent paper suggests, there are high value datasets where it might not be feasible to allow for direct sale of data. For example, medical or genomic data might not be feasible to sell outright. In that work, we experimented with cryptographic techniques such as garbled circuits or homomorphic encryption to enable secure computation on secure datasets. We constructed a custom implementation that prototyped these concepts, but we did not integrate this implementation with the datatrust code itself.

Part of the reason we couldn’t take this additional step is that at present, there’s no programming model for the datatrust. What do I mean by that? To be precise, there’s no way to run user specified code upon the raw data within the datatrust. For the most part, this is the right decision for a first datatrust implementation. There is considerable security risk in allowing users to execute external code. At the same time though, in the longer run, it feels like adding capabilities to run external user code will be critically important for unlocking many useful applications. For one, we should be able to use this facility to implement the secure computing features described in the paper. Second, there might be a number of other use cases that a suitable programming model could unlock. One of the powers of Turing-complete systems is that they don’t limit the user to only specific applications.

Let’s start to think about what this programming model might look like. For simplicity, in this post, I’m not going to discuss much about potential VM designs for the datatrust. Floating point support will be a critical need, since many datasets are natively represented in floats. Any programming model for data in a data market will need to be able to handle a variety of types since many types crop up in real world data markets. Here’s specification of the primitive types we might want to support

data VMPrimitiveType 
    = Integer 
     | Bool
     | Char
     | String
     | Float

(It’s likely in practice we’ll want additional types like Float16, UINT32 etc, but I’m not specifying these out for simplicity). We’ll also need a native type for vectors/arrays.

data VMListTypes
    = [Integer]
     | [Bool]
     | [Char]
     | [String]
     | [Float]

SQL data will also very commonly appear in data markets, so it will make sense to support row types

(name :: String, age :: Integer)

Such row types can be used to represent SQL data. Ok, now that we’ve got this collection of types put together, what are we going to do with them? Well, the basic idea is that each listing in a data market will be annotated with a type. For example, let’s have a GPS data market with latitude and longitude types. We can represent the listing type as

newtype ListingType = (latitude :: Float, longitude :: Float)

These types of type annotations provide a very simple and flexible scheme for the data in a data market. SQL datasets can be represented with row types. Image datasets can be represented with array types:

newtype ImageType = [[Float]]

(It might be worthwhile adding size annotations so we could specify image sizes in the type, but that complicates the type system so we won’t get into it here). You might be asking what the point of all this effort was though. The answer is that the presence of a type system gives us a fairly sensible programming model on top almost for free. We allow users to specify functions that transfer the typed data in the data market into a suitably typed output. It might be easier to provide some sample code rather than explain in more detail.

# Listings for a latitude/longitude data market 
listings = {
  0xabcwe23: (23.111, 12.12312),
  0xw45f3sd:  (11.455, 5.334552)
}

sum_lat, sum_long = 0, 0
for listing_hash, (lat, long) in listings:
  sum_lat += lat
  sum_long += long
mean_lat = sum_lat / len(listings)
mean_long = sum_long / len(listings)
return (mean_lat, mean_long)

This snippet provides a very brief program to compute the mean latitude/longitude in a GPS data market. The program itself is python-esque (but type-safe), and provides the user to compute a desired function. It ought to be able to use static analysis to gauge the compute cost of the function approximately, which will allow the datatrust to decide whether it wants to execute the function.

It’s important to note that this post only sketches out a crude idea. Considerably more work will be needed to refine the type system, the language itself, ensure security and feasibility. Additional work will also be needed to add programming primitives to enable secure computation techniques to be supported, so we can implement the secure computing paper on top of this framework. But there’s something compelling about allowing computation on data in data markets to happen that will open up a new frontier of exciting use cases.