Valuing Derived Datasets

The core of the computable protocol is designed to fairly value a gathered dataset. In this model, makers harvest data, and stakeholders in the market accept the data if it meets quality standard requirements. Data that’s accessed often by buyers gains listing rewards. This design seems simplistic at first blush. How for example could we value a derived dataset?

To start, what is a derived dataset? Let’s take the example of a location based data market. Suppose that datapoints consist of (latitude, longitude) tuples. A listing will in this case consist of many datapoints; say for simplicity 1000 tuples per listing. In this setup, makers would contribute location listings and earn market tokens. It’s often desirable to be able to add on additional information to augment the base data. For example, if the data were gathered in the United States, it might be useful to augment the (latitude, longitude) pairs with zipcodes. This process is called reverse geocoding and can be quite tricky to pull off since open source libraries don’t do zipcodes well at present. A reverse geocoded version of the dataset would have datapoints be tuples (latitude, longitude, zipcode). This augmented dataset is the “derived dataset”

The challenge now is how do we value this derived dataset? One possibility is that this derived dataset isn’t represented directly on-chain. That is, an interested party can purchase access to the original dataset and construct a derived dataset with reverse geocodings which they proceed to sell independently. It’s then the responsibility of this party to value the new dataset, perhaps using the data market protocol for this purpose. Alternatively, if the interested party operates the datatrust for the original market, they can just perform this augmentation directly on the raw data. Both of these options have the advantage of simplicity, since the protocol doesn’t need to track complex data lineages. The first mainnet version of Computable data markets will enable these types of derived datasets.

Let’s consider a more complex example. Suppose that a second data market contains points of interest (locations such as stores or landmarks). In this case, the listing for a point of interest (POI) would be a tuple (POI-latitude, POI-longitude, POI-description). A derived dataset might consist of a “join” of sorts between both data markets. This derived dataset would have tuples of the form (latitude, longitude, POI) where each location tuple was annotated with a point-of-interest that it might correspond to.

The creator of the derived dataset will need to have access to both of the original data markets. Let’s assume that this entity has purchased access to both original data markets and has assembled this new derived “join” dataset. The protocol itself is permissionless, so the purchaser has the ability to create a new data market which exposes this joined dataset to buyers. If they do this, then the derived dataset can be valued by the data market protocol, but with the limitation that the original makers will not profit from the derived dataset. At first look, this may seem like a limitation; shouldn’t the original maker profit from the derived dataset? However, it’s important to realize that the creator of the derived dataset has already paid for access, so the makers have seen their reward. Cutting the lineage of ownership after one purchase allows for flexibility and allows for rapid data innovation since there aren’t data rent seekers who can claim rent on derived data in perpetuity.