How Serious is Data Counterfeiting?

In a previous post, I introduced the idea of a derived dataset. To recap, a derived dataset is one where additional information is added on top of the original dataset that increases its usefulness or market value. If this information is added on by the datatrust operator, it can be directly added to the source data market. Alternatively, a third-party can purchase access to the data outright then spin up a new independent data market which exposes the augmented data market. In this scenario, it’s likely this third party has expended significant money and effort to create the augmented data market, so it doesn’t seem like a terribly unfair trade.

Let’s consider now an alternate setting where the third-party is a little lazier and a little greedier. In this case, the third party obtains access to the data, then spins up a new independent data market with prices lower than the original. Such a data market can reasonably be called a “counterfeit data market.”

There are a number of classical analogues of counterfeit goods in the more traditional economy. For example, there exists a vast ecosystem of Nike counterfeit knock-offs (see this recent article for example). The usual diagnosis is that Nike must fight counterfeiting by partnering with law enforcement since counterfeits dilute the Nike brand. At the same time, a quick investigation of Nike’s stock price shows a very healthy increase over the last few years. What could explain this discrepancy? One real possibility is that Nike counterfeits may result in some lost sales for Nike, but may also serve to popularize Nike’s brand among consumers who previously felt it out of reach.

Coming back to data markets, it’s likely that a counterfeit data market could well drive revenue and interest back to the source data market. There are few reasons for this. The first is that the source data market has a few powerful network effects that make it very difficult for the counterfeit data market to overtake it. First, the network of makers who serve the source data market would individually need to decide to switch to the counterfeit market. Inertia and trust in the original data market’s brand will slow this switch, but even more importantly, the counterfeit data market can’t steal the source data market’s reserve. The reserve is what motivates contributions by makers, and the presence of the source’s reserve will likely keep makers tied to the source market. This stickiness means that future data gathered will be contributed to the source market. Buyers who were attracted to the cheap prices of the counterfeit market will likely be drawn over time upstream to the higher quality source market.

Note though that this analysis presumes that the data is somewhat “volatile.” That is, we assume old data loses value and is less interesting than newer data. This assumption holds true for many types of datasets such as time series data, sensor data (where newer sensors are better than old sensors), and even some biological data (newer assays are often superior to older assays). However, there do exist some nonvolatile datasets. For example, the structures of small molecules designed by pharmaceutical companies are often kept highly secret and can be valuable IP for decades at a time. In this situation, counterfeits (in the absence of patent protection) could prove damaging.

How can counterfeiting be controlled in non-volatile data markets? The solution will likely involve the use of cryptographic techniques that limit information leakage from the source market. For example, a molecular structure data market could choose to only answer a limited set of secure queries for buyers instead of selling arbitrary data. However, additional research will be needed to scope out a proper set of “secure computation” primitives.

1 Like