How to think about fraudulent submissions to a data market
The Computable protocol—a set of smart contracts built on the Ethereum platform—aims to democratize the task of creating and using large scale datasets by wrapping a governance and financial layer around a dataset, turning a dataset into a data market.
Buyers pay to access data in the market; patrons have equity in the market in the form of tokens; and makers contribute data to the market in exchange for tokens.
This note explores how dishonest makers might try to forge data and sneak it into a data market.
To incentivize makers to only submit data that would increase the value of the market, the protocol incorporates voting by token-holders. Candidates—new data instances that makers furnish and propose be added to the market—are voted in by token-holders. If an undesirable data instance slips through this process, it can later be challenged, and a vote amongst token-holders can remove it.
These two mechanics incentivize token-holders to make decisions to only include data that they believe would increase the value of the market. They don’t, however, answer how a rational token holder should evaluate whether a candidate would increase the value of the market.
Assuming a data market has a well defined purpose, how does one evaluate the contribution (positive or negative) of any given data instance to the overall value of the market?
One way to better understanding and frame the problem is to list potential attacks. We’ll restrict the list to attacks where a maker is dishonest and attempts to get a fraudulent data instance listed in the market. We assume token-holders who vote and participants who challenge listings are honest.
For clarity, let’s use an example data market with the purpose of training computer vision machine learning models. The market has images of a number of different objects like dogs, furniture, people, vehicles, etc. Each image is accompanied by a list of text labels corresponding to objects in the scene.
A random bitstring is submitted as a listing candidate without any attempt to conform to the expected data structure. For example, a random string of bits that isn’t even an image file.
Structured garbage attack
A data instance that does conform to the expected structure (valid jpg file, right resolution, contains list of labels). The distribution of the data is chosen randomly from some arbitrary simple distribution. For example an image where each pixel value is chosen from a uniform distribution.
A data instance that is an exact byte-for-byte copy of one already in the market.
Semantic copy attack
A data instance that is a copy of one already in the dataset, but has had bits changed such that semantically it is very similar but a hash function taken over the data would give a different result than the original. An example of this is copying an image and changing a single pixel value.
A data instance that is a copy of one already in the dataset, but that has been more significantly changed than a semantic copy attack. The change modifies the original but does not add any new information. For example, several existing images are cropped together; an existing image has a region of pixels blacked out; an existing image has filters applied, or is rotated.
Model driven generative attack
Some or all of the dataset is used to create a model that then generates new data instances that are submitted as candidates. For example, the data set is used to train a GAN (generative adversarial network) and outputs of the generator model are submitted. Here, the proposed image might be unique but does not add information to the dataset since it was generated by a process that was trained by the dataset itself.
Evaluating the impact of attacks on the data market
The total garbage attack, structured garbage attack, unmodified copy attack, and semantic copy attack clearly don’t add value to the data market.
Evaluating the remix attack and the model driven generative attack is more complicated. Computer vision training datasets are routinely enlarged and diversified by duplicating images which are then visually distorted through filters or rotation. The process leads to more robust recognition of objects when the algorithm is challenged by varying lighting conditions or camera angles. This kind of distortion is the same as the remix attack.
Likewise, synthetic datasets, where training instances are computer generated, are used in the wild to train machine learning algorithms, similar to the model driven generative attack.
The challenge with these last two types of attacks is that though they may be valuable additions to the market, they add less information than de-novo “real” training instances. Combined with the fact that an automated process can generate an unlimited number of these types of instances, allowing them in a data market without some alternative pricing model would distort the economics and potentially create a disincentive for submitting “real” instances.
An inherent challenge to defending against these types of attacks comes from the informational purpose of creating a dataset in the first place.
How can an honest evaluator (voter, challenger, buyer of data) decide whether a given data instance is valid or fraudulent given that the purpose of the assembly of data is to create ground truth information where before there was none?
In other words, the data model an unknown distribution. If the distribution were known in advance, it could be used to assess a candidate data instance, but knowing it in advance would also defeat the purpose of trying to create a dataset in the first place.
We can break down defenses into two classes: those that do and those that do not have access to a reference independent of the dataset in the market.
Reference free defenses
Again using the computer vision example, intuitively, it should be easy to write (or train) an algorithm that detects and throws out fraudulent data instances that are just pixels drawn at random (this would be a structured garbage attack). Visually, these images would look like snow on an old TV set and nothing like a picture of anything from the real world.
Statistically these should be straightforward to flag because all valid images in the dataset should lie on a lower dimensional manifold. A garbage image would be so far from this manifold that it could be automatically throw out without fear of it being a rare but valuable outlier.
The same reasoning should allow us to flag semantic copy attacks as these images would likely be extremely close to an existing instance in the dataset. In pixel-space they might be different, but in the lower dimensional space that any neural network would transform the pixel values in to, they should lie very close to an existing valid instance, and hence be seen as a copy.
Candidate instances that lie close to the manifold but not on top of a valid instance are harder to assess. Is it the product of a remix attack or a model driven generative attack that uses information in the dataset to create convincing frauds or is it new information from an honest maker?
Independent reference defenses
If we know something about where a candidate data instance came from, we may be able decide if it is valid or fraudulent. For certain types of data, this of course will be hard. A candidate instance in a data market of original measurements of the atmospheric temperature of planets in the solar system would be impossible to independently assess.
Many potential data markets, however, may be aimed at data hungry machine learning applications. Often these are deep neural networks for perceptual applications that are trained to recognize features in image or audio. Datasets (labeled or unlabeled) for these applications are inherently independently verifiable, since they rely on a human to attach labels in the first place.
To put it another way, a person can just look at the data and tell the difference between a valid instance and an attack instance. (Note, this may change in the future as GANs or image completion networks become more sophisticated and start to generate data that fools a human).
A practical approach to evaluating data instances
Practically, a data market would likely employ both automated and human checks of listing candidates. Automated checks could flag low quality attacks.
Attacks that are not flagged by an automated check (because they’re sophisticated frauds or they’re actually valid listings) would need to be inspected by a human before they vote. This assumes that the data is of a type that can be validated by inspection.
The question then becomes how to make human inspection, which is necessarily time consuming, scalable. One way is to use reputation as a heuristic. Since all actions (listing data, supporting or withdrawing support from a market, buying data) leave a public log of the action, tied to the actor’s public Ethereum address, voters (and challengers) can look at the history of actions taken by a listing candidate’s owner to decide how intensely to scrutinize a candidate.
Makers with long histories and no challenges would be trusted, and might have listings approved without any human inspection. Makers with short or no history would expect more scrutiny. To be successful, a dishonest maker would have to behave honestly to establish enough history to be trusted before submitting fraudulent candidates. Even then, they are at risk, since finding any fraudulent listing would lead to increased scrutiny of all their listings.
Finally, this record of a maker’s history of past listings, challenges, and even market token ownership is what is granted by the cryptographically signed and transactional nature of anything built on Ethereum. There is a wide design space of features that combine a maker’s private key and other platforms. For example, makers could cryptographically link their Ethereum private key and another online identity (Twitter, Github, Facebook, etc) by signing and posting a message. This would increase trust at the expense of anonymity, but might be the design-space sweet spot to broaden participation in the creation of shared datasets while maintaining quality and fair compensation.