What is a dataset? The question may at first blush appear silly; isn’t it pretty obvious? But there’s more subtlety to the question than it seems. For example, in a previous post, I explained the notion of a derived dataset and how a dataset can gain in value as it’s annotated with relevant information. Data augmentation is a related concept in which a set of transformations can be applied to data points to create a larger dataset from the base dataset. For example, images can be perturbed in a structured fashion to create new images (source
How can a data market account for these sorts of augmentations? Until recently, this question would only have been of academic interest; data augmentation wasn’t extremely useful in practice. But this state of affairs is starting to change. Some recent papers have begun to work out mathematical transformations that expand the usefulness of augmentation transforms. The cumulative research is starting to build up to a point where “semi-supervised” techniques are able to achieve impressive learning results on relatively small datasets.
Does the advent of these newly empowered transformations break the data valuation aspects of our protocol? Not so fast. Most of these augmentation techniques still require source data which can be governed by the data governance protocol. In fact, the advent of data augmentation might actually increase the usefulness of the protocol by making smaller datasets valuable enough that they’re worth governing. That said, as data augmentation grows more valuable, it will likely be useful to have the reference datatrust implementation support standard data augmentation techniques for various datatypes. These transformations will allow buyers who lack the knowledge to construct their own augmentation pipelines to nevertheless benefit.