Listings Should be Typed

One of the fundamental challenges that data markets face is the problem of validating listing candidates. The listing itself is an off-chain chunk of data, so it’s not possible to control its exact structure on-chain (except through coarse mechanisms such as dataHash which can store the hash of the listing on-chain). As a result, current data markets have only very loose control of the structure of submitted listings.

This looseness could cause considerable confusion as the ecosystem grows. If a data market employs a third party datatrust, how can this datatrust operator decide what constitutes a fraudulent listing submission? If we use the taxonomy of fraudulent listing submission types, it isn’t straightforward even for a datatrust to defend against a “garbage attack” since the desired structure of the data isn’t specified anywhere on-chain. At present, this would require the datatrust to be a semi-trusted party which can maintain off-chain communications channels with key stakeholders in the market. While this situation is fine for an initial launch, it would be useful to make it easier for third party datatrusts to detect garbage listings.

Luckily for us, there has been a wealth of research on the problem of structuring data from the programming languages community. Strong type systems are capable of describing very complex data structures. Let’s look at a few examples. Let’s suppose that we wanted to specify that listings should match rows in a SQL table. To be concrete, consider the table created by the following CREATE TABLE command:


CREATE TABLE Persons (

PersonID int,

LastName varchar(255),

FirstName varchar(255),

Address varchar(255),

City varchar(255)

);

To specify that a listing should constitute a row in this table, we can use a row type (I’m going to use Haskell/Purescript type syntax, see here):


(personId :: Int, lastName :: String, firstName :: String, address :: String, city :: String)

Intuitively, this row type specifies that a listing is only valid if it contains 5 different quantities each with the specified types. If this type were specified somewhere on-chain (say as a string field on the Datatrust.vy contract), any datatrust would be able to evaluate the basic correctness of new data by running a type checking algorithm. This basic type checker could serve as a defense against garbage attacks on our data market.

At the same time, we should note the limitations of this technique. In particular, there’s a slight mismatch between the SQL specification and the type. In SQL, we have


LastName varchar(255)

But in our type system we have


lastName :: String

which doesn’t enforce the length restriction on the string. One attack that could slip through would use strings of unsuitable lengths (say 512). It’s possible to catch such errors with more refined type systems (such as refinement types), but we won’t discuss that here.

As another limitation, note that the type system can’t defend against “structured garbage attacks.” That is, if I submitted a nonsensical record that looked as follows


(1231, “:CLKJSCSD”, “@#$@#$@#”, “kds;cs435”, “l3kj453453”)

The type system wouldn’t be able to detect this error. More sophisticated “copy attacks” or “generative attacks” would also be hard to detect with pure type specifications and would require more sophisticated defenses.

Nonetheless, despite these limitations, strong listing types could catch nonsensical garbage inputs like


“Efkw;elfkjw;elkjr”

which are malformed. In addition, since the type is compact enough to be stored on-chain, there is a central form of truth that can be referred to by external datatrust providers as they make decisions over the validity of submitted listing candidates. The specification of such types will take data markets one step closer to decentralization by loosening needed off-chain communication channels between market participants.

As a final consideration, it’s worth asking how this type system would interact with the existing challenge mechanism for removing listings. The challenge system requires significant effort on the part of the challenger. Multiple Ethereum transactions are required, and the challenger takes on the risk of losing their stake if the challenge fails. For this reason, challenges will likely be relatively limited in number and reserved for removing egregiously bad listings. The type system can serve as an automatic and fair safeguard that handles simpler failure cases. This will allow challenge makers to reserve their attention for more complicated examples (such as the structured garbage attack above)