Serializing Data with Protobufs or Apache Arrow

I’ve argued in a previous forum post that adding type annotations to listings would allow for considerable simplification in the data validation and verification process. At the same time though, building a type system for data brings considerable challenges. Designing a new language is an extraordinary development lift, so it seems worthwhile to explore existing alternatives.

There have been a number of previous attempts to structure data serialized onto disk. The two most widely used formats are likely JSON and protocol buffers. Both of these formats model a datapoint as containing a set of subfields with attached types. For example, here’s an example of a Person protobuf:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

Using our haskell-like syntax from our previous post, we would say Person has type

(name :: string, id :: int32, email :: string)

The idea would be that each data market would have a listing protobuf that specifies the format of listings valid for that data market. This protobuf specification would be stored on-chain as the canonical reference for datatrusts.

Protobufs have been used widely to serialize data and have been productionized broadly so should be relatively straightforward to deploy. However, the downside of using protobufs is that they haven’t been optimized to handle machine-learning or numerical data. On the other hand, Apache arrow has been designed with numerical data serialization in mind from day one. Arrow has been baked into the Ray library as the native data serialization format and been deployed in large production contexts.

The major advantage of using Arrow is that it will allow us to model numerical datatypes much more naturally. For example, data that has the following format

[(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]

would map onto the following Arrow data structure

UnionArray(type_ids=[tuple, string, int, int, ndarray],
           tuples=ListArray(offsets=[0, 2],
                            UnionArray(type_ids=[int, int],
                                       ints=[1, 2])),
           strings=['hello'],
           ints=[3, 4],
           ndarrays=[<offset of numpy array>])

In addition, Arrow has an underlying schema which specifies the underlying type. This schema could be stored on-chain as the correct type for all datatrusts to enforce.

The major advantages of using Arrow or protobufs is that we would gain a production hardened set of tools that we can use in practice quickly. The downside however might be that we might be locked into the Arrow/Protobuf schema format. However, the advantages of being able to gain a functioning type system sooner rather than later will be considerable so the tradeoffs seem worthwhile. In fact, it doesn’t seem unreasonable that different data markets could use different serialization formats since the canonical on-chain scheme/type could just an arbitrary bytestring which could either a protobuf or an Arrow schema or anything else we desire.