Comparing Shapley Values with Data Market Valuation

The Shapley value is an idea from Game theory which attempts to fairly distribute the surplus generated by a coalition of actors. It attempts to fairly estimate the contribution of each player to the coalition and provide a payoff proportion to this contribution. What does this mean mathematically? Let’s suppose that N actors are participating in a system. Suppose in addition that there exists a utility function U: 2^N \to \mathbb{R} that captures the value created by any subset of actors.

Before we go further, let’s pause and quickly talk through the idea of the utility function. We are in essence assuming that there exists some function which is capable of capturing the economic outcomes that result from a subset of the actors working together to solve a game. For the case of data, each actor could be a listing \ell_i, and U(\{\ell_1,\dotsc,\ell_k\}) could be the total economic rewards that a market with these k listings accrues from buyers. Note that in practice, U is often not known. But the use of utility functions is a well established mathematical tool in economics (and AI).

We then define the Shapley value for the i-th actor as follows:

s_i = \sum_{S \subseteq N \setminus \{i\} } \frac{|S|!(N-|S| - 1)!}{N!} \left ( U(S \cup \{i\}) - U(S) \right )

This definition is hard to parse, so it might be easier to look at the interpretation

s_i = \frac{1}{\text{Number of Players}} \sum_{\text{coalitions excluding i}} \frac{\text{marginal contribution of }i\text{ to coalition}}{\text{Number of coalitions excluding }i\text{ of this size}}

Some recent work has attempted to use the Shapley value to solve the problem of data valuation. In this formulation, the utility function is assumed to be the model accuracy on the held-out test set. The main challenge faced in this context is that computing the Shapley value involves an exponentially large summation if done naively, so clever techniques need to be used to control the computational complexity of the summation. A straightforward approximate computation of the Shapley value would be to use random sampling to pick random terms in the summation out compute an estimate of the true Shapley value. By using some clever algorithmic tricks, the cost of computing the Shapley value for ML applications can be reduced considerably and brought down to a reasonable cost for moderately sized datasets.

One of the key limitations of this line of work however is the assumption that the utility function is test accuracy of a particular model trained on the dataset at hand. Consider the ImageNet dataset. This data has been used for hundreds or thousands of different tasks. Capturing the true utility created by the data is hard to categorize, especially since different data points could have contributed more or less to different downstream tasks. Similarly, for a data market, it isn’t reasonable to assume that the core data will be used for only one downstream ML task. The data from a data market may be used for training many different types of ML models or may feed into the construction of downstream derived datasets. It’s not clear how the true utility function can be meaningfully modeled for a data market. For this reason, we’ve chosen at present to use a simpler scheme based on usage rewards. This scheme is easy to implement within smart contracts, and has the advantage of not assuming a single downstream ML application.

At the same time, there might be clever ways that the true utility function for a data market could be modeled more accurately. If we could construct such approximate utility functions, it might then become interesting to consider alternative valuations based on Shapley values. At present though, this is left for future research.