Secure Computation in Decentralized Data Markets

In this blog post, we describe our recent paper exploring secure computation protocols for decentralized data markets. As quick background, decentralized data markets gather data from many contributors to create a joint data cooperative governed by market stakeholders. The ability to perform secure computation on decentralized data markets would allow for useful insights to be gained while respecting the privacy of data contributors. In this paper, we design secure protocols for such computation by utilizing secure multi-party computation techniques including garbled circuit evaluation and homomorphic encryption. Our proposed solutions are efficient and capable of performing arbitrary computation, but we report performance on two specific applications in the healthcare domain to emphasize the applicability of our methods to sensitive datasets.

One of the challenges of building a decentralized data market is providing adequate protection for the privacy of data contributors. Data contributors might be unwilling to contribute sensitive information into a data market if they lack adequate protections for their data. Economic considerations may ease some of these worries, but for high-value datasets more powerful cryptographic tools may be necessary to secure user data.

Therefore, in the paper, we focus on an interesting scenario where different makers wish to share their own listings (datapoints) to make data available for buyers to perform specific computations on aggregated data. At the same time, we assume makers are not comfortable sharing plaintext data; therefore, our main goal in this work is performing computation on protected and aggregated data. In many practice, these listings may not be owned by one organization, and might be distributed across a decentralized data market.

In our previous work, we introduced decentralized data markets which provide a powerful framework for constructing datasets with distributed ownership and control. In these scenarios, storing protected listings and performing computation on them is not straightforward. Who performs encryption upon listings? How is computation done on encrypted data? In this paper, we explore these questions on two healthcare inspired examples: performing logistic regression on the breast cancer Wisconsin Dataset, and the linkage disequilibrium test on GWAS data. We design a secure solution to compute both logistic regression and linkage disequilibrium tests, but our designed protocol is general and can be used to perform arbitrary computation.

Here, we explore this topic on two healthcare inspired examples: Logistic Regression on the breast cancer Wisconsin Dataset and Linkage Disequilibrium test on GWAS data. We design a secure solution to carry out the computation of Logistic Regression and Linkage Disequilibrium tests, but the proposed solution is not limited to these two particular algorithms; the designed protocol is general and can handle arbitrary computation.

In the following, we first provide some background related to the computation of Logistic regression, Linkage Disequilibrium, and cryptographic tools. Then we moved to our designed protocols, and at the end, we briefly report our experimental results.

Logistic Regression Computation:
Logistic Regression (LR) is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. We can divide the computation of LR in two main categories: training and testing. In training, a model is trained based on training samples and parameters are computed and then the trained model is run on the test cases. There are some standard and open source tools to perform training phase of LR such as TensorFlow (designed by Google), PyTorch (designed by Facebook), and SKLearn (very useful for some standard classifiers like LR). In here, we assume that we have access to a trained model and all the needed parameters and we only focus on the implementation of LR of the testing phase.

If we have a dimension n, precomputed parameters W = (w1,..., wn) and b which are regression coefficients of a trained model, and a sample X = (x1,..., xn), then we can compute probability p as:


As you can see in the computation of p, if we can compute the exponentiation, we can easily compute the rest of the probability.

Linkage Disequilibrium:
A DNA is a sequence of nucleotides {A,C,G,T}. An individual’s collection of genes is called a genotype and physical observable characteristics of an individual are called a phenotype. A genetic marker is defined as a gene or a DNA segment with a known locus (location) on a chromosome, which is typically used to help link an inherited disease with the responsible gene.
A single nucleotide polymorphism (SNP) represents a common type of a genetic variation among people in a single nucleotide that occurs at a specific locus in a genome. One of a number of alternative forms of a gene at a given locus is called an allele. The most and the least common alleles that occur in a given population are called major and minor alleles, respectively. We denote a major allele by a capital letter, e.g., A, and a minor allele by the corresponding lowercase letter, e.g., a. Let N denote the total number of collected alleles in a pool of genes. We then use NA and Na to denote the number of major and minor alleles in the observed population, respectively.
Linkage Disequilibrium (LD) is an important notion in population genetics and it occurs when genotypes at two different loci are not independent of each other. LD computation is based on Chi-square statistics and we can summarize its computation as:

Cryptographic Tools:
We design our solution based on Homomorphic Encryption (HE) and Garbled Circuit (GC). Note that we can use any HE including additive HE and fully HE, but here we are more interested in the fully HE (e.g., Paillier encryption as the additive HE and Lattice-based cryptography as fully HE). In the following, we describe HE and GC.

Homomorphic Encryption:
HE is a type of encryption that allows computations to be performed on encrypted data without revealing any information from the data. In here, we use a specific type of HE where its key is defined in a public-key cryptosystem. This scheme is defined by three algorithms (Gen, Enc, Dec), where Gen is a key generation algorithm that on input of a security parameter 1^κ produces a public-private key pair (pk, sk); Enc is an encryption algorithm that on input of a public key pk and message m produces ciphertext c; and Dec is a decryption algorithm that on input of a private key sk and ciphertext c produces decrypted message m or special character that indicates failure. For conciseness, we use notation Enc_pk(m) or Enc(m) and Dec_sk(c) or Dec(c) in place of Enc(pk,m) and Dec(sk,c), respectively. A semantically secure encryption scheme guarantees that no information about the encrypted message can be learned from its ciphertext with more than a negligible (in κ) probability.
Note that, fully HE supports arbitrary computation and it is a powerful tool, but we need to define a specific noise budget for sequential multiplication operations which affects the performance of computation. Since we use SEAL library for implementation, more information about fully HE can be found in SEAL’s paper.

Garbled Circuit:
The use of GC allows two parties P1 and P2 to securely evaluate a Boolean circuit of their choice. That is, given an arbitrary function f(x1,x2) that depends on private inputs x1 and x2 of P1 and P2, respectively, the parties first represent is as a Boolean circuit. One party, say P1, acts as a circuit generator and creates a garbled representation of the circuit by associating both values of each binary wire with random labels. The other party, say P2, acts as a circuit evaluator and evaluates the circuit in its garbled representation without knowing the meaning of the labels that it handles during the evaluation. The output labels can be mapped to their meaning and revealed to either or both parties.
Note that, in the two-party setting solution based on GC, the complexity of operation is measured in the number of non-free (i.e., non-XOR) Boolean gates. Also, some computations like shift operation do not consist of any kind of gate and it is totally free. Therefore, to have an optimized solution, we need to minimize the number of non-XOR gates by using more free operations during the computation instead.

Designed Protocols:
This is the designed protocol based on HE. In the following protocols, CSP is a Crypto Service Provider.

This is the designed protocol based on GC.

Experimental Results:
In the following, you can see how the designed protocols perform for two LD and LR tests. These are two examples, but the protocols support any arbitrary computation. For more information about the details of the protocols and their performance, we encourage you to read our paper.

This research post was supported by a Computable Fellowship. If you’re interested in performing fundamental research about decentralizing control of data, please apply!

1 Like