In this blog post, we describe our recent paper exploring secure computation protocols for decentralized data markets. As quick background, decentralized data markets gather data from many contributors to create a joint data cooperative governed by market stakeholders. The ability to perform secure computation on decentralized data markets would allow for useful insights to be gained while respecting the privacy of data contributors. In this paper, we design secure protocols for such computation by utilizing secure multi-party computation techniques including garbled circuit evaluation and homomorphic encryption. Our proposed solutions are efficient and capable of performing arbitrary computation, but we report performance on two specific applications in the healthcare domain to emphasize the applicability of our methods to sensitive datasets.

One of the challenges of building a decentralized data market is providing adequate protection for the privacy of data contributors. Data contributors might be unwilling to contribute sensitive information into a data market if they lack adequate protections for their data. Economic considerations may ease some of these worries, but for high-value datasets more powerful cryptographic tools may be necessary to secure user data.

Therefore, in the paper, we focus on an interesting scenario where different makers wish to share their own listings (datapoints) to make data available for buyers to perform specific computations on aggregated data. At the same time, we assume makers are not comfortable sharing plaintext data; therefore, our main goal in this work is performing computation on protected and aggregated data. In many practice, these listings may not be owned by one organization, and might be distributed across a decentralized data market.

In our previous work, we introduced decentralized data markets which provide a powerful framework for constructing datasets with distributed ownership and control. In these scenarios, storing protected listings and performing computation on them is not straightforward. Who performs encryption upon listings? How is computation done on encrypted data? In this paper, we explore these questions on two healthcare inspired examples: performing logistic regression on the breast cancer Wisconsin Dataset, and the linkage disequilibrium test on *GWAS* data. We design a secure solution to compute both logistic regression and linkage disequilibrium tests, but our designed protocol is general and can be used to perform arbitrary computation.

Here, we explore this topic on two healthcare inspired examples: Logistic Regression on the breast cancer Wisconsin Dataset and Linkage Disequilibrium test on *GWAS* data. We design a secure solution to carry out the computation of Logistic Regression and Linkage Disequilibrium tests, but the proposed solution is not limited to these two particular algorithms; the designed protocol is general and can handle arbitrary computation.

In the following, we first provide some background related to the computation of Logistic regression, Linkage Disequilibrium, and cryptographic tools. Then we moved to our designed protocols, and at the end, we briefly report our experimental results.

**Logistic Regression Computation:**

Logistic Regression (*LR*) is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. We can divide the computation of *LR* in two main categories: training and testing. In training, a model is trained based on training samples and parameters are computed and then the trained model is run on the test cases. There are some standard and open source tools to perform training phase of *LR* such as TensorFlow (designed by Google), PyTorch (designed by Facebook), and SKLearn (very useful for some standard classifiers like *LR*). In here, we assume that we have access to a trained model and all the needed parameters and we only focus on the implementation of *LR* of the testing phase.

If we have a dimension `n`

, precomputed parameters `W = (w1,..., wn)`

and b which are regression coefficients of a trained model, and a sample `X = (x1,..., xn)`

, then we can compute probability `p`

as:

As you can see in the computation of `p`

, if we can compute the exponentiation, we can easily compute the rest of the probability.

**Linkage Disequilibrium:**

A *DNA* is a sequence of nucleotides `{A,C,G,T}`

. An individual’s collection of genes is called a genotype and physical observable characteristics of an individual are called a phenotype. A genetic marker is defined as a gene or a *DNA* segment with a known locus (location) on a chromosome, which is typically used to help link an inherited disease with the responsible gene.

A single nucleotide polymorphism (*SNP*) represents a common type of a genetic variation among people in a single nucleotide that occurs at a specific locus in a genome. One of a number of alternative forms of a gene at a given locus is called an allele. The most and the least common alleles that occur in a given population are called major and minor alleles, respectively. We denote a major allele by a capital letter, e.g., `A`

, and a minor allele by the corresponding lowercase letter, e.g., `a`

. Let N denote the total number of collected alleles in a pool of genes. We then use `NA`

and `Na`

to denote the number of major and minor alleles in the observed population, respectively.

Linkage Disequilibrium (*LD*) is an important notion in population genetics and it occurs when genotypes at two different loci are not independent of each other. *LD* computation is based on Chi-square statistics and we can summarize its computation as:

**Cryptographic Tools:**

We design our solution based on Homomorphic Encryption (*HE*) and Garbled Circuit (*GC*). Note that we can use any *HE* including additive *HE* and fully *HE*, but here we are more interested in the fully *HE* (e.g., Paillier encryption as the additive *HE* and Lattice-based cryptography as fully *HE*). In the following, we describe *HE* and *GC*.

**Homomorphic Encryption:**

*HE* is a type of encryption that allows computations to be performed on encrypted data without revealing any information from the data. In here, we use a specific type of *HE* where its key is defined in a public-key cryptosystem. This scheme is defined by three algorithms `(Gen, Enc, Dec)`

, where Gen is a key generation algorithm that on input of a security parameter `1^κ`

produces a public-private key pair `(pk, sk)`

; Enc is an encryption algorithm that on input of a public key `pk`

and message `m`

produces ciphertext `c`

; and `Dec`

is a decryption algorithm that on input of a private key `sk`

and ciphertext `c`

produces decrypted message `m`

or special character `⊥`

that indicates failure. For conciseness, we use notation `Enc_pk(m)`

or `Enc(m)`

and `Dec_sk(c)`

or `Dec(c)`

in place of `Enc(pk,m)`

and `Dec(sk,c)`

, respectively. A semantically secure encryption scheme guarantees that no information about the encrypted message can be learned from its ciphertext with more than a negligible (in `κ`

) probability.

Note that, fully *HE* supports arbitrary computation and it is a powerful tool, but we need to define a specific noise budget for sequential multiplication operations which affects the performance of computation. Since we use *SEAL* library for implementation, more information about fully *HE* can be found in *SEAL*’s paper.

**Garbled Circuit:**

The use of *GC* allows two parties `P1`

and `P2`

to securely evaluate a Boolean circuit of their choice. That is, given an arbitrary function `f(x1,x2)`

that depends on private inputs `x1`

and `x2`

of `P1`

and `P2`

, respectively, the parties first represent is as a Boolean circuit. One party, say `P1`

, acts as a circuit generator and creates a garbled representation of the circuit by associating both values of each binary wire with random labels. The other party, say `P2`

, acts as a circuit evaluator and evaluates the circuit in its garbled representation without knowing the meaning of the labels that it handles during the evaluation. The output labels can be mapped to their meaning and revealed to either or both parties.

Note that, in the two-party setting solution based on *GC*, the complexity of operation is measured in the number of non-free (i.e., *non-XOR*) Boolean gates. Also, some computations like shift operation do not consist of any kind of gate and it is totally free. Therefore, to have an optimized solution, we need to minimize the number of *non-XOR* gates by using more free operations during the computation instead.

**Designed Protocols:**

This is the designed protocol based on *HE*. In the following protocols, *CSP* is a Crypto Service Provider.

This is the designed protocol based on *GC*.

**Experimental Results:**

In the following, you can see how the designed protocols perform for two *LD* and *LR* tests. These are two examples, but the protocols support any arbitrary computation. For more information about the details of the protocols and their performance, we encourage you to read our paper.

*This research post was supported by a Computable Fellowship. If you’re interested in performing fundamental research about decentralizing control of data, please apply!*