Methods for Reducing Disclosure Risks When Sharing Data:
Overview of Computer Science Methods
Like statisticians, computer
scientists conduct research on privacy and confidentiality. A first area is secure distributed computing. In this scenario, multiple
parties (e.g., companies, government agencies) own different data sources. The parties seek to allow one
another, and possibly others, to analyze the combined data without actually revealing their own data. Computer
scientists have developed algorithms and protocols to facilitate this form of data sharing. A second active area is designing
systems that answer queries without revealing sensitive data. Typically, the protection in these systems
derives from withholding information to selected queries or from perturbing outputs of queries. A third area overlaps largely with research in statistical disclosure limitation:
developing measures of disclosure risk and data usefulness, and creating methods for altering individual data values to protect confidentiality.
Research in these three topics is often called privacy preserving data mining in the computer science literature.
Below are introductions to these three topics.
Computer scientists also work extensively on information security, for example encrypting
data to prevent hackers from learning sensitive
information stored on computing systems. The literature on information security is vast, and we do not
cover it here.
1. Secure Distributed Computing
When two or more parties seek to compute specific functions of their combined data, they can use
protocols from secure multi-party computation. Although there are many protocols, the
secure summation protocol illustrates the approach. Suppose that K groups own one data value each, and the groups want to
compute the sum of the K elements without revealing the individual data values. The first
party selects a very large random number, say R, and adds their data element x to it.
The first party then sends S = R+x to the second party, which adds its value y to
S and passes this sum to the third party. The process continues until party K adds its value
and returns the sum to the first party, which can subtract R to obtain the sum. It then shares
the sum with the other parties.
A summary of secure multi-party computation, including some key references, can be found at Latanya Sweeney's data privacy lab website.
Secure summation protocols can be used to estimate parameters of a variety
of statistical models; see Karr, et al. (2007, Technometrics, 49, 335 - 345) for a review.
2. Secure query systems
In a secure query system, users submit requests for functions of the data, but they are not allowed to see or
download the individual records in the database. Additional protections are needed because
some queries can result in disclosures; an obvious example is a direct query for a particular
record's sensitive data. To protect confidentiality, designers of query systems take one, or both, of the
following approaches. First, they do not give answers to all queries. Specifying the unanswerable
set is a challenging problem because of the potential for interaction among queries, i.e., the results
of one query when combined with another may lead to disclosures. Second, they perturb the output of the query--not the
data themselves--by aggregating or adding noise. Computer scientists have developed theory on
the types of perturbations that best ensure protection. An excellent website on this theory is maintained by
Microsoft's database privacy group.
3. Some risk metrics and perturbation approaches
k-anonymity: any released record cannot be distinguished from at least k-1 records in the database.
Latanya Sweeney's k-anonymity research describes some
of the algorithms used to achieve k-anonymity.
- l-diversity: all records with the same non-sensitive attributes have sufficient variation of sensitive values.
Computer scientists at Cornell University describe an
algorithm for implementing l-diversity.
- differential privacy: complicated mathematically, but intuitively says that the risks of disclosures of any individual's data
do not meaningfully depend on whether or not that individual is in the released database. Vitaly Shmatikov and
Adam Smith have posted an informative
PowerPoint presentation on differential privacy.