In the paper, Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset), Arvind Narayanan and Vitaly Shmatikov of The University of Texas at Austin describe the problem; show a general method of de-anonymizing statistical data and demonstrate its use in an area where the participants were under the impression that their information was anonymous.
Datasets containing “micro-data,” that is, information about specific individuals, are increasingly becomingCollectors and publishers of data need to be aware of the potential for exposure of information that may be regarded as sensitive.
public—both in response to “open government” laws, and to support data mining research. Some datasets
include legally protected information such as health histories; others contain individual preferences, purchases,
and transactions, which many people may view as private or sensitive.
Privacy risks of publishing micro-data are well-known. Even if identifying information such as names,
addresses, and Social Security numbers has been removed, the adversary can use contextual and background
knowledge, as well as cross-correlation with publicly available databases, to re-identify individual
data records. Famous re-identification attacks include de-anonymization of a Massachusetts hospital discharge
database by joining it with with a public voter database [...]
We present a very general class of statistical de-anonymization algorithms which
demonstrate the fundamental limits of privacy in public micro-data. We then show how these methods
can be used in practice to de-anonymize the Netflix Prize dataset, a 500,000-record public dataset.
The issue is not limited to widely disseminated information. Individuals or special-interest groups may have legitimate need for micro-data (for example in health funding policy) but then have the means of uncovering personal data for an unauthorised purpose.
- are ethics sufficient to protect the privacy of individuals described by such micro-data?
- is the information exposed by statistical de-anonymization sufficiently protected by legislation?
- where would you go for assurance that the data that you are providing is not susceptible to statistical de-anonymization?