Quasi-identifier

Quasi-identifiers are pieces of information that are not of themselves unique identifiers, but are sufficiently well correlated with an entity that they can be combined with other quasi-identifiers to create a unique identifier.^[1]

Quasi-identifiers can thus, when combined, become personally identifying information. This process is called re-identification. As an example, Latanya Sweeney has shown that even though neither gender, birth dates nor postal codes uniquely identify an individual, the combination of all three is sufficient to identify 87% of individuals in the United States.^[2]

The term was introduced by Tore Dalenius in 1986.^[3] Since then, quasi-identifiers have been the basis of several attacks on released data. For instance, Sweeney linked health records to publicly available information to locate the then-governor of Massachusetts' hospital records using uniquely identifying quasi-identifiers,^[4]^[5] and Sweeney, Abu and Winn used public voter records to re-identify participants in the Personal Genome Project.^[6] Additionally, Arvind Narayanan and Vitaly Shmatikov discussed on quasi-identifiers to indicate statistical conditions for de-anonymizing data released by Netflix.^[7]

Motwani and Ying warn about potential privacy breaches being enabled by publication of large volumes of government and business data containing quasi-identifiers.^[8]

^ "Glossary of Statistical Terms: Quasi-identifier". OECD. November 10, 2005. Retrieved 29 September 2013.
^ Sweeney, Latanya. Simple demographics often identify people uniquely. Carnegie Mellon University, 2000. http://dataprivacylab.org/projects/identifiability/paper1.pdf
^ Dalenius, Tore. Finding a Needle In a Haystack or Identifying Anonymous Census Records. Journal of Official Statistics, Vol.2, No.3, 1986. pp. 329–336. http://www.jos.nu/Articles/abstract.asp?article=23329 Archived 2017-08-08 at the Wayback Machine
^ Anderson, Nate. Anonymized data really isn’t—and here’s why not. Ars Technica, 2009. https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/
^ Barth-Jones, Daniel C. The're-identification'of Governor William Weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (June 4, 2012) (2012).
^ Sweeney, Latanya, Akua Abu, and Julia Winn. "Identifying participants in the personal genome project by name." Available at SSRN 2257732 (2013).
^ Narayanan, Arvind and Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets. The University of Texas at Austin, 2008. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
^ Rajeev Motwani and Ying Xu (2008). Efficient Algorithms for Masking and Finding Quasi-Identifiers (PDF). Proceedings of SDM’08 International Workshop on Practical Privacy-Preserving Data Mining.

[1] "Glossary of Statistical Terms: Quasi-identifier". OECD. November 10, 2005. Retrieved 29 September 2013.

[2] Sweeney, Latanya. Simple demographics often identify people uniquely. Carnegie Mellon University, 2000. http://dataprivacylab.org/projects/identifiability/paper1.pdf

[3] Dalenius, Tore. Finding a Needle In a Haystack or Identifying Anonymous Census Records. Journal of Official Statistics, Vol.2, No.3, 1986. pp. 329–336. http://www.jos.nu/Articles/abstract.asp?article=23329 Archived 2017-08-08 at the Wayback Machine

[4] Anderson, Nate. Anonymized data really isn’t—and here’s why not. Ars Technica, 2009. https://arstechnica.com/tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/

[5] Barth-Jones, Daniel C. The're-identification'of Governor William Weld's medical information: a critical re-examination of health data identification risks and privacy protections, then and now. Then and Now (June 4, 2012) (2012).

[6] Sweeney, Latanya, Akua Abu, and Julia Winn. "Identifying participants in the personal genome project by name." Available at SSRN 2257732 (2013).

[7] Narayanan, Arvind and Shmatikov, Vitaly. Robust De-anonymization of Large Sparse Datasets. The University of Texas at Austin, 2008. https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

[8] Rajeev Motwani and Ying Xu (2008). Efficient Algorithms for Masking and Finding Quasi-Identifiers (PDF). Proceedings of SDM’08 International Workshop on Practical Privacy-Preserving Data Mining.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]