Anonymizing datasets through de-identification and sampling them has been a central tool to address privacy concerns when sharing data. Despite re-identifications regularly happening, how can one be sure they truly identified the right person? We proposed a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
[How to cite] Rocher, L., Hendrickx, J. M., & de Montjoye, Y. A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models). Nature communications, 10 (1), 3069.
[Selected press] New York Times, Guardian, CNBC, The Telegraph, TechCrunch, Technology Review, New Scientist, Gizmodo, Scientific American, RT, Forbes, El Pais (ES), Sueddeutsche Zeitung (DE), Le Soir (FR), La Libre (FR), L’Echo (FR), De Morgen (NL)