Misinterpretation, privacy and data protection challenges – putting proxy data under the spotlight

written on November 29, 2022

[Republished from the OII blog]

I recently gave evidence to the Centre for Data Ethics and Innovation (CDEI) of the Department for Digital, Culture, Media & Sport, at an expert workshop on risks and benefits of data proxies for bias monitoring held on 23 November 2022.

Automated and algorithmic decision-making systems are increasingly used in the private and public sector. Researchers have highlighted the potential harms that arise from these systems, such as unfair or illegal decisions, and the difficulty to monitor them in practice.

The CDEI’s Demographic Data Project noted how these deployed systems often lack useful demographic data to monitor in practice if and how harms are mitigated. Testing if deployed systems illegally discriminate based on protected characteristics such as race, gender, religion, or sexuality, could be impossible if no socio-demographic data exists to measure how outcomes differ across groups, and notably for marginalised groups. The CDEI currently studies the implications of using inferred, proxy data to monitor such biases when no data exists in the first place.

Proxy measurements have been used extensively in social sciences to replace variables that cannot be measured, at a macro level (e.g., satellite images to estimate vegetation) down to micro level (e.g., inferring individual traits). At the micro level, new methods are regularly proposed to derive proxy data by inferring demographics, attitudes, and other sensitive attributes using statistical and machine learning (ML) techniques. Proxy data includes simple heuristics, such as inferring someone’s gender from their first name and location. They also include complex ML methods, such as when researchers claimed to be able to sexual orientation [1] from face photographs. But social science researchers have raised the alarm at multiple occasions, suggesting that proxy data can easily be misinterpreted since they simply “interpolate the training data” they learn from [2]. Gelman et al. point out that these methods reduce complex contextual human phenomenon (e.g., sexual orientation or gender), create stereotypes, and offer illusory accuracy notably for marginalised groups [2]. Indeed, even for simple heuristics, inferring gender from first name, a very high accuracy of 98% for all the UK population will drop to 0 for non-binary people, whose name does not signal their gender in many parts of the world[3].

Beyond their controversial utility, proxy data offer additional challenges in terms of privacy and data protection. Proxy data, inferring information that does not exist or has not been disclosed, does not necessarily result in better privacy protection by design.

When it comes to special categories—sensitive data that needs more protection—, the ICO notes that “if you can infer relevant information with a reasonable degree of certainty then it’s likely to be special category data even if it’s not a cast-iron certainty.” [4] Even inferring information from signals as simple as a first name can result in such scrutiny. The ICO notes again that “if you process such names specifically because they indicate ethnicity or religion, […] then you are processing special category data” [5].

Proxy data may therefore not minimise risk to participants as it could be expected and require a similar level of protection and transparency as traditional data in UK and European data protection regimes. The CDEI inquired if modern “privacy-enhancing technologies” such as synthetic data generation, can be used to better protect such sensitive (proxy) data.

Fifty years of research on anonymity, privacy, and privacy-enhancing technologies suggest that there is currently no silver lining, no method that perfectly protects rights to privacy while preserving the utility of collected and inferred data. Re-identification attacks allow attackers to infer sensitive information from individual-level datasets [6], interactive systems where analysts send queries and obtain aggregate answers [7], and even differentially private aggregated datasets [8]. Some of these attacks rely on inherent risks with heuristic defences, others with incorrect assumptions, or even bugs in software implementations [9].

Re-identification attacks are practical and occur frequently [10]. In 2016, journalists re-identified politicians in an anonymized browsing history dataset of 3 million German citizens, uncovering their medical information and their sexual preferences. A few months before, the Australian Department of Health publicly released de-identified medical records for 10% of the population only for researchers to re-identify them 6 weeks later. In 2019, the New York Times re-identified the then-US President Trump’s tax records using public I.R.S. anonymous information on top earners. Re-identification attacks on location data have been demonstrated several times on different datasets, such as on taxi rides in NYC, bike-sharing trips in London, and GPS trajectories collected by apps such as Grindr.

Overall, while proxy data can offer evidence of harm, notably for systems in which socio-demographic data were not available, their use calls for heightened transparency and scrutiny. In particular, the inherent privacy risks and limitations of these techniques need to be clearly communicated to individuals on which inferences are being made.


  1. Wang, Y. and Kosinski, M., 2018. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology, 114(2), p.246.
  2. Gelman, A., Mattson, G.G. and Simpson, D., 2018. Gaydar and the Fallacy of Decontextualized Measurement. Sociological Science, 5, pp.270-280.
  3. Dev, S., Monajatipoor, M., Ovalle, A., Subramonian, A., Phillips, J. and Chang, K.W., 2021, November. Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 1968-1994).
  4. Information Commissioner’s Office, 2022. What is special category data?
  5. Ibid
  6. Rocher, L., Hendrickx, J.M. and de Montjoye, Y.A., 2019. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1), pp.1-9.
  7. Gadotti, A., Houssiau, F., Rocher, L., Livshits, B. and de Montjoye, Y.A., 2019. When the signal is in the noise: Exploiting Diffix’s Sticky Noise. In 28th USENIX Security Symposium (USENIX Security 19) (pp. 1081-1098).
  8. Houssiau, F., Rocher, L. and de Montjoye, Y.A., 2022. On the difficulty of achieving Differential Privacy in practice: user-level guarantees in aggregate location data. Nature communications, 13(1), pp.1-3.
  9. Near, J., Darais, D., 2021. Differential Privacy Bugs and Why They’re Hard to Find. NIST.
  10. Rocher, L., Hendrickx, J.M. and de Montjoye, Y.A., 2019. Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1), pp.1-9.