Decision Making in Dataset Anonymization: Protecting the Privacy of Data Subjects while Preserving Analytical Value

This article is written by Junior Researcher Viktor Hargitai and it is based on his 2017 Bachelor project of the same title. The Bachelor project was supervised by Irina Shklovski, associate professor in the Technologies in Practice and Interaction Design (IxD) research groups at the IT University of Copenhagen.

This article reflects on my Bachelor project investigating the decision making involved in applications of data anonymization, which I carried out as a Junior Researcher at ETHOS Lab. During my studies on the Global Business Informatics BSc programme at ITU, I previously collaborated with the Lab as a JR on a project focusing on privacy, and I decided that working on my Bachelor thesis in this capacity would be a valuable source of additional inspiration and guidance.

Data anonymization refers to algorithms, definitions, and methods that seek to make the identification of individuals in datasets difficult or impossible, while preserving analytical value. Incorporating contributions from fields like statistics, computer science, cryptography, data mining, and privacy research, new anonymization and de-anonymization approaches are being developed actively.

Changing perspectives regarding the value of personal data, the exponentially increasing amount and granularity of the data being captured, and the proliferation of increasingly advanced analytics introduce new privacy risks, which citizens are more and more aware of. Anonymization methods are being applied in several areas of the public and private sector, in order to limit the risks and concerns about handling personal data, and comply with data protection regulations. In light of data subjects’ growing concerns about privacy, it is important to evaluate whether data anonymization can truly provide a solution, for which an understanding of practical decision making is a prerequisite.

Based on the work of prominent authors like Latanya Sweeney and Cynthia Dwork, the application of contemporary anonymization methods involves complex choices about what is considered sensitive information, and the calibration of algorithmic parameters to find an ideal balance in a particular context between reducing privacy risks to data subjects and preserving the analytical value of the published data.

Yet, it became clear early on during my investigation of the topic that there is a striking lack of published research about the way data anonymization – and the crucial decision making involved – are carried out in practice. While the shortage of available information proved to be the greatest challenge throughout my research process, it also highlighted the need for further inquiry in this field.

Due to the scarcity of relevant literature and the relatively underexplored topic, I focused on gathering information from interviews with experts who use or implement anonymization, such as analysts, data scientists, developers, and researchers. I sought to interview as many and as wide a spectrum of candidates – in terms of position, field, professional interest, and location – as possible, initially reaching out through my extended network in the EU. However, even among the (public and private) organizations whose primary activities involve processing sensitive personal data, very few proclaim their use of any anonymization methods. As this became apparent during the background research, the data collection approach was adapted to contact entities where there was a reasonable likelihood of such activities being performed.

In total, after contacting close to 70 persons and organizations, I received answers from 12 entities, as well as several suggestions for other interview candidates. 6 interviewees provided complete and relevant responses, each from different fields. 4 of the responses were provided by practitioners in the private sector, and 2 of these interviewees requested that their answers are handled confidentially – thus, only aggregated insights are shared from my research here. While confidentiality imposed limitations on the findings I can share, it enabled receiving information that would have been impossible otherwise.

In my analysis, I focused on the interviewees who provided complete answers – i.e. relevant to the scope of this study, addressing each of the questions being investigated. Every one of these six interview subjects are engaged in different fields, working with different kinds of data sources. Based on their responses, I argued that the following main conclusions emerged:

Although anonymization has been featured in public discourse, emphasized in personal data regulations, and academic inquiries in multiple disciplines, there seem to be fewer practitioners than indicated. One of the most common reasons among contacts for declining to be interviewed was that they do not perform anonymization. Of particular local relevance is the lack of research conducted in Denmark about the subject, and apparent obstacles in the way of commercial projects.

Furthermore, a large portion of organizations that engage with anonymization treat it as a sensitive topic. Multiple interviewees stated that although they share general insights at professional events, they do not disclose the anonymization practices of their organization in detail due to security concerns.

Anonymization practices were revealed to be highly heterogeneous, with great differences in the decision making between various fields of application. The approach that the organizations of the interviewees take for selecting anonymization methods and their basis for granting access were particularly in contrast. For example, only a healthcare interviewee performs de-anonymization attacks and computes probabilities to express risks, to review the results before finalizing his approach.

Echoing some of the ongoing debates about the viability of data anonymization, significantly different opinions were expressed by the interviewees in this matter. Perspectives ranged from general optimism about the possibilities of emergent anonymization methods, all the way to the complete dismissal of conventional anonymization practices and the policies that regulate them. A researcher working with location data and a developer who uses masked data for testing argued that anonymization fails outright in their domain.

Concluding Remarks

Despite its non-trivial nature, there is a lack of research conducted about data anonymization in terms of its application in practice, especially with regard to decision making, and there is a particularly startling absence of anonymization research in general in Denmark. Practitioners’ contrasting approaches to decision making and views on the viability of anonymization can be seen as challenging in light of citizens’ growing concerns about privacy and the position of anonymization as a potential solution in discourse and regulation.

In my view, the findings from this project suggest that citizens’ concerns about privacy, the prominence of data anonymization as a solution in data protection regulation and relevant public discourse, as well as researchers’ efforts to develop more and more advanced anonymization methods are not aligned with how such techniques are applied in practice: there are very few organizations that use them, those who do are somewhat opaque about their practices, their decision making is more ad-hoc than standardized, and their approaches to anonymization hardly address emergent threats that are posed by e.g. new analytical approaches. I find this concerning, due in part to the significance of anonymization in current and planned data protection regulations.