success stories

Quantifying the likelihood of re-identification of anonymized personal data

Customer

LANTK S.A.M.P.

Sector

Public Administrations

Capabilities used

Development in Python of a tool for calculating re-identification probability of anonymized personal data based on k-anonymity, l-diversity and t-proximity elicitation.

Situation

Project in which the client must provide external entities with data on dependent persons. These data, of special category according to the General Data Protection Regulation, are anonymized prior to the transfer, so that a priori it is not possible to know the person to whom they belong. However, there is no quantitative measure of the possibility of re-identification of this disassociated data by a malicious actor or the possible inferences or deductions that a malicious actor might make. Thus, it is not possible to rigorously answer the question: is the data anonymization process carried out sufficient to ensure that the individuals concerned cannot be re-identified?

Tasks

The main objective of the project is to quantify the re-identification probability of an anonymized data dataset with more than 40 attributes and more than 39,000 records. The study decided to base the calculation for this objective on the k-anonymity, l-diversity and t-proximity measures, defined in Opinion 05/2014 on anonymization techniques of the Article 29 Data Protection Working Party.

Action

Although there are certain software tools that perform the anonymization process and calculate certain risk parameters, none provided the detail required for the study. For this reason, specific software was developed within the scope of the project to quantify the probability of re-identification of the anonymized data based on the parameters indicated above.

Result

As a result of the calculations performed, it was determined that after a first anonymization process that simply eliminated those data that were directly identifiable, more than 50% of the records could be associated to a single person with 100% certainty if the attacker had sufficient information. Therefore, various alternatives were proposed to reduce this percentage to below 1%, indicating in each case numerically the associated re-identification probabilities. In this way, it was possible to eliminate these records without significantly affecting the quality of the data provided. As a result, the persons responsible for data processing had an objective assessment of the effectiveness of the anonymization process carried out prior to the transfer of the data. This provided them with certainty in order to validate the anonymization process carried out within the scope of an impact assessment on the processing of personal data.