“Anonymised data can never be totally anonymous”1. A big concern of anonymisation of data, is its de-anonymisation, which is the process of using information from different data sets to re-create the anonymised data. For instance, A research team at the University of Texas at Austin, wanted to demonstrate how de-anonymisation could take place with very little information. As a result, with a background information available from IMDB, they successfully identified many records of Netflix users. This was consequently used to find out their sensitive information and political preferences.2

In India, ‘Anonymisation' is dealt with under the Personal Data Protection Bill, 2019. The consequences of personal data falling into the wrong hands or handled carelessly by data principals are catastrophic; which is why it is included in the framework of privacy jurisprudence.

The individual whose personal data is stolen, could be subjected to the risks of misuse and fraud including identity theft, phishing attacks, user tracking, etc. Anonymisation or pseudonymisation is used to protect the integrity of an individual's personal data by preventing such malicious use by third parties. So, let's first understand the meaning of these terms, as discussed below.

What is anonymisation?

Personally identifiable information (PII) or sensitive data contains identifiable markers such as an individual's name, age, address, date of birth, etc. When personal data is collected, these identifiers enable the data fiduciary or the data controller to relate the personal data back to the individual. Many jurisdictions aim to regulate the use and flow of such personal data by the data fiduciaries and data controllers.  Some examples of this include, the European Union's General Data Protection Regulation and the Indian Personal Data Protection Bill, 2019. The former introduced the concept of ‘pseudonymisation', while the latter uses the term ‘anonymisation'; both offer such procedures respectively to protect personal data from being identified.

The Personal Data Protection Bill, 2019 defines ‘Anonymisation' under Clause 3(2) as, “…such irreversible process of transforming or converting personal data to a form in which a data principal cannot be identified, which meets the standards of irreversibility specified by the Authority.”.

While the GDPR, 2016, defines ‘pseudonymisation' under Article 4(5) as, “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual.”

“Anonymisation is not only removing personal information of individuals”

The term “anonymisation” is used to refer to the broad range of techniques and processes that can be used to prevent the identification of individuals to whom data relates. Anonymisation is the process of turning personal data into anonymous information, such that individual to whom the data relates is no longer identifiable.

How is it done?

A few practical examples of Data Anonymization techniques, provided by Information Commissioner's Office UK3, include:

  1. Data reduction
  • Removing variables: The simplest method of anonymisation is the removal of variables which provide direct or indirect identifiers from the data file. These need not necessarily be names; a variable should be removed when it is highly identifying in the context of the data.
  • Removing records: Removing records of particular units or individuals can be adopted as an extreme measure of data protection when the unit is identifiable in spite of the application of other protection techniques.
  • Global recording: The global recoding method consists in aggregating the values observed in a variable into pre-defined classes. Every record in the table is recoded.
  • Local suppression: Local suppression consists of replacing the observed value of one or more variables in a certain record with a ‘missing' value. E.g.: we can suppress the variable ‘Age' and recode it as ‘missing'.
  1. Data Perturbation (alters values or adds noise)
  • Micro-aggregation: The idea of micro-aggregation is to replace a value with the average computed on a small group of units.
  • Data swapping: Data swapping alters records in the data by switching values of variables.
  • Post-Randomisation Method (PRAM): “Protection method for microdata in which the scores of a categorical variable are changed with certain probabilities into other scores”.4
  • Adding noise: Adding noise consists of adding a random value “n” to all values in the variable that is to be protected.
  • Resampling: Resampling has three steps. First, we have to identify the way that the sensitive or key data variables vary across the whole population. The second step is to generate a distorted sample artificially which has the same parameter values as our estimate. The sample should be the same size as the database. The third step is to replace the confidential data in the database with the distorted sample.
  1. Non-Perturbation Methods (do not alter values)
  • Sampling: Sampling is when the original data is in sufficient quantity to make a sample meaningful. Instead of publishing the original data, a sample is taken from it and published without identifiers. The resulting sample may contain information which is sensitive. However, it may not be traced to any particular individual.
  • Cross-tabulation of data: When we have a table of data with two or more variables, we can create another table by tabulating the two variables against each other.

Although theoretically, anonymisation of data might seem like a straightforward method to protect personal data, in practical application, it is not so. Sometimes, a contextual reference could lead to identification of data if the ‘variables' are not direct ‘identifiers'. The same risk is faced in the case of pseudonymisation, where other available data could be held together to identify the pseudonymised data. Thus, these methods are likely to be fallible and is not an ideal route for personal data protection.

Some alternate suggestions for anonymisation of data include:

  • “Differential privacy: It is a technique by which information about a dataset is publicly shared by describing groups' patterns within the dataset, while concealing the personally identifiable information.
  • Federated learning: Google introduced the technique in 2017. Federated learning enables researchers to train statistical models based on decentralised servers with a local data set. Meaning, there is no need to upload private data to the cloud or exchange it with other teams. Federated learning is better than traditional machine learning techniques as it mitigates data security and privacy risks.
  • Homomorphic encryption: In this technique, calculations are performed on encrypted data without first decrypting it. Since Homomorphic encryption makes it possible to manipulate encrypted data without revealing the actual data, it has enormous potential in healthcare and financial services where the person's privacy is most important.”5

In conclusion, anonymisation and pseudonymisation are commonly used as a tool to protect personal data. The process of anonymising data could be either simple or complex, depending on the way in which it is being anonymised. While infallible anonymisation is ideal, it might not be possible any time soon. Furthermore, even when anonymisation techniques are used, there may still be a risk of the relevant individual being identified. This risk does not mean that the anonymisation technique is ineffective, nor that the data is not effectively anonymised for the purposes of a protective legislation. However, the above alternatives could help avoid the technical loophole of de-anonymisation. Data privacy laws should ideally work around to incorporate a more infallible method to protect personal data, especially when better alternatives are available. It is not too late for India to incorporate a sound alternative to ‘anonymisation' under its Personal Data Protection Bill, 2019. Such a minor alternation would awaken a new dawn of a more advanced data privacy law.

Footnotes

1 https://www.theguardian.com/technology/2019/jul/23/anonymised-data-never-be-anonymous-enough-study-finds

2 https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf

3 https://ico.org.uk/media/1061/anonymisation-code.pdf

4 https://stats.oecd.org/glossary/detail.asp?ID=6954

5 https://analyticsindiamag.com/data-anonymization-is-not-a-fool-proof-method-heres-why/

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.