Manuel Herranz

Data Anonymisation is the process of encrypting or removing personal data or personally identifiable data (in order to protect private or sensitive information) from data sets so that the person can no longer be identified directly or indirectly. Identifiers such as contact data, names, addresses, social security numbers etc., are erased or encrypted so the individual cannot be traced back to the stored data. When a person cannot be re-identified the data is no longer considered personal data and GDPR does not apply. Running software such as MAPA’s open source docker will make personal data disappear through Anonymisation or Pseudonymisation techniques, making it possible for Public Administrations to easily deploy truly multilingual anonymisation to their data sets.

However, there is always a possibility that even data is cleared of identifiers, attackers could use de-anonymisation techniques in a sort of back-engineering to link the anonymised personal data to the person. Since data usually may be transferred through multiple sources and some can be available to the public, potential de-anonymisation techniques could cross-reference the sources and reveal personal information.

The General Data Protection Regulation (GDPR) outlines a specific set of rules that protect citizens and user data and create transparency in information sharing. GDPR is the strictest data privacy regulation in the world, but it allows companies to collect anonymised data without consent, use it for any purpose, and store it for an indefinite time as long as companies remove all identifiers from the data.

Data Anonymisation Techniques

Data Masking, that is, hiding data with altered values such as placeholders, gaps, etc. You can create a mirror version of a database and apply modification techniques such as word or character substitution, character shuffling or encryption. For example, you can replace a value character with a symbol such as “*” or “x”. This type of irreversible anonymisation makes reverse engineering or detection impossible.
Pseudonymization is a data management and de-identification method that replaces private identifiers with pseudonyms (that is invented or fake identifiers). For example, this table showing current Italian ministers completely identifies the person as the one holding office at a Ministry

Minister
Mit Geschäftsbereich

Amt oder Ressort	Bild	Name	Partei
Arbeit und Sozialpolitik		Andrea Orlando		PD
Auswärtiges und internationale Zusammenarbeit		Luigi Di Maio		M5S
Gesundheit		Roberto Speranza		Art. 1
Infrastruktur und Verkehr		Enrico Giovannini		parteilos
Inneres		Luciana Lamorgese		parteilos
Justiz		Marta Cartabia		parteilos
Kultur		Dario Franceschini		PD
Landwirtschaft, Ernährung und Forstwirtschaft		Stefano Patuanelli		M5S
Ökologischer Übergang		Roberto Cingolani		parteilos
Universitäten und Forschung		Maria Cristina Messa		parteilos
Unterricht		Patrizio Bianchi		parteilos
Verteidigung		Lorenzo Guerini		PD
Wirtschaft und Finanzen		Daniele Franco		parteilos
Wirtschaftliche Entwicklung		Giancarlo Giorgetti		Lega

could be pseudoanonymised with MAPA as

Work
Mit Geschäftsbereich

Amt oder Ressort	Bild	Name	Partei
Arbeit und Sozialpolitik		Ana Maria Lynch		G6E9
Auswärtiges und internationale Zusammenarbeit		Olga Simoneva		B3B0
Gesundheit		Ernest van Dyck		C2C
Infrastruktur und Verkehr		Viljar Mälk		m1A2
Inneres		Olof Mann		n1A2
Justiz		Roberto Rossi		B1A2
Kultur		Alexander Owloski		C6E9
Landwirtschaft, Ernährung und Forstwirtschaft		Jordi Lluch		Q23s
Ökologischer Übergang		Else Frandsen		X7V2
Universitäten und Forschung		Alberto Casamonte		D2D3
Unterricht		Cristina Longo		Bg06
Verteidigung		Jennifer Low		5TGR
Wirtschaft und Finanzen		Aristoteles Myriakos		2D40
Wirtschaftliche Entwicklung		Deborah C. Myers		K1FF

Note that not only names but also political affiliation has been substituted in order to mask any trace. Also, there is no relation between the original gender and the pseudonymised. Of course, we could apply world knowledge and find out who the Minister for Universities is, and that is why the “Minister” term has also been pseudonymised. Political affiliation has been masked, too, using synthetic random data (see below). Thus, the table could belong to any job category. Pseudonymization preserves statistical accuracy and data integrity, allowing the modified data to be used for training, development, testing, and analytics while protecting data privacy.

Data perturbation: modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range of values needs to take into account the proportion to the perturbation (years, membership, work, term in office). A small base may lead to weak anonymization while a large base can reduce the utility of the dataset. For example, you can use a base of 5 or 10 for rounding values like age or house number because it’s proportional to the original value. However, using multipliers like x25 may make the memberships, work history, term in office or people’s age fake.
Synthetic data is algorithmically manufactured information that has no connection to real events. Synthetic data is used to create artificial datasets for names, addresses, political groups or other identifiers instead of altering the original dataset or using it as is and risking privacy and security. The process involves previously creating statistical models which can be based on general data (people’s names, street names, etc.) and patterns found in the original dataset. Alternatively, methods such as standard deviations, medians, linear regression or other statistical techniques can be used to generate the synthetic data.
Generalization (not a full GDPR compliant technique) removes some of the data to make it less identifiable. For example, the house number in an address can be removed, but not the road name. The purpose is to eliminate some of the identifiers whilst retaining a measure of data accuracy. Data is thus partly obfuscated into a broad area or set of ranges with appropriate boundaries.
Data shuffling, also known as swapping or permutation. Here, the dataset attribute values are rearranged so they don’t correspond with the original records.

Disadvantages of Full Data Anonymisation

GDPR stipulates that websites have to obtain explicit consent from their users to collect personal information such as IP addresses, device ID, and cookies, something that all website managers are aware of nowadays. GDPR does not tend to speak about full data anonymisation as that would have to absolutely guarantee that even applying external knowledge the individual or individuals mentioned in a data set could not be identified. Also, data anonymisation is an irreversible process and that is why the technique usually applied is pseudonymisation. Commercially speaking, collecting anonymous data and deleting the identifiers from the database limits the ability to derive value and insights from it. However, it allows Public Administrations to comply with Open Data directives and share citizens’ data safely. The Multilingual Anonymisation for Public Administrations (MAPA) project has been created with the needs of Public Administrations in mind. In commercial environments, anonymised data clearly loses value as it cannot be used for marketing efforts, or to personalize the website’s user experience. The selling of un-anonymised personal data is strictly forbidden.

Author: Manuel Herranz

Data Anonymisation and Data Pseudonymisation techniques

Data Anonymisation Techniques

Disadvantages of Full Data Anonymisation

Anonymisation and Pseudonymisation

What do we mean by Anonymisation and Pseudonymisation