Action completion

Action completion

 

The MAPA Action has run through the middle of the first, second and third waves of the pandemic, ending in December 2021 at the beginning of the 6th wave (omicron). Undoubtedly, the first lockdowns that took place in Europe from March 2020 had an effect in the way the project had to be managed (full remote) and of course in the in-person gatherings for national dissemination that had to take place.

Nevertheless, and despite the above, MAPA has not experienced any major deviations in the final results, deliverables and objectives, with the large language community and European institutions being aware of the importance of anonymisation and MAPA becoming a reference point as the first truly multilingual open-source software available for Public Administrations with reference Use Cases. The MAPA docker is a usable, solid piece of software that can be used as set of language independent engines or as a general multilingual piece of software that can be integrated in document processing or other processes for general-purpose anonymisation, with a specialism in the legal and health domains.

MAPA can be deployed easily as all engines are fully dockerised. The integrity for use at Public Administrations has been independently tested by partner LIMSI (CNRS) as a specialist in de-identification software and tried at the engaged Use Cases (Spanish Ministry of Justice, Complaints Watch by DG-Justice). The security features are standard internet https:// compliant and those of a docker system as use in all cases has been internal to each organisation. An AS4 Domibus compliant connection is possible as planned for SaaS implementations. The docker can easily be connected to popular Computer Assisted Tools (CAT) used at Public Administrations to speed up the work of translators or build on them for full anonymisation during document translation processes, for example. Another use is as implemented by Complaints Watch (anonymisation of Excel files) or Spanish MoJ (documents, with document reconstruction taking place externally).

The consortium’s sustainability plan is designed to lead potentially to a source of revenue having European Public Administrations as clients and potentially fork out and further develop the tool to create other distributions and commercial solutions in the future. Furthermore, the Action has engaged several organisations as part of an interest group: members of the language association GALA, users of the European Language Grid. Several language service providers have expressed their interest in getting hold of the open-source version in GitLab (December 2021-January 2022). The set of engines are for free for European Public Administrations and the general public at large to download and implement (through ELG or GitLab). Consultancy and/or maintenance contracts may add to the software maintenance and product viability and sustainability as well as anonymisation engine updates.

All partners are committed to the maintenance of the product as part of its software portfolio offering. The anonymisation engines are and will remain an open-source MT engine system that can also be used as an academic benchmark to further the state-of-the-art.  MAPA, as a set of anonymisation engines and a private deployment is possible either in a European cloud or with on-premises deployments when privacy requires no outside connections.

Data Anonymisation and Data Pseudonymisation techniques

Data Anonymisation is the process of encrypting or removing personal data or personally identifiable data (in order to protect private or sensitive information) from data sets so that the person can no longer be identified directly or indirectly.  Identifiers such as contact data, names, addresses, social security numbers etc., are erased or encrypted so the individual cannot be traced back to the stored data. When a person cannot be re-identified the data is no longer considered personal data and GDPR does not apply. Running software such as MAPA’s open source docker will make personal data disappear through Anonymisation or Pseudonymisation techniques, making it possible for Public Administrations to easily deploy truly multilingual anonymisation to their data sets.

However, there is always a possibility that even data is cleared of identifiers, attackers could use de-anonymisation techniques in a sort of back-engineering to link the anonymised personal data to the person. Since data usually may be transferred through multiple sources and some can be available to the public, potential de-anonymisation techniques could cross-reference the sources and reveal personal information.

The General Data Protection Regulation (GDPR) outlines a specific set of rules that protect citizens and user data and create transparency in information sharing. GDPR is the strictest data privacy regulation in the world, but it allows companies to collect anonymised data without consent, use it for any purpose, and store it for an indefinite time as long as companies remove all identifiers from the data.

Blue and yellow starred flag of europe with MAPA project logo and General Data Protection Regulation

Data Anonymisation Techniques

  • Data Masking, that is, hiding data with altered values such as placeholders, gaps, etc. You can create a mirror version of a database and apply modification techniques such as word or character substitution, character shuffling or encryption. For example, you can replace a value character with a symbol such as “*” or “x”. This type of irreversible anonymisation makes reverse engineering or detection impossible.
  • Pseudonymization is a data management and de-identification method that replaces private identifiers with pseudonyms (that is invented or fake identifiers). For example, this table showing current Italian ministers completely identifies the person as the one holding office at a Ministry

Minister
Mit Geschäftsbereich

Amt oder Ressort Bild Name Partei
Arbeit und Sozialpolitik Andrea Orlando PD
Auswärtiges und internationale Zusammenarbeit Luigi Di Maio M5S
Gesundheit Roberto Speranza Art. 1
Infrastruktur und Verkehr Enrico Giovannini parteilos
Inneres Luciana Lamorgese parteilos
Justiz Marta Cartabia parteilos
Kultur Dario Franceschini PD
Landwirtschaft, Ernährung und Forstwirtschaft Stefano Patuanelli M5S
Ökologischer Übergang Roberto Cingolani parteilos
Universitäten und Forschung Maria Cristina Messa parteilos
Unterricht Patrizio Bianchi parteilos
Verteidigung Lorenzo Guerini PD
Wirtschaft und Finanzen Daniele Franco parteilos
Wirtschaftliche Entwicklung Giancarlo Giorgetti Lega

could be pseudoanonymised with MAPA as

Work
Mit Geschäftsbereich

Amt oder Ressort Bild Name Partei
Arbeit und Sozialpolitik Ana Maria Lynch G6E9
Auswärtiges und internationale Zusammenarbeit Olga Simoneva B3B0
Gesundheit Ernest van Dyck C2C
Infrastruktur und Verkehr Viljar Mälk m1A2
Inneres Olof Mann n1A2
Justiz Roberto Rossi B1A2
Kultur Alexander Owloski C6E9
Landwirtschaft, Ernährung und Forstwirtschaft Jordi Lluch Q23s
Ökologischer Übergang Else Frandsen X7V2
Universitäten und Forschung Alberto Casamonte D2D3
Unterricht Cristina Longo Bg06
Verteidigung Jennifer Low 5TGR
Wirtschaft und Finanzen Aristoteles Myriakos 2D40
Wirtschaftliche Entwicklung Deborah C. Myers K1FF

Note that not only names but also political affiliation has been substituted in order to mask any trace. Also, there is no relation between the original gender and the pseudonymised. Of course, we could apply world knowledge and find out who the Minister for Universities is, and that is why the “Minister” term has also been pseudonymised. Political affiliation has been masked, too, using synthetic random data (see below). Thus, the table could belong to any job category. Pseudonymization preserves statistical accuracy and data integrity, allowing the modified data to be used for training, development, testing, and analytics while protecting data privacy.

  • Data perturbation: modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range of values needs to take into account the proportion to the perturbation (years, membership, work, term in office). A small base may lead to weak anonymization while a large base can reduce the utility of the dataset. For example, you can use a base of 5 or 10 for rounding values like age or house number because it’s proportional to the original value. However, using multipliers like x25 may make the memberships, work history, term in office or people’s age fake.
  • Synthetic data is algorithmically manufactured information that has no connection to real events. Synthetic data is used to create artificial datasets for names, addresses, political groups or other identifiers instead of altering the original dataset or using it as is and risking privacy and security. The process involves previously creating statistical models which can be based on general data (people’s names, street names, etc.) and patterns found in the original dataset. Alternatively, methods such as standard deviations, medians, linear regression or other statistical techniques can be used to generate the synthetic data.
  • Generalization (not a full GDPR compliant technique) removes some of the data to make it less identifiable. For example, the house number in an address can be removed, but not the road name. The purpose is to eliminate some of the identifiers whilst retaining a measure of data accuracy. Data is thus partly obfuscated into a broad area or set of ranges with appropriate boundaries.
  • Data shuffling, also known as swapping or permutation. Here, the dataset attribute values are rearranged so they don’t correspond with the original records.

Disadvantages of Full Data Anonymisation

GDPR stipulates that websites have to obtain explicit consent from their users to collect personal information such as IP addresses, device ID, and cookies, something that all website managers are aware of nowadays. GDPR does not tend to speak about full data anonymisation as that would have to absolutely guarantee that even applying external knowledge the individual or individuals mentioned in a data set could not be identified. Also, data anonymisation is an irreversible process and that is why the technique usually applied is pseudonymisation. Commercially speaking, collecting anonymous data and deleting the identifiers from the database limits the ability to derive value and insights from it. However, it allows Public Administrations to comply with Open Data directives and share citizens’ data safely. The Multilingual Anonymisation for Public Administrations (MAPA) project has been created with the needs of Public Administrations in mind. In commercial environments, anonymised data clearly loses value as it cannot be used for marketing efforts, or to personalize the website’s user experience. The selling of un-anonymised personal data is strictly forbidden.

Anonymisation and Pseudonymisation

In our first blog post, we examine differences between Anonymisation and Pseudonymisation.

It is interesting to note that GDPR’s recommendation is to pseudonymise personal data wherever possible. A fundamental principle of the EU’s General Data Protection Regulation (GDPR), which came into force on 25 May 2018, is the recommendation to pseudonymise personal data wherever possible. Articles 4, 6, 25, 32, 40 and 89, as well as recitals 28, 29, 75, 78, 85 and 156 of the GDPR, explicitly mention pseudonymisation. While this is deemed sufficient, it does not preclude other means of data protection (Recital 28). GDPR does not refer to anonymisation anywhere in those articles and recitals.

What do we mean by Anonymisation and Pseudonymisation

Let’s look at an ideal and perfect scenario: anonymization. According to GDPR, anonymisation is the processing of data so that it cannot be identifiable as being associated with a particular individual. For a truly effective anonymisation, it has to become impossible for readers of the resulting data to carry out the identification of the person originally associated with the data – even with the help of other knowledge about the anonymised data.

This ideal scenario presents a problem for data controllers and data processors because the data is also rendered useless for most analytics. Let’s have a look at the anonymisation options below.

In the next blog post of our GDPR compliance series, we will review all these techniques and how they can become useful for Public Administrations, the focus of our project, as well as common element-level protection techniques. We will also show and map anonymisation and pseudonymisation to those techniques.

Doing without the ability to do valuable analytics could be one explanation for GDPR’s omission of the terms anonymisation. Nevertheless, anonymized data can still be useful for development and testing use cases and MAPA will provide the three options as part of its deliverables.

Let’s consider a potential use at a Public Administration with a table below

Fig 1 — Table before anonymisation

after anonymisation, the table could look like the one below (Fig 2).

Fig 2 — Table after anonymisation

Which is provides extreme untraceability and looks very secure but it is of little use we plan to process it in any way since the process is irreversible.

Now let’s consider pseudonymization. Let us assume that in addition to the “Department” column the “Salary” column is also not modified for whatever reason (for example being an item that has not been identified as essential in our anonymization procedures or we are using a software that does not allow to do that). The following table (Fig 3) results from that action rather than the table in Fig 2.

Fig 3 — Table after Pseudonymisation (Case A).

Note that every cell in the spreadsheet has been modified in Fig 3 just as in Fig 2, except “Department”. The assumption is that each department consists of more than one person and, therefore, getting back to the original data is not possible, even if we made use of additional external information. However, if Sales or Engineering were a single-person department, then we could have the person’s record. But as all the other values apart from “Department” are also anonymized, obtaining or finding out that external information would be useless to anyone with access to the record.

But let’s assume our anonymization software does not take salary information to be relevant (in fact, it is not a stated requirement of either GDPR, CCPA / HIPPA / NIST, Japanese APPI or Brazilian LGPD).

Fig 3 — Table after Pseudonymisation (Case B)

If President Janalyn Czerwinski (row 3) had not been in the data set, this pseudoanonymised Case B would have been equivalent to an anonymized set. However, since the “Salary” record has not been modified, the significant outlier in that column (over €490,000) gives away the identity and salary information of the highest official. To put it simple, the knowledge that the President or the person with the highest responsibility in an organisation is likely to earn well above everyone else basically re-identifies the record even though it has been pseudonymised.

In all the examples above, since the number of fields transformed is substantial in comparison with the total number of fields, the data, while usable for testing, is rendered useless for meaningful analysis. To be able to draw meaningful conclusions, the fields of interest in analysis need to be available without transformation —or at least be in the same range— so that aggregate results are the same.

Organisations, from Public Administrations to corporations, should have an insight into the data they are processing. They should also minimise the use, collection, and retention of PII or Personal Data to what is strictly necessary to accomplish their business purpose, and implement proper procedures and technical and organisational safeguards.

The degree of anonymization and indeed whether a data set is irreversibly anonymized or pseudonymized for further processing and the variety of techniques greatly depend on the nature of the un-transformed data and how much it might reveal. In our example today, that “sufficient additional information” is the logical assumption that the President of an organisation is very likely the highest paid employee in the company. Additional information might be public information or data available in other tables or data stores in the organization.

As we saw in the discussion above, anonymization and pseudonymization are distinct approaches that protect data as a whole, in the aggregate. The effects of anonymization and pseudonymization are achieved by applying transformations at the unit (element) level. We will delve into these element level techniques in the next blog post and map those techniques to anonymization and pseudonymization.

Project Kicks Off in Valencia

The consortium met in Valencia at PangeaMT’s facilities to confirm the working plan, work packages and discuss the best strategies for data acquisition, word-embeddings and multilingual approaches to anonymisation

Click in the image below to see more info about the project.

Why Anonymize Data?

GDPR has changed multinational organisations keep and share personal data and it obliges them to protect citizens’ data so it is not released to 3rd parties. 

MAPA anonymisation will provide the means to share language data through a toolkit designed to protect personal or sensitive data. The project will focus on practical applications by justice departments, health authorities (Public Administrations). One of the aims of MAPA is to be able to provide access to data and manage an anonymisation strategy. A byproduct of anonymisation, for example, can be the release of large amounts of anonymised data that can help the community to have more AI training data. 

Most importantly, MAPA will satisfy GDPR requirements at scale. Although no software can guarantee 100% accuracy in anonymization, just as perfect machine translation does not exist (yet), it will make document sharing and keeping personal details private a straight-forward exercise.

Technical Approaches to Anonymisation

At its core, the MAPA anonymisation toolkit will use Named-Entity Recognition and Classification (NERC) techniques using both Deep Learning techniques and neural networks.

In addition, thanks to the transfer learning capabilities shown by new types of Deep-Learning models, new systems can be trained using relatively small datasets of manually labelled data. The knowledge acquired for a given domain or language can be transferred and re-used cross-language  or cross-domain. MAPA will be trained to detect named entities that involve sensitive information.

Use cases

MAPA is committed to implement Use Cases at a national level by several consortium members. These Use Cases will engage public institutions in Spain, Malta, Latvia, and eTranslation as an institution, with the focus being in the health domain and one for the legal domain, where the Spanish Ministry of Justice has already shown interest in its results. Both domains were selected given their strong anonymisation requirements as they are sensitive to personal details leak. 

The system will be tailored to the specific needs of the relevant institution.

MAPA is funded by the Connecting Europe Facility (CEF) programme, under grant No A2019/1927065, and will run from January 2020 until December 2021.