The consortium met in Valencia at PangeaMT’s facilities to confirm the working plan, work packages and discuss the best strategies for data acquisition, word-embeddings and multilingual approaches to anonymisation
Click in the image below to see more info about the project.
Why Anonymize Data?
GDPR has changed multinational organisations keep and share personal data and it obliges them to protect citizens’ data so it is not released to 3rd parties.
MAPA anonymisation will provide the means to share language data through a toolkit designed to protect personal or sensitive data. The project will focus on practical applications by justice departments, health authorities (Public Administrations). One of the aims of MAPA is to be able to provide access to data and manage an anonymisation strategy. A byproduct of anonymisation, for example, can be the release of large amounts of anonymised data that can help the community to have more AI training data.
Most importantly, MAPA will satisfy GDPR requirements at scale. Although no software can guarantee 100% accuracy in anonymization, just as perfect machine translation does not exist (yet), it will make document sharing and keeping personal details private a straight-forward exercise.
Technical Approaches to Anonymisation
At its core, the MAPA anonymisation toolkit will use Named-Entity Recognition and Classification (NERC) techniques using both Deep Learning techniques and neural networks.
In addition, thanks to the transfer learning capabilities shown by new types of Deep-Learning models, new systems can be trained using relatively small datasets of manually labelled data. The knowledge acquired for a given domain or language can be transferred and re-used cross-language or cross-domain. MAPA will be trained to detect named entities that involve sensitive information.
MAPA is committed to implement Use Cases at a national level by several consortium members. These Use Cases will engage public institutions in Spain, Malta, Latvia, and eTranslation as an institution, with the focus being in the health domain and one for the legal domain, where the Spanish Ministry of Justice has already shown interest in its results. Both domains were selected given their strong anonymisation requirements as they are sensitive to personal details leak.
The system will be tailored to the specific needs of the relevant institution.
MAPA is funded by the Connecting Europe Facility (CEF) programme, under grant No A2019/1927065, and will run from January 2020 until December 2021.