Utility-preserving anonymization of textual documents

Fadi Abdulfattah Mohammed Hassan

Ayuda

Utility-preserving anonymization of textual documents

Autores: Fadi Abdulfattah Mohammed Hassan
Directores de la Tesis: Josep Domingo i Ferrer (dir. tes.) , David Sánchez Ruenes (dir. tes.)
Lectura: En la Universitat Rovira i Virgili ( España ) en 2021
Idioma: español
Tribunal Calificador de la Tesis: Vicenç Torra i Reventós (presid.) , Jordi Castellà Roca (secret.) , Francesc Sebé Feixas (voc.)
Enlaces
- Tesis en acceso abierto en: TDX
Resumen
- Text is the most usual way to share information in society. Textual data are therefore a crucial resource for many businesses and researchers. For instance, medical histories and clinical notes are needed in medical and pharmacological research, publications in social networks can drive socioeconomic studies, or written opinions and reviews can be used to improve recommender systems. Yet, if textual documents contain personal sensitive information, they cannot be shared with third parties or released in the public sphere without properly protecting the fundamental right to privacy of the individuals to whom the text refers.
  
  Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. In the last twenty years, a panoply of privacy protection methods have been proposed in the literature, most of them focused on structured data (that is, data that conform to a regular model such as a database schema) and more concretely on numerical attributes. However, little attention has been devoted to unstructured textual data. This contrasts with the fact that the vast majority of data generated nowadays are unstructured. Specifically, unstructured text is the most common form of unstructured data, and it can be found in books, articles, web pages, emails, posts in social networks or clinical reports.
  
  If dealing with structured data may be challenging, protecting unstructured text is even more complex. First, we no longer have a fixed list of attributes: textual data may contain any information, which varies across documents. Furthermore, deciding which part of the text should be protected is much more complex than with structured data: for each piece of text we need to judge whether it can be used for re-identification or may disclose sensitive values. Such a judgment is not easy for a human expert, let alone for a computer program.
  
  In general, accurate protection of textual documents remains a largely manual process. At most, very limited (semi)automatic tools based on named entity recognition (NER) have been designed to remove --some-- of the burden from the human experts.
  
  Objectives In this thesis, we aim to develop methods to automatically anonymize textual data. As such, we introduce the following set of goals: • To study the privacy threats underlying textual data releases and survey works on data protection framed in the areas of statistical disclosure control (SDC) and privacy-preserving data publishing (PPDP), with a focus on protection methods for unstructured textual data.
  
  • To develop and improve the current machine and deep learning methods (i.e. based on sequence labeling or NER) to tackle the medical document anonymization problem.
  
  • To propose an extension of current NER-based models that is more in line with the notion of privacy as understood in the literature on SDC. To this end, we leverage NER-based methods to detect identifiers, quasi-identifiers and confidential attributes, and we thereafter protect these in-text attributes using standard masking methods.
  
  • To design and develop an integral approach that captures a broader and more accurate notion of privacy and of privacy requirements. To do this, we delve in state-of-the-art linguistic techniques and more specifically in word embedding models to automatically detect and mask quasi-identifiers in plain text. The goal is to offer a more flexible, robust and utility-preserving protection of unstructured documents.
  
  • To design new metrics to evaluate the robustness of data protection, the potential disclosure risk and the degree of semantics/utility preservation of the masked outputs.
  
  Contributions Chapter 3 (Medical document anonymization) focuses on medical document anonymization. To tackle the problem of anonymizing medical documents in the Spanish language, we developed two systems, ReCRF and E2EJ. Both systems were submitted to the MEDDOCAN 2019 contest, where they scored the second and the fifth positions, respectively. ReCRF is a combination of hand-crafted features and automatically generated regular expressions, while E2EJ is an end-to-end model based on deep learning methods. This work resulted in two conference papers.
  
  Chapter 4 (Approaching document anonymization from an SDC perspective) presents a first approach applying the notion of disclosure risk as understood in the literature on SDC to textual documents. The proposal leverages NER-based models to detect quasi-identifiers and/or confidential terms in these documents. Once these terms have been located, we can build a structured representation of the sensitive information contained in the document, which can be anonymized through standard SDC methods (e.g. generalization, suppression, etc.) to keep the disclosure risk under control. This work resulted in one conference paper.
  
  In Chapter 5 (Utility-preserving protection of documents via word embeddings), we introduce a complete framework for document anonymization that leverages word embedding models and ontologies to provide robust and utility-preserving anonymization of textual documents. The presented approach is more general and, at the same time, more flexible than methods based on NER models. The experiments show that the proposed model significantly outperforms NER models. The work in this chapter resulted in a conference paper and an extension to a journal paper.
  
  Conclusions This thesis has dealt with anonymization methods for unstructured textual data. First, we have focused on improving the current sequence labeling mechanisms (i.e. NER models). Even though our methods outperform the current state of the art in specific tasks of medical document anonymization, they are hampered by the inherent limitations of NER methods applied to data anonymization.
  
  Next, we have shown that, provided that collections of textual documents can be transformed to structured lists of (quasi-)identifiers, standard SDC methods can be applied to enforce more robust anonymization. The detection of (quasi-)identifiers is, however, very challenging for textual documents and, again, relying on NER-based methods severely limits the generality of the approach.
  
  To overcome the shortcomings of NER-based methods, we leveraged the notion of semantic relatedness via word embeddings and the structured knowledge modeled in ontologies. In this way, we were able to build a complete automated framework for textual data anonymization. The empirical work we carried out on real textual data supported our starting hypothesis: by relying on sound semantic tools and resources, textual data can be protected while preserving their utility significantly better than with naive methods like NER-based models.