What does it mean to anonymize text?

By Bennett Kleinberg, Maximilian Mozes, Toby Davies

Text data are a resource that we are only beginning to understand. Many human interactions are moving to the digital world, and we become increasingly sophisticated in documenting interactions. Face-to-face encounters are replaced by written communication (e.g., WhatsApp, Twitter) and every crime incident or hospital visit is recorded. All of these interactions leave a trace in the form of text data.

Legal frameworks such as the General Data Protection Regulation (GDPR) prohibit the intrusion of an individual’s privacy. That protection is achieved by ensuring that no data with sensitive information (e.g., identifiers such as names, age, details of crimes, patient dossiers) can be shared without consent. That poses a dilemma for data holders (e.g., police forces, hospitals) who, by adhering to GDPR rules, cannot share their data with those able to harness the potential of text data (typically academic researchers). While there is a broad consensus that the solution to this must lie in text anonymization, little attention has been paid to what it means to anonymize text.

Here, we try to provide perspectives that can help in guiding anonymization efforts. Specifically, we look at (1) “over” and “under” anonymization, (2) the issue of data re-identification, and (3) the validation and benchmarking of automated anonymization efforts.

Over and under anonymization – How much does a text have to be altered?

The task of anonymizing text represents a difficult challenge, ranging from the requirements for a text document to be considered anonymized, to the tools that can be used to achieve "anonymization". Developing a fixed set of steps to algorithmically anonymize text is practically impossible. The vast amount of potential rules and exceptions to these make any rule-based, algorithmic approach into a technically intractable problem.

However, due to the immense difficulty of defining anonymization procedures explicitly, an alternative way of approaching this problem is to provide general guidelines that are applicable across documents of different types and from different backgrounds. In an attempt to enumerate such a set of guidelines, the UK Data Service has published a list of best practices for the anonymising of text that should help individuals to execute the anonymization process of qualitative data correctly. In this context, the UK Data Service differentiates between the concepts of "under-anonymization" and "over-anonymization".  A piece of text is under-anonymized if identifying information (such as names and locations) are only partially removed or replaced in a way that the described individuals can still be re-identified in a given document. Over-anonymization, in contrast, can be interpreted as an anonymization that, even though it is strong (in that it becomes impossible to re-identify certain entities), blurs essential information in a way that the semantics and contextual information present in a given text are not preserved. We illustrate this with the following example statement:

“Alfred and Annabel have been married for 37 years. They have two children called Mary and Elizabeth. While Mary has a closer relation to Alfred than to Annabel, Elizabeth feels more connected with Annabel and barely speaks to Alfred.”

Here under-anonymizing this statement can be achieved by focusing on the removal of the names Alfred and Annabel, for example. Hence, the resulting anonymization would be:

“[name] and [name] have been married for 37 years. They have two children called Mary and Elizabeth. While Mary has a closer relation to [name] than to [name], Elizabeth feels more connected with [name] and barely speaks to [name].”

However, this anonymized statement leaves sufficient identifying information (such as the 37 years of marriage and the names of their two children) to re-identify Alfred and Annabel and can hence be considered under-anonymized. If we, in contrast, anonymize the statement by identifying and replacing all words starting with a capital letter and all numbers from this statement, we would obtain the text:

“X and X have been married for X years. X have two children called X and X. X X has a closer relation to X than to X, X feels more connected with X and barely speaks to X.

This statement would represent an over-anonymization, since, although it is now challenging to re-identify the individuals, the statement itself loses essential information and is turned into a highly-ambiguous piece of text. For instance, the above anonymization makes it impossible to understand the relationships between Alfred and Annabel and their two daughters.

This example illustrates the difficulties associated with the problem of de-identifying individuals in text data. In order to be practical and useful for further research analysis, it is desirable to obtain a set of rules that help anonymize a piece of text such that it is neither over- nor under-anonymized, meaning that it should be impossible to re-identify the described individuals whilst preserving the necessary contextual information to retain the semantic properties of the original text fully. Finding the balance between these two extremes, however difficult it may seem, is essential to conduct text anonymization reliably so that the properties above are preserved. But to find this balance, one would first need a means of measuring whether either of the states of over- or under-anonymization has been achieved for a given statement. What is required, therefore, is a form of assessing whether or not a piece of text is sufficiently anonymized.


Read more about Text Wash, winners of the 2019 SAGE Concept Grant

Making sensitive text data accessible for computational social science


Text re-identification – Can texts ever be 100% anonymous?

The litmus test for any anonymization procedure is how easily the original data can be inferred from the anonymized version. However, under GDPR, the threshold for successful anonymization is even higher since it protects the individual who the data is about. The UK’s Information Commissioner’s Office (ICO) code of practice of data anonymisation states that “it can be enough to be able to establish a reliable connection between particular data and a known individual” (p. 21). Translated to text data, this implies that an anonymized text should not allow for the identification of the individual(s) in the text.

Suppose researchers want to examine whether teenagers’ diaries provide some leading indicators to the onset of eating disorders. The data are shared under the premise that the diaries are anonymized. When formulating the specific requirements of anonymization, one will likely encounter the questions of “how much anonymization is enough?” Would it suffice to remove the name of the diaries’ authors? What about names of friends, locations they might frequently mention, or even very specific and unique details (e.g., the favorite plate with flamingos on it)?

Arguably, if the parents of the teenage diary author would read a version that has all names removed but still contains the details about the flamingo plate, they might still be able to identify their child. Similarly, a health care professional will possess privileged information about individuals that might allow for their identification even if direct identifiers of the patient dossiers are redacted. While many will agree that direct identifiers do not belong in the public domain and should not be shared between organizations, it is debatable whether data need to be anonymized so that even someone with highly privileged information could not re-identify an individual.

To determine the appropriate degree of anonymization, the ICO put forward the motivated intruder test. Here, the anonymized text should be subjected to the scenario of “a person who starts without any prior knowledge but who wishes to identify the individual […] [and] is reasonably competent, has access to resources such as the internet, libraries, and all public documents, and would employ investigative techniques” (p. 22). That intruder “is not assumed to have any specialist knowledge […], or to have access to specialist equipment or to resort to criminality such as burglary, to gain access to data that is kept securely” (p. 23). Therefore, related to the motivated intruder test are the questions who should be able to re-identify an individual and in what context does the anonymization take place? For the diary example, the parents possess specialist knowledge, and for a health care professional treating the teenager, it is even essential to identify the individual. As such, the potential of re-identification does not extend to individuals beyond those that are motivated yet do not have specialist knowledge.

Although there are no set rules that unambiguously can be derived from the motivated intruder test, it is of great value for those working on text anonymization. Especially the qualitative nature of text data presents the challenge of "unforgiving mistakes". The leakage of seemingly minor details makes it hard to ever anonymize a text to 100% since there will likely always be someone with the in-depth knowledge of an individual that can infer that it must be Anna who always eats from her favorite flamingo plate. Yet, this is not the threshold of anonymization.

Thus, while it will be difficult to ever fully anonymize a text without losing its essential characteristics needed for analyses, it is not necessarily required to anonymize for everyone. Instead, the motivated intruder test is a guide in building scenarios that help in the formulation of meaningful anonymization requirements. An idea of how far the protection against re-identification must go is vital for all anonymization efforts. Automated approaches that are needed to scale the process up, however, have to consider an additional dimension, namely that of benchmarks.

Validating and setting benchmarks for automated text anonymization

In designing automated text anonymization systems, eventually, the question of validation and comparing its performance against some benchmarks should come up. It is essential to differentiate between the validation of the system and the validation of the system's goal. System validation is the most common approach and typically involves counting how many pieces of information in a predefined category (e.g., names, locations, dates) are identified and altered. While this kind of validation does ascertain how accurately the system performs its job, it does not allow for conclusions about the validity of the anonymization. Validating the system's aims, in contrast, would submit an anonymized text to the motivated intruder test. If a motivated human can re-identify individuals from the text, the anonymization was unsuccessful. Both validation procedures result in a performance metric (e.g., an accuracy score), but they mean fundamentally different things. A system can be 95% accurate in identifying and replacing names/locations/dates but only 40% successful in prohibiting re-identification.

Yet another aspect of automated anonymization tools to compare their performance against some benchmark - both in identifying-and-altering information and in prohibiting re-identification. That question will become increasingly important because it is the critical test of whether automated anonymization is endorsed to replace human anonymization efforts. Automation has several advantages over manual efforts, including the possibility of scaling up in a fraction of the time needed by humans and removing the need for security vetted individuals, thereby drastically reducing the costs. Deciding whether automation is appropriate or not, in essence, raises the broader question of whether the status quo of manual anonymization is the gold standard.

A system that can automatically remove identifying information will outperform human anonymizers in speed and reliability of the task. Ideally, one could argue, the automated system will also reach human accuracy on prohibiting re-identification. If that were the case, all a computerized system would need to do is mirror the human anonymization process. If the human information removal process is automated, the results should be identical. But in arguing like this, we assume that human anonymization does its job. To date, there are practically no evaluations of the effectiveness of human anonymization and the ICO's code of practice is unique in its attempt to provide some overarching guidelines, albeit in an understandably vague manner. Thus, by automating the humans' process, we might perpetuate a suboptimal idea of anonymization. A promising approach for the future could lie in reversing the definitional logic: rather than assuming that what a human does equals anonymization, we could simply define anonymized texts as those that withstand the motivated intruder test. How exactly that was achieved is then of secondary importance. A system that automatically modifies texts so that they preserve their meaningfulness for further analysis and pass the motivated intruder test would meet today's best practices for text anonymization.

What’s next?

Proper automated text anonymization is what stands between the data and the people capable of working with the data in many contexts. Therefore, text anonymization will play a role in the solution to pressing issues. This blog posts discussed some definitional pitfalls of text anonymization and highlighted possible means to assess what makes a good anonymization. With Text Wash, we are committed to contributing to part of the solution by creating a tool that uses natural language processing and machine learning to make text data robust against the motivated intruder test automatically. Because we believe that the validation of any such system is as essential as the system itself, we will make the benchmarking and validation results of our efforts publicly available.

About

Dr Bennett Kleinberg is an assistant professor at the Department of Security and Crime Science and the Dawes Centre for Future Crime at University College London. He is interested in understanding crime and security problems with computational techniques and in behavioral inferences from text data.

Maximilian Mozes is a Ph.D. student at University College London supervised by Lewis Griffin (Department of Computer Science) and Bennett Kleinberg (Department of Security and Crime Science). His research interests lie at the intersection of natural language processing and crime science and his doctoral studies focus on assessing the vulnerabilities of statistical learning techniques operating on textual data.

Dr Toby Davies is an assistant professor in the Department of Security and Crime Science at University College London. His interests lie in the quantitative and computational analysis of crime, with particular focus on spatial analysis and the role of networks in crime.


Previous
Previous

Theory and tools in the age of big data

Next
Next

How to take a social media sabbatical as an academic