Automated text analysis: Who is the threatening minority?

By Christa Brelsford, Chico Camargo and Teun Cuijpers

News media serves as a window into the society its readership represents. A newspaper’s description of a social group both demonstrates and constructs perceptions of that group within its audience. Understanding long-term trends or spatial differences in the representation of minority groups in news media can contribute to ongoing theoretical debates about the role and perception of minority groups in society. This has been discussed by Esther Greussing and Hajo G. Boomgaarden in a 2017 article analyzing the 2015 refugee crisis in Europe, as well as by Erik Bleich, Hasher Nisar and Rama Abdelhamid in a 2016 article which uses New York Times headlines to consider the effect of terrorist events on media portrayals of Islam and Muslims. 

Automated text analysis of media corpora can contribute to systematic and empirical strategies for following representation of different groups in the media. This method allows studies to incorporate thousands of articles, over long time periods, and across different newspapers, languages, and media audiences—including extension to comparisons between print and digital media. Similarly, the scale at which automated text analysis can work makes it possible to explore intersectional perspectives on social groups, such as looking at female, ethnic or religious minorities.

In a 2015 article in the Journal of Ethnic Migration Studies, Erik Bleich, Irene Bloemraad, and Els de Graauw call for exactly this type of research, even though these scholars acknowledge that automated analysis will lose much of the theoretical and narrative richness of traditional analytical methods. They also have some concern regarding the potential difficultly of applying these tools.

In a 12 hour long ‘Datathon’ at the International Conference on Computational Social Science at Kellogg Northwestern in July 2018, we demonstrated that automated text analysis methods can quickly and easily be applied to a broad corpus of news media in order to identify trends in sentiment regarding national origin and religion. We aimed to discover whether western attitudes toward different nationalities have changed over time, as described in news media. For this purpose, we used corpora from both the New York Times, an American daily newspaper, and Der Spiegel, a German weekly magazine.

Initially, we identified a list of 27 countries from all continents, including Germany and USA, to be tracked over time. We used the New York Times API to acquire articles mentioning these countries, and scraped the Der Spiegel website for their articles ranging from 1950 to 2015. For each country, we collected 10 articles from the New York Times containing the country name for each 5-year time period between 1950 and 2015, totaling 3,780 articles. Similarly, we selected all articles published in Der Spiegel between 1947 and 2016 which mention at least one of the selected countries.

After selecting these articles by time series, we operationalized attitudes present in the text as sentiment, measured with SentiStrength. This tool, developed by Mike Thelwall, provides word-level positivity and negativity scores for short texts, as described in a 2017 chapter in Cyberemotions. We applied this tool to all articles in both the American and German corpora, and aggregated positivity and negativity scores to 5-year periods for American articles, and 1-year periods for German articles.

We found that in general, text with the aforementioned country names did not develop to be more negative or positive over a period of approximately 70 years. However, we found some exceptions; for example, in Der Spiegel, Germans talked more positively about Germany as years increased. More interestingly though, we used the aforementioned method to examine attitudes toward different religions. We found that sentiment towards Christianity remained stable over time. However, we encountered a positive trend in the sentiment towards Judaism, and a negative trend in the sentiment regarding Islam.


Following our sentiment analysis, we also applied word embedding methods to our datasets. Word embedding is a name for a number of linguistic tools which map words and phrases from a document to vectors of real numbers, which allow the application of geometric tools of data analysis. The core idea behind word embedding is that similar words should map to similar vectors, and one can measure the distance between different vectors as a proxy for the similarity in meaning between the two corresponding words. Words such as “cat” and “kitten”, for example, would be closer than “cat” and “umbrella”.

In our work, we performed word embedding using word2vec, developed by a team lead by Tomas Mikolov, then at Google (a 2013 working paper describing this work by Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean is available on arXiv). Word2vec groups words in multidimensional space by how often they co-occur in a series of texts. After applying word2vec to both English and German corpora, we looked at the word-vectors which were the nearest neighbours to a set of chosen keywords. Since word2vec brings together words with similar meanings, this would be a way to assess how the meaning of a given word has changed over time—at least in the newspapers we studied. Both of these tools can be applied to text in many different languages—allowing quantitative analysis of text in languages which the researchers cannot themselves read. 

Take the word “refugee”, for example: in our Der Spiegel corpus, over time its vector features close to words corresponding to different nationalities, ranging from Pakistani and Indian in the mid-twentieth century, to Palestinian and Turkish, Vietnamese, and finally Afghan in the twenty-first century. After 2015, it also appears close to “Mediterranean”. Historical patterns such as the ones observed for “refugee” can also be observed in the word “bomb”, which features next to “atomic” in the mid-twentieth century, passing by “jungle” during the time corresponding to the Vietnam War, and more recently moving close to words such as “Baghdad” and “Kandahar”, cities respectively in Iraq and Afghanistan.

The strongest patterns of change in word meaning however, occur for the words “Muslim” and “Islam”. Throughout the twentieth century, they were associated to neutral words such as "religion", “spiritual” and "tradition", but from 2000 to 2005, the words nearest to them become “hate”, “violence” and “enemies”. This pattern is not observed for words representing other religions such as “Judaism” or “Christianity”.

Our brief study shows how automated methods of text analysis can detect change in the framing of different groups. These tools can also be easily extended to study the portrayal of other minority groups in other corpora, and to study how their representation in news media has changed both over time, and across space through the use of different national media sources. In this way, these methods can contribute to the discussion of how religions and different nationalities are described and perceived in the news media and popular culture, by allowing fast analysis across long time periods, different nations, languages, and media sources.


Christa Brelsford is the Liane B. Russell Fellow at Oak Ridge National Laboratory in the Geographic Information Science and Technology group. Previously, she was a Postdoctoral Fellow at the Santa Fe Institute. She obtained her Ph.D. from the School of Sustainability at Arizona State University in 2014 for research on the determinants of residential water demand. Christa’s core research goal is to develop empirical methods to understand interactions between human and physical systems, especially in an urban context. She uses empirical methods like spatial analysis, network analysis, and remote sensing to explore the shape and topology of cities and neighborhoods. Christa’s research has been applied to problems of water demand, water institutions, and informal settlement upgrading.

Chico Camargo is a Postdoctoral Researcher in Data Science at the Oxford Internet Institute, University of Oxford. He uses tools from complex systems, data science and evolutionary theory to answer questions in the social sciences, focusing on public opinion dynamics, information dynamics, and human mobility.

Teun Cuijpers is a data scientist at Isatis Health, where he focuses on improving the Dutch pharmaceutical healthcare system with process mining, unsupervised machine learning, and predictive analysis techniques. Teun recently graduated from a research master's degree in behavioral science at Radboud University Nijmegen, with a major in deep learning on social media popularity.