Tapping into the hidden power of big search data

By Sam Gilbert, Department of Politics & International Studies, University of Cambridge

What do trundle beds, nail guns and pug insurance have in common? They’re all e-commerce success stories that started with big search data.

On the subject of trundle beds, Richard Tucker, co-founder of homewares retailer WorldStores says: ‘None of us had a clue what one was, but people were typing it into search. So we got them in stock and they sold like hell.’ (It’s a low, wheeled bed, stored under another bed—in case you’re wondering.)

Here’s data scientist Steve Johnston on the opportunities that he found in search data for his client, Screwfix: ‘‘Security lighting’ they called ‘outdoor lighting’, ‘nail guns’ they called ‘nailers’, and ‘kitchen lighting’ they didn’t call anything because they considered, quite reasonably, that the majority of lights could exist in many room types, even though that is exactly how many of their customers search.’

As for pug insurance, that was the first big opportunity I found when mining search data for the then-startup Bought By Many. Dog insurance already existed, of course, but no insurers had thought to develop product features that met the specific needs of pugs. Fast forward six years, and Bought By Many is the UK’s 13th fastest growing tech company (and the number 1 provider of insurance for unusual pets).

Search data in academic journals


Think of internet search data as a vast reservoir of human needs and desires—a collective of consciousness. Data-driven entrepreneurs and search marketers know how to draw on it, as these stories show. But what about academic researchers?

To try and find out, and with help from SAGE Ocean and the Cambridge University library, I found 265 articles in peer-reviewed journals that use internet search data in some way.

Public Health researchers have been the most enthusiastic adopters of search data. It’s been used to analyze the symptoms of fibromyalgia, understand the seasonality of domestic violence in Finland, and forecast the spread of the West Nile virus in the United States. Kudos to Nicola Bragazzi, an infodemiologist at the University of Genoa, who has co-authored almost 10% of all the papers that use search data.

The most cited articles, meanwhile, are in economics. Here, the main use-case is so-called ‘nowcasting’—that is, understanding changes in consumption, financial markets or tourism before conventional (lagging) indicators become available.

The opportunity for search data in social science

In general, however, social scientists are yet to tap into the exciting potential of search data. Across psychology and sociology journals, there are only three papers that use it to advance causal claims. Only one author, Jon Mellon, has published papers using search data in political science journals.

Why is this? Because there aren’t any interesting puzzles in these fields that search data can help with?

That’s not it. I did some analysis using Google Trends data to test Bernard Manin’s theory of Audience Democracy. Manin suggests that in the age of broadcast media, political parties have become ‘instruments in service of a leader,’ and that electoral competition is therefore driven by personalities more than by policy programs. This seems intuitive when thinking of figures like Macron, Salvini, Duterte or Bolsonaro. But why not test it using data? I compared search volume for parties to search volume for party leaders during the most recent election year in 61 states.

Party:Party Leader Google Ratio (x-axis, log scale) compared to Freedom House Scores 2018 (y-axis)

Party:Party Leader Google Ratio (x-axis, log scale) compared to Freedom House Scores 2018 (y-axis)

This analysis, based on the Google searches of hundreds of millions—perhaps billions—of individuals, suggests that parties are more front-of-mind than party leaders in a significant minority of cases. In some cases this is not so surprising—non-presidential states with proportional representation systems in the Nordic countries, for example. Others, like Taiwan, Pakistan and Angola, are much more intriguing. This is the kind of variation that gets scholars of comparative politics hot under the collar.

Inspired by a conversation with Helene Bie Lilleør at Rockwool Fonden in Denmark, I went on to use search data to test a hypothesis about the relationship between social media usage and self-harm (‘selvskade’ in Danish). The analysis showed a striking correlation. There’s not enough in it to start making casual claims, but it ought to at least pique the interest of some sociology and psychology scholars.

Searches for  selvskade  (lhs) compared to % of population using Instagram and Snapchat every day (rhs), 2013-17

Searches for selvskade (lhs) compared to % of population using Instagram and Snapchat every day (rhs), 2013-17

My suggestion is that the main reason social scientists aren’t making more use of search data is that they (you) simply haven’t thought to use it, and/or don’t know where to find it. Let me explain how you can do that.

How social scientists can use search data

The go-to source of this data is Google Trends. It’s a free resource that will give you an indexed volume for any search term over time, going back to 2004. Helpfully, Google has solved the problem of lexical ambiguity for you with the concept of ‘topics’. For example, there are different topics for ‘Apple’ (the technology company) and ‘apple’ (the fruit). You can compare volumes for up to five different search terms or topics, get information on national and regional variations, and download everything in CSV format.

There’s also an extension to Google Trends called Google Correlate. This allows you to upload your own time series data set, and get back a list of the search terms that most strongly correlate with it. (For a detailed primer on using Google Trends and Google Correlate, including some re-usable code, check out Seth Stephens-Davidowitz and Hal Varian’s Hands-on Guide to Google Data.)

Google Trends is the data source for the overwhelming majority of the 265 peer-reviewed papers mentioned above. If you only do one thing as a result of reading this blog, do this: Spend 15 minutes messing around with Google Trends on topics that are interesting for your research, and see where it takes you.

Search data isn’t just about Google Trends, however. A very valuable thing that Google keeps to itself is the full set of variations on a search term—that is, what exactly people search for when they search on a particular topic. Let’s say you were interested in attitudes towards asylum seekers. You would want to know not just the volume of searches for the term ‘asylum seeker’ over time, but also what exactly people wanted to know when they made the search. Were they hostile, or curious? Were they interested in the policies of political parties, or in asylum seekers’ existing rights and entitlements?

Answer The Public, a fantastic tool built by the SEO agency Propellernet, can help with these kinds of insights. It works by scraping and aggregating search engines’ autocomplete suggestions for searches that are formulated as a question. It’s designed for marketing and PR people looking for new content ideas, but is just as useful for social science researchers. Examples for our current thought experiment include, ‘how are asylum seekers treated in Australia?’ and ‘can asylum seekers work?’ You can run a small number of queries each day for free and download the data in CSV format. A paid subscription is required if you want to run unlimited queries, or run queries in multiple countries.

For researchers, there are two main downsides of Answer The Public as a source of search term variations. The first is its scope; it is limited to question formulations, and only provides a very high-level indication of relative search volume. The second drawback relates to the methodology used for data collection. Since Google removes queries that violate its policies from autocomplete, Answer The Public’s data set will omit the searches that would be most important for answering research questions on racial prejudice, violence, sectarianism and so on. 

Unfiltered search term variation data is, however, available from specialist data companies like Hitwise, Jumpshot, and Similarweb, who collect it permissively and anonymously from panels of millions of internet users. Searching for ‘asylum seeker’ in Hitwise’s database returns 1,146 variations in the last 12 weeks, together with a granular search volume metric for each variation. Some interesting examples in the top 20 search variations by volume include ‘ptsd in asylum seekers ted talk’, ‘difference between refugee and asylum seeker’, and ‘failed asylum seekers amnesty 2019’. Hitwise also enables time-based comparisons.

Unfortunately for scholars, these rich sources of search term variation data are currently only available under licence, and are priced for commercial applications in e-commerce retail and hedge funds—not with academic budgets in mind.

But help is at hand. Dr Gilad Rosner and I are on a mission to make big search data widely available for use in social science research and public policy analysis. We plan to do this by creating a free online source of deep search term variation data, focused on topic areas that are insightful for researchers and policymakers (but have limited commercial value). We’re starting with the topic of LGBT inclusion, and we’d love to hear from you via Twitter or LinkedIn about other research questions you have that big search data might provide answers to.

About Sam gilbert

SG Headshot Crop2.jpg

An expert in data-driven marketing, Sam was CMO at the insurtech company Bought By many from startup until 2018. Previously, he was Head of Strategy and Development at Experian. He is currently on sabbatical at the University of Cambridge’s Department for Politics and International Studies.