Following the launch of the SAGE Ocean initiative in February 2018, the inaugural winners of the SAGE Concept Grant program were announced in March of the same year. As we build up to this year’s winner announcement we’ve caught up with the three winners from 2018 to see what they’ve been up to and how the seed funding has helped in the development of their tools.
Here we catch up with Ken Benoit, who developed Quanteda, a large R package originally designed for the quantitative analysis of textual data.
What is Quanteda and how did you start or come up with the idea?
Quanteda is a large R package originally designed for the quantitative analysis of textual data, from which the name is derived. In my SAGE Concept Grant, I have been building a graphical user interface via a web application around this library, an application that we are calling “Quanteda-GUI” (and actively looking for a better name).
The software library is 100% open-source, and grew out of a collection of reusable code that I was using in different research projects back around 2011. At the time, I was using Python for text manipulation (the nltk package) but once this was completed, dumping the results as a matrix into R for analysis or visualization. As I learned more and more about R, I started to write more functions to move a lot of the “natural language processing” pipeline into R, and gradually phased out my use of Python for this purpose. I also started to use some of the existing packages for NLP in R, at the same time that these began to develop.
Because it is a software library, Quanteda requires the ability to program in R in order to use it. In order to make its functionality accessible to users without R programming knowledge, I designed a graphical user interface via a web application that uses Quanteda as its back-end. That application is the focus of my SAGE Concept Grant.
When I refer to “we”, by the way, I mean a team of co-authors that have worked with me on the project over the years, including especially Kohei Watanabe, without whose massive talent and effort Quanteda would be nothing like it is today, but also Haiyan Wang, Paul Nulty, Adam Obeng Stefan Müller, and Akitaka Matsuo.Gokhan Ciflikli has worked with me directly on quanteda-GUI over the past year.
Over the past two years, what were your main challenges and how did you overcome them?
There were two principal technical challenges. First, a web application for text analytics is only as good as the analytical software engine on which it is built. Developing the underlying Quanteda software library that drives the web application took years to develop, and continues actively today. At the beginning there were many challenges due to my inexperience in software development in R, which is a high-level language that requires a lot of discipline to use cleanly and efficiently. The biggest early challenge was mainly learning about best practices and workflow for software development, including how to make the code modular, efficient, readable, and easy to extend. Publishing the software on CRAN (the Comprehensive R Archive Network) is only possible after passing a very rigorous checklist, designed to ensure stability and compatibility across numerous platforms and with other R packages.
Writing code efficiently was another key challenge. Most of the modern Quanteda functions have undergone extensive rewriting through various iterations designed to boost every bit of performance possible from the code. In the early days this consisted of algorithmic efficiency within R, but moved soon to writing core functions in C++, and later to rewriting those to use parallel processing for even more speed.
The second major challenge was one of deployment: how can a web application for large-scale use be built on top of an R engine? We knew that the “Shiny” technology developed by RStudio could be used for small-scale web applications, but were unsure that it would provide the basis for an application with user access control, scalable deployment, and linkage to scaleable and economically viable cloud computing resources that could be delivered to users via both free and paid tiers. In developing our app, which is built on Shiny but highly customized, we have faced a number of really interesting technical challenges. Our development team is based in Poland—Appsilon Data Science—has great expertise in this area and has developed many cutting-edge tools for this purpose.
Why does Quanteda resonate so well with social science researchers?
We get a lot of praise from users (see some testimonials here) for the combination of power and simplicity, as well as our extreme concern with design. This appeals to all of our users, I think, once they come to appreciate that we have written a coherent ecosystem that applies consistent conventions. Quanteda is also extremely flexible. It is designed to allow powerful operations to be performed using the defaults, by users with relatively little experience, but it also allows extensive customization and adaptation by power users through its many options. Finally, it is super fast relative to alternatives, even those in other languages such as Python or Java.
Social science researchers appreciate that Quanteda was designed by social scientists, and hence many of the functions are designed for addressing substantive research purposes. These include: judging the similarity of texts; comparing texts on the basis of readability or lexical diversity; identifying key words in context or key words through statistical associations; extensive plotting and visualization tools; machine learning for text classification; and statistical scaling methods for measuring latent traits from text. All of the functions are not only extensively documented, but also this documentation contains scholarly references to the source material that motivated its design. When implementing a function we go back to original source material and try to match its examples in unit tests. Sometimes this is very difficult, such as verifying the correct behavior of readability indices from articles published in the 1950s or even earlier, when natural language processing was all done by hand.
Being power users of the software ourselves means we share the perspective and concerns of our user base. After all, we originally developed the software for our own use. In extending the user base to non-programmers via the web application, we hope to reach a much broader category of social scientists. They will be able to trust the results it generates because of the trust built from years of developing the underlying Quanteda software library, for a user base of hundreds of thousands R programmers who rely on it for text analytics.
Do you have any interesting examples or case studies to share?
As part of an ERC funded project EUENGAGE, I used Quanteda to analyze tens of millions of tweets we captured about Brexit, from the beginning of 2016 up to the referendum held on 23 June 2016. We used a combination of frequency and keyword analysis, dictionary analysis, and machine learning to classify and compare the language employed by pro-leave and pro-remain social media users. We did all of this analysis using Quanteda, which performed like a champion even when classifying seven million user accounts containing the text of 23 million tweets.
Most of my recent research publications also use the software for analysis. We also encourage other researchers to cite us, so that we not only get academic credit but also have a method of tracking scholarly usage. These are tracked on Google Scholar here.
What sets you apart from other tools and services in this space?
Our web application is built on entirely open-source software, unlike other non-programmer solutions available. Open-source software can be trusted, because it is completely open to scrutiny and verification by experts who want to know exactly how its results are generated. Some competing tools, such as Provalis’s WordStat software are fantastic alternatives, but are not open-source.
Our application also follows a software-as-a-service model, meaning that all computation is done on a cloud server, and requires only a web browser to use. No installation or updating are required by the user, and the amount of computing power is scalable without respect to the capacity of the user’s computer.
How did you hear about the SAGE Concept Grants, what made you apply and how did the funding helped you bring your idea closer to a fully operational tool that researchers can use?
I heard about this through my contacts with SAGE and a book manuscript on quantitative text analysis that I am working on. I learned that SAGE was interested in exploring novel areas beyond the traditional academic publishing model, such as SAGE Ocean and supporting software projects that could aid social science users. As I was also developing software but also thinking of ways to reach non-programmers, the SAGE Concept Grant was a perfect fit at a very suitable time.
Where can researchers find your tool and can they use it already?
We’re undergoing one final development cycle in June, and then will start offering free tier usage for people willing to provide feedback. We hope to roll out the application on a much larger scale by September. Our last development cycle is prioritizing three areas: stability, usability, and security.