Earlier this year Allen AI were announced as the winners of the NYU Coleridge Initiative’s Rich Context Competition. The goal of the competition was to automate the discovery of research datasets and the associated research methods and fields of social science research publications. You can find out about all the finalists and their work here.
We caught up with Allen AI to talk about the work and their involvement in this year’s competition.
Can you tell us a little bit about Allen AI and the work you’re involved with there?
The Allen Institute for Artificial Intelligence (AI2) is a non-profit research institute, with a mission of doing high-impact artificial intelligence (AI) research for the common good. We specifically work on the Semantic Scholar and AllenNLP projects. Semantic Scholar is an academic search engine, using AI techniques to empower researchers to perform more effective and efficient literature review. Some of the distinctive features of Semantic Scholar include extracting numerical results and figures from papers, linking papers to relevant related content like blogs and videos, and predicting the intent of citations. AllenNLP is an open-source, pytorch-based, python library to enable researchers to more easily perform NLP research.
How did you hear about the competition?
The Semantic Scholar team is always on the lookout for opportunities to engage with the wider research community to address hard problems in this domain, so when we came across the Rich Context Competition (RCC), we considered it a great opportunity for applying and extending our research on analyzing scientific documents.
What is the relationship of the work you do at Allen AI and the rich text competition?
On the Semantic Scholar team, we do lots of research around information extraction and natural language processing (NLP) with the aim of augmenting the scientific literature to assist researchers. Augmenting the scientific literature with structured information about datasets would enable researchers to more easily discover additional information and previous research on new and existing datasets. The RCC was aimed exactly at this task of identifying both new and existing datasets in the literature, so it aligned perfectly with Semantic Scholar’s goals.
Why would knowing who else worked with a dataset, on what topics and with what results help researchers?
In order to perform data-driven research, researchers need to work with high-quality datasets which capture the observations and phenomena under study. However, it is not always easy to find which datasets are relevant for a given research problem, or assess their quality. By reading previously published research on several datasets, researchers could save months of investigation and experiments by homing in on the highest quality datasets relevant to their problem.
This is especially true for AI research, where it is common to focus work on a problem around a particular dataset (or a set of datasets). At the early stages of a research project, we would like to know what datasets already exist that are relevant to our research problem, who else has worked on them, what methods they have used, what results they achieved, and what possible issues with the dataset they may have found. This is all part of a preliminary investigation when starting a research project. The process of searching for these pieces of information is currently quite manual and time consuming, so automated tools to assist this search could both speed it up and make it more complete.
Did you learn anything interesting through participation in this competition?
This competition was a good reminder that many important research problems still lack quality data, and curating these datasets is an expensive and challenging process. Existing techniques still struggle on problems with only a small amount of labeled data. Some of the subtasks in the RCC competition were also extremely hard to evaluate due to the lack of labeled data, which is a fairly common situation when working on a low-resource domain.
What kind of machine learning approach did you adopt in the end, and why?
We ultimately framed this problem using a fairly standard information extraction framework, named entity recognition (NER) followed by entity linking. This involves first identifying pieces of text that are likely to correspond to a dataset, and then, when possible, linking these pieces of text to the real world dataset that they refer to. More specifically, for NER we used a BiLSTM+CRF model, and for entity linking we used TFIDF-based candidate generation, followed by a gradient boosted trees binary classifier. Some of the models we used in this competition are available in an open source library which was built here at AI2 for performing efficient modeling of scientific text (https://github.com/allenai/scispacy based on https://spacy.io/).
Is there anything you would have done differently?
We would have spent more time with the dataset. The noise in the dataset and distributional shift between the train and test sets made evaluation and iteration quite difficult, and more time spent examining and cleaning the dataset would have helped us try more experimental techniques on the problem. Another direction we could have pursued more is identifying publicly available and naturally occurring datasets which are related to the competition.
Are there other initiatives that you are working on that are of interest to the wider academic community?
We are always working on multiple initiatives at Semantic Scholar that are of interest to the wider academic community. As mentioned above, we built scispaCy (https://github.com/allenai/scispacy), a version of the popular NLP python library spaCy (https://spacy.io/) optimized for scientific and biomedical text. Right now, we are working on expanding Semantic Scholar to cover all scientific domains, personalized paper recommendations for researchers, and classifying the intent of a citation in a scientific paper, among other things. We encourage you to check out https://www.semanticscholar.org/ and sign up for our mailing list (at the bottom of the homepage) to receive updates on our efforts, and also try out AllenNLP (https://allennlp.org/) if you are interested in doing NLP research.
The Allen Institute for Artificial Intelligence team was: Daniel King, Suchin Gururangan, Christine Betts, Iz Beltagy, Waleed Ammar, Madeleine van Zuylen
Find out more about the Allen Institute for Artificial Intelligence, and the AllenNLP and Semantic Scholar projects.
The data for this competition was provided by ICPSR, Digital Science and SAGE.