By James Siddle
In part one of this series we introduced the topic of automated image tagging and showed how Cloud Vision APIs such as Clarifai can be used to classify images into different categories. We showed examples of SAGE images and the tags assigned by different Cloud Vision APIs, then discussed use cases for this innovative technology—primarily in discoverability and accessibility.
In this follow-on post, we focus on data analysis and specifically co-occurrence networks. By way of example we present a co-occurrence network derived from Clarifai image tags, which represents a kind of mental model of the SAGE journal images we processed. The following image is a visualization of the co-occurrence network that we created:
The article starts with a short introduction to co-occurrence networks—what they are, and what you can learn from them. We'll then take a look at how the visualization was created, what characteristics it has, and then discuss what it tells us about SAGE images and the underlying image tagging techniques.
Data Analysis: Co-occurrence Networks
The visualization above shows co-occurrences between tags assigned to SAGE journal images. A co-occurrence represents, simply put, a link between two things that appear together in some context. For example if two names, say Romeo and Juliet, or flour and eggs, appear together in the same sentence then you can infer that there is a relationship between the topics mentioned.
A co-occurrence network is a collection of concepts and relationships within some scope, such as a body of text, or in our case SAGE journal image tags.
The exact nature of a co-occurrence relationship is not known, you just know that there is a link. As such, co-occurrences are useful for exploring potential relationships, and developing new insights into known concepts. They are imprecise, but quick and easy to determine compared to other more advanced techniques that attempt to extract knowledge semantics from images or text.
We decided to look at co-occurrences between image tags to attempt to extract a rudimentary "mental model" of SAGE journal topics, based on the images published in journal articles.
As discussed in the previous post, image tags from the Clarifai API provide a richer source of concepts compared to other sources such as the Google Cloud Vision API, so we decided to focus on co-occurrences between Clarifai tags.
Clarifai returns a confidence score for each tag, per image. To filter out noise we set a threshold score of 0.95, meaning that the minimum acceptable confidence level was 95%. This helped to eliminate spurious concepts from the results and led to a cleaner and more understandable mental model.
For any two tags associated with an image, we counted that as a co-occurrence between the concepts, as the following example demonstrates.
To detect the co-occurrences, we wrote a simple Python script. The script's job was to extract co-occurrences from the Clarifai responses, then collect these together into a single dataset of all known concept co-occurrences, with a count per co-occurrence. This was the data for our mental model.
Building the Data Visualization
We fed the dataset of co-occurrences and counts into a tool called Gephi to create a static image of 'nodes' and 'edges' that is common in network visualization. Gephi is a network visualization and exploratory data analysis tool, ideally suited to an investigation into relationships between image concepts.
Gephi uses circles to display nodes in the network, and these represent the different tags or concepts detected in the images. Gephi draws connections between nodes that correspond to the edges in the network, in this case representing co-occurrence relationships.
To show the relative popularity of concepts and their co-occurrences, the size of nodes and edges in the visualization was adjusted—so bigger circles represent more popular topics, and wider connections represent more frequently occurring co-occurrences. This means that the most topical concepts and relationships for SAGE journal images were highlighted.
The layout of the visualization is based on an algorithm (called Force Atlas 2) that tends to pull related concepts closer together based on relationships, and highlights significant or influential nodes.
The colors show potential communities or clusters of terms, detected based on the structure of the network using a community detection algorithm called the Louvain method. This algorithm finds coherent groups of concepts in the network and is useful for distinguishing different aspects of the data.
In this case, the network layout and communities give strong indications of different classes of image used in SAGE journals.
What can we learn from the visualization of our mental model?
First, there are a handful of very popular concepts that appear in many images and are connected to many other concepts, such as Education, Medicine, Science, Business. These terms correspond nicely with SAGE's focus on scholarly publishing and suggest that the core concepts in the mental model are sound.
The popularity of the Medicine topic is a little surprising however, as this is a growth area for SAGE. The disproportionate level of popularity is probably due to higher accuracy of tags in medical images which are quite visually distinct, perhaps combined with a greater frequency of images in medical journals.
Second, there are very clear clusters corresponding to classes of image included in SAGE journals. There is a crisply defined cluster of medical image terms to the left, then clusters to the right that correspond to tables, diagrams, and other styles of illustration, as well as some smaller clusters relating to people, geographic concepts, and chemistry concepts.
Unfortunately, the concepts that make up these clusters are in some cases inaccurate or overly vague, which limits the usefulness of this approach in exploring the conceptual space of SAGE journals.
For example, there are many terms such as Almanac or Monthly in the leftmost cluster in pink; an inspection of the underlying images shows that this cluster represents the many tables that appear in SAGE journals, but that the tags in the cluster are imprecise and have little bearing on the content of the tables. A second area where the analysis demonstrates shortcomings in the image tagging is in the small cluster of chemistry focused images near the top. These images contain graph-like visual designs (e.g. a small social network) that can easily be mistaken for a picture of a molecular structure
As such, one of the main takeaways of this exercise is that co-occurrence networks derived from image tags will be more accurate and meaningful if the images are photographic in nature. Images of tables, diagrams, or similar figures which often appear in scientific journals will benefit from other types of analysis, such as Optical Character Recognition which extracts textual content.
It's worth noting that image tags and associated co-occurrences can likely be used to filter out or identify tables, figures etc, which could prove to be a useful categorization technique in the absence of image metadata or dedicated image classifiers.
Thankfully, SAGE has metadata that allows us to identify photographic images, so in part three of this blog series we'll present an updated version of the visualization focused solely on photographs, and we'll explore the conceptual space in more detail.
"Half Product Manager, half Software Engineer, with a sprinkling of Data Scientist, James is an independent IT consultant based on London. He works with various companies in the London technology sector, and has a particular interest in scientific research workflows, life science research, and scholarly publishing. He also has a blog called The Variable Tree where he writes about data, from mining and integration, to analysis and vizualization."