Digital DNA: How to map our online behavior

By Beatrice Rapisarda and Stefano Cresci

Nowadays, issues related to the diffusion of fake news, rumours, hoaxes, as well as the diffusion of malware and viruses in online social networks have become so important as to transcend the virtual ecosystem and interfere with our businesses and societies. Currently, we are unable to effectively deal with these issues.

A survey conducted in 2012 by Pew Internet Research confirmed that we are running a high risk of “distribution of harms” due to the abundance and spreading of inaccurate and false information. Furthermore, in 2013, the World Economic Forum listed “massive digital misinformation”—either intentionally or unintentionally—as one of the main risks for our modern society.

Social data—that means user-generated content in social media as well as in other online activities—are being exploited for a myriad of goals: to improve healthcare, sport performance or security, as well as to optimize cities mobility and business processes or for financial trading,  just to give some examples. But, without the adequate tools, we run the risk that much of the content we rely on, is actually fake and possibly purposely created to mislead algorithms and users alike. In fact, evidence of fake accounts as well as spam and automated (bot) activities in social platforms is being reported at a growing rate. Should this risk materialize, real-world consequences would be severe. For example, it has been demonstrated that bots tampered with recent US, Italian, French, Japanese, and—to a minor extent—German political elections, as well as with online discussions about the 2016 UK Brexit referendum.

This problem clearly affects all sectors. Think for example when during an earthquake in Chile in 2010, rumors were spread through Twitter that a volcano had become active and there was a tsunami warning in Valparaiso. Later, these reports were found to be false, but in the meantime they caused panic among the population.

Moreover, fake news can cause the fortune or misfortune of the financial market. For instance, in 2013, the US International Press Officer's Twitter account was hacked and a false rumor was posted reporting that President Obama had been injured during a terrorist attack. The fake news rapidly caused a stock market collapse that burned 136B US$. Then, in 2014, the unknown Cynk Technology briefly became a 6B US$ worth company. Automatic trading algorithms detected a fake social discussion and begun to invest heavily in the company's shares. By the time analysts noticed the orchestration, investments had already turned into heavy losses.

  Figure 1: Illustrative figure of Digital DNA.

Figure 1: Illustrative figure of Digital DNA.

However, recent advances in theoretical data science, as well as the development of big data systems capable of processing the huge volume of online social networks data, gives us the unprecedented opportunity to tackle these critical and multidisciplinary issues.

“Taking inspiration from biological DNA, we propose modeling online user behavior with strings of characters representing the sequence of a user’s online actions. Each action type, such as posting new content or following or replying to a user, can be encoded with a different character, just as in DNA sequences, where characters encode nucleotide bases. According to this paradigm, online user actions would represent the bases of their digital DNA” explains Stefano Cresci.

Different kinds of user behaviors, in fact, can be observed on the Internet, and digital DNA is a flexible and compact way of modeling such behaviors. The flexibility lies in the possibility of choosing which actions form the sequence. For example, digital DNA sequences on Facebook could include a different base for each user-to-user interaction type: comments (C), likes (L), shares (S), and mentions (M). Then, interactions can be encoded as strings formed by such characters according to the sequence of user-performed actions. Similarly, user-to-item interactions on an e-commerce platform could be modeled by using a base for every product category. User purchasing behaviors could be encoded as a sequence of characters according to the category of products they buy. In this regard, digital DNA shows a major difference from biological DNA, where the four nucleotide bases are fixed; in digital DNA, both the number and the meaning of the bases can change according to the behavior or interaction to be modeled. Just like its biological predecessor, digital DNA is a compact representation of information—for example, a Twitter user’s timeline could be encoded as a single string of 3,200 characters (one character per tweet), as shown in the example of Figure 2.

   Figure 2: Excerpt of a digital DNA extraction process in Twitter. In digital DNA each user action is associated to a given character, according to a predefined alphabet.   


Figure 2: Excerpt of a digital DNA extraction process in Twitter. In digital DNA each user action is associated to a given character, according to a predefined alphabet.
 

“We exploit digital DNA to study the behavior of groups of users following the intuition that, because of their automated nature, spammers and bots are likely to share more similarities in their digital DNA than a group of heterogeneous genuine users will” continues Stefano Cresci.

This process is called digital DNA fingerprinting and encompasses four main steps: (i) acquisition of behavioral data; (ii) extraction of DNA sequences; (iii) comparison of DNA sequences; (iv) evaluation. First, datasets of verified spambots and genuine Twitter accounts are created. Then, the digital DNA of the accounts is extracted, that is, each account is associated to a string that encodes its behavioral information. Spammers and bots are found by studying the similarities among the DNA sequences of the investigated accounts

“We consider similarity as a proxy for automation and, thus, an exceptionally high level of similarity among a large group of accounts serves as a red flag for anomalous behaviors. In particular, we quantify similarity by looking at the Longest Common Substring (LCS) among digital DNA sequences. We show that the similarity, as measured by the LCS, between the DNA sequences of spambots is much higher than that of genuine accounts, and we leverage this distinctive feature to perform our spam and bot detection”, concludes Maurizio Tesconi.

The research aims at developing a Digital DNA Toolbox (DDNA) to provide researchers and practitioners from many disciplines with a collection of algorithms, cutting-edge tools, and techniques to analyze the activity of accounts, in order to highlight suspicious (e.g., fake, bot) accounts and unreliable (e.g., fake, unverified) content. 
This powerful and compact representation opens up the possibility to efficiently perform both individual and group analyses. The core techniques that will constitute the DDNA Toolbox have already been successfully employed for the detection of fake and bot accounts in online social networks, for the detection of fake content, and for the analysis of discussion forums. 

The theoretical foundations of digital DNA have been developed during the last three years as part of national and European research projects by renowned researchers in web and data science. Digital DNA has now reached an adequate level of maturity to be profitably turned into a useful product for a wide technical and non-technical audience.

Algorithms and techniques at the core of the DDNA Toolbox are listed within the methods catalogue of the European Research Infrastructure for Big Data Analytics (SoBigData.eu). The DDNA Toolbox will be made available as a Python and R library.


The DDNA Toolbox was one of the inaugural winners of the SAGE Ocean Concept Grant program, winning $35, 000 to support the development of their project. Read the press release and find out more about the DDNA toolbox here.