Assembly Fellowship • May 2020 • DISINFORMATION

The spike and the long tail

Filling the Data Void

By EC, Rafiq Copeland, Jenny Fan and Tanay Jaeel

The risks posed by “data voids” – the absence of high-quality, authoritative sources in search results – have only recently been explored in the context of disinformation. As gatekeepers to the information ecosystem that are both collaboratively created from public contributions, Google and Wikipedia (which features prominently in search results) are vulnerable to media manipulators seeking to distort the narrative. This project aims to add more structure and empirical data to the understanding of this challenge with: 1) a Harms Framework to evaluate existing and emerging data voids, and 2) a data-driven method to map the life cycle of data voids across Google search trends, Mediacloud journalistic coverage, and Wikipedia page edits. The life cycle timelines point to a distinction between the "interest spikes" of breaking news data voids, which are quickly filled by mainstream news coverage, and more "long tail" data voids which persist over time and slowly accumulate problematic content. The timelines also highlight the interconnected relationship between mainstream media coverage and organic search in both covering and amplifying disinformation messages, and reinforce Wikipedia's role as a potential future battleground for disinformation campaigns. By providing a theoretical framework on how to assess emerging data voids and a programmatic script to analyze cross-platform media coverage, we are able to better understand the scope of the problem. However, this opens up a number of questions and areas for future research on how to appropriately develop policy responses to address emerging data voids.

Monday, February 3rd was a tough day for the Iowa Democratic Caucus. A glitch in a recently launched app to report results meant that voting outcomes were significantly delayed. While the US waited with baited breath, a narrative started to emerge that the faulty app was indicative of vote rigging. The conspiracy theory was encouraged by prominent political figures and news outlets which led people to search for information online about “Iowa Caucuses Rigged.” This in turn, led to claims that in the absence of credible, authoritative information that the Iowa Caucuses were not rigged, conspiracy theorists and propagandists successfully hijacked public discourse. Pushing a fringe narrative during such a crucial moment in the US’s political calendar would surely undermine the eventual election results and erode people’s trust in democratic processes.

This and other recent events have ignited questions in American society on how users search for and are influenced by information. From civic trust during the Iowa Caucus to public health and safety during COVID-19, concerns have been raised over the presence of misinformation within important social and political issues, and to what extent this misinformation hijacks the public narrative when credible information is not available to balance it out.

This phenomenon has a name within academic circles: a data void, or lack of readily available, credible, authoritative information relating to a specific topic (Gobeliewski & boyd 2019). In this paper, we set out to better understand the concept of data voids, specifically:

  1. What is the current framing around data voids?
  2. How should we think about the harms posed by data voids?
  3. What data exists to measure and further understand data voids?
[top]

Background: How information search works

Before we dive into these questions and into data voids, however, it's important to first step back and understand how users seek and receive information online.

Before search engines, information search online required much more user-directed methods that offered greater transparency and control at the cost of effort, such as looking up indices manually (Caroll 2014). The invention and popularization of search engines solidified its position as a ubiquitous and indispensable way to navigate the world wide web (Strategic Direction 2014). Google alone is responsible for 87% of all searches online (Caroll 2014), and this important role as both gatekeeper and intermediary has not gone unmissed (Clement 2020).

By design, search relies on a "vast ecosystem of networked information that is both created and ordered by a crowd of contributors", and search engines rely on proprietary algorithms to present the most relevant content (Graham 2013). This system is held together by high user trust: to cope with the sheer volume of information, most users process search results heuristically rather than systematically, assuming that highly ranked results are automatically more credible and authoritative (Werner 2007).

For the most part, the information ecosystem has indeed been optimized over time for credible, authoritative information. Google Search has made several changes over the last few years to optimize it’s ranking algorithms for relevance and quality. Webpages that potentially spread hate, cause harm, misinform, or deceive users are rated as not authoritative or trustworthy by Google’s Search Quality Raters meaning that webpages will be ranked lower in Google Search compared to other authoritative, trustworthy sources.

To further support heuristic search and rely on the broader networked information ecosystem, Google introduced the Knowledge Graph feature in 2012 to present facts about entities (e.g. people, places, events) at a glance, with the underlying data being pulled from Wikipedia (Singhal 2012). To help curate the quality of its crowdsourced content, Wikipedia refers to a dynamic list of reliable and non-reliable information sources compiled by their community of moderators and maintains a running list of Controversial Issues to protect sensitive pages against biased information and “edit wars”. Newsrooms are also becoming increasingly aware about the importance of responsible and timely reporting on sensitive events in a decentralized online information environment.

[top]

Part 2: What is the current framing around data voids?

Despite recent improvements made in the information ecosystem, not all topics are covered with equal rigor by authoritative, credible sources – nor is it possible to do so. This is where we return to our exploration of data voids.

To explore how data voids are currently framed, we will first examine an oft-cited data void: the term “crisis actor.” Historically, “crisis actors'' referred to the use of actors in emergency preparedness training. Crisis actors would play the role of victims or perpetrators in helping first responders prepare for disaster scenarios. Over time, the term was co-opted by conspiracy theorists to refer to genuine victims of violent events (mass shootings, terrorist attacks, etc) in an attempt to prove that an event was staged or did not happen. The co-opted use of “crisis actor” first cropped up prominently amongst conspiracy theorists after the shooting at a school in Sandy Hook, Connecticut in 2012.

While the facts of the conspiracy itself were quickly debunked, the crisis actor narrative continued and grew. From the time of the Sandy Hook shooting onwards, online interest in the term "crisis actor" remains low but fairly steady, with noticeable spikes around key breaking news events such as the LAX shooting in 2013, Pulse Nightclub shooting in 2016, and the Las Vegas shooting in 2017. This trend changed in 2018, when the term “crisis actor” was thrust into the mainstream.

David Hogg is interviewed on CNN to dispute crisis actor claims.

In the wake of the Marjory Stoneman Douglas high school shootings in Parkland, in which 17 people were killed, a survivor named David Hogg was accused of being a crisis actor resulting in a huge increase of attention to the term. As this spike in search interest grew, authoritative media sources began to run stories to debunk the term “crisis actor” and the related conspiracy theory. On the evening of February 20th Anderson Cooper interviewed David Hogg on CNN to respond to these conspiracies, which appears to have fed the surge in search interest. By the following morning, as search interest peaked, mainstream media sources like VOX and USA today continued to debunk the narrative. Once these credible, mainstream sources enter the discussion, it appears that an individual searching for the term "crisis actor" – on its own or in relation to David Hogg and the Parkland shooting - would almost certainly be shown accurate content highlighting the term as misinformation. At this point we may consider a data void as closed due to an abundance of authoritative, credible information.

Google search trends for "crisis actor" (11/1/12 - 4/19/20)

Both the Iowa caucuses and crisis actor data void cases offer a glimpse into the sheer breadth and scope of data voids that can exist. Theoretically, any search query could be used as a coded message or gateway to an unsavory data void, and there have been many coordinated efforts on 4chan to evade content moderation censorship by using common terms such as "Google", "Yahoo", and "Skittles" as racist dog-whistles. In its most harmless form, a data void could emerge around a suddenly viral meme, such as when the term generously buttered noodles" emerged from the New York Times cooking comment section in January 2019.1

To more adequately scrutinize the impact of a “data void” in search results, we need to better understand the potential harms caused by these voids. Golebiewski and boyd's taxonomy of data voids covers five types:

  • Breaking news events that are quickly filled by news organizations (e.g. the "Sutherland Springs, TX" shooting),
  • Strategic terms that are created by media manipulators to hijack a narrative (e.g. "black on white crimes", referenced by Dylann Roof),
  • Outdated terms that are co-opted for use due to lack of credible, recent content (e.g "social justice warriors"),
  • Fragmented concepts or even syntactical differences (Tripodi 2018) that lead to filter bubbles (e.g. "Vatican sexual abuse" vs. "Vatican pedophiles"), and
  • Problematic queries which return controversial results in absence of authoritative content (e.g. "did the Holocaust happen?")2
[top]

Part 3: How should we think about the harms posed by data voids?

While these categories help give some structure to a broad theoretical concept, it remains difficult to analyze the harms of any individual data voids purely from the "data" of search engine results alone.3 The difficulty of analyzing the most harmful instances of data voids is due to the broader context in which they are developing. Search does not occur in a vacuum (or a void) - what prompts users to search for a specific query is potentially as relevant as the results they surface. Often, activity happens across multiple open and closed digital platforms and even communications media, with cable television as a particularly powerful contributor to a developing story.

What's missing from the understanding of data voids that emerge in their wake is a harms framework that contextualizes the data voids in a bigger picture. As a starting point, we suggest the following framework to help platforms, journalists, and researchers monitor and assess the importance of emerging data voids.

To get a rough heuristic for expected harm, first refer to these five questions about the subject matter of the data void. Not all categories will be relevant, so weight the relative importance of each category accordingly.

  • Who is the data void affecting?
  • What topic is the data void addressing?
  • Where is the impacted area of the data void?
  • How fast is this data void developing?
  • How long has this data void lasted?

Next, look at the broader media audience for context surrounding an emerging data void:

  • Scope of exposure: What is the audience size of those who might be exposed to and internalize mis/disinformation around a data void?
  • Likelihood of immediate action: How likely is immediate action to be taken as a result of data voids?
  • Severity of impact: How harmful are the consequences of these actions?

This framework draws inspiration from other attempts to define harmful speech, ranging from legal exceptions to free speech protections (e.g. public endangerment oft-cited "shouting fire in a crowded theater" or inciting "imminent lawless action"4) to the Dangerous Speech Project, which identifies a subset of hate speech that has greater potential to incite violence.5 Notably, the question of "why" is absent from this framework, as intent is notoriously difficult to assess online. Search results are filtered through an opaque algorithm ranking, making coordinated manipulation attempts even more difficult to spot.

Scope of Exposure Likelihood of Imminent Action Severity of Impact
Low High Low High Low High
Who Broad audience (e.g. "NYC lockdown") Specific group (e.g. "Chinese virus") General public sentiment Credible, specific threats No specific targets Targeted hate crimes
What Mild memes (e.g. "generously buttered noodles") Dog-whistling (e.g. "cut the tall trees") No suggested call to action Dog-whistling (e.g. "cut the tall trees") Shared in-group jokes Radicalization, terrorism, public health risks
Where Stable, harmonious societies Conflict areas (e.g. warzones) No pre-existing tensions Strained existing tensions (e.g. Notre Dame fire Muslims Localized confusion Exacerbates existing conflicts
How fast Slow, well fact-checked Breaking news Many existing authorities Vacuum of credible sources Well-vetted (e.g. elections) High info disorder, distrust
How long Short-lived viral content Long-running, recurring interest Low relevance to any group Highly valued by a community (e.g. anti-vaxxers) Publicly available archives Closed conspiracies (e.g. white genocide)
[top]

Part 4: What data can we use to understand data voids?

If we take a step back, we see a common theme across all the different types of data voids we have discussed to-date: each has two potential states that it can exist in (or switch between). The first state is an interest spike, when a topic experiences high velocity, “breaking news” style attention (such as "crisis actor" after the Parkland shooting). The second state is the long tail, before or after the spike, when the narrative around a topic persists and develops over a long stretch of time under the public radar and with a relatively low volume of search activity (such as "crisis actor" before the Parkland shooting). Examples of both categories are also offered by Golebiewski and boyd initial paper introducing data voids, such as “was the Holocaust real?” as a low-lying query that has simmered across the internet for years, and “Sutherland Springs” as a query catapulted into the mainstream during a mass shooting in November 2017.

In each of these states, there is potential harm that a data void can cause. In the long-tail state, a (relatively) small number of searchers may engage with misinformation or conspiracies which could stand uncorrected for long periods of time. In a breaking news spike, large numbers of searchers looking to learn more about a recent event may be shown misinformation or conspiracy content (generated either before or after the event), including potentially material created by bad actors consciously capitalizing on the influx of attention to a topic. Golebiewski and boyd take a clear stance on the relative importance and harm posed by both of these states by asserting that “generally speaking, data voids are not a liability until something happens that results in an increase of searches on a term” (Golebiweski and boyd 2019).

In the broader discussion of data voids, there has notably been a void of quantitative, reproducible data that could help us better understand this phenomenon in mis/disinformation. Given this, our team was curious to dig deeper and inject data into the conversation. Could we evaluate this claim and observe how the surfacing of information (both credible and malicious) surrounding these breaking news data voids played out? Was it possible to approximate what these massive numbers of users saw when they searched for these terms?

To begin, we compiled a list of breaking news search terms / data voids associated with misinformation as the subjects for our study. In keeping with previous researchers, to avoid amplifying new harmful topics, we stuck to known data voids. For search terms that on their face were innocuous, we identified associated misinformation / conspiracies that began to spread online, so that we could track when those topics were covered by the mainstream media.

Next, we aligned on our goal: to create a timeline of credible information and mainstream coverage around each of these search terms. As the Data & Society authors note, data voids are filled when credible information addresses the misinformation related to a topic, in that “the time between the first report and the creation of massive news content is when manipulators have the largest opportunity to capture attention” (Golebiweski and boyd 2019). In an ideal world we could compile a list of the exact results users were shown when they searched for that term at a specific time (e.g. “50% of users who searched "Sutherland springs" within a few hours of the event were shown Reddit threads touting conspiracy theories”). Given that this data is not available publicly, we decided to create a timeline that includes both search activity for a breaking news data void, and the presence of authoritative media sources addressing / correcting misinformation related to that subject.

Finally, we amassed the data needed to create this timeline for each breaking news data void. We utilized publicly-available Google Search Trends data to quantify the number of searches for a particular term over time (note that Google normalizes all search trends data with 0 being minimal search volume and 100 being maximum for that term). To approximate the presence of media articles on a topic, we examined 2 data sets: MIT & Berkman Klein Center’s MediaCloud system (an archive of media stories from across the web) and Wikipedia article data (given the prominent role that Wikipedia plays in search engine results, such as the Knowledge Panel on Google).

Data void lifecycles

Below, we have plotted the peak week of search activity for each term (in red) and layered in the specific times that authoritative media articles entered the discussion (in blue) and specific times that edits were made to relevant Wikipedia articles (in yellow). With this data we have a timeline of when credible news sources posted about a term, relative to when searches for that data void were spiking.

The patterns we observe in the graphs below lead us to several key findings:

Mainstream media’s capacity to respond to fast-moving misinformation is complicated and varied.

To illustrate this, consider examples where the media were timely in responding directly to conspiracies and misinformation, as well as times they were not. In the case of the 2020 Iowa Caucuses, numerous mainstream outlets (Seattle Times, MSNBC, and more) posted content specifically addressing misinformation that the Caucus was rigged during - and even slightly before - searches for Iowa Caucus spiked. This indicates that the searchers who pulled out their phones and laptops during these “crucial hours” (as Golebiweski and boyd phrase it) would have been presented with credible sources. On the other hand, the Sutherland Springs mass shooting reveals some room for improvement, with sites like the Washington Post and FactCheck.org correcting misinformation (related to Antifa being responsible for the shooting) hours or even days after search activity quieted down (although it is important to note that smaller news sources like the San Antonio Express, as well as a few blogs, did cover Antifa misinformation in a much more timely manner, and are likely to have shown up in early search results).

Amplification or Inoculation?

For these search terms, the peak in search interest was always predated by some mainstream media coverage. This seems to confirm an important point from Golebiewski & boyd, that mainstream media’s tendency to quickly generate content and draw attention to unfolding events often amplifies attention and helps drive search interest. However this also raises a key paradox in our exploration of data voids. If a data void is defined as a search query that does not surface any authoritative sources, then we must consider a void filled when authoritative media sources are included in top search results, and Wikipedia includes relevant information. In each case we looked at we found a clear correlation between increased search interest and an increase in the availability of authoritative information. As attention spikes mainstream media publish more information directly addressing the relevant term, and therefore fill the void.

One of the most stark examples we observed in our data was with the term “crisis actor”, when the previously-mentioned Anderson Cooper interview drove massive search interest from viewers to this term that had been discussed online at a low level for several years. This raises a critical question: when is media attention a good thing, and when is it harmful? An argument can be made that this media coverage is positive based on inoculation theory, which is centered on informing readers of misinformation and proactively labeling it as false. It can also be seen as harmful amplification of otherwise latent misinformation, giving a platform and voice to content that otherwise would have passed under-the-radar. The same mainstream media reporting that "fills the void" for a given term may lead audiences to seek out and engage with disinformation and conspiracy content. At this stage it is hard to quantify the harms of amplification compared to those of leaving a void unfilled. However it is critical that we recognize that post-void amplification and data voids where no authoritative content exists are two markedly seperate - if closely related - problems.

Discussions with breaking news reporters about the events in the crucial minutes and hours of an event such as a school shooting shed light on both the strengths and the drawbacks of the media process. Mainstream media sources that have a dedicated breaking news team aim to publish something in response to an event like the Sutherlands Springs shooting extremely quickly - in some cases perhaps as soon as 20 minutes after the event itself. Monitoring social media, including accounts of local authorities or dedicated to police scanners, allows newsrooms to be alerted to such events increasingly quickly. Confirming details, however, will always take time. The faster an article is published, the less detail it is likely to contain - especially as mainstream media are ethically committed (and otherwise incentivised) to not getting things wrong. These initial article stubs will be added to as more details emerge. With ethical constraints limiting media reporting of details such as the names of perpetrators or victims until they are confirmed by appropriate authorities, publication of these details can lag behind public discussion. Online, true facts about the event shared by eyewitnesses through social media or other channels may mingle with rumor, conspiracy and deliberate disinformation. It is this mixture of good information and garbage that seems to dominate a breaking news event, rather than a true "void" or complete absence of authoritative content.

Sting in the Tail

While the exposure to much wider audiences delivered by a spike in search interest undoubtedly creates a large potential for harm, the presence of at least some authoritative information either before or during these search spikes leads us to question whether our definition of harm should be focused on breaking news data voids, as Gobeliewski & boyd originally stated. Rather, for many data voids the greatest harm could be manifested well before there is a surge in search interest (for example, in the 6 years 2012-2018 we observed as the long tail for “crisis actor”6). It is during the long tail period, when a user searching for a given subject will find no accurate or authoritative results, that conspiracy and disinformation narratives develop and grow communities of interest. When a spike of search interest does occur these established ideas are able to find new audiences, pushed through social media and other platforms - from cable television to celebrity endorsements - even after authoritative counter information is readily available via search.

Wikipedia as a battleground for misinformation

There are extremely good reasons why Wikipedia is ranked so highly in Google search results. But the crowdsourced nature of the platform does make it vulnerable to disinformation actors and media manipulators, and these vulnerabilities are amplified by the close relationship with Google search and other platforms.

Turning back to our example of "crisis actor", in the history of Wikipedia edits we can clearly see a battle being waged between mis/disinfo actors and long-term Wikipedia editors. In the days after the Parkland shooting, the Crisis Actor wikipedia page saw multiple attempts at “spam, abuse, bias, and conspiracy theories” with bad actors peddling false narratives about the mass shooting. Eventually, the Wikipedia page was classified as “Protected” to limit who can modify that article due to “persistent vandalism and garbage edits” (the same Wikipedia edit wars can be seen across multiple data void topics, including the 2020 Iowa Caucuses and Pizzagate).

Wikipedia has already made huge steps to mitigate these risks. Better understanding and documentation of how these manipulations play out - particularly as they move to platforms outside of Wikipedia itself - will help continued resilience building.

[top]

Part 5: Future research agenda

The concept of a "data void" around particular search terms, potentially weaponized by disinformation actors, is a relatively new and powerful one. Thinking about the problem of misinformation and disinformation through this prism helps us to understand the wider information ecosystem. Many related questions remain to be explored.

If data voids are a potential pathway to misinformation and radicalization, what pathways bring people to the data void itself? People looking for content about “crisis actors” have almost certainly already encountered the term somewhere else, and are searching for more information, beit confirmation, contradiction, or simply context. People who ask Google “is the holocaust real?” would seem to have already opened the window to doubt. Exposure to these terms and ideas may come via a meme or a message board, cable TV or a Presidential press conference. Whatever the medium, this initial exposure is arguably the critical point of contagion, even if it is the resulting search activity that ultimately leads to belief and even radicalization. Understanding what drives people to search for these terms is as important as understanding what information they surface when they do so.

Our research into the timelines of specific data voids has led us to concentrate on the distinction between breaking news events and those problematic or weaponized terms that deepen and develop in the long tail of search interest. Based on the data it is our assessment that the biggest spikes in search interest into a specific term are not necessarily the most harmful, at least when framed exclusively as a data void. Good information is reasonably quickly available within these timeframes - perhaps surprisingly so given the inherent chaos of breaking news. But exposing wider audiences to problematic ideas is likely to be harmful, even when good information and “fact checks” are readily accessible. People may be inclined to believe or take interest in a conspiracy despite quick or concurrent exposure to accurate information. People are like that. Exploring the development of communities of interest around disinformation and conspiracy content that grow in the long tail of search interest is essential, and understanding if and how these communities expand when these ideas are exposed to new audiences through rapid media amplification is of great importance.

A better understanding of the relationship between social media and authoritative media when it comes to developing and amplifying narratives is also of obvious interest. A preliminary investigation using CrowdTangle to compare Facebook and Reddit activity with the search and media graphs above seems to indicate that authoritative sources ultimately receive significantly more engagement, but problematic content is more likely to appear on social media first. This aligns with our understanding of disinformation models and is not surprising. For practical reasons disinformation and media manipulation is likely to directly focus efforts on social media and “fringe” websites, allowing them to seed narratives, develop communities - create content around strategic terms. Further pinning down the transition between ground laying that occurs within a data void and wider exposure that comes when a void is filled seems possible with the data available.

Better understanding how authoritative media sources cover breaking news stories and how this affects the timing and ranking of search results has the potential to help both media and search platforms mitigate much of the risk of these data voids. How do we ensure that the results from authoritative sources appear first during a breaking news event, even if they lack specific details that people might be searching for? This is potentially a consideration for social platforms like Twitter and Facebook in particular, where “top” results based on engagement may contain disinformation. When looking at the short timelines between events and reporting it is also imperative to better understand how updated news stories affect search results. From a research perspective we have questions about how these updated stories are indexed and surfaced through tools such as MediaCloud and CrowdTangle, given that information containing search terms may be added after initial publication, potentially confusing timelines.

The prominent use of Wikipedia in search engine tools and results presents a potentially valuable opportunity for both research and practical improvements for mitigating risks of disinformation and misinformation. The ability to track every change on Wikipedia allows unique insight into how manipulation plays out and is resolved. Guidelines for editing and layout of controversial pages already exist, but may be further developed with their impact on tools such as Google Knowledge Graphs taken into further consideration.

Conclusion

As is true for anyone studying misinformation and disinformation, our research team is left with just as many – if not many more – questions as answers. One question in particular has captured our attention, which we’d like to leave you with.

The term “data void” implies that what is missing is data, and if only more data (or information more broadly) were readily available to a user, the problem would be fixed as people are able to access credible and authoritative content on a topic. But this hypothesis still fails to explain how misinformation and conspiracy still percolate within topics for which there exists plenty of credible information all over the web.

Perhaps after considering the lens of media attention and estimated harms of a data void, what is missing in the analysis is not only credible and authoritative data, but also user trust. User trust is needed to form a connection with the information emerging in data voids, encouraging users to not only read and believe what they see, but also to evangelize and proactively refer back to these sources when faced with future misinformation. This is a difficult challenge, as user trust is capricious and difficult to track online. However, until we understand how to fill data voids with both data and trust, there will be always be a gap between what is known and what is believed.