Lynn de Rijk and Martha Larson
This poster provides an overview of the datasets used by recent research in false information detection and a characterization of how the ground truth for these datasets is created. “False information” refers to misinformation, disinformation and fake news, the three types of information that machine learning models are generally built to detect. We point out that whoever creates the ground truth has an enormous impact on the moderation of online discussions. Datasets are used to train models, which could find their way to real-world use. Automatically flagging and filtering online content using such models could be seen as deciding what people should believe. Once a dataset is in use, how the ground truth was defined and how the data was labeled is often forgotten. Also, the labeling procedure followed for one dataset can be adopted for another without revisiting the original rationale for the ground truth.
Ground truth that is created quickly by non-experts or using heuristics is particularly worrisome. Studies have shown that limitations in NLP training data can lead to ethically problematic effects in the resulting models, e.g., models learning cultural biases about gender, race, ethnicity and/or religion. It is therefore imperative to ensure that dataset labeling leads to a model that does what it is intended to do and avoids propatings undesired patterns. One way to achieve this is by improving transparency through standardized data documentation.
The transparency of data and coding is also important for improving detection models in general. For example, some multimedia concept detection models rely on ontologies of concepts that are not always consistent with the vantage points of users of such systems. Standardizing data documentation could thus be a step towards more grounded and expertise-informed coding in detection models overall.
For automated false information detection specifically, problems regarding free speech arise. Data needs to be labeled for veracity, but how can it be decided fairly what information should be regarded as false? Answering this question involves deciding who gets to be arbiter of truth and when content should be regarded as inadmissably misleading. The choices made impact free speech and threaten the democratizing nature of the internet.
We provide an overview of the ground truth used in false information detection literature and the implications of how it is created for free speech, showing where possible issues arise and how these are addressed by researchers in the field. Through a systematic literature search, recent papers presenting false information detection models were collected and the datasets used were indexed focusing on 1) who gets to decide what is true and 2) how data is labeled for veracity. Lastly, a genealogy of the English datasets was made showing relations between datasets and how older data might propagate into newer datasets. The findings show that there is an urgent need for data documentation and what concerns data statements for false information datasets should address.