HintsOfTruth

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

Michiel van der Meer¹, Pavel Korshunov², Sébastien Marcel², Lonneke van der Plas³

Leiden Institute of Advanced Computer Science, Leiden University, The Netherlands
Biometrics Security and Privacy group, Idiap Research Institute, Switzerland
Institute of Argumentation, Linguistics and Semiotics, USI Università della Svizzera italiana, Switzerland

📄 Preprint 🤗 HuggingFace Hub 🛠️ Source code coming soon

The first publicly available multimodal dataset of image-text pairs containing both real-world and synthetically generated checkworthy and non-checkworthy claims. We source real claims from datasets like 5Pils, Multiclaim, Flickr30K, and SentiCap. Synthetic images and text are generated using Flux, StableDiffusion 3.5, Llava, and BLIP. This dataset can be used as a benchmark for checkworthiness detection models.

Checkworthy example

Claim:

Photo of a flooded Ahmedabad International Airport.

Llava-generated claim:

The image shows a flooded airport with several airplanes parked in the water. There are five airplanes in total, with one of them being a large jetliner. The airplanes are parked in a row, with some of them partially submerged in the water. The scene appears to be a mix of a flooded airport and a beach, with the airplanes serving as a unique and unexpected sight.

BLIP-generated claim:

Arafly parked airplanes are lined up in a row at an airport

Real

Flux

StableDiffusion

Non-checkworthy example