Main article

Amira Khalil*
Oxford Internet Institute, University of Oxford, Oxford, UK, OX1 3JS
amira.khalil@oii.ox.ac.uk
Thomas van der Berg
Department of Information and Computing Sciences, Utrecht University, Utrecht, Netherlands, 3584 CS

Abstract

The AI research community's relationship with labelled data has always been ambivalent — it is simultaneously the foundation of supervised learning and the bottleneck that constrains what supervised learning can achieve. As large-scale pre-training has reduced the labelled data requirements for many standard NLP and computer vision tasks, a somewhat premature narrative has emerged suggesting that the labelling problem is essentially solved. This perspective argues the opposite. The labelling crisis is not receding; it is intensifying and shifting. The tasks for which labelling is most needed — nuanced safety evaluation, cross-cultural preference alignment, specialised scientific annotation, temporal event ordering in long documents — are precisely those for which automated labelling approaches perform worst, crowd-sourcing quality is most unreliable, and the epistemic foundations of 'ground truth' are most contested. We examine three dimensions of the labelling crisis: the labour economics of annotation at scale, the epistemic problem of ground truth in subjective or culturally-dependent tasks, and the emerging challenge of synthetic data for LLM training as a substitute for — and complication of — human-generated ground truth. We conclude with specific recommendations for the research community and for organisations building annotation infrastructure.

Article details

How to Cite

Khalil, A., & Berg, T. van der. (2025). The Data Labelling Crisis: Why Ground Truth Is Becoming AI’s Scarcest Resource. DATAMIND, 3(4), 1-4. https://doi.org/10.63646/