The Data Labelling Crisis: Why Ground Truth Is Becoming AI's Scarcest Resource
Main article
Abstract
The AI research community's relationship with labelled data has always been ambivalent — it is simultaneously the foundation of supervised learning and the bottleneck that constrains what supervised learning can achieve. As large-scale pre-training has reduced the labelled data requirements for many standard NLP and computer vision tasks, a somewhat premature narrative has emerged suggesting that the labelling problem is essentially solved. This perspective argues the opposite. The labelling crisis is not receding; it is intensifying and shifting. The tasks for which labelling is most needed — nuanced safety evaluation, cross-cultural preference alignment, specialised scientific annotation, temporal event ordering in long documents — are precisely those for which automated labelling approaches perform worst, crowd-sourcing quality is most unreliable, and the epistemic foundations of 'ground truth' are most contested. We examine three dimensions of the labelling crisis: the labour economics of annotation at scale, the epistemic problem of ground truth in subjective or culturally-dependent tasks, and the emerging challenge of synthetic data for LLM training as a substitute for — and complication of — human-generated ground truth. We conclude with specific recommendations for the research community and for organisations building annotation infrastructure.
