Why Your ML Model Is Lying to You: A Practitioner's Guide to Distribution Shift
Main article
Abstract
Distribution shift — the divergence between the statistical properties of training data and the data a deployed model encounters in production — is among the most common and most underappreciated causes of model failure in practice. This perspective piece argues that the field has developed sophisticated theoretical frameworks for characterising distribution shift (covariate shift, label shift, concept drift, dataset shift) but has invested comparatively little in the practical tooling that would help working data scientists detect, diagnose, and respond to shift in deployed systems. We draw on the authors' experience deploying and monitoring ML systems at scale to identify four categories of shift that practitioners encounter most frequently, describe the specific signals that indicate each category, and recommend a monitoring architecture that provides early warning across all four. We also address a subtler issue that the theoretical literature largely ignores: the difference between a model that performs poorly because of shift and a model that appears to perform well despite shift because the shift has affected the evaluation metric along with the model input. This phenomenon — which we call metric blindness to shift — is more common than is acknowledged and is potentially the most dangerous failure mode in deployed ML systems.
