# Detecting {term}`drift` in Datasets ## What is {term}`drift`? Drift refers to the phenomenon where the statistical properties of data change over time, leading to discrepancies between the data a model was trained on and the data it encounters during deployment. This can significantly degrading the performance of {term}`machine learning` models, as the assumptions made during training may no longer hold in real-world scenarios. ### _Formal Definition and Types of Drift_ In the context of {term}`supervised learning`, where a model is trained to predict the conditional probability $P(Y|X)$—with $X$ representing input features and $Y$ representing the target variable—drift occurs when the joint distribution $P(X, Y)$ changes between the training and deployment phases. Specifically, drift is observed when the joint distribution $P_t(X,Y)$ during training differs from the joint distribution $P_d(X,Y)$ during deployment: $$ P_t(X, Y) \neq P_d(X, Y) $$ The joint distribution $P(X, Y)$ can be decomposed into two equivalent forms: - As the product of the posterior probability and the evidence: $P(X, Y) = P(Y|X)P(X)$ - As the product of the likelihood and the prior: $P(X, Y) = P(X|Y)P(Y)$ Different types of drift can be identified by analyzing which components of these decompositions have changed. ### _Covariate Shift_ Covariate Shift (also known as population shift or virtual drift) occurs when the conditional probability of the target given the input, $P(Y|X)$, remains unchanged, but the distribution of the input features, $P(X)$, changes between training and deployment: $$ P_t(Y|X) = P_d(Y|X) \quad \text{but} \quad P_t(X) \neq P_d(X) $$ This type of shift often arises due to environmental variability, sensor degradation, or biased sampling during training. Covariate shift can lead to poor model performance if the model has not seen certain regions of the input space during training (i.e., "blind spots"). ### _Label Shift_ {term}`Label shift