Leakage (machine learning)

In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.[1]

Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model.[1]

  1. ^ a b Shachar Kaufman; Saharon Rosset; Claudia Perlich (January 2011). "Leakage in data mining: Formulation, detection, and avoidance". Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. Vol. 6. pp. 556–563. doi:10.1145/2020408.2020496. ISBN 9781450308137. S2CID 9168804. Retrieved 13 January 2020.