Regularization (mathematics)

The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting $\lambda$ , the weight of the regularization term.

In mathematics, statistics, finance,^[1] and computer science, particularly in machine learning and inverse problems, regularization is a process that converts the answer of a problem to a simpler one. It is often used in solving ill-posed problems or to prevent overfitting.^[2]

Although regularization procedures can be divided in many ways, the following delineation is particularly helpful:

Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods (such as random forests and gradient boosted trees).

In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce regularization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.

In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data.^[3]

One of the earliest uses of regularization is Tikhonov regularization (ridge regression), related to the method of least squares.

^ Kratsios, Anastasis (2020). "Deep Arbitrage-Free Learning in a Generalized HJM Framework via Arbitrage-Regularization Data". Risks. 8 (2): [1]. doi:10.3390/risks8020040. hdl:20.500.11850/456375. Term structure models can be regularized to remove arbitrage opportunities [sic?]. {{cite journal}}: Cite journal requires |journal= (help)
^ Bühlmann, Peter; Van De Geer, Sara (2011). Statistics for High-Dimensional Data. Springer Series in Statistics. p. 9. doi:10.1007/978-3-642-20192-9. ISBN 978-3-642-20191-2. If p > n, the ordinary least squares estimator is not unique and will heavily overfit the data. Thus, a form of complexity regularization will be necessary.
^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning Book. Retrieved 2021-01-29.

[1] Kratsios, Anastasis (2020). "Deep Arbitrage-Free Learning in a Generalized HJM Framework via Arbitrage-Regularization Data". Risks. 8 (2): [1]. doi:10.3390/risks8020040. hdl:20.500.11850/456375. Term structure models can be regularized to remove arbitrage opportunities [sic?]. {{cite journal}}: Cite journal requires |journal= (help)

[2] Bühlmann, Peter; Van De Geer, Sara (2011). Statistics for High-Dimensional Data. Springer Series in Statistics. p. 9. doi:10.1007/978-3-642-20192-9. ISBN 978-3-642-20191-2. If p > n, the ordinary least squares estimator is not unique and will heavily overfit the data. Thus, a form of complexity regularization will be necessary.

[3] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron. Deep Learning Book. Retrieved 2021-01-29.

[1]

[2]

[3]