Learning rate

In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.^[1] Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns". In the adaptive control literature, the learning rate is commonly referred to as gain.^[2]

In setting a learning rate, there is a trade-off between the rate of convergence and overshooting. While the descent direction is usually determined from the gradient of the loss function, the learning rate determines how big a step is taken in that direction. A too high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum.^[3]

In order to achieve faster convergence, prevent oscillations and getting stuck in undesirable local minima the learning rate is often varied during training either in accordance to a learning rate schedule or by using an adaptive learning rate.^[4] The learning rate and its adjustments may also differ per parameter, in which case it is a diagonal matrix that can be interpreted as an approximation to the inverse of the Hessian matrix in Newton's method.^[5] The learning rate is related to the step length determined by inexact line search in quasi-Newton methods and related optimization algorithms.^[6]^[7]

^ Murphy, Kevin P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. p. 247. ISBN 978-0-262-01802-9.
^ Delyon, Bernard (2000). "Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory". Unpublished Lecture Notes. Université de Rennes. CiteSeerX 10.1.1.29.4428.
^ Buduma, Nikhil; Locascio, Nicholas (2017). Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. p. 21. ISBN 978-1-4919-2558-4.
^ Patterson, Josh; Gibson, Adam (2017). "Understanding Learning Rates". Deep Learning : A Practitioner's Approach. O'Reilly. pp. 258–263. ISBN 978-1-4919-1425-0.
^ Ruder, Sebastian (2017). "An Overview of Gradient Descent Optimization Algorithms". arXiv:1609.04747 [cs.LG].
^ Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer. p. 25. ISBN 1-4020-7553-7.
^ Dixon, L. C. W. (1972). "The Choice of Step Length, a Crucial Factor in the Performance of Variable Metric Algorithms". Numerical Methods for Non-linear Optimization. London: Academic Press. pp. 149–170. ISBN 0-12-455650-7.

[1] Murphy, Kevin P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press. p. 247. ISBN 978-0-262-01802-9.

[2] Delyon, Bernard (2000). "Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory". Unpublished Lecture Notes. Université de Rennes. CiteSeerX 10.1.1.29.4428.

[3] Buduma, Nikhil; Locascio, Nicholas (2017). Fundamentals of Deep Learning : Designing Next-Generation Machine Intelligence Algorithms. O'Reilly. p. 21. ISBN 978-1-4919-2558-4.

[variablelearningrate-4] Patterson, Josh; Gibson, Adam (2017). "Understanding Learning Rates". Deep Learning : A Practitioner's Approach. O'Reilly. pp. 258–263. ISBN 978-1-4919-1425-0.

[5] Ruder, Sebastian (2017). "An Overview of Gradient Descent Optimization Algorithms". arXiv:1609.04747 [cs.LG].

[6] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer. p. 25. ISBN 1-4020-7553-7.

[7] Dixon, L. C. W. (1972). "The Choice of Step Length, a Crucial Factor in the Performance of Variable Metric Algorithms". Numerical Methods for Non-linear Optimization. London: Academic Press. pp. 149–170. ISBN 0-12-455650-7.

[1]

[2]

[3]

[4]

[5]

[6]

[7]