Vanishing gradient problem

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight.^[1] The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease (or grow uncontrollably), slowing the training process.^[1] In the worst case, this may completely stop the neural network from further learning.^[1] As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range $[-1,1]$ , and backpropagation computes gradients using the chain rule. This has the effect of multiplying $n$ of these small numbers to compute gradients of the early layers in an $n$ -layer network, meaning that the gradient (error signal) decreases exponentially with $n$ while the early layers train very slowly.

Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem",^[2]^[3] which not only affects many-layered feedforward networks,^[4] but also recurrent networks.^[5]^[6] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).

When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

^ ^a ^b ^c Basodi, Sunitha; Ji, Chunyan; Zhang, Haiping; Pan, Yi (September 2020). "Gradient amplification: An efficient way to train deep neural networks". Big Data Mining and Analytics. 3 (3): 198. arXiv:2006.10560. doi:10.26599/BDMA.2020.9020004. ISSN 2096-0654. S2CID 219792172.
^ Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diplom thesis). Institut f. Informatik, Technische Univ. Munich.
^ Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. (2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kremer, S. C.; Kolen, J. F. (eds.). A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. ISBN 0-7803-5369-2.
^ Goh, Garrett B.; Hodas, Nathan O.; Vishnu, Abhinav (15 June 2017). "Deep learning for computational chemistry". Journal of Computational Chemistry. 38 (16): 1291–1307. arXiv:1701.04503. Bibcode:2017arXiv170104503G. doi:10.1002/jcc.24764. PMID 28272810. S2CID 6831636.
^ Bengio, Y.; Frasconi, P.; Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. IEEE International Conference on Neural Networks. IEEE. pp. 1183–1188. doi:10.1109/ICNN.1993.298725. ISBN 978-0-7803-0999-9.
^ Pascanu, Razvan; Mikolov, Tomas; Bengio, Yoshua (21 November 2012). "On the difficulty of training Recurrent Neural Networks". arXiv:1211.5063 [cs.LG].

[Basodi2020-1] Basodi, Sunitha; Ji, Chunyan; Zhang, Haiping; Pan, Yi (September 2020). "Gradient amplification: An efficient way to train deep neural networks". Big Data Mining and Analytics. 3 (3): 198. arXiv:2006.10560. doi:10.26599/BDMA.2020.9020004. ISSN 2096-0654. S2CID 219792172.

[2] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diplom thesis). Institut f. Informatik, Technische Univ. Munich.

[3] Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. (2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kremer, S. C.; Kolen, J. F. (eds.). A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press. ISBN 0-7803-5369-2.

[4] Goh, Garrett B.; Hodas, Nathan O.; Vishnu, Abhinav (15 June 2017). "Deep learning for computational chemistry". Journal of Computational Chemistry. 38 (16): 1291–1307. arXiv:1701.04503. Bibcode:2017arXiv170104503G. doi:10.1002/jcc.24764. PMID 28272810. S2CID 6831636.

[5] Bengio, Y.; Frasconi, P.; Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. IEEE International Conference on Neural Networks. IEEE. pp. 1183–1188. doi:10.1109/ICNN.1993.298725. ISBN 978-0-7803-0999-9.

[:1-6] Pascanu, Razvan; Mikolov, Tomas; Bengio, Yoshua (21 November 2012). "On the difficulty of training Recurrent Neural Networks". arXiv:1211.5063 [cs.LG].

[1]

[2]

[3]

[4]

[5]

[6]