It was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks.[10] It was introduced by Kunihiko Fukushima in 1969 in the context of visual feature extraction in hierarchical neural networks.[11][12] It was later argued that it has strong biological motivations and mathematical justifications.[13][14] In 2011,[4] ReLU activation enabled training deep supervised neural networks without unsupervised pre-training, compared to the widely used activation functions prior to 2011, e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical[15] counterpart, the hyperbolic tangent.
^Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225.
^Fukushima, K.; Miyake, S. (1982). "Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Visual Pattern Recognition". Competition and Cooperation in Neural Nets. Lecture Notes in Biomathematics. Vol. 45. Springer. pp. 267–285. doi:10.1007/978-3-642-46466-9_18. ISBN978-3-540-11574-8. {{cite book}}: |journal= ignored (help)