Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization.

The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s.[1]

Background

Statistical estimation considers the problem of minimizing an objective function that has the form of a sum:

Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),

where the parameter $w$ that minimizes $Q(w)$ is to be estimated. Each summand function $Q_{i}$ is typically associated with the $i$ -th observation.

In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation.[2] Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations).

The sum-minimization problem also arises for empirical risk minimization. In this case, $Q_{i}(w)$ is the value of the loss function at $i$ -th example, and $Q(w)$ is the empirical risk.

When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations:

w:=w-\eta \nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w),

where $\eta$ is a step size.

In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations.

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the data set is large and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step.

Iterative method

Fluctuations in the total objective function as gradient steps with respect to mini-batches are taken.

In stochastic (or "on-line") gradient descent, the true gradient of $Q(w)$ is approximated by a gradient at a single example:

w:=w-\eta \nabla Q_{i}(w).

As the algorithm sweeps through the data samples, it performs the above update for each sample. Several passes can be made over the data set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles.

In pseudocode, stochastic gradient descent can be presented as follows:

Choose an initial vector of parameters $w$ and step size $\eta$ .
Repeat until an approximate minimum is obtained:
- Randomly shuffle examples in the data set.
- For $i=1,2,...,n$ $\text{[math]}$ , do:
  - $\!w:=w-\eta \nabla Q_{i}(w).$

The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the step size $\eta$ decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum.[3][4] This is in fact a consequence of the Robbins–Siegmund theorem.[5]

Example

Let's suppose we want to fit a straight line ${\hat {y}}=\!w_{1}+w_{2}x$ to create an estimate of a linear relationship between a data set with sample observations $(x_{1},x_{2},\ldots ,x_{n})$ and corresponding estimated responses $({\hat {y_{1}}},{\hat {y_{2}}},\ldots ,{\hat {y_{n}}})$ using least squares. The objective function to be minimized is:

Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y_{i}}}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.

The last line in the above pseudocode for this specific problem will become:

{\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}:={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.

Note that in each iteration (also called update), only the gradient evaluated at a single point $x_{i}$ instead of evaluating at the set of all samples.

The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the dataset is used to calculate the step, and the piece of data is picked randomly at each step.

Notable applications

Stochastic gradient descent is used in signal processing applications such adaptive filters, noise cancellation and the least mean squares (LMS) adaptive filter.

It is used in Geophysics, specifically in applications of Full-Waveform Inversion (FWI).[6]

Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning, including (linear) support vector machines, logistic regression (see, e.g., Vowpal Wabbit) and graphical models.[7] When combined with the backpropagation algorithm, it is the de facto standard algorithm for training artificial neural networks.[8]

Stochastic gradient descent competes with the L-BFGS algorithm, which is also widely used. Stochastic gradient descent has been used since at least 1960 for estimating linear regression models, originally under the name ADALINE.[9]

Extensions and variants

Many improvements on the basic stochastic gradient descent algorithm have been proposed and used. In particular, in machine learning, the need to set a learning rate (step size) has been recognized as problematic. Setting this parameter too high can cause the algorithm to diverge; setting it too low makes it slow to converge.[10] A conceptually simple extension of stochastic gradient descent makes the learning rate a decreasing function $η t$ of the iteration number $t$ , giving a learning rate schedule, so that the first iterations cause large changes in the parameters, while the later ones do only fine-tuning. Such schedules have been known since the work of MacQueen on $k$ -means clustering.[11] Practical guidance on choosing the step size in several variants of SGD is given by Spall.[12]

Implicit updates (ISGD)

As mentioned earlier, classical stochastic gradient descent is generally sensitive to the step size $η$ . Fast convergence requires large step sizes but this may induce numerical instability. The problem can be largely solved[13] by considering implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one:

w^{new}:=w^{old}-\eta \nabla Q_{i}(w^{new}).

This equation is implicit since $w^{new}$ appears on both sides of the equation. It is a stochastic form of the proximal gradient method since the update can also be written as:

w^{new}:=\arg \min _{w}\{Q_{i}(w)+{\frac {1}{2\eta }}||w-w^{old}||^{2}\}.

As an example, consider least squares with features $x_{1},\ldots ,x_{n}\in \mathbb {R} ^{p}$ and observations $y_{1},\ldots ,y_{n}\in \mathbb {R}$ . We wish to solve:

\min _{w}\sum _{j=1}^{n}(y_{j}-x_{j}'w)^{2},

where $x_{j}'w=x_{j1}w_{1}+x_{j,2}w_{2}+...+x_{j,p}w_{p}$ indicates the inner product. Note that $x$ could have "1" as the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:

w^{new}=w^{old}+\eta (y_{i}-x_{i}'w^{old})x_{i}

where $i$ is uniformly sampled between 1 and $n$ . Although theoretical convergence of this procedure happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular, when $\eta$ is misspecified so that $I-\eta x_{i}x_{i}'$ has large absolute eigenvalues with high probability, the procedure may diverge numerically within a few iterations. In contrast, implicit stochastic gradient descent (shortened as ISGD) can be solved in closed-form as:

w^{new}=w^{old}+{\frac {\eta }{1+\eta ||x_{i}||^{2}}}(y_{i}-x_{i}'w^{old})x_{i}.

This procedure will remain numerically stable virtually for all $\eta$ as the step size is now normalized. Such comparison between classical and implicit stochastic gradient descent in the least squares problem is very similar to the comparison between least mean squares (LMS) and normalized least mean squares filter (NLMS).

Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be efficiently implemented in a wide range of models. Specifically, suppose that $Q_{i}(w)$ depends on $w$ only through a linear combination with features $x_{i}$ , so that we can write $\nabla _{w}Q_{i}(w)=-q(x_{i}'w)x_{i}$ , where $q()\in \mathbb {R}$ may depend on $x_{i},y_{i}$ as well but not on $w$ except through $x_{i}'w$ . Least squares obeys this rule, and so does logistic regression, and most generalized linear models. For instance, in least squares, $q(x_{i}'w)=y_{i}-x_{i}'w$ , and in logistic regression $q(x_{i}'w)=y_{i}-S(x_{i}'w)$ , where $S(u)=e^{u}/(1+e^{u})$ is the logistic function. In Poisson regression, $q(x_{i}'w)=y_{i}-e^{x_{i}'w}$ , and so on.

In such settings, ISGD is simply implemented as follows. Let $f(\xi )=\eta q(x_{i}'w^{old}+\xi ||x_{i}||^{2})$ , where $\xi$ is scalar. Then, ISGD is equivalent to:

w^{new}=w^{old}+\xi ^{\ast }x_{i},~{\text{where}}~\xi ^{\ast }=f(\xi ^{\ast }).

The scaling factor $\xi ^{\ast }\in \mathbb {R}$ can be found through the bisection method since in most regular models, such as the aforementioned generalized linear models, function $q()$ is decreasing, and thus the search bounds for $\xi ^{\ast }$ are $[\min(0,f(0)),\max(0,f(0))]$ .

Momentum

Further proposals include the momentum method, which appeared in Rumelhart, Hinton and Williams' paper on backpropagation learning.[14] Stochastic gradient descent with momentum remembers the update $Δ w$ at each iteration, and determines the next update as a linear combination of the gradient and the previous update:[15][16]

\Delta w:=\alpha \Delta w-\eta \nabla Q_{i}(w)

w:=w+\Delta w

that leads to:

w:=w-\eta \nabla Q_{i}(w)+\alpha \Delta w

where the parameter $w$ which minimizes $Q(w)$ is to be estimated, $\eta$ is a step size (sometimes called the learning rate in machine learning) and $\alpha$ is an exponential decay factor between 0 and 1 that determines the relative contribution of the current gradient and earlier gradients to the weight change.

The name momentum stems from an analogy to momentum in physics: the weight vector $w$ , thought of as a particle traveling through parameter space,[14] incurs acceleration from the gradient of the loss ("force"). Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing oscillations. Momentum has been used successfully by computer scientists in the training of artificial neural networks for several decades.[17]

Averaging

Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of[18]

{\bar {w}}={\frac {1}{t}}\sum _{i=0}^{t-1}w_{i}

.

When optimization is done, this averaged parameter vector takes the place of $w$ .

AdaGrad

AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-parameter learning rate, first published in 2011.[19] Informally, this increases the learning rate for sparser parameters and decreases the learning rate for ones that are less sparse. This strategy often improves convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative. Examples of such applications include natural language processing and image recognition.[19] It still has a base learning rate $η$ , but this is multiplied with the elements of a vector ${G j, j}$ which is the diagonal of the outer product matrix

G=\sum _{\tau =1}^{t}g_{\tau }g_{\tau }^{\mathsf {T}}

where $g_{\tau }=\nabla Q_{i}(w)$ , the gradient, at iteration $τ$ . The diagonal is given by

G_{j,j}=\sum _{\tau =1}^{t}g_{\tau ,j}^{2}

.

This vector is updated after every iteration. The formula for an update is now

w:=w-\eta \,\mathrm {diag} (G)^{-{\frac {1}{2}}}\circ g

[lower-alpha 1]

or, written as per-parameter updates,

w_{j}:=w_{j}-{\frac {\eta }{\sqrt {G_{j,j}}}}g_{j}.

Each ${G (i, i)}$ gives rise to a scaling factor for the learning rate that applies to a single parameter $w i$ . Since the denominator in this factor, ${\sqrt {G_{i}}}={\sqrt {\sum _{\tau =1}^{t}g_{\tau }^{2}}}$ is the ℓ₂ norm of previous derivatives, extreme parameter updates get dampened, while parameters that get few or small updates receive higher learning rates.[17]

While designed for convex problems, AdaGrad has been successfully applied to non-convex optimization.[20]

RMSProp

RMSProp (for Root Mean Square Propagation) is also a method in which the learning rate is adapted for each of the parameters. The idea is to divide the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.[21] So, first the running average is calculated in terms of means square,

v(w,t):=\gamma v(w,t-1)+(1-\gamma )(\nabla Q_{i}(w))^{2}

where $\gamma$ is the forgetting factor.

And the parameters are updated as,

w:=w-{\frac {\eta }{\sqrt {v(w,t)}}}\nabla Q_{i}(w)

RMSProp has shown good adaptation of learning rate in different applications. RMSProp can be seen as a generalization of Rprop and is capable to work with mini-batches as well opposed to only full-batches.[22]

Adam

Adam[23] (short for Adaptive Moment Estimation) is an update to the RMSProp optimizer. In this optimization algorithm, running averages of both the gradients and the second moments of the gradients are used. Given parameters $w^{(t)}$ and a loss function $L^{(t)}$ , where $t$ indexes the current training iteration (indexed at $0$ ), Adam's parameter update is given by:

m_{w}^{(t+1)}\leftarrow \beta _{1}m_{w}^{(t)}+(1-\beta _{1})\nabla _{w}L^{(t)}

v_{w}^{(t+1)}\leftarrow \beta _{2}v_{w}^{(t)}+(1-\beta _{2})(\nabla _{w}L^{(t)})^{2}

{\hat {m}}_{w}={\frac {m_{w}^{(t+1)}}{1-\beta _{1}^{t+1}}}

{\hat {v}}_{w}={\frac {v_{w}^{(t+1)}}{1-\beta _{2}^{t+1}}}

w^{(t+1)}\leftarrow w^{(t)}-\eta {\frac {{\hat {m}}_{w}}{{\sqrt {{\hat {v}}_{w}}}+\epsilon }}

where $\epsilon$ is a small scalar (e.g. $10^{-8}$ ) used to prevent division by 0, and $\beta _{1}$ (e.g. 0.9) and $\beta _{2}$ (e.g. 0.999) are the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-rooting is done elementwise.

Backtracking line search

Backtracking line search is another variant of gradient descent. All of the below are sourced from the mentioned link. It is based on a condition known as the Armijo–Goldstein condition. Both methods allow learning rates to change at each iteration; however, the manner of the change is different. Backtracking line search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for determining the learning rates can be long and unknown in advance. Adaptive SGD does not need a loop in determining learning rates. On the other hand, adaptive SGD does not guarantee the "descent property" – which Backtracking line search enjoys – which is that $f(x_{n+1})\leq f(x_{n})$ for all n. If the gradient of the cost function is globally Lipschitz continuous, with Lipschitz constant L, and learning rate is chosen of the order 1/L, then the standard version of SGD is a special case of backtracking line search.

Second-Order Methods

A stochastic analogue of the standard (deterministic) Newton–Raphson algorithm (a "second-order" method) provides an asymptotically optimal or near-optimal form of iterative optimization in the setting of stochastic approximation. A method that uses direct measurements of the Hessian matrices of the summands in the empirical risk function was developed by Byrd, Hansen, Nocedal, and Singer.[24] However, directly determining the required Hessian matrices for optimization may not be possible in practice. Practical and theoretically sound methods for second-order versions of SGD that do not require direct Hessian information are given by Spall and others.[25][26][27] (A less efficient method based on finite differences, instead of simultaneous perturbations, is given by Ruppert.[28]) These methods not requiring direct Hessian information are based on either values of the summands in the above empirical risk function or values of the gradients of the summands (i.e., the SGD inputs). In particular, second-order optimality is asymptotically achievable without direct calculation of the Hessian matrices of the summands in the empirical risk function.

Notes

$\circ$ is the element-wise product.

External links

Using stochastic gradient descent in C++, Boost, Ublas for linear regression
Machine Learning Algorithms
Goh (April 4, 2017). "Why Momentum Really Works". Distill."Gradient Descent, How Neural Networks Learn". 3Blue1Brown. October 16, 2017 – via YouTube. An interactive paper explaining momentum.

References

Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6.
Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR 2287314.
Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6.
Kiwiel, Krzysztof C. (2001). "Convergence and efficiency of subgradient methods for quasiconvex minimization". Mathematical Programming, Series A. 90 (1). Berlin, Heidelberg: Springer. pp. 1–25. doi:10.1007/PL00011414. ISSN 0025-5610. MR 1819784.
Robbins, Herbert; Siegmund, David O. (1971). "A convergence theorem for non negative almost supermartingales and some applications". In Rustagi, Jagdish S. (ed.). Optimizing Methods in Statistics. Academic Press. ISBN 0-12-604550-X.
Jerome R. Krebs, John E. Anderson, David Hinkley, Ramesh Neelamani, Sunwoong Lee, Anatoly Baumstein, and Martin-Daniel Lacasse, (2009), "Fast full-wavefield seismic inversion using encoded sources," GEOPHYSICS 74: WCC177-WCC188.
Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.
LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48
Avi Pfeffer. "CS181 Lecture 5 — Perceptrons" (PDF). Harvard University.
Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. p. 291. ISBN 978-0262035613.
Cited by Darken, Christian; Moody, John (1990). Fast adaptive k-means clustering: some empirical results. Int'l Joint Conf. on Neural Networks (IJCNN). IEEE. doi:10.1109/IJCNN.1990.137720.
Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Hoboken, NJ: Wiley. pp. Sections 4.4, 6.6, and 7.5. ISBN 0-471-33052-3.
Toulis, Panos; Airoldi, Edoardo (2017). "Asymptotic and finite-sample properties of estimators based on stochastic gradients". Annals of Statistics. 45 (4): 1694–1727. arXiv:1408.2923. doi:10.1214/16-AOS1506. S2CID 10279395.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.
Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy Dasgupta and David Mcallester (ed.). On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13). 28. Atlanta, GA. pp. 1139–1147. Retrieved 14 January 2016.
Sutskever, Ilya (2013). Training recurrent neural networks (PDF) (Ph.D.). University of Toronto. p. 74.
Zeiler, Matthew D. (2012). "ADADELTA: An adaptive learning rate method". arXiv:1212.5701 [cs.LG].
Polyak, Boris T.; Juditsky, Anatoli B. (1992). "Acceleration of stochastic approximation by averaging" (PDF). SIAM J. Control Optim. 30 (4): 838–855. doi:10.1137/0330046.
Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.
Gupta, Maya R.; Bengio, Samy; Weston, Jason (2014). "Training highly multiclass classifiers" (PDF). JMLR. 15 (1): 1461–1492.
Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.
Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 29. Retrieved 19 March 2020.
Diederik, Kingma; Ba, Jimmy (2014). "Adam: A method for stochastic optimization". arXiv:1412.6980 [cs.LG].
Byrd, R. H.; Hansen, S. L.; Nocedal, J.; Singer, Y. (2016). "A Stochastic Quasi-Newton method for Large-Scale Optimization". SIAM Journal on Optimization. 26 (2): 1008–1031. arXiv:1401.7020. doi:10.1137/140954362. S2CID 12396034.
Spall, J. C. (2000). "Adaptive Stochastic Approximation by the Simultaneous Perturbation Method". IEEE Transactions on Automatic Control. 45 (10): 1839−1853. doi:10.1109/TAC.2000.880982.
Spall, J. C. (2009). "Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm". IEEE Transactions on Automatic Control. 54 (6): 1216–1229. doi:10.1109/TAC.2009.2019793.
Bhatnagar, S.; Prasad, H. L.; Prashanth, L. A. (2013). Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. London: Springer. ISBN 978-1-4471-4284-3.
Ruppert, D. (1985). "A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure". Annals of Statistics. 13 (1): 236–245. doi:10.1214/aos/1176346589.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[20] $\circ$ is the element-wise product.

[1] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6.

[2] Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR 2287314.

[3] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6.

[4] Kiwiel, Krzysztof C. (2001). "Convergence and efficiency of subgradient methods for quasiconvex minimization". Mathematical Programming, Series A. 90 (1). Berlin, Heidelberg: Springer. pp. 1–25. doi:10.1007/PL00011414. ISSN 0025-5610. MR 1819784.

[5] Robbins, Herbert; Siegmund, David O. (1971). "A convergence theorem for non negative almost supermartingales and some applications". In Rustagi, Jagdish S. (ed.). Optimizing Methods in Statistics. Academic Press. ISBN 0-12-604550-X.

[:0-6] Jerome R. Krebs, John E. Anderson, David Hinkley, Ramesh Neelamani, Sunwoong Lee, Anatoly Baumstein, and Martin-Daniel Lacasse, (2009), "Fast full-wavefield seismic inversion using encoded sources," GEOPHYSICS 74: WCC177-WCC188.

[7] Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based, Conditional Random Field Parsing. Proc. Annual Meeting of the ACL.

[8] LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 9-48

[9] Avi Pfeffer. "CS181 Lecture 5 — Perceptrons" (PDF). Harvard University.

[10] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning. MIT Press. p. 291. ISBN 978-0262035613.

[11] Cited by Darken, Christian; Moody, John (1990). Fast adaptive k-means clustering: some empirical results. Int'l Joint Conf. on Neural Networks (IJCNN). IEEE. doi:10.1109/IJCNN.1990.137720.

[12] Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Hoboken, NJ: Wiley. pp. Sections 4.4, 6.6, and 7.5. ISBN 0-471-33052-3.

[13] Toulis, Panos; Airoldi, Edoardo (2017). "Asymptotic and finite-sample properties of estimators based on stochastic gradients". Annals of Statistics. 45 (4): 1694–1727. arXiv:1408.2923. doi:10.1214/16-AOS1506. S2CID 10279395.

[Rumelhart1986-14] Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.

[Sutskever2013-15] Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy Dasgupta and David Mcallester (ed.). On the importance of initialization and momentum in deep learning (PDF). In Proceedings of the 30th international conference on machine learning (ICML-13). 28. Atlanta, GA. pp. 1139–1147. Retrieved 14 January 2016.

[SutskeverPhD-16] Sutskever, Ilya (2013). Training recurrent neural networks (PDF) (Ph.D.). University of Toronto. p. 74.

[Zeiler_2012-17] Zeiler, Matthew D. (2012). "ADADELTA: An adaptive learning rate method". arXiv:1212.5701 [cs.LG].

[18] Polyak, Boris T.; Juditsky, Anatoli B. (1992). "Acceleration of stochastic approximation by averaging" (PDF). SIAM J. Control Optim. 30 (4): 838–855. doi:10.1137/0330046.

[duchi-19] Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online learning and stochastic optimization" (PDF). JMLR. 12: 2121–2159.

[21] Gupta, Maya R.; Bengio, Samy; Weston, Jason (2014). "Training highly multiclass classifiers" (PDF). JMLR. 15 (1): 1461–1492.

[22] Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 26. Retrieved 19 March 2020.

[23] Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude" (PDF). p. 29. Retrieved 19 March 2020.

[Adam2014-24] Diederik, Kingma; Ba, Jimmy (2014). "Adam: A method for stochastic optimization". arXiv:1412.6980 [cs.LG].

[25] Byrd, R. H.; Hansen, S. L.; Nocedal, J.; Singer, Y. (2016). "A Stochastic Quasi-Newton method for Large-Scale Optimization". SIAM Journal on Optimization. 26 (2): 1008–1031. arXiv:1401.7020. doi:10.1137/140954362. S2CID 12396034.

[26] Spall, J. C. (2000). "Adaptive Stochastic Approximation by the Simultaneous Perturbation Method". IEEE Transactions on Automatic Control. 45 (10): 1839−1853. doi:10.1109/TAC.2000.880982.

[27] Spall, J. C. (2009). "Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm". IEEE Transactions on Automatic Control. 54 (6): 1216–1229. doi:10.1109/TAC.2009.2019793.

[28] Bhatnagar, S.; Prasad, H. L.; Prashanth, L. A. (2013). Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods. London: Springer. ISBN 978-1-4471-4284-3.

[29] Ruppert, D. (1985). "A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure". Annals of Statistics. 13 (1): 236–245. doi:10.1214/aos/1176346589.