Training

The most enigmatic procedure in machine learning is training of neural networks, or, in general, parametric families of functions.

Essentially training is described as minimization of loss, i.e. for a given loss function $ L $ on a space of functions $ f $ parametrized by a set of parameters $ \theta \in \Theta $ $$ \operatorname{arginf}_{\theta \in \Theta} L(f^\theta) $$ is searched.

Assume that $ \Theta $ is some manifold of points and $ U : \theta \to L(f^\theta) $ a sufficiently regular function with a unique minimum $ \theta^* \in \Theta $, then one can describe essentially one local and one global method to find the infimum:

  1. If $ U $ is strictly convex and $ C^2 $ in a neighborhood of the unique minimizer $ \theta^* $, in the sense that there is a chart such that $ U $ has these properties, then $$ d \theta_t = - D_\Theta U(\theta_t) dt $$ converges to $ \theta^* $ as $ t \to \infty $ for appropriate first order differential operators $ D_\Theta $. Let us consider this theorem on $ \Theta $ equal to the unit ball $ D_\Theta = \nabla $.

For any $ t \geq 0 $ it holds that $$ d U(\theta_t) = - {|| \nabla U(\theta_t) ||}^2 dt \, , $$ i.e. the value of $ U $ is strictly increasing along the path $ t \mapsto \theta_t $. Together with the fact that $ U $ is strictly convex we obtain a convergence of $ || \theta_ t - \theta^* || \leq C \exp(- \lambda_{\text{min}} t ) $ as $ t \to \infty $, where $ \lambda_min $ is the minimum of the smallest eigenvalue of the Hessian of $ U $ on $ \Theta $. This holds remarkably for any starting point $ \theta_0 \in \Theta $ and is the basis of all sorts of gradient descent algorithms.

A far reaching generalization is given by the following consideration: consider $ U $ on $ \Theta $ having a unique minimizer $ \theta^* \in \Theta $, then the probability measure given by the density $$ p_\epsilon := \frac{1}{Z_\epsilon} \exp \big( -\frac{U}{\epsilon} \big) $$ tends in law to $ \delta_{\theta^*} $ as $ \epsilon \to 0 $. The denominator $ Z_\epsilon $ is just the integral $ \int_{\Theta} \exp(-U(\theta)/\epsilon) d \lambda(\theta) $ and the above statement nothing else than the fact that the described density function concentrates at $ \theta^* $. If one manages to sample from the measure $ p_\epsilon d \lambda $, then one can approximate empirically $ \theta^* $. One method to sample from this measure is to simulate from a stochastic differential equation (for $ \Theta = \mathbb{R}^N $) of the type $$ d \theta_t = - \nabla U(\theta_t) dt + \alpha(t) dW_t $$ where $ W $ is an $ N $-dimensional Brownian motion and the non-negative quantity, called cooling schedule, $ \alpha(t) = O(\frac{1}{\log(t)}) $ as $ t \to \infty $. For appropriate constants we obtain that $ \theta_t $ converges in law to $ \delta_{\theta^*} $ as $ t \to \infty $. This procedure is called simulated annealing and is the fundament for global mimimization algorithms of the type 'steepest descent plus noise'.

In machine learning applications, however, an algorithm, which traces back to work of Robbins-Monro (Kiefer-Wolfowitz respectively) in the fifties of the last century, is applied, the so called stochastic approximation algorithm.

The stochastic gradient descent algorithm essentially says for a function of expectation type $ U(\theta) = E \big[ V(\theta) \big] $ $$ \theta_{n+1} = \theta_n - \gamma_n \nabla V(\theta_n,\omega_n) $$ for independently chosen samples $ \omega_n $ converges in law to $ \theta^* $. Notice first that all our examples, in particular the ones from mathematical finance, are of expectation type, where the samples $ \omega $ are usually seen as elements from the training data set.

There are several proofs of this statement, but I want to focus one particular aspect which connects stochastic approximation with simulated annealing.

By the central limit theorem and appropriate sub-sampling one can understand $$ \nabla V(\theta,\omega) = \nabla U(\theta) + \text{ 'Gaussian noise with a certain covariance structure' } \Sigma(\theta) $$ where $ \Sigma(\theta) $ is essentially given by $$ \operatorname{cov}(\nabla V(\theta)) \, . $$ Whence one can understand stochastic gradient descent as simulated annealing with a space dependent covariance structure.

Such simulated annealing algorithms exist but are more sophisticated in nature, in particular one needs to work with geometric structures on $ \Theta $.