{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most enigmatic procedure in machine learning is training of neural networks, or, in general, parametric families of functions.\n",
    "\n",
    "Essentially training is described as minimization of loss, i.e. for a given loss function $ L $ on a space of functions $ f $ parametrized by a set of parameters $ \\theta \\in \\Theta $\n",
    "$$\n",
    "\\operatorname{arginf}_{\\theta \\in \\Theta} L(f^\\theta)\n",
    "$$\n",
    "is searched."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assume that $ \\Theta $ is some manifold of points and $ U : \\theta \\to L(f^\\theta) $ a sufficiently regular function with a unique minimum $ \\theta^* \\in \\Theta $, then one can describe essentially one local and one global method to find the infimum:\n",
    "\n",
    "1. If $ U $ is strictly convex and $ C^2 $ in a neighborhood of the unique minimizer $ \\theta^* $, in the sense that there is a chart such that $ U $ has these properties, then\n",
    "$$\n",
    "d \\theta_t = - D_\\Theta U(\\theta_t) dt\n",
    "$$\n",
    "converges to $ \\theta^* $ as $ t \\to \\infty $ for appropriate first order differential operators $ D_\\Theta $. Let us consider this theorem on $ \\Theta $ equal to the unit ball $ D_\\Theta = \\nabla $.\n",
    "\n",
    "For any $ t \\geq 0 $ it holds that\n",
    "$$\n",
    "d U(\\theta_t) = - {|| \\nabla U(\\theta_t) ||}^2 dt \\, ,\n",
    "$$\n",
    "i.e. the value of $ U $ is strictly increasing along the path $ t \\mapsto \\theta_t $. Together with the fact that $ U $ is strictly convex we obtain a convergence of $ || \\theta_ t - \\theta^* || \\leq C \\exp(- \\lambda_{\\text{min}} t ) $ as $ t \\to \\infty $, where $ \\lambda_min $ is the minimum of the smallest eigenvalue of the Hessian of $ U $ on $ \\Theta $. This holds remarkably for any starting point $ \\theta_0 \\in \\Theta $ and is the basis of all sorts of gradient descent algorithms."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A far reaching generalization is given by the following consideration: consider $ U $ on $ \\Theta $ having a unique minimizer $ \\theta^* \\in \\Theta $, then the probability measure given by the density\n",
    "$$\n",
    "p_\\epsilon := \\frac{1}{Z_\\epsilon} \\exp \\big( -\\frac{U}{\\epsilon} \\big)\n",
    "$$\n",
    "tends in law to $ \\delta_{\\theta^*} $ as $ \\epsilon \\to 0 $. The denominator $ Z_\\epsilon $ is just the integral $ \\int_{\\Theta} \\exp(-U(\\theta)/\\epsilon) d \\lambda(\\theta)  $ and the above statement nothing else than the fact that the described density function concentrates at $ \\theta^* $. If one manages to sample from the measure $ p_\\epsilon d \\lambda $, then one can approximate empirically $ \\theta^* $. One method to sample from this measure is to simulate from a stochastic differential equation (for $ \\Theta = \\mathbb{R}^N $) of the type\n",
    "$$\n",
    "d \\theta_t = - \\nabla U(\\theta_t) dt + \\alpha(t) dW_t \n",
    "$$\n",
    "where $ W $ is an $ N $-dimensional Brownian motion and the non-negative quantity, called cooling schedule, $ \\alpha(t) = O(\\frac{1}{\\log(t)}) $ as $ t \\to \\infty $. For appropriate constants we obtain that $ \\theta_t $ converges in law to $ \\delta_{\\theta^*} $ as $ t \\to \\infty $. This procedure is called simulated annealing and is the fundament for global mimimization algorithms of the type 'steepest descent plus noise'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In machine learning applications, however, an algorithm, which traces back to work of Robbins-Monro (Kiefer-Wolfowitz respectively) in the fifties of the last century, is applied, the so called [stochastic approximation algorithm](https://en.wikipedia.org/wiki/Stochastic_approximation).\n",
    "\n",
    "The stochastic gradient descent algorithm essentially says for a function of expectation type $ U(\\theta) = E \\big[ V(\\theta) \\big] $\n",
    "$$\n",
    "\\theta_{n+1} = \\theta_n - \\gamma_n \\nabla V(\\theta_n,\\omega_n)\n",
    "$$\n",
    "for independently chosen samples $ \\omega_n $ converges in law to $ \\theta^* $. Notice first that all our examples, in particular the ones from mathematical finance, are of expectation type, where the samples $ \\omega $ are usually seen as elements from the training data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several proofs of this statement, but I want to focus one particular aspect which connects stochastic approximation with simulated annealing.\n",
    "\n",
    "By the central limit theorem and appropriate sub-sampling one can understand\n",
    "$$\n",
    "\\nabla V(\\theta,\\omega) = \\nabla U(\\theta) + \\text{ 'Gaussian noise with a certain covariance structure' } \\Sigma(\\theta)\n",
    "$$\n",
    "where $ \\Sigma(\\theta) $ is essentially given by\n",
    "$$\n",
    "\\operatorname{cov}(\\nabla V(\\theta)) \\, .\n",
    "$$\n",
    "Whence one can understand stochastic gradient descent as simulated annealing with a space dependent covariance structure.\n",
    "\n",
    "Such simulated annealing algorithms exist but are more sophisticated in nature, in particular one needs to work with geometric structures on $ \\Theta $."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}