{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most enigmatic procedure in machine learning is training of neural networks, or, in general, parametric families of functions.\n", "\n", "Essentially training is described as minimization of loss, i.e. for a given loss function $ L $ on a space of functions $ f $ parametrized by a set of parameters $ \\theta \\in \\Theta $\n", "$$\n", "\\operatorname{arginf}_{\\theta \\in \\Theta} L(f^\\theta)\n", "$$\n", "is searched." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Assume that $ \\Theta $ is some manifold of points and $ U : \\theta \\to L(f^\\theta) $ a sufficiently regular function with a unique minimum $ \\theta^* \\in \\Theta $, then one can describe essentially one local and one global method to find the infimum:\n", "\n", "1. If $ U $ is strictly convex and $ C^2 $ in a neighborhood of the unique minimizer $ \\theta^* $, in the sense that there is a chart such that $ U $ has these properties, then\n", "$$\n", "d \\theta_t = - D_\\Theta U(\\theta_t) dt\n", "$$\n", "converges to $ \\theta^* $ as $ t \\to \\infty $ for appropriate first order differential operators $ D_\\Theta $. Let us consider this theorem on $ \\Theta $ equal to the unit ball $ D_\\Theta = \\nabla $.\n", "\n", "For any $ t \\geq 0 $ it holds that\n", "$$\n", "d U(\\theta_t) = - {|| \\nabla U(\\theta_t) ||}^2 dt \\, ,\n", "$$\n", "i.e. the value of $ U $ is strictly increasing along the path $ t \\mapsto \\theta_t $. Together with the fact that $ U $ is strictly convex we obtain a convergence of $ || \\theta_ t - \\theta^* || \\leq C \\exp(- \\lambda_{\\text{min}} t ) $ as $ t \\to \\infty $, where $ \\lambda_min $ is the minimum of the smallest eigenvalue of the Hessian of $ U $ on $ \\Theta $. This holds remarkably for any starting point $ \\theta_0 \\in \\Theta $ and is the basis of all sorts of gradient descent algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A far reaching generalization is given by the following consideration: consider $ U $ on $ \\Theta $ having a unique minimizer $ \\theta^* \\in \\Theta $, then the probability measure given by the density\n", "$$\n", "p_\\epsilon := \\frac{1}{Z_\\epsilon} \\exp \\big( -\\frac{U}{\\epsilon} \\big)\n", "$$\n", "tends in law to $ \\delta_{\\theta^*} $ as $ \\epsilon \\to 0 $. The denominator $ Z_\\epsilon $ is just the integral $ \\int_{\\Theta} \\exp(-U(\\theta)/\\epsilon) d \\lambda(\\theta) $ and the above statement nothing else than the fact that the described density function concentrates at $ \\theta^* $. If one manages to sample from the measure $ p_\\epsilon d \\lambda $, then one can approximate empirically $ \\theta^* $. One method to sample from this measure is to simulate from a stochastic differential equation (for $ \\Theta = \\mathbb{R}^N $) of the type\n", "$$\n", "d \\theta_t = - \\nabla U(\\theta_t) dt + \\alpha(t) dW_t \n", "$$\n", "where $ W $ is an $ N $-dimensional Brownian motion and the non-negative quantity, called cooling schedule, $ \\alpha(t) = O(\\frac{1}{\\log(t)}) $ as $ t \\to \\infty $. For appropriate constants we obtain that $ \\theta_t $ converges in law to $ \\delta_{\\theta^*} $ as $ t \\to \\infty $. This procedure is called simulated annealing and is the fundament for global mimimization algorithms of the type 'steepest descent plus noise'." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In machine learning applications, however, an algorithm, which traces back to work of Robbins-Monro (Kiefer-Wolfowitz respectively) in the fifties of the last century, is applied, the so called [stochastic approximation algorithm](https://en.wikipedia.org/wiki/Stochastic_approximation).\n", "\n", "The stochastic gradient descent algorithm essentially says for a function of expectation type $ U(\\theta) = E \\big[ V(\\theta) \\big] $\n", "$$\n", "\\theta_{n+1} = \\theta_n - \\gamma_n \\nabla V(\\theta_n,\\omega_n)\n", "$$\n", "for independently chosen samples $ \\omega_n $ converges in law to $ \\theta^* $. Notice first that all our examples, in particular the ones from mathematical finance, are of expectation type, where the samples $ \\omega $ are usually seen as elements from the training data set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several proofs of this statement, but I want to focus one particular aspect which connects stochastic approximation with simulated annealing.\n", "\n", "By the central limit theorem and appropriate sub-sampling one can understand\n", "$$\n", "\\nabla V(\\theta,\\omega) = \\nabla U(\\theta) + \\text{ 'Gaussian noise with a certain covariance structure' } \\Sigma(\\theta)\n", "$$\n", "where $ \\Sigma(\\theta) $ is essentially given by\n", "$$\n", "\\operatorname{cov}(\\nabla V(\\theta)) \\, .\n", "$$\n", "Whence one can understand stochastic gradient descent as simulated annealing with a space dependent covariance structure.\n", "\n", "Such simulated annealing algorithms exist but are more sophisticated in nature, in particular one needs to work with geometric structures on $ \\Theta $." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }