{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Re-inforcement Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the basis of many algorithms applied in Re-inforcement learning the dynamic programming principle (DPP) lies, which we shall introduce in detail in the sequel.\n",
    "\n",
    "We shall work in the category of Feller processes, however, proofs will be presented in detail only in the case of Markov chains, i.e. Feller processes with finite state space.\n",
    "\n",
    "Let $X$ be a compact state space (more generally, locally compact with adjoint cemetary state $\\Delta$). A Feller semigroup is a strongly continuous semigroup of bounded linear operators $ (P_t) $ acting on real valued continuous functions $ f \\in C(X) $ with\n",
    "1. $P_t f \\geq 0 $ for all $ f \\geq 0 $,\n",
    "2. $P_t 1 = 1 $,\n",
    "for all $ t \\geq 0 $.\n",
    "\n",
    "Strong continuity can be characterized by\n",
    "$$\n",
    "\\lim_{t \\to 0} P_t f(x) = f(x)\n",
    "$$\n",
    "for all $ x \\in X $ and $ f \\in C(X) $ for linear semigroups of bounded, positive linear operators with $ P_t 1 = 1 $ for $ t \\geq 0 $. We shall denote its generator by $ A $ usually only densely defined on $ \\operatorname{dom}(A) $. A densely defined operator $A$ which satisfies the positive maximum principle, i.e. $ A f(z) \\leq 0 $ whenever $ f \\leq f(z) $ for some $ z \\in X $ and $ f \\in \\operatorname{dom}(A) $, and for which exists $ \\lambda > 0 $ such that $ \\operatorname{rg}(\\lambda - A ) $ is dense in $ C(X) $ is the generator of a Feller semigroup (this is the contents of the [Lumer-Phillips theorem](https://en.wikipedia.org/wiki/Lumer%E2%80%93Phillips_theorem)).\n",
    "\n",
    "For all details on Feller semigroups see the excellent textbook of [Ethier-Kurtz](https://onlinelibrary.wiley.com/doi/book/10.1002/9780470316658).\n",
    "\n",
    "For every Feller semigroup $ (P_t)_{t \\geq 0} $ we can construct a family of measures $ \\mathbb{P}_x $ for $ x \\in X $ on path space $ D([0,\\infty[) $ such that the canonical process $ (x(t))_{t \\geq 0} $ is a Markov process for each measure $ \\mathbb{P}_x $ starting at $ x \\in X $ with Markov semigroup $ (P_t)_{t \\geq 0} $, i.e.\n",
    "$$\n",
    "E_x \\big[ f(x(t)) \\, | \\; \\mathcal{F}_s \\big] = E_y \\Big . \\big[ f(x(t-s) \\big]\\Big|_{y = x(s)} \n",
    "$$\n",
    "$\\mathbb{P}_x $ almost surely as well as\n",
    "$$\n",
    "E_x[f(x(t))] = P_t f(x)\n",
    "$$\n",
    "each for all $ f \\in C(X) $ and $ 0 \\leq s \\leq t$. In particular we have that\n",
    "$$\n",
    "(P_{t-s}f(x(s)))_{0 \\leq s \\leq t} \\text{ or } \\big( f(x(t)) - f(x(0)) - \\int_0^t A f(x(s)) ds \\big)_{t \\geq 0}\n",
    "$$\n",
    "are $ \\mathbb{P}_x $-martingales for every $ x \\in X $ and every $ f \\in C(X) $, or $ f \\in \\operatorname{dom}(A) $, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We shall focus in the sequel on $ X = \\{1,\\ldots,n \\} $. In this case the strongly continuous semigroup is uniformly continuous and in the representation $C(X) = \\mathbb{R}^n $ the generator $A$ is a stochastic matrix, i.e. diagonal elements are non-positive, off-diagonal elements non-negative and rows sum up to zero. Then and only then $ P_t = \\exp(At) $ defines a Feller semigroup.\n",
    "\n",
    "Let us define jump measures on $ X $ by\n",
    "$$\n",
    "\\mu(i,.) = \\frac{1}{\\sum_{i \\neq k} a_{ik}} \\sum_{k\\neq i} a_{ik} \\delta_{k}\n",
    "$$\n",
    "if $ - a_{ii} =  \\sum_{i \\neq k} a_{ik} > 0 $, otherwise $ \\mu(i,.) = 0 $. Then the pure jump process which jumps with intensity $ a_{ii} $ at $ i \\in X $ with jump measure $ \\mu(i,.) $ coincides with the Markov process associated to the above Feller semigroup."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the sequel we shall consider a finite set $ U $ of controls (actions) such that $ (A^u)_{u \\in U} $ is a family of Feller generators on (the finite set) $ X $ depending (continuously of course) on $U$. We shall consider the set $ x^\\pi $ of processes with values in $X$ controlled by controls $ \\pi \\in \\Pi $, a set of predictable process taking values in $U$ defined on path space, such that\n",
    "$$\n",
    "\\big(f(x^\\pi(t))-f(x) - \\int_0^t A^{\\pi_s} f(x^{\\pi}(s)) ds \\big)_{t \\geq 0}\n",
    "$$\n",
    "is a $\\mathbb{P}_x $-martingale for every $ x \\in X $ and $ f \\in \\cap_{u \\in U} \\operatorname{dom}(A^u) $."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let $ R : X \\to \\mathbb{R} $ be a reward function and $ c: [0,\\infty[ \\times X \\times U \\to \\mathbb{R} $ be a continuous cost function.\n",
    "\n",
    "We shall always assume the following two properties for the set of strategies (policies) $\\Pi$:\n",
    "1. The set $ \\Pi $ is translation invariant, i.e.  $ \\pi_{t+h}(\\omega(.+h)) \\in \\Pi $ for all $ h \\geq 0 $ (shift invariance).\n",
    "2. For all initial laws of the type $ x^\\pi(t) $ for $ \\pi \\in \\Pi $ and $ t \\geq 0 $ we have the expectation property namely that\n",
    "$$\n",
    "\\sup_{\\pi \\in \\Pi} E_\\nu \\big[R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr  \\big] = E_\\nu \\Big[ \\sup_{\\pi \\in \\Pi} \\big( R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr \\big)\\Big] \\, . \n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We shall now consider a stochastic optimization problem, namely\n",
    "$$\n",
    "\\sup_{\\pi \\in \\Pi} E_x\\big[ R(x^\\pi(T)) + \\int_0^T c(s,x^\\pi(s),\\pi_s) ds \\big]\n",
    "$$\n",
    "for $ x \\in X $. We shall solve this problem by dynamization, i.e. consider\n",
    "$$\n",
    "V^\\ast(t,x) := \\sup_{\\pi \\in \\Pi} E_{t,x}\\big[ R(x^\\pi(T)) + \\int_t^{T} c(s,x^\\pi(s),\\pi_s) ds \\big] $$\n",
    "for $ x \\in X $ and $ 0 \\leq t \\leq T $."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By means of these properties we can prove the dynamic programming principle (DPP): for $ 0 \\leq s \\leq t \\leq T $ we have that\n",
    "$$\n",
    "V^\\ast(s,x) = \\sup_{\\pi \\in \\Pi} E_{s,x} \\big[ V^\\ast(t,x^\\pi(t)) + \\int_s^t c(r,x^\\pi(r),\\pi_r) dr \\big] \\, .\n",
    "$$\n",
    "\n",
    "The proof is direct:\n",
    "$$\n",
    "V^\\ast(s,x) = \\sup_{\\pi \\in \\Pi} E_{s,x} \\big[ R(x^\\pi(T)) + \\int_s^T c(r,x^\\pi(r),\\pi_r) dr \\big] = \\sup_{\\pi \\in \\Pi} E_{s,x} \\big[ R(x^\\pi(T)) + \\int_s^t + \\int_t^T c(r,x^\\pi(r),\\pi_r) dr \\big] \\, , \n",
    "$$\n",
    "where the supremum can be split over suprema on processes on $ [s,t[ $ and on $ [t,\\infty [ $. By the shift invariance and the expectation property this yields the result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the case of finite $X$ no regularity issues arise and we can consider the derivative of the dynamic programing equation yielding\n",
    "$$\n",
    "\\partial_t V^\\ast(t,x) + \\sup_{u \\in U} (A^u V^\\ast(t,x)  + c(t,x,u)) = 0\n",
    "$$\n",
    "for $ x \\in X $ and $ 0 \\leq t \\leq T $ with $ V^\\ast(T,x) = R(x) $ for $ x \\in X $.\n",
    "\n",
    "From DPP we can immediately derive that a strategy $ \\pi $ is optimal for the optimization problem if and only if\n",
    "$$\n",
    "(V^*(t,x^\\pi(t))+\\int_0^tc(r,x^\\pi(r),\\pi_r)dr)_{0 \\leq t \\leq T}\n",
    "$$\n",
    "is a $ \\mathbb{P}_x $ martingale for all $ x \\in X$.\n",
    "\n",
    "Indeed, let the previous expression be a martingale for some strategy $ \\pi $, then\n",
    "$$\n",
    "E_x \\big[ R(x^\\pi(T))+\\int_0^Tc(r,x^\\pi(r),\\pi_r)dr \\big]=V^\\ast(0,x) \\,\n",
    "$$\n",
    "which is precisely the optimality condition. Let $ \\pi $ be any strategy, then\n",
    "$$\n",
    "V^*(t,x^\\pi(t))+\\int_0^tc(r,x^\\pi(r),\\pi_r)dr = V^*(0,x) + \\int_0^t \\big(\\partial_r V^\\ast(r,x^\\pi(r))+ A^{\\pi_r} V^*(r,x^\\pi(r))+c(r,x^\\pi(r),\\pi_r)\\big)dr + \\text{ martingale }\n",
    "$$\n",
    "is a $ \\mathbb{P}_x $ supermartingale for all $ x \\in X $.\n",
    "\n",
    "Finally let us assume that there is a measurable function $ \\pi(s,.) $ from $ X $ to $ U $ such that\n",
    "$$\n",
    "\\pi(s,x) \\in \\operatorname{argmax}_{u \\in U} (A^u V^\\ast(s,x) + c(s,x,u)) \\, ,\n",
    "$$\n",
    "such that $ \\pi^*_s := \\pi(s,x(s-)) $ for $ s \\geq 0 $ is an element of $ \\Pi $, then the above martingale condition is satisfied."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar conditions can be derived in case of time discrete or infinite time horizon problems: let us formulate DPP for a time discrete infinite time horizon problem with objective function\n",
    "$$\n",
    "\\sup_{\\pi \\in \\Pi} E_x\\big[ \\sum_{\\tau \\geq s \\geq 0} \\gamma^s r(x^\\pi(s)) \\big] \\, ,\n",
    "$$\n",
    "where $ \\tau $ is the first time terminal states are reached and $ \\gamma $ is usually a discounting factor, so less than one. Since the problem is infinite horizon the value function does not depend on time (stationary case). DPP then reads as follows\n",
    "$$\n",
    "V^\\ast(x) =\\gamma \\sup_{u \\in U} P_1^u V^\\ast (x) + r(x) \\, ,\n",
    "$$\n",
    "which can be solved by a Banach fixed point principle. Again the optimal strategy (policy) can be found by looking at\n",
    "$$\n",
    "\\pi^\\ast(x) \\in \\operatorname{argmax}_{u \\in U} P_1^u V^\\ast(x) \\, \n",
    "$$\n",
    "for $ x $ not terminal. This is clearly the case since\n",
    "$$\n",
    "V^\\ast(x) =\\gamma P_1^{\\pi^\\ast(x)} V^\\ast (x) + r(x) \\, ,\n",
    "$$\n",
    "for $ x $ not terminal, which means by iteration\n",
    "$$\n",
    "V^\\ast(x) = E_x\\big[ \\sum_{\\tau \\geq s \\geq 0} \\gamma^s r(x^{\\pi^\\ast}(s)) \\big] \\, .\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An important concept for future considerations is the Q-function:\n",
    "$$\n",
    "Q^*(x,u) := r(x) + \\gamma P_1^u V^*(x)\n",
    "$$\n",
    "from which we can of course calculate $ V^* $."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this very case we can distinguish three cases of finding solutions of the problem:\n",
    "1. (Value iteration) Solve the Belman equation by a Banach fixed point argument. Choose an arbitrary initial value function $ V^{(0)} $, then perform\n",
    "$$\n",
    "Q^{(n+1)}(x,u) = \\gamma P_1^u V^{(n)}(x) + r(x); V^{(n+1)}(x) = \\sup_{u \\in U} Q^{(n+1)}(x,u) \\, .\n",
    "$$\n",
    "2. (Policy iteration) Choose an initial strategy $ \\pi^{(0)} $ and calculate its value function\n",
    "$$\n",
    "V^{\\pi^{(n)}}(x) = E_x\\big[ \\sum_{\\tau \\geq s \\geq 0} \\gamma^s r(x^{\\pi^{(n)}}(s)) \\big] \\, ,\n",
    "$$\n",
    "then define\n",
    "$$\n",
    "\\pi^{(n+1)}(x) \\in \\operatorname{argmax}_{u \\in U} \\gamma P_1^u V^{\\pi^{(n)}}(x) + r(x) \\, .\n",
    "$$\n",
    "3. (Q-Learning -- known environment) Choose an initial $ Q^{(0)} $ function and update via\n",
    "$$\n",
    "\\pi^{(n)}(x) \\in \\operatorname{argmax}_{u \\in U} Q^{(n)}(x,u) \\, .\n",
    "$$\n",
    "and\n",
    "$$\n",
    "Q^{(n+1)}(x,u) = \\gamma P_1^u (Q^{(n)}(.,\\pi^{(n)(.)})(x) + r(x) \\, .\n",
    "$$\n",
    "3. (Q-Learning -- unknown environment) Choose an initial $ Q^{(0)} $ function and update via a learning rate $ \\alpha $ in a possibly unknown environment, i.e.\n",
    "$$\n",
    "Q^{(n+1)}(x,u) = (1-\\alpha) Q^{(n)}(x,u) + \\alpha(r(x) + \\gamma \\sup_{u \\in U} Q^{(n)}(x',u)) \n",
    "$$\n",
    "for some chosen state $ x' \\in X $ which is reached after action $u$. Here the question is how to choose the action $ u $ and the following state $ x' $ for the next updating step. This leads to the exploitation-exploration dilemma: either $x'$ is chosen according to where $ Q^{(n)}(x,.) $ takes its maximum (exploitation) or it is reached after choosing a random action (exploration). The background of this algorithm lies in an extended version of the stochastic approximation algorithm of [Robins-Monroe](https://en.wikipedia.org/wiki/Stochastic_approximation). It reads as follows and can be found in [John Tsitsiklis famour article](https://link.springer.com/content/pdf/10.1007/BF00993306.pdf): imagine we want to solve a fixed point problem $ Q(x,u) = r(x) +P_1^u (\\sup_{u \\in U} Q(.,u))(x) $, then we can consider a stochastic algorithm of the following type\n",
    "$$\n",
    "Q^{(n+1)}(x,u) = Q^{(n)}(x,u) + \\alpha(P_1^u (\\sup_{u \\in U} Q^{(n)}(.,u))(x) - Q^{(n)}(x,u) + \\text{ 'noise'})\n",
    "$$\n",
    "with a learning rate $ \\alpha $. Then one replaces\n",
    "$$\n",
    "P_1^u (\\sup_{u \\in U} Q^{(n)}(.,u))(x) + \\text{ 'noise'} \n",
    "$$\n",
    "by an estimator for the expectation, for instance the most likely next state $ x' $, or an average over most likely states, which is proven to converge in the cited paper."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar algorithms may be designed in any DPP situation with adaptions according to the structure of DPP, take for instance the previous problem for simplicity with $ c = 0 $, then value iteration, policy iteration or Q-learning are just ways to solve the HJB equation\n",
    "$$\n",
    "\\partial_t V^\\ast(t,x) + \\sup_{u \\in U} A^u V^\\ast(t,x) = 0 \\, , \\; V^\\ast(T,x) = R(x)\n",
    "$$\n",
    "on refining time grids."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A major insight into the structure and solution concepts of stochastic control problems is the following relaxation procedure: instead of considering strategies (policies) as predictable processes with values in $ U $ one considers _randomized_ strategies $ \\delta_s \\in \\Delta(U) $ (again with predictable time dependence). The corresponding martingale problem is\n",
    "$$\n",
    "\\Big(f(x^\\delta(t))-f(x) - \\int_0^t \\big (\\int_U A^{u} f  \\delta_s(du) \\big)(x^{\\delta}(s))  ds \\Big)_{t \\geq 0}\n",
    "$$\n",
    "is a $\\mathbb{P}_x $-martingale for every $ x \\in X $ and $ f \\in \\cap_{u \\in U} \\operatorname{dom}(A^u) $, i.e. the control has an additional randomness before actually acting on the environment.\n",
    "\n",
    "This relaxation has two advantages: first the set of controls is a convex set and the controlled generator depends linearly on the control, and, second, a more robust solution theory is achieved in case 'classical' solutions are difficult to construct.\n",
    "\n",
    "The HJB equation looks then then same (under mild regularity assumptions)\n",
    "$$\n",
    "\\partial_t V^\\ast(t,x) + \\sup_{\\delta \\in \\Delta(U)} \\int_U A^\\delta V^\\ast (t,x) \\delta(du) = \\sup_{u \\in U} A^u V^\\ast (t,x) = 0 \\, ,\n",
    "$$\n",
    "however, new algorithms can emerge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us look again at the algorithms from the point of view of the above HJB equation, fix an equidistant grid in time $ 0 = t_0 < t_1 \\dots < t_n = T $ with mesh $ \\Delta t $:\n",
    "1. (Value iteration) We solve backwards: $V^{(n)}(t_n,x) = R(x) $ and\n",
    "$$\n",
    "V^{(n)}(t_i,x) = \\sup_{u \\in U} P^u_{\\Delta t} V^{(n)}(t_{i+1},x)\n",
    "$$\n",
    "for $ 0 \\leq i < n $, which yields under weak regularity assumption a converging scheme.\n",
    "2. (Policy iteration) We choose a policy $ \\pi^{(n)}(t_i,.) $ and calculate the value function for this very policy via $ V^{\\pi^{(n)}}(t_n,x) = R(x) $\n",
    "$$\n",
    "V^{\\pi^{(n)}}(t_i,x) =  P^{\\pi^{(n)}(x)}_{\\Delta t} V^{\\pi{(n)}}(t_{i+1},x) \\, .\n",
    "$$\n",
    "Then we optimize the policy at one intermediate point $ t_i $ and at one $ x \\in X $."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a last theoretical step we move forward to Markov games, i.e. situation where several independent agents optimize their strategies. Again we shall be able to formulate dynamic programmig principles. Situation is actually very similar and we just formulate the according principles.\n",
    "\n",
    "We shall consider two finite sets $ U_1 $ and $ U_2$ of controls (actions) such that $ (A^{(u_1,u_2)})_{u \\in U_1 \\times U_2} $ is a family of Feller generators on (the finite set) $ X $ depending (continuously of course) on $U := U_1 \\times U_2$. We shall consider the set $ x^{(\\pi_1,\\pi_2)} $ of processes with values in $X$ controlled by controls $ \\pi := (\\pi_1,\\pi_2) \\in \\Pi := \\Pi_1 \\times \\Pi_2 $, a set of predictable process taking values in $U_1 \\times U_2$ defined on path space, such that\n",
    "$$\n",
    "\\big(f(x^\\pi_1(t))-f(x) - \\int_0^t A^{\\pi_s} f(x^{\\pi}(s)) ds \\big)_{t \\geq 0}\n",
    "$$\n",
    "is a $\\mathbb{P}_x $-martingale for every $ x \\in X $ and $ f \\in \\cap_{u \\in U} \\operatorname{dom}(A^u) $."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let $ R : X \\to \\mathbb{R} $ be again a reward function and $ c: [0,\\infty[ \\times X \\times U \\to \\mathbb{R} $ be a continuous cost function.\n",
    "\n",
    "We shall always assume the following two properties for the set of strategies (policies) $\\Pi$:\n",
    "1. The set $ \\Pi $ is translation invariant, i.e.  $ \\pi_{t+h}(\\omega(.+h)) \\in \\Pi $ for all $ h \\geq 0 $ (shift invariance).\n",
    "2. For all initial laws of the type $ x^\\pi(t) $ for $ \\pi \\in \\Pi $ and $ t \\geq 0 $ we have the expectation property namely that\n",
    "$$\n",
    "\\sup_{\\pi_1 \\in \\Pi_1} \\inf_{\\pi_2 \\in \\Pi_2} E_\\nu \\big[R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr  \\big] = E_\\nu \\Big[ \\sup_{\\pi_1 \\in \\Pi_1} \\inf_{\\pi_2 \\in \\Pi_2} \\big( R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr \\big)\\Big] \\, ,\n",
    "$$\n",
    "furthermore\n",
    "$$\n",
    "\\inf_{\\pi_2 \\in \\Pi_2} \\sup_{\\pi_1 \\in \\Pi_1} E_\\nu \\big[R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr  \\big] = E_\\nu \\Big[ \\inf_{\\pi_2 \\in \\Pi_2} \\sup_{\\pi_1 \\in \\Pi_1} \\big( R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr \\big)\\Big] \\, , \n",
    "$$\n",
    "and\n",
    "$$\n",
    "\\sup_{\\pi_1 \\in \\Pi_1} \\inf_{\\pi_2 \\in \\Pi_2} E_\\nu \\big[R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr  \\big] = \\inf_{\\pi_2 \\in \\Pi_2} \\sup_{\\pi_1 \\in \\Pi_1} E_\\nu \\big[R(x^\\pi(t)) + \\int_0^t c(r,x^\\pi(r),\\pi_r)dr  \\big] \\, .\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Under these assumptions a completely analog dynamic programming principle for this zero sum two player Markov game can be formulated. The game is obviously (Markov) stochastic due since each of the two players just controls a Markov process. It is furthermore _zero sum_ since a gain for player 1 means a loss for player 2. The _Isaacs condition_ of interchanging suprema and infima allows to interpret the solution as Nash equilibrium, i.e. no player cannot improve her situation if the other one plays a fixed strategy. There is only one value function which satisfies DPP for $ 0 \\leq s \\leq t \\leq T $ we have that\n",
    "$$\n",
    "V^\\ast(s,x) = \\sup_{\\pi_1 \\in \\Pi_1} \\inf_{\\pi_2 \\in \\Pi_2} E_{s,x} \\big[ V^\\ast(t,x^\\pi(t)) + \\int_s^t c(r,x^\\pi(r),\\pi_r) dr \\big] \\, , \\; V^\\ast(T,x) = R(x)\n",
    "$$\n",
    "leading the the HJB equation\n",
    "$$\n",
    "\\partial_t V^\\ast(t,x) + \\sup_{u \\in U_1} \\inf_{u_2 \\in U_2} (A^{u_1,u_2} V^\\ast(t,x)  + c(t,x,u_1,u_2)) = 0\n",
    "$$\n",
    "for $ x \\in X $ and $ 0 \\leq t \\leq T $ with $ V^\\ast(T,x) = R(x) $ for $ x \\in X $. In the finite state space case this can be proved under mild regularity assumptions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generalization are similar: policies can be relaxed, more the 2 players can be considered and also non-zero sum games can be considered."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us just show in the case of a simple Markov decision problem why the standard algorithms converge. We consider the fixed point equation\n",
    "$$\n",
    "V^*(x) = r(x) + \\gamma \\max_{u \\in U} P_1^u V^*(x) \\, \n",
    "$$\n",
    "for $ 0 < \\gamma < 1 $.\n",
    "\n",
    "1. (Value iteration) The Bellmann operator\n",
    "$$\n",
    "Q \\mapsto \\mathcal{B}(Q) := r + \\gamma \\max_{u \\in U} P_1^u Q\n",
    "$$\n",
    "is contractive. Indeed\n",
    "$$\n",
    "| \\mathcal{B}(Q_1)(x) - \\mathcal{B}(Q_2)(x) | = \\gamma \\| \\max_{u \\in U} P_1^u Q_1(x) - \\max_{u \\in U} P_1^u Q_2(x) \\| \\leq \\gamma \\max_{u \\in U} | P_1^u Q_1(x) - P_2^u Q_2(x) | \\leq \\gamma {\\| Q_1 - Q_2 \\|}_\\infty \n",
    "$$\n",
    "for all $ x \\in X $. Whence we obtain an exponential convergence, i.e.\n",
    "$$\n",
    "|| Q^{(n)} - V^* || \\leq C \\gamma^n  \n",
    "$$\n",
    "as $ n \\to \\infty $ in the supremum norm. However, computations are relatively heavy due to the involved nature of the Bellmann operator.\n",
    "\n",
    "2. (Policy iteration) Here at each step $ V^\\pi $ is calculated, which is done by either calculating the expectation of by just solving a linear system by\n",
    "$$\n",
    "V^\\pi = (\\operatorname{id} - \\gamma P_1^\\pi(.)) r \\, ,\n",
    "$$\n",
    "then the Bellmann operator is applied which yields an improved strategy. The value function is improving. Whence again by a contraction principle one obtains convergence, but it might be much quicker due particularities of the control problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the sequel several games from the [AI project](https://gym.openai.com/) are shown to illustrate and deepen concepts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Solving FrozenLake8x8 environment using Value-Itertion.\n",
    "Author : Moustafa Alzantot (malzantot@ucla.edu)\n",
    "\"\"\"\n",
    "import numpy as np\n",
    "import gym\n",
    "from gym import wrappers\n",
    "\n",
    "\n",
    "def run_episode(env, policy, gamma = 1.0, render = True):\n",
    "    \"\"\" Evaluates policy by using it to run an episode and finding its\n",
    "    total reward.\n",
    "    args:\n",
    "    env: gym environment.\n",
    "    policy: the policy to be used.\n",
    "    gamma: discount factor.\n",
    "    render: boolean to turn rendering on/off.\n",
    "    returns:\n",
    "    total reward: real value of the total reward recieved by agent under policy.\n",
    "    \"\"\"\n",
    "    obs = env.reset()\n",
    "    total_reward = 0\n",
    "    step_idx = 0\n",
    "    while True:\n",
    "        if render:\n",
    "            env.render()\n",
    "        obs, reward, done , _ = env.step(int(policy[obs]))\n",
    "        total_reward += (gamma ** step_idx * reward)\n",
    "        step_idx += 1\n",
    "        if done:\n",
    "            break\n",
    "    return total_reward\n",
    "\n",
    "\n",
    "def evaluate_policy(env, policy, gamma = 1.0,  n = 100):\n",
    "    \"\"\" Evaluates a policy by running it n times.\n",
    "    returns:\n",
    "    average total reward\n",
    "    \"\"\"\n",
    "    scores = [run_episode(env, policy, gamma = gamma, render = False) for _ in range(n)]\n",
    "    return np.mean(scores)\n",
    "\n",
    "def extract_policy(v, gamma = 1.0):\n",
    "    \"\"\" Extract the policy given a value-function \"\"\"\n",
    "    policy = np.zeros(env.nS)\n",
    "    for s in range(env.nS):\n",
    "        q_sa = np.zeros(env.action_space.n)\n",
    "        for a in range(env.action_space.n):\n",
    "            for next_sr in env.P[s][a]:\n",
    "                # next_sr is a tuple of (probability, next state, reward, done)\n",
    "                p, s_, r, _ = next_sr\n",
    "                q_sa[a] += (p * (r + gamma * v[s_]))\n",
    "        policy[s] = np.argmax(q_sa)\n",
    "    return policy\n",
    "\n",
    "\n",
    "def value_iteration(env, gamma = 1.0):\n",
    "    \"\"\" Value-iteration algorithm \"\"\"\n",
    "    v = np.zeros(env.nS)  # initialize value-function\n",
    "    max_iterations = 100000\n",
    "    eps = 1e-20\n",
    "    for i in range(max_iterations):\n",
    "        prev_v = np.copy(v)\n",
    "        for s in range(env.nS):\n",
    "            q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in env.P[s][a]]) for a in range(env.nA)] \n",
    "            v[s] = max(q_sa)\n",
    "        if (np.sum(np.fabs(prev_v - v)) <= eps):\n",
    "            print ('Value-iteration converged at iteration# %d.' %(i+1))\n",
    "            break\n",
    "    return v"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Value-iteration converged at iteration# 2357.\n",
      "Policy average score =  1.0\n"
     ]
    }
   ],
   "source": [
    "env_name  = 'FrozenLake8x8-v0'\n",
    "gamma = 1.0\n",
    "env = gym.make(env_name)\n",
    "env=env.unwrapped\n",
    "optimal_v = value_iteration(env, gamma);\n",
    "policy = extract_policy(optimal_v, gamma)\n",
    "policy_score = evaluate_policy(env, policy, gamma, n=1000)\n",
    "print('Policy average score = ', policy_score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Solving FrozenLake8x8 environment using Policy iteration.\n",
    "Author : Moustafa Alzantot (malzantot@ucla.edu)\n",
    "\"\"\"\n",
    "import numpy as np\n",
    "import gym\n",
    "from gym import wrappers\n",
    "\n",
    "\n",
    "def run_episode(env, policy, gamma = 1.0, render = False):\n",
    "    \"\"\" Runs an episode and return the total reward \"\"\"\n",
    "    obs = env.reset()\n",
    "    total_reward = 0\n",
    "    step_idx = 0\n",
    "    while True:\n",
    "        if render:\n",
    "            env.render()\n",
    "        obs, reward, done , _ = env.step(int(policy[obs]))\n",
    "        total_reward += (gamma ** step_idx * reward)\n",
    "        step_idx += 1\n",
    "        if done:\n",
    "            break\n",
    "    return total_reward\n",
    "\n",
    "\n",
    "def evaluate_policy(env, policy, gamma = 1.0, n = 100):\n",
    "    scores = [run_episode(env, policy, gamma, False) for _ in range(n)]\n",
    "    return np.mean(scores)\n",
    "\n",
    "def extract_policy(v, gamma = 1.0):\n",
    "    \"\"\" Extract the policy given a value-function \"\"\"\n",
    "    policy = np.zeros(env.nS)\n",
    "    for s in range(env.nS):\n",
    "        q_sa = np.zeros(env.nA)\n",
    "        for a in range(env.nA):\n",
    "            q_sa[a] = sum([p * (r + gamma * v[s_]) for p, s_, r, _ in  env.P[s][a]])\n",
    "        policy[s] = np.argmax(q_sa)\n",
    "    return policy\n",
    "\n",
    "def compute_policy_v(env, policy, gamma=1.0):\n",
    "    \"\"\" Iteratively evaluate the value-function under policy.\n",
    "    Alternatively, we could formulate a set of linear equations in iterms of v[s] \n",
    "    and solve them to find the value function.\n",
    "    \"\"\"\n",
    "    v = np.zeros(env.nS)\n",
    "    eps = 1e-10\n",
    "    while True:\n",
    "        prev_v = np.copy(v)\n",
    "        for s in range(env.nS):\n",
    "            policy_a = policy[s]\n",
    "            v[s] = sum([p * (r + gamma * prev_v[s_]) for p, s_, r, _ in env.P[s][policy_a]])\n",
    "        if (np.sum((np.fabs(prev_v - v))) <= eps):\n",
    "            # value converged\n",
    "            break\n",
    "    return v\n",
    "\n",
    "def policy_iteration(env, gamma = 1.0):\n",
    "    \"\"\" Policy-Iteration algorithm \"\"\"\n",
    "    policy = np.random.choice(env.nA, size=(env.nS))  # initialize a random policy\n",
    "    max_iterations = 200000\n",
    "    gamma = 1.0\n",
    "    for i in range(max_iterations):\n",
    "        old_policy_v = compute_policy_v(env, policy, gamma)\n",
    "        new_policy = extract_policy(old_policy_v, gamma)\n",
    "        if (np.all(policy == new_policy)):\n",
    "            print ('Policy-Iteration converged at step %d.' %(i+1))\n",
    "            break\n",
    "        policy = new_policy\n",
    "    return policy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Policy-Iteration converged at step 12.\n",
      "Average scores =  1.0\n"
     ]
    }
   ],
   "source": [
    "env_name  = 'FrozenLake8x8-v0'\n",
    "env = gym.make(env_name)\n",
    "env = env.unwrapped\n",
    "optimal_policy = policy_iteration(env, gamma = 1.0)\n",
    "scores = evaluate_policy(env, optimal_policy, gamma = 1.0)\n",
    "print('Average scores = ', np.mean(scores))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Q-Learning example using OpenAI gym MountainCar enviornment\n",
    "Author: Moustafa Alzantot (malzantot@ucla.edu)\n",
    "\"\"\"\n",
    "import numpy as np\n",
    "\n",
    "import gym\n",
    "from gym import wrappers\n",
    "\n",
    "n_states = 40\n",
    "iter_max = 10000\n",
    "\n",
    "initial_lr = 1.0 # Learning rate\n",
    "min_lr = 0.003\n",
    "gamma = 1.0\n",
    "t_max = 10000\n",
    "eps = 0.02\n",
    "\n",
    "def run_episode(env, policy=None, render=False):\n",
    "    obs = env.reset()\n",
    "    total_reward = 0\n",
    "    step_idx = 0\n",
    "    for _ in range(t_max):\n",
    "        if render:\n",
    "            env.render()\n",
    "        if policy is None:\n",
    "            action = env.action_space.sample()\n",
    "        else:\n",
    "            a,b = obs_to_state(env, obs)\n",
    "            action = policy[a][b]\n",
    "        obs, reward, done, _ = env.step(action)\n",
    "        total_reward += gamma ** step_idx * reward\n",
    "        step_idx += 1\n",
    "        if done:\n",
    "            break\n",
    "    return total_reward\n",
    "\n",
    "def obs_to_state(env, obs):\n",
    "    \"\"\" Maps an observation to state \"\"\"\n",
    "    env_low = env.observation_space.low\n",
    "    env_high = env.observation_space.high\n",
    "    env_dx = (env_high - env_low) / n_states\n",
    "    a = int((obs[0] - env_low[0])/env_dx[0])\n",
    "    b = int((obs[1] - env_low[1])/env_dx[1])\n",
    "    return a, b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "----- using Q Learning -----\n",
      "Iteration #1 -- Total reward = -200.\n",
      "Iteration #101 -- Total reward = -200.\n",
      "Iteration #201 -- Total reward = -200.\n",
      "Iteration #301 -- Total reward = -200.\n",
      "Iteration #401 -- Total reward = -200.\n",
      "Iteration #501 -- Total reward = -200.\n",
      "Iteration #601 -- Total reward = -200.\n",
      "Iteration #701 -- Total reward = -200.\n",
      "Iteration #801 -- Total reward = -200.\n",
      "Iteration #901 -- Total reward = -200.\n",
      "Iteration #1001 -- Total reward = -200.\n",
      "Iteration #1101 -- Total reward = -200.\n",
      "Iteration #1201 -- Total reward = -200.\n",
      "Iteration #1301 -- Total reward = -200.\n",
      "Iteration #1401 -- Total reward = -200.\n",
      "Iteration #1501 -- Total reward = -200.\n",
      "Iteration #1601 -- Total reward = -178.\n",
      "Iteration #1701 -- Total reward = -200.\n",
      "Iteration #1801 -- Total reward = -200.\n",
      "Iteration #1901 -- Total reward = -200.\n",
      "Iteration #2001 -- Total reward = -200.\n",
      "Iteration #2101 -- Total reward = -200.\n",
      "Iteration #2201 -- Total reward = -200.\n",
      "Iteration #2301 -- Total reward = -200.\n",
      "Iteration #2401 -- Total reward = -200.\n",
      "Iteration #2501 -- Total reward = -200.\n",
      "Iteration #2601 -- Total reward = -200.\n",
      "Iteration #2701 -- Total reward = -200.\n",
      "Iteration #2801 -- Total reward = -200.\n",
      "Iteration #2901 -- Total reward = -200.\n",
      "Iteration #3001 -- Total reward = -200.\n",
      "Iteration #3101 -- Total reward = -200.\n",
      "Iteration #3201 -- Total reward = -200.\n",
      "Iteration #3301 -- Total reward = -200.\n",
      "Iteration #3401 -- Total reward = -200.\n",
      "Iteration #3501 -- Total reward = -200.\n",
      "Iteration #3601 -- Total reward = -200.\n",
      "Iteration #3701 -- Total reward = -200.\n",
      "Iteration #3801 -- Total reward = -200.\n",
      "Iteration #3901 -- Total reward = -200.\n",
      "Iteration #4001 -- Total reward = -200.\n",
      "Iteration #4101 -- Total reward = -200.\n",
      "Iteration #4201 -- Total reward = -200.\n",
      "Iteration #4301 -- Total reward = -200.\n",
      "Iteration #4401 -- Total reward = -200.\n",
      "Iteration #4501 -- Total reward = -200.\n",
      "Iteration #4601 -- Total reward = -200.\n",
      "Iteration #4701 -- Total reward = -200.\n",
      "Iteration #4801 -- Total reward = -200.\n",
      "Iteration #4901 -- Total reward = -200.\n",
      "Iteration #5001 -- Total reward = -200.\n",
      "Iteration #5101 -- Total reward = -200.\n",
      "Iteration #5201 -- Total reward = -200.\n",
      "Iteration #5301 -- Total reward = -200.\n",
      "Iteration #5401 -- Total reward = -200.\n",
      "Iteration #5501 -- Total reward = -200.\n",
      "Iteration #5601 -- Total reward = -200.\n",
      "Iteration #5701 -- Total reward = -200.\n",
      "Iteration #5801 -- Total reward = -200.\n",
      "Iteration #5901 -- Total reward = -200.\n",
      "Iteration #6001 -- Total reward = -200.\n",
      "Iteration #6101 -- Total reward = -200.\n",
      "Iteration #6201 -- Total reward = -200.\n",
      "Iteration #6301 -- Total reward = -200.\n",
      "Iteration #6401 -- Total reward = -200.\n",
      "Iteration #6501 -- Total reward = -200.\n",
      "Iteration #6601 -- Total reward = -200.\n",
      "Iteration #6701 -- Total reward = -200.\n",
      "Iteration #6801 -- Total reward = -200.\n",
      "Iteration #6901 -- Total reward = -200.\n",
      "Iteration #7001 -- Total reward = -200.\n",
      "Iteration #7101 -- Total reward = -200.\n",
      "Iteration #7201 -- Total reward = -200.\n",
      "Iteration #7301 -- Total reward = -200.\n",
      "Iteration #7401 -- Total reward = -200.\n",
      "Iteration #7501 -- Total reward = -200.\n",
      "Iteration #7601 -- Total reward = -200.\n",
      "Iteration #7701 -- Total reward = -200.\n",
      "Iteration #7801 -- Total reward = -200.\n",
      "Iteration #7901 -- Total reward = -200.\n",
      "Iteration #8001 -- Total reward = -200.\n",
      "Iteration #8101 -- Total reward = -200.\n",
      "Iteration #8201 -- Total reward = -200.\n",
      "Iteration #8301 -- Total reward = -200.\n",
      "Iteration #8401 -- Total reward = -200.\n",
      "Iteration #8501 -- Total reward = -200.\n",
      "Iteration #8601 -- Total reward = -200.\n",
      "Iteration #8701 -- Total reward = -200.\n",
      "Iteration #8801 -- Total reward = -200.\n",
      "Iteration #8901 -- Total reward = -200.\n",
      "Iteration #9001 -- Total reward = -200.\n",
      "Iteration #9101 -- Total reward = -200.\n",
      "Iteration #9201 -- Total reward = -200.\n",
      "Iteration #9301 -- Total reward = -200.\n",
      "Iteration #9401 -- Total reward = -200.\n",
      "Iteration #9501 -- Total reward = -200.\n",
      "Iteration #9601 -- Total reward = -200.\n",
      "Iteration #9701 -- Total reward = -200.\n",
      "Iteration #9801 -- Total reward = -200.\n",
      "Iteration #9901 -- Total reward = -200.\n",
      "Average score of solution =  -132.57\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "-158.0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "env_name = 'MountainCar-v0'\n",
    "env = gym.make(env_name)\n",
    "env.seed(0)\n",
    "np.random.seed(0)\n",
    "print ('----- using Q Learning -----')\n",
    "q_table = np.zeros((n_states, n_states, 3))\n",
    "for i in range(iter_max):\n",
    "    obs = env.reset()\n",
    "    total_reward = 0\n",
    "        ## eta: learning rate is decreased at each step\n",
    "    eta = max(min_lr, initial_lr * (0.85 ** (i//100)))\n",
    "    for j in range(t_max):\n",
    "        a, b = obs_to_state(env, obs)\n",
    "        if np.random.uniform(0, 1) < eps:\n",
    "            action = np.random.choice(env.action_space.n)\n",
    "        else:\n",
    "            logits = q_table[a][b]\n",
    "            logits_exp = np.exp(logits)\n",
    "            probs = logits_exp / np.sum(logits_exp)\n",
    "            action = np.random.choice(env.action_space.n, p=probs)\n",
    "        obs, reward, done, _ = env.step(action)\n",
    "        total_reward += (gamma ** j) * reward\n",
    "            # update q table\n",
    "        a_, b_ = obs_to_state(env, obs)\n",
    "        q_table[a][b][action] = q_table[a][b][action] + eta * (reward + gamma *  np.max(q_table[a_][b_]) - q_table[a][b][action])\n",
    "        if done:\n",
    "            break\n",
    "    if i % 100 == 0:\n",
    "        print('Iteration #%d -- Total reward = %d.' %(i+1, total_reward))\n",
    "solution_policy = np.argmax(q_table, axis=2)\n",
    "solution_policy_scores = [run_episode(env, solution_policy, False) for _ in range(100)]\n",
    "print(\"Average score of solution = \", np.mean(solution_policy_scores))\n",
    "    # Animate it\n",
    "run_episode(env, solution_policy, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Reinforcement Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So far we were fully in the field of optimal control without any appearance of deep learning techniques. It is particularly interesting to think of exploring an unknown environment, learning a Q function increasingly well but storing the information in a deep neural networks. In terms of the HJB equation this amounts to solving the equation by a deep neural network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are basically two approaches: learning the $Q$ function and learning the policy $ \\pi $ (often in a relaxed version). One can see this from the point of view of the HJB equation, which we take in the simplest case (one player, $c=0$):\n",
    "1. (Value iteration) Approximate solutions of the HJB equation by neural networks. i.e. choose a value function as neural network and run one step of value iteration.\n",
    "2. (Policy iteration) Approximate policies by neural networks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Previous algorithms were just implementations of solving fixed point problems by value or policy iteration, this can also be done by learning technology yielding surprising and not yet understood effects. It is not clear why this works so well and, in contrast to some classical learning tasks, there is little regularity involved.\n",
    "\n",
    "However, also very directed approaches are efficient, see for instance: in the sequel the game Cartpole is shown from several angles and a very direct approach to learn an efficient strategy is shown, we follow here the great blog entry by [Greg Surma](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "import random\n",
    "import numpy as np\n",
    "from keras.models     import Sequential\n",
    "from keras.layers     import Dense\n",
    "from keras.optimizers import Adam"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "env = gym.make('CartPole-v1')\n",
    "env.reset()\n",
    "goal_steps = 500\n",
    "score_requirement = 60\n",
    "intial_games = 10000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def play_a_random_game_first():\n",
    "    for step_index in range(goal_steps):\n",
    "        env.render()\n",
    "        action = env.action_space.sample()\n",
    "        observation, reward, done, info = env.step(action)\n",
    "        print(\"Step {}:\".format(step_index))\n",
    "        print(\"action: {}\".format(action))\n",
    "        print(\"observation: {}\".format(observation))\n",
    "        print(\"reward: {}\".format(reward))\n",
    "        print(\"done: {}\".format(done))\n",
    "        print(\"info: {}\".format(info))\n",
    "        if done:\n",
    "            break\n",
    "    env.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Step 0:\n",
      "action: 0\n",
      "observation: [-0.019366   -0.17879223 -0.00830127  0.25901343]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 1:\n",
      "action: 0\n",
      "observation: [-0.02294184 -0.37379469 -0.003121    0.5490665 ]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 2:\n",
      "action: 0\n",
      "observation: [-0.03041774 -0.56887266  0.00786033  0.84076446]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 3:\n",
      "action: 0\n",
      "observation: [-0.04179519 -0.76410104  0.02467562  1.13590889]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 4:\n",
      "action: 0\n",
      "observation: [-0.05707721 -0.95953696  0.0473938   1.43622743]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 5:\n",
      "action: 1\n",
      "observation: [-0.07626795 -0.76503029  0.07611835  1.1587236 ]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 6:\n",
      "action: 0\n",
      "observation: [-0.09156856 -0.96105713  0.09929282  1.47426961]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 7:\n",
      "action: 1\n",
      "observation: [-0.1107897  -0.76727897  0.12877821  1.2141782 ]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 8:\n",
      "action: 1\n",
      "observation: [-0.12613528 -0.57403203  0.15306178  0.96446429]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 9:\n",
      "action: 0\n",
      "observation: [-0.13761592 -0.77084187  0.17235106  1.30105232]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 10:\n",
      "action: 0\n",
      "observation: [-0.15303276 -0.96768009  0.19837211  1.64235593]\n",
      "reward: 1.0\n",
      "done: False\n",
      "info: {}\n",
      "Step 11:\n",
      "action: 1\n",
      "observation: [-0.17238636 -0.77535698  0.23121923  1.41746846]\n",
      "reward: 1.0\n",
      "done: True\n",
      "info: {}\n"
     ]
    }
   ],
   "source": [
    "play_a_random_game_first()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can read at the [Cartpole documentation](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py) what the numbers do precisely mean. Now we create a set of random strategies which were up to some extend successful. Notice that you have to install from some [packages](https://github.com/gsurma/cartpole)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def model_data_preparation():\n",
    "    training_data = []\n",
    "    accepted_scores = []\n",
    "    for game_index in range(intial_games):\n",
    "        score = 0\n",
    "        game_memory = []\n",
    "        previous_observation = []\n",
    "        for step_index in range(goal_steps):\n",
    "            action = random.randrange(0, 2)\n",
    "            observation, reward, done, info = env.step(action)\n",
    "            \n",
    "            if len(previous_observation) > 0:\n",
    "                game_memory.append([previous_observation, action])\n",
    "                \n",
    "            previous_observation = observation\n",
    "            score += reward\n",
    "            if done:\n",
    "                break\n",
    "            \n",
    "        if score >= score_requirement:\n",
    "            accepted_scores.append(score)\n",
    "            for data in game_memory:\n",
    "                if data[1] == 1:\n",
    "                    output = [0, 1]\n",
    "                elif data[1] == 0:\n",
    "                    output = [1, 0]\n",
    "                training_data.append([data[0], output])\n",
    "        \n",
    "        env.reset()\n",
    "\n",
    "    print(accepted_scores)\n",
    "    print(len(accepted_scores))\n",
    "    \n",
    "    return training_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[86.0, 62.0, 63.0, 62.0, 62.0, 139.0, 61.0, 69.0, 72.0, 74.0, 66.0, 62.0, 83.0, 71.0, 68.0, 89.0, 65.0, 60.0, 84.0, 80.0, 69.0, 79.0, 82.0, 67.0, 94.0, 88.0, 67.0, 109.0, 60.0, 76.0, 65.0, 90.0, 66.0, 65.0, 63.0, 81.0, 94.0, 85.0, 68.0, 71.0, 79.0, 61.0, 80.0, 67.0, 91.0, 64.0, 72.0, 62.0, 91.0, 61.0, 74.0, 67.0, 114.0, 61.0, 60.0, 61.0, 83.0, 79.0, 66.0, 63.0, 80.0, 102.0, 89.0, 75.0, 67.0, 72.0, 71.0, 60.0, 62.0, 77.0, 66.0, 69.0, 88.0, 64.0, 66.0, 65.0, 92.0, 61.0, 87.0, 62.0, 64.0, 66.0, 67.0, 69.0, 64.0, 68.0, 105.0, 66.0, 63.0, 63.0, 79.0, 65.0, 69.0, 70.0, 60.0, 60.0, 85.0, 60.0, 74.0, 67.0, 66.0, 63.0, 74.0, 94.0, 70.0, 60.0, 66.0, 82.0, 71.0, 63.0, 81.0, 73.0, 60.0, 66.0, 63.0, 82.0, 73.0, 60.0, 78.0, 94.0, 78.0, 60.0, 63.0, 65.0, 68.0, 61.0, 88.0, 95.0, 95.0, 69.0, 76.0, 73.0, 96.0, 62.0, 64.0, 73.0, 83.0, 63.0, 62.0, 83.0, 73.0, 64.0, 63.0, 98.0, 64.0, 110.0, 74.0, 70.0, 76.0, 65.0, 63.0, 77.0, 71.0, 76.0, 62.0, 61.0, 70.0, 67.0, 62.0, 65.0, 64.0, 68.0, 62.0, 60.0, 63.0, 78.0, 75.0, 63.0, 62.0, 65.0, 79.0, 68.0, 67.0, 63.0]\n",
      "174\n"
     ]
    }
   ],
   "source": [
    "training_data = model_data_preparation()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def build_model(input_size, output_size):\n",
    "    model = Sequential()\n",
    "    model.add(Dense(128, input_dim=input_size, activation='relu'))\n",
    "    model.add(Dense(52, activation='relu'))\n",
    "    model.add(Dense(output_size, activation='linear'))\n",
    "    model.compile(loss='mse', optimizer=Adam())\n",
    "\n",
    "    return model\n",
    "\n",
    "def train_model(training_data):\n",
    "    X = np.array([i[0] for i in training_data]).reshape(-1, len(training_data[0][0]))\n",
    "    y = np.array([i[1] for i in training_data]).reshape(-1, len(training_data[0][1]))\n",
    "    model = build_model(input_size=len(X[0]), output_size=len(y[0]))\n",
    "    \n",
    "    model.fit(X, y, epochs=10)\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /scratch/users/jteichma/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "keep_dims is deprecated, use keepdims instead\n",
      "Epoch 1/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2465     \n",
      "Epoch 2/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2331     \n",
      "Epoch 3/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2317     \n",
      "Epoch 4/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2315     \n",
      "Epoch 5/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2306     \n",
      "Epoch 6/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2310     \n",
      "Epoch 7/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2307     \n",
      "Epoch 8/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2303     \n",
      "Epoch 9/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2301     \n",
      "Epoch 10/10\n",
      "12442/12442 [==============================] - 0s - loss: 0.2299     \n"
     ]
    }
   ],
   "source": [
    "trained_model = train_model(training_data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[155.0, 253.0, 197.0, 228.0, 135.0, 138.0, 172.0, 265.0, 161.0, 131.0, 331.0, 326.0, 284.0, 330.0, 152.0, 213.0, 311.0, 212.0, 152.0, 209.0, 159.0, 388.0, 139.0, 129.0, 265.0, 172.0, 285.0, 323.0, 200.0, 187.0, 166.0, 267.0, 297.0, 275.0, 172.0, 161.0, 271.0, 253.0, 155.0, 211.0, 246.0, 358.0, 175.0, 151.0, 202.0, 174.0, 239.0, 136.0, 187.0, 182.0, 239.0, 367.0, 210.0, 142.0, 373.0, 247.0, 276.0, 376.0, 139.0, 136.0, 217.0, 131.0, 275.0, 160.0, 265.0, 168.0, 150.0, 346.0, 277.0, 331.0, 445.0, 147.0, 280.0, 200.0, 150.0, 164.0, 172.0, 260.0, 187.0, 118.0, 153.0, 141.0, 144.0, 177.0, 247.0, 135.0, 181.0, 180.0, 406.0, 257.0, 129.0, 153.0, 239.0, 211.0, 352.0, 127.0, 294.0, 339.0, 334.0, 298.0]\n",
      "Average Score: 223.25\n",
      "choice 1:0.49993281075027995  choice 0:0.50006718924972\n"
     ]
    }
   ],
   "source": [
    "scores = []\n",
    "choices = []\n",
    "for each_game in range(100):\n",
    "    score = 0\n",
    "    prev_obs = []\n",
    "    for step_index in range(goal_steps):\n",
    "        # Uncomment below line if you want to see how our bot is playing the game.\n",
    "        #env.render()\n",
    "        #print('Step:', step_index)\n",
    "        if len(prev_obs)==0:\n",
    "            action = random.randrange(0,2)\n",
    "        else:\n",
    "            action = np.argmax(trained_model.predict(prev_obs.reshape(-1, len(prev_obs)))[0])\n",
    "        \n",
    "        choices.append(action)\n",
    "        new_observation, reward, done, info = env.step(action)\n",
    "        prev_obs = new_observation\n",
    "        score+=reward\n",
    "        if done:\n",
    "            break\n",
    "    #print('Game:', each_game)\n",
    "    env.reset()\n",
    "    scores.append(score)\n",
    "\n",
    "print(scores)\n",
    "print('Average Score:',sum(scores)/len(scores))\n",
    "print('choice 1:{}  choice 0:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import random\n",
    "import gym\n",
    "import numpy as np\n",
    "from collections import deque\n",
    "from keras.models import Sequential\n",
    "from keras.layers import Dense\n",
    "from keras.optimizers import Adam\n",
    "\n",
    "\n",
    "from scores.score_logger import ScoreLogger\n",
    "\n",
    "ENV_NAME = \"CartPole-v1\"\n",
    "\n",
    "GAMMA = 0.95\n",
    "LEARNING_RATE = 0.001\n",
    "\n",
    "MEMORY_SIZE = 1000000\n",
    "BATCH_SIZE = 20\n",
    "\n",
    "EXPLORATION_MAX = 1.0\n",
    "EXPLORATION_MIN = 0.01\n",
    "EXPLORATION_DECAY = 0.995\n",
    "\n",
    "\n",
    "class DQNSolver:\n",
    "\n",
    "    def __init__(self, observation_space, action_space):\n",
    "        self.exploration_rate = EXPLORATION_MAX\n",
    "\n",
    "        self.action_space = action_space\n",
    "        self.memory = deque(maxlen=MEMORY_SIZE)\n",
    "\n",
    "        self.model = Sequential()\n",
    "        self.model.add(Dense(24, input_shape=(observation_space,), activation=\"relu\"))\n",
    "        self.model.add(Dense(24, activation=\"relu\"))\n",
    "        self.model.add(Dense(self.action_space, activation=\"linear\"))\n",
    "        self.model.compile(loss=\"mse\", optimizer=Adam(lr=LEARNING_RATE))\n",
    "\n",
    "    def remember(self, state, action, reward, next_state, done):\n",
    "        self.memory.append((state, action, reward, next_state, done))\n",
    "\n",
    "    def act(self, state):\n",
    "        if np.random.rand() < self.exploration_rate:\n",
    "            return random.randrange(self.action_space)\n",
    "        q_values = self.model.predict(state)\n",
    "        return np.argmax(q_values[0])\n",
    "\n",
    "    def experience_replay(self):\n",
    "        if len(self.memory) < BATCH_SIZE:\n",
    "            return\n",
    "        batch = random.sample(self.memory, BATCH_SIZE)\n",
    "        for state, action, reward, state_next, terminal in batch:\n",
    "            q_update = reward\n",
    "            if not terminal:\n",
    "                q_update = (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))\n",
    "            q_values = self.model.predict(state)\n",
    "            q_values[0][action] = q_update\n",
    "            self.model.fit(state, q_values, verbose=0)\n",
    "        self.exploration_rate *= EXPLORATION_DECAY\n",
    "        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)\n",
    "\n",
    "\n",
    "def cartpole():\n",
    "    env = gym.make(ENV_NAME)\n",
    "    score_logger = ScoreLogger(ENV_NAME)\n",
    "    observation_space = env.observation_space.shape[0]\n",
    "    action_space = env.action_space.n\n",
    "    dqn_solver = DQNSolver(observation_space, action_space)\n",
    "    run = 0\n",
    "    while True:\n",
    "        run += 1\n",
    "        state = env.reset()\n",
    "        state = np.reshape(state, [1, observation_space])\n",
    "        step = 0\n",
    "        while True:\n",
    "            step += 1\n",
    "            #env.render()\n",
    "            action = dqn_solver.act(state)\n",
    "            state_next, reward, terminal, info = env.step(action)\n",
    "            reward = reward if not terminal else -reward\n",
    "            state_next = np.reshape(state_next, [1, observation_space])\n",
    "            dqn_solver.remember(state, action, reward, state_next, terminal)\n",
    "            state = state_next\n",
    "            if terminal:\n",
    "                print(\"Run: \" + str(run) + \", exploration: \" + str(dqn_solver.exploration_rate) + \", score: \" + str(step))\n",
    "                score_logger.add_score(step, run)\n",
    "                break\n",
    "            dqn_solver.experience_replay()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "cartpole()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}