Deep optimal stopping

In the following we will implement a deep learning method for optimal stopping problems based on the article Becker, S., Cheridito, P., and Jentzen, A. Deep optimal stopping. Journal of Machine Learning Research 20, 74 (2015), 1–25 and test it on Bermudan max-call options.

In [8]:
import numpy as np
import tensorflow as tf
from scipy.stats import norm
from tensorflow.python.ops import init_ops
from tensorflow.contrib.layers.python.layers import initializers
from tensorflow.python.training.moving_averages import assign_moving_average
from tensorflow.contrib.layers.python.layers import utils
In [9]:
def neural_net(x, name, num_neurons, 
               activation_fn=tf.nn.relu, reuse=None, dtype=tf.float32):

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[-1], out_size],
            dtype=dtype,
            initializer=tf.random_uniform_initializer())
        b = tf.get_variable(
            name='bias',
            shape=[1, out_size],
            dtype=dtype,
            initializer=tf.random_normal_initializer())
        return activation(tf.matmul(_x, w) + b)

    with tf.variable_scope(name, reuse=reuse):
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x

Bermudan max-call options

First of all we need to create training samples. In this case this task boils down to the simulation of sample paths of $d$ underlying assets:

We assume that the risk-neutral dynamics of the assets are given by a multi-dimensional Black-Scholes model

$$ S_t^i = s_0 \exp([r - \delta - \sigma^2/2] t + \sigma W_t^i), \quad i = 1, 2, \ldots, d, $$

for initial values $s_0 \in (0,\infty) $, a risk-free interest rate $ r \in \mathbb{R} $, dividend yields $ \delta \in [0, \infty) $, volatilities $ \sigma \in (0, \infty) $ and a $d$-dimensional Brownian motion $W$ with constant instantaneous correlations $p_{ij} = 0$ for $i\neq j$ between different components $W^i$ and $W^j$. In the following we assume a time grid of the form $t_n = nT/N$ and denote $X_n^i = S_{t_n}^i$ for $n=0,1,\ldots,N$. A Bermudan max-call option on $S^1, S^2, \ldots, S^d$ has payoff $(\max_{1\leq i \leq d} S_t^i - K)^+$ and can be exercised at any point of the time grid $0=t_0 < t_1 < \ldots < t_N $. The price can then be written as $\sup_{\tau \in \mathcal{T}} \mathbb{E}g(\tau,X_{\tau})$ for

$$ g(n,x) = e^{-r t_n} \left( \max_{1 \leq i \leq d} x^i - K \right)^+, $$

where $\mathcal{T}$ is the set of $X$-stopping times taking values in $\{t_0, t_1, \ldots, t_N\}$.

In the next step we need to implement the deep optimal stopping method:

The numerical method consists in iteratively approximating optimal stopping decisions $ f_n \colon \mathbb{R}^d \to \{0,1\} $, $ n = 0, 1, \dots, N - 1 $, by a neural network $f^{ \theta } \colon \mathbb{R}^d \to \{ 0, 1 \}$ with parameter $ \theta \in \mathbb{R}^q $. We do this by starting with the terminal stopping decision $ f_N \equiv 1 $ and proceeding by backward induction. More precisely, let $ n \in \{ 0, 1, \dots, N - 1 \} $, and assume parameter values $ \theta_{ n + 1 }, \theta_{ n + 2 }, \dots, \theta_N \in \mathbb{R}^q $ have been found such that $f^{\theta_N} \equiv 1$ and the stopping time

$$ \tau_{n+1} = \sum_{m =n+1}^{N} m f^{\theta_m}(X_m) \prod_{j=n+1}^{m-1} (1-f^{\theta_j}(X_j)) $$

produces an expected value $ \mathbb{E} \, g(\tau_{n+1},X_{\tau_{n+1}}) $. Since $f^{\theta}$ takes values in $\{0,1\}$, it does not directly lend itself to a gradient-based optimization method. So, as an intermediate step, we introduce a feedforward neural network $F^{\theta} \colon \mathbb{R}^d \to (0,1)$ of the form

$$ F^{\theta} = \psi \circ a^{\theta}_I \circ \varphi_{q_{I-1}} \circ a^{\theta}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta}_1,$$

where

  • $I, q_1, q_2, \dots, q_{I-1}$ are positive integers specifying the depth of the network and the number of nodes in the hidden layers,

  • $a^{\theta}_1 \colon \mathbb{R}^d \to \mathbb{R}^{q_1}, \dots, a^{\theta}_{I-1} \colon \mathbb{R}^{q_{I-2}} \to \mathbb{R}^{q_{I-1}} $ and $a^{\theta}_I \colon \mathbb{R}^{q_{I-1}} \to \mathbb{R}$ are affine functions,

  • for $j \in \mathbb{N}$, $\varphi_j \colon \mathbb{R}^j \to \mathbb{R}^j$ is the component-wise ReLU activation function given by $\varphi_j(x_1, \dots, x_j) = (x^+_1 , \dots, x^+_j)$

  • $\psi \colon \mathbb{R} \to (0,1)$ is the standard logistic function $\psi(x) = e^x/(1+ e^x) = 1 / ( 1 + e^{ - x } )$.

The components of the parameter $\theta \in \mathbb{R}^q$ of $F^{\theta}$ consist of the entries of the matrices $A_1 \in \mathbb{R}^{q_1 \times d}, \dots,$ $A_{I-1} \in \mathbb{R}^{q_{I-1} \times q_{I-2}}, A_I \in \mathbb{R}^{1 \times q_{I-1}}$ and the vectors $b_1 \in \mathbb{R}^{q_1}, \dots, b_{I-1} \in \mathbb{R}^{q_{I-1}}, b_I \in \mathbb{R}$ given by the representation of the affine functions

$$a^{\theta}_i(x) = A_i x + b_i, \quad i = 1, \dots, I. $$

Our aim is to determine $\theta_n \in \mathbb{R}^q$ so that

$$ \mathbb{E} \left[g(n,X_n) F^{\theta_n}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta_n}(X_n))\right] $$

is close to the supremum $\sup_{\theta \in \mathbb{R}^q} \mathbb{E} \left[g(n,X_n) F^{\theta}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta}(X_n))\right]$. Once this has been achieved, we define the function $f^{\theta_n} \colon \mathbb{R}^d \to \{0,1\}$ by

$$ f^{\theta_n} = 1_{[0,\infty)} \circ a^{\theta_n}_I \circ \varphi_{q_{I-1}} \circ a^{\theta_n}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta_n}_1, $$

where $1_{[0,\infty)} \colon \mathbb{R} \to \{0,1\}$ is the indicator function of $[0,\infty)$. The only difference between $F^{\theta_n}$ and $f^{\theta_n}$ is the final nonlinearity. While $F^{\theta_n}$ produces a stopping probability in $(0,1)$, the output of $f^{\theta_n}$ is a hard stopping decision given by $0$ or $1$, depending on whether $F^{\theta_n}$ takes a value below or above $1/2$.

Some results from our article:

$d$ $s_0$ $\hat{L}$ $t_L$ $\hat{U}$ $t_U$ Point est. 95% CI
2 90 8.072 28.7 8.075 25.4 8.074 [8.060, 8.081]
2 100 13.895 28.7 13.903 25.3 13.899 [13.880, 13.910]
2 110 21.353 28.4 21.346 25.3 21.349 [21.336, 21.354]
3 90 11.290 28.8 11.283 26.3 11.287 [11.276, 11.290]
3 100 18.690 28.9 18.691 26.4 18.690 [18.673, 18.699]
3 110 27.564 27.6 27.581 26.3 27.573 [27.545, 27.591]
5 90 16.648 27.6 16.640 28.4 16.644 [16.633, 16.648]
5 100 26.156 28.1 26.162 28.3 26.159 [26.138, 26.174]
5 110 36.766 27.7 36.777 28.4 36.772 [36.745, 36.789]
10 90 26.208 30.4 26.272 33.9 26.240 [26.189, 26.289]
10 100 38.321 30.5 38.353 34.0 38.337 [38.300, 38.367]
10 110 50.857 30.8 50.914 34.0 50.886 [50.834, 50.937]
In [11]:
tf.reset_default_graph()

# parameters for the max-call example
T, N, d = 3., 9, 5
r, delta, sigma = 0.05, 0.1, 0.2
K, s_0 = 100., 100.
train_steps, mc_runs = 0, 0


# <-- create training samples, implement the algorithm, and fill in loss and g_tau in a meaningful way
loss = 0.
g_tau = 0.


# bare minimum boilerplate code below
px_mean_batch = tf.reduce_mean(g_tau)

optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(tf.reduce_mean(loss))

with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())
    
    for _ in range(train_steps):
        sess.run(train_op)
    
    px_mean = 0.
    for _ in range(mc_runs):
        px_mean += sess.run(px_mean_batch)

    print(px_mean / mc_runs)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-3f70bd095c29> in <module>
     17 
     18 optimizer = tf.train.AdamOptimizer()
---> 19 train_op = optimizer.minimize(tf.reduce_mean(loss))
     20 
     21 with tf.Session() as sess:

/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py in minimize(self, loss, global_step, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, name, grad_loss)
    398         aggregation_method=aggregation_method,
    399         colocate_gradients_with_ops=colocate_gradients_with_ops,
--> 400         grad_loss=grad_loss)
    401 
    402     vars_with_grad = [v for g, v in grads_and_vars if g is not None]

/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py in compute_gradients(self, loss, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, grad_loss)
    511     processors = [_get_processor(v) for v in var_list]
    512     if not var_list:
--> 513       raise ValueError("No variables to optimize.")
    514     var_refs = [p.target() for p in processors]
    515     grads = gradients.gradients(

ValueError: No variables to optimize.

Below you will find some more helper functions for neural network architectures that we will use later to improve our implementation.

def neural_net_bn(x, name, num_neurons, is_training, 
                  activation_fn=tf.nn.relu, reuse=None, 
                  decay=0.9, dtype=tf.float32):

    def batch_normalization(_x):
        shape = _x.get_shape().as_list()
        beta = tf.get_variable(name='beta', shape=[shape[-1]], dtype=dtype,
                               initializer=init_ops.zeros_initializer())
        gamma = tf.get_variable(name='gamma', shape=[shape[-1]], dtype=dtype,
                                initializer=init_ops.ones_initializer())
        mv_mean = tf.get_variable('mv_mean', [shape[-1]], dtype=dtype,
                                  initializer=init_ops.zeros_initializer(), 
                                  trainable=False)
        mv_var = tf.get_variable('mv_var', [shape[-1]], dtype=dtype,
                                 initializer=init_ops.zeros_initializer(), 
                                 trainable=False)
        mean, variance = tf.nn.moments(_x, [0], name='moments')
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_mean, mean, decay, 
                                                   zero_debias=True))
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_var, variance, decay, 
                                                   zero_debias=True))
        mean, variance = utils.smart_cond(is_training, 
                                          lambda: (mean, variance), 
                                          lambda: (mv_mean, mv_var))
        return gamma * (_x - mean) / (variance + var_epsilon) + beta

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[-1], out_size],
            dtype=dtype,
            initializer=initializers.xavier_initializer())
        return activation(batch_normalization(tf.matmul(_x, w)))

    with tf.variable_scope(name, reuse=reuse):
        x = batch_normalization(x)
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x
def neural_net_bn_combined(x, name, num_neurons, is_training, 
                           activation_fn=tf.nn.relu, reuse=None, 
                           decay=0.9, dtype=tf.float32):

    def batch_normalization(_x):
        shape = _x.get_shape().as_list()
        _x = tf.reshape(_x, [-1, shape[1] * shape[2]])
        beta = tf.get_variable(name='beta', 
                               shape=[shape[1] * shape[2]], dtype=dtype,
                               initializer=init_ops.zeros_initializer())
        gamma = tf.get_variable(name='gamma', 
                                shape=[shape[1] * shape[2]], dtype=dtype,
                                initializer=init_ops.ones_initializer())
        mv_mean = tf.get_variable('mv_mean', 
                                  shape=[shape[1] * shape[2]], dtype=dtype,
                                  initializer=init_ops.zeros_initializer(), 
                                  trainable=False)
        mv_var = tf.get_variable('mv_var', 
                                 shape=[shape[1] * shape[2]], 
                                 dtype=dtype,
                                 initializer=init_ops.zeros_initializer(), 
                                 trainable=False)
        mean, variance = tf.nn.moments(_x, [0], name='moments')
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_mean, mean, decay, 
                                                   zero_debias=True))
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_var, variance, decay, 
                                                   zero_debias=True))
        mean, variance = utils.smart_cond(is_training, 
                                          lambda: (mean, variance), 
                                          lambda: (mv_mean, mv_var))
        return tf.reshape(gamma * (_x - mean) / (variance + var_epsilon) + beta, 
                          [-1, shape[1], shape[2]])

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[2], shape[1], out_size],
            dtype=dtype,
            initializer=initializers.xavier_initializer())
        _x = tf.matmul(tf.transpose(_x, [2, 0, 1]), w)
        return activation(batch_normalization(tf.transpose(_x, [1, 2, 0])))

    with tf.variable_scope(name, reuse=reuse):
        x = batch_normalization(x)
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x