Deep optimal stopping¶

In the following we will implement a deep learning method for optimal stopping problems based on the article Becker, S., Cheridito, P., and Jentzen, A. Deep optimal stopping. Journal of Machine Learning Research 20, 74 (2015), 1–25 and test it on Bermudan max-call options.

import numpy as np
import tensorflow as tf
from scipy.stats import norm
from tensorflow.python.ops import init_ops
from tensorflow.contrib.layers.python.layers import initializers
from tensorflow.python.training.moving_averages import assign_moving_average
from tensorflow.contrib.layers.python.layers import utils

def neural_net(x, name, num_neurons, 
               activation_fn=tf.nn.relu, reuse=None, dtype=tf.float32):

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[-1], out_size],
            dtype=dtype,
            initializer=tf.random_uniform_initializer())
        b = tf.get_variable(
            name='bias',
            shape=[1, out_size],
            dtype=dtype,
            initializer=tf.random_normal_initializer())
        return activation(tf.matmul(_x, w) + b)

    with tf.variable_scope(name, reuse=reuse):
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x

Bermudan max-call options¶

First of all we need to create training samples. In this case this task boils down to the simulation of sample paths of $d$ underlying assets:

We assume that the risk-neutral dynamics of the assets are given by a multi-dimensional Black-Scholes model

$$ S_t^i = s_0 \exp([r - \delta - \sigma^2/2] t + \sigma W_t^i), \quad i = 1, 2, \ldots, d, $$

for initial values $s_0 \in (0,\infty) $, a risk-free interest rate $ r \in \mathbb{R} $, dividend yields $ \delta \in [0, \infty) $, volatilities $ \sigma \in (0, \infty) $ and a $d$-dimensional Brownian motion $W$ with constant instantaneous correlations $p_{ij} = 0$ for $i\neq j$ between different components $W^i$ and $W^j$. In the following we assume a time grid of the form $t_n = nT/N$ and denote $X_n^i = S_{t_n}^i$ for $n=0,1,\ldots,N$. A Bermudan max-call option on $S^1, S^2, \ldots, S^d$ has payoff $(\max_{1\leq i \leq d} S_t^i - K)^+$ and can be exercised at any point of the time grid $0=t_0 < t_1 < \ldots < t_N $. The price can then be written as $\sup_{\tau \in \mathcal{T}} \mathbb{E}g(\tau,X_{\tau})$ for

$$ g(n,x) = e^{-r t_n} \left( \max_{1 \leq i \leq d} x^i - K \right)^+, $$

where $\mathcal{T}$ is the set of $X$-stopping times taking values in $\{t_0, t_1, \ldots, t_N\}$.

In the next step we need to implement the deep optimal stopping method:

The numerical method consists in iteratively approximating optimal stopping decisions $ f_n \colon \mathbb{R}^d \to \{0,1\} $, $ n = 0, 1, \dots, N - 1 $, by a neural network $f^{ \theta } \colon \mathbb{R}^d \to \{ 0, 1 \}$ with parameter $ \theta \in \mathbb{R}^q $. We do this by starting with the terminal stopping decision $ f_N \equiv 1 $ and proceeding by backward induction. More precisely, let $ n \in \{ 0, 1, \dots, N - 1 \} $, and assume parameter values $ \theta_{ n + 1 }, \theta_{ n + 2 }, \dots, \theta_N \in \mathbb{R}^q $ have been found such that $f^{\theta_N} \equiv 1$ and the stopping time

$$ \tau_{n+1} = \sum_{m =n+1}^{N} m f^{\theta_m}(X_m) \prod_{j=n+1}^{m-1} (1-f^{\theta_j}(X_j)) $$

produces an expected value $ \mathbb{E} \, g(\tau_{n+1},X_{\tau_{n+1}}) $. Since $f^{\theta}$ takes values in $\{0,1\}$, it does not directly lend itself to a gradient-based optimization method. So, as an intermediate step, we introduce a feedforward neural network $F^{\theta} \colon \mathbb{R}^d \to (0,1)$ of the form

$$ F^{\theta} = \psi \circ a^{\theta}_I \circ \varphi_{q_{I-1}} \circ a^{\theta}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta}_1,$$

where

$I, q_1, q_2, \dots, q_{I-1}$ are positive integers specifying the depth of the network and the number of nodes in the hidden layers,
$a^{\theta}_1 \colon \mathbb{R}^d \to \mathbb{R}^{q_1}, \dots, a^{\theta}_{I-1} \colon \mathbb{R}^{q_{I-2}} \to \mathbb{R}^{q_{I-1}} $ and $a^{\theta}_I \colon \mathbb{R}^{q_{I-1}} \to \mathbb{R}$ are affine functions,
for $j \in \mathbb{N}$, $\varphi_j \colon \mathbb{R}^j \to \mathbb{R}^j$ is the component-wise ReLU activation function given by $\varphi_j(x_1, \dots, x_j) = (x^+_1 , \dots, x^+_j)$
$\psi \colon \mathbb{R} \to (0,1)$ is the standard logistic function $\psi(x) = e^x/(1+ e^x) = 1 / ( 1 + e^{ - x } )$.

The components of the parameter $\theta \in \mathbb{R}^q$ of $F^{\theta}$ consist of the entries of the matrices $A_1 \in \mathbb{R}^{q_1 \times d}, \dots,$ $A_{I-1} \in \mathbb{R}^{q_{I-1} \times q_{I-2}}, A_I \in \mathbb{R}^{1 \times q_{I-1}}$ and the vectors $b_1 \in \mathbb{R}^{q_1}, \dots, b_{I-1} \in \mathbb{R}^{q_{I-1}}, b_I \in \mathbb{R}$ given by the representation of the affine functions

$$a^{\theta}_i(x) = A_i x + b_i, \quad i = 1, \dots, I. $$

Our aim is to determine $\theta_n \in \mathbb{R}^q$ so that

$$ \mathbb{E} \left[g(n,X_n) F^{\theta_n}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta_n}(X_n))\right] $$

is close to the supremum $\sup_{\theta \in \mathbb{R}^q} \mathbb{E} \left[g(n,X_n) F^{\theta}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta}(X_n))\right]$. Once this has been achieved, we define the function $f^{\theta_n} \colon \mathbb{R}^d \to \{0,1\}$ by

$$ f^{\theta_n} = 1_{[0,\infty)} \circ a^{\theta_n}_I \circ \varphi_{q_{I-1}} \circ a^{\theta_n}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta_n}_1, $$

where $1_{[0,\infty)} \colon \mathbb{R} \to \{0,1\}$ is the indicator function of $[0,\infty)$. The only difference between $F^{\theta_n}$ and $f^{\theta_n}$ is the final nonlinearity. While $F^{\theta_n}$ produces a stopping probability in $(0,1)$, the output of $f^{\theta_n}$ is a hard stopping decision given by $0$ or $1$, depending on whether $F^{\theta_n}$ takes a value below or above $1/2$.

Some results from our article:

$d$	$s_0$	$\hat{L}$	$t_L$	$\hat{U}$	$t_U$	Point est.	95% CI
2	90	8.072	28.7	8.075	25.4	8.074	[8.060, 8.081]
2	100	13.895	28.7	13.903	25.3	13.899	[13.880, 13.910]
2	110	21.353	28.4	21.346	25.3	21.349	[21.336, 21.354]
3	90	11.290	28.8	11.283	26.3	11.287	[11.276, 11.290]
3	100	18.690	28.9	18.691	26.4	18.690	[18.673, 18.699]
3	110	27.564	27.6	27.581	26.3	27.573	[27.545, 27.591]
5	90	16.648	27.6	16.640	28.4	16.644	[16.633, 16.648]
5	100	26.156	28.1	26.162	28.3	26.159	[26.138, 26.174]
5	110	36.766	27.7	36.777	28.4	36.772	[36.745, 36.789]
10	90	26.208	30.4	26.272	33.9	26.240	[26.189, 26.289]
10	100	38.321	30.5	38.353	34.0	38.337	[38.300, 38.367]
10	110	50.857	30.8	50.914	34.0	50.886	[50.834, 50.937]

tf.reset_default_graph()

# parameters for the max-call example
T, N, d = 3., 9, 5
r, delta, sigma = 0.05, 0.1, 0.2
K, s_0 = 100., 100.
train_steps, mc_runs = 0, 0


# <-- create training samples, implement the algorithm, and fill in loss and g_tau in a meaningful way
loss = 0.
g_tau = 0.


# bare minimum boilerplate code below
px_mean_batch = tf.reduce_mean(g_tau)

optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(tf.reduce_mean(loss))

with tf.Session() as sess:

    sess.run(tf.global_variables_initializer())
    
    for _ in range(train_steps):
        sess.run(train_op)
    
    px_mean = 0.
    for _ in range(mc_runs):
        px_mean += sess.run(px_mean_batch)

    print(px_mean / mc_runs)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-3f70bd095c29> in <module>
     17 
     18 optimizer = tf.train.AdamOptimizer()
---> 19 train_op = optimizer.minimize(tf.reduce_mean(loss))
     20 
     21 with tf.Session() as sess:

/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py in minimize(self, loss, global_step, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, name, grad_loss)
    398         aggregation_method=aggregation_method,
    399         colocate_gradients_with_ops=colocate_gradients_with_ops,
--> 400         grad_loss=grad_loss)
    401 
    402     vars_with_grad = [v for g, v in grads_and_vars if g is not None]

/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py in compute_gradients(self, loss, var_list, gate_gradients, aggregation_method, colocate_gradients_with_ops, grad_loss)
    511     processors = [_get_processor(v) for v in var_list]
    512     if not var_list:
--> 513       raise ValueError("No variables to optimize.")
    514     var_refs = [p.target() for p in processors]
    515     grads = gradients.gradients(

ValueError: No variables to optimize.

Below you will find some more helper functions for neural network architectures that we will use later to improve our implementation.

def neural_net_bn(x, name, num_neurons, is_training, 
                  activation_fn=tf.nn.relu, reuse=None, 
                  decay=0.9, dtype=tf.float32):

    def batch_normalization(_x):
        shape = _x.get_shape().as_list()
        beta = tf.get_variable(name='beta', shape=[shape[-1]], dtype=dtype,
                               initializer=init_ops.zeros_initializer())
        gamma = tf.get_variable(name='gamma', shape=[shape[-1]], dtype=dtype,
                                initializer=init_ops.ones_initializer())
        mv_mean = tf.get_variable('mv_mean', [shape[-1]], dtype=dtype,
                                  initializer=init_ops.zeros_initializer(), 
                                  trainable=False)
        mv_var = tf.get_variable('mv_var', [shape[-1]], dtype=dtype,
                                 initializer=init_ops.zeros_initializer(), 
                                 trainable=False)
        mean, variance = tf.nn.moments(_x, [0], name='moments')
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_mean, mean, decay, 
                                                   zero_debias=True))
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_var, variance, decay, 
                                                   zero_debias=True))
        mean, variance = utils.smart_cond(is_training, 
                                          lambda: (mean, variance), 
                                          lambda: (mv_mean, mv_var))
        return gamma * (_x - mean) / (variance + var_epsilon) + beta

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[-1], out_size],
            dtype=dtype,
            initializer=initializers.xavier_initializer())
        return activation(batch_normalization(tf.matmul(_x, w)))

    with tf.variable_scope(name, reuse=reuse):
        x = batch_normalization(x)
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x

def neural_net_bn_combined(x, name, num_neurons, is_training, 
                           activation_fn=tf.nn.relu, reuse=None, 
                           decay=0.9, dtype=tf.float32):

    def batch_normalization(_x):
        shape = _x.get_shape().as_list()
        _x = tf.reshape(_x, [-1, shape[1] * shape[2]])
        beta = tf.get_variable(name='beta', 
                               shape=[shape[1] * shape[2]], dtype=dtype,
                               initializer=init_ops.zeros_initializer())
        gamma = tf.get_variable(name='gamma', 
                                shape=[shape[1] * shape[2]], dtype=dtype,
                                initializer=init_ops.ones_initializer())
        mv_mean = tf.get_variable('mv_mean', 
                                  shape=[shape[1] * shape[2]], dtype=dtype,
                                  initializer=init_ops.zeros_initializer(), 
                                  trainable=False)
        mv_var = tf.get_variable('mv_var', 
                                 shape=[shape[1] * shape[2]], 
                                 dtype=dtype,
                                 initializer=init_ops.zeros_initializer(), 
                                 trainable=False)
        mean, variance = tf.nn.moments(_x, [0], name='moments')
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_mean, mean, decay, 
                                                   zero_debias=True))
        tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, 
                             assign_moving_average(mv_var, variance, decay, 
                                                   zero_debias=True))
        mean, variance = utils.smart_cond(is_training, 
                                          lambda: (mean, variance), 
                                          lambda: (mv_mean, mv_var))
        return tf.reshape(gamma * (_x - mean) / (variance + var_epsilon) + beta, 
                          [-1, shape[1], shape[2]])

    def fc_layer(_x, out_size, activation):
        shape = _x.get_shape().as_list()
        w = tf.get_variable(
            name='weights',
            shape=[shape[2], shape[1], out_size],
            dtype=dtype,
            initializer=initializers.xavier_initializer())
        _x = tf.matmul(tf.transpose(_x, [2, 0, 1]), w)
        return activation(batch_normalization(tf.transpose(_x, [1, 2, 0])))

    with tf.variable_scope(name, reuse=reuse):
        x = batch_normalization(x)
        for i in range(len(num_neurons)):
            with tf.variable_scope('layer_{0}'.format(i)):
                x = fc_layer(x, num_neurons[i],
                             activation_fn if i < len(num_neurons) - 1 else tf.identity)
        return x