In the following we will implement a deep learning method for optimal stopping problems based on the article Becker, S., Cheridito, P., and Jentzen, A. Deep optimal stopping. Journal of Machine Learning Research 20, 74 (2015), 1–25 and test it on Bermudan max-call options.
import numpy as np
import tensorflow as tf
from scipy.stats import norm
from tensorflow.python.ops import init_ops
from tensorflow.contrib.layers.python.layers import initializers
from tensorflow.python.training.moving_averages import assign_moving_average
from tensorflow.contrib.layers.python.layers import utils
def neural_net(x, name, num_neurons,
activation_fn=tf.nn.relu, reuse=None, dtype=tf.float32):
def fc_layer(_x, out_size, activation):
shape = _x.get_shape().as_list()
w = tf.get_variable(
name='weights',
shape=[shape[-1], out_size],
dtype=dtype,
initializer=tf.random_uniform_initializer())
b = tf.get_variable(
name='bias',
shape=[1, out_size],
dtype=dtype,
initializer=tf.random_normal_initializer())
return activation(tf.matmul(_x, w) + b)
with tf.variable_scope(name, reuse=reuse):
for i in range(len(num_neurons)):
with tf.variable_scope('layer_{0}'.format(i)):
x = fc_layer(x, num_neurons[i],
activation_fn if i < len(num_neurons) - 1 else tf.identity)
return x
First of all we need to create training samples. In this case this task boils down to the simulation of sample paths of $d$ underlying assets:
We assume that the risk-neutral dynamics of the assets are given by a multi-dimensional Black-Scholes model
$$ S_t^i = s_0 \exp([r - \delta - \sigma^2/2] t + \sigma W_t^i), \quad i = 1, 2, \ldots, d, $$
for initial values $s_0 \in (0,\infty) $, a risk-free interest rate $ r \in \mathbb{R} $, dividend yields $ \delta \in [0, \infty) $, volatilities $ \sigma \in (0, \infty) $ and a $d$-dimensional Brownian motion $W$ with constant instantaneous correlations $p_{ij} = 0$ for $i\neq j$ between different components $W^i$ and $W^j$. In the following we assume a time grid of the form $t_n = nT/N$ and denote $X_n^i = S_{t_n}^i$ for $n=0,1,\ldots,N$. A Bermudan max-call option on $S^1, S^2, \ldots, S^d$ has payoff $(\max_{1\leq i \leq d} S_t^i - K)^+$ and can be exercised at any point of the time grid $0=t_0 < t_1 < \ldots < t_N $. The price can then be written as $\sup_{\tau \in \mathcal{T}} \mathbb{E}g(\tau,X_{\tau})$ for
$$ g(n,x) = e^{-r t_n} \left( \max_{1 \leq i \leq d} x^i - K \right)^+, $$
where $\mathcal{T}$ is the set of $X$-stopping times taking values in $\{t_0, t_1, \ldots, t_N\}$.
In the next step we need to implement the deep optimal stopping method:
The numerical method consists in iteratively approximating optimal stopping decisions $ f_n \colon \mathbb{R}^d \to \{0,1\} $, $ n = 0, 1, \dots, N - 1 $, by a neural network $f^{ \theta } \colon \mathbb{R}^d \to \{ 0, 1 \}$ with parameter $ \theta \in \mathbb{R}^q $. We do this by starting with the terminal stopping decision $ f_N \equiv 1 $ and proceeding by backward induction. More precisely, let $ n \in \{ 0, 1, \dots, N - 1 \} $, and assume parameter values $ \theta_{ n + 1 }, \theta_{ n + 2 }, \dots, \theta_N \in \mathbb{R}^q $ have been found such that $f^{\theta_N} \equiv 1$ and the stopping time
$$ \tau_{n+1} = \sum_{m =n+1}^{N} m f^{\theta_m}(X_m) \prod_{j=n+1}^{m-1} (1-f^{\theta_j}(X_j)) $$
produces an expected value $ \mathbb{E} \, g(\tau_{n+1},X_{\tau_{n+1}}) $. Since $f^{\theta}$ takes values in $\{0,1\}$, it does not directly lend itself to a gradient-based optimization method. So, as an intermediate step, we introduce a feedforward neural network $F^{\theta} \colon \mathbb{R}^d \to (0,1)$ of the form
$$ F^{\theta} = \psi \circ a^{\theta}_I \circ \varphi_{q_{I-1}} \circ a^{\theta}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta}_1,$$
where
$I, q_1, q_2, \dots, q_{I-1}$ are positive integers specifying the depth of the network and the number of nodes in the hidden layers,
$a^{\theta}_1 \colon \mathbb{R}^d \to \mathbb{R}^{q_1}, \dots, a^{\theta}_{I-1} \colon \mathbb{R}^{q_{I-2}} \to \mathbb{R}^{q_{I-1}} $ and $a^{\theta}_I \colon \mathbb{R}^{q_{I-1}} \to \mathbb{R}$ are affine functions,
for $j \in \mathbb{N}$, $\varphi_j \colon \mathbb{R}^j \to \mathbb{R}^j$ is the component-wise ReLU activation function given by $\varphi_j(x_1, \dots, x_j) = (x^+_1 , \dots, x^+_j)$
$\psi \colon \mathbb{R} \to (0,1)$ is the standard logistic function $\psi(x) = e^x/(1+ e^x) = 1 / ( 1 + e^{ - x } )$.
The components of the parameter $\theta \in \mathbb{R}^q$ of $F^{\theta}$ consist of the entries of the matrices $A_1 \in \mathbb{R}^{q_1 \times d}, \dots,$ $A_{I-1} \in \mathbb{R}^{q_{I-1} \times q_{I-2}}, A_I \in \mathbb{R}^{1 \times q_{I-1}}$ and the vectors $b_1 \in \mathbb{R}^{q_1}, \dots, b_{I-1} \in \mathbb{R}^{q_{I-1}}, b_I \in \mathbb{R}$ given by the representation of the affine functions
$$a^{\theta}_i(x) = A_i x + b_i, \quad i = 1, \dots, I. $$
Our aim is to determine $\theta_n \in \mathbb{R}^q$ so that
$$ \mathbb{E} \left[g(n,X_n) F^{\theta_n}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta_n}(X_n))\right] $$
is close to the supremum $\sup_{\theta \in \mathbb{R}^q} \mathbb{E} \left[g(n,X_n) F^{\theta}(X_n) + g(\tau_{n+1}, X_{\tau_{n+1}}) (1- F^{\theta}(X_n))\right]$. Once this has been achieved, we define the function $f^{\theta_n} \colon \mathbb{R}^d \to \{0,1\}$ by
$$ f^{\theta_n} = 1_{[0,\infty)} \circ a^{\theta_n}_I \circ \varphi_{q_{I-1}} \circ a^{\theta_n}_{I-1} \circ \dots \circ \varphi_{q_1} \circ a^{\theta_n}_1, $$
where $1_{[0,\infty)} \colon \mathbb{R} \to \{0,1\}$ is the indicator function of $[0,\infty)$. The only difference between $F^{\theta_n}$ and $f^{\theta_n}$ is the final nonlinearity. While $F^{\theta_n}$ produces a stopping probability in $(0,1)$, the output of $f^{\theta_n}$ is a hard stopping decision given by $0$ or $1$, depending on whether $F^{\theta_n}$ takes a value below or above $1/2$.
Some results from our article:
$d$ | $s_0$ | $\hat{L}$ | $t_L$ | $\hat{U}$ | $t_U$ | Point est. | 95% CI |
---|---|---|---|---|---|---|---|
2 | 90 | 8.072 | 28.7 | 8.075 | 25.4 | 8.074 | [8.060, 8.081] |
2 | 100 | 13.895 | 28.7 | 13.903 | 25.3 | 13.899 | [13.880, 13.910] |
2 | 110 | 21.353 | 28.4 | 21.346 | 25.3 | 21.349 | [21.336, 21.354] |
3 | 90 | 11.290 | 28.8 | 11.283 | 26.3 | 11.287 | [11.276, 11.290] |
3 | 100 | 18.690 | 28.9 | 18.691 | 26.4 | 18.690 | [18.673, 18.699] |
3 | 110 | 27.564 | 27.6 | 27.581 | 26.3 | 27.573 | [27.545, 27.591] |
5 | 90 | 16.648 | 27.6 | 16.640 | 28.4 | 16.644 | [16.633, 16.648] |
5 | 100 | 26.156 | 28.1 | 26.162 | 28.3 | 26.159 | [26.138, 26.174] |
5 | 110 | 36.766 | 27.7 | 36.777 | 28.4 | 36.772 | [36.745, 36.789] |
10 | 90 | 26.208 | 30.4 | 26.272 | 33.9 | 26.240 | [26.189, 26.289] |
10 | 100 | 38.321 | 30.5 | 38.353 | 34.0 | 38.337 | [38.300, 38.367] |
10 | 110 | 50.857 | 30.8 | 50.914 | 34.0 | 50.886 | [50.834, 50.937] |
tf.reset_default_graph()
# parameters for the max-call example
T, N, d = 3., 9, 5
r, delta, sigma = 0.05, 0.1, 0.2
K, s_0 = 100., 100.
train_steps, mc_runs = 0, 0
# <-- create training samples, implement the algorithm, and fill in loss and g_tau in a meaningful way
loss = 0.
g_tau = 0.
# bare minimum boilerplate code below
px_mean_batch = tf.reduce_mean(g_tau)
optimizer = tf.train.AdamOptimizer()
train_op = optimizer.minimize(tf.reduce_mean(loss))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(train_steps):
sess.run(train_op)
px_mean = 0.
for _ in range(mc_runs):
px_mean += sess.run(px_mean_batch)
print(px_mean / mc_runs)
Below you will find some more helper functions for neural network architectures that we will use later to improve our implementation.
def neural_net_bn(x, name, num_neurons, is_training,
activation_fn=tf.nn.relu, reuse=None,
decay=0.9, dtype=tf.float32):
def batch_normalization(_x):
shape = _x.get_shape().as_list()
beta = tf.get_variable(name='beta', shape=[shape[-1]], dtype=dtype,
initializer=init_ops.zeros_initializer())
gamma = tf.get_variable(name='gamma', shape=[shape[-1]], dtype=dtype,
initializer=init_ops.ones_initializer())
mv_mean = tf.get_variable('mv_mean', [shape[-1]], dtype=dtype,
initializer=init_ops.zeros_initializer(),
trainable=False)
mv_var = tf.get_variable('mv_var', [shape[-1]], dtype=dtype,
initializer=init_ops.zeros_initializer(),
trainable=False)
mean, variance = tf.nn.moments(_x, [0], name='moments')
tf.add_to_collection(tf.GraphKeys.UPDATE_OPS,
assign_moving_average(mv_mean, mean, decay,
zero_debias=True))
tf.add_to_collection(tf.GraphKeys.UPDATE_OPS,
assign_moving_average(mv_var, variance, decay,
zero_debias=True))
mean, variance = utils.smart_cond(is_training,
lambda: (mean, variance),
lambda: (mv_mean, mv_var))
return gamma * (_x - mean) / (variance + var_epsilon) + beta
def fc_layer(_x, out_size, activation):
shape = _x.get_shape().as_list()
w = tf.get_variable(
name='weights',
shape=[shape[-1], out_size],
dtype=dtype,
initializer=initializers.xavier_initializer())
return activation(batch_normalization(tf.matmul(_x, w)))
with tf.variable_scope(name, reuse=reuse):
x = batch_normalization(x)
for i in range(len(num_neurons)):
with tf.variable_scope('layer_{0}'.format(i)):
x = fc_layer(x, num_neurons[i],
activation_fn if i < len(num_neurons) - 1 else tf.identity)
return x
def neural_net_bn_combined(x, name, num_neurons, is_training,
activation_fn=tf.nn.relu, reuse=None,
decay=0.9, dtype=tf.float32):
def batch_normalization(_x):
shape = _x.get_shape().as_list()
_x = tf.reshape(_x, [-1, shape[1] * shape[2]])
beta = tf.get_variable(name='beta',
shape=[shape[1] * shape[2]], dtype=dtype,
initializer=init_ops.zeros_initializer())
gamma = tf.get_variable(name='gamma',
shape=[shape[1] * shape[2]], dtype=dtype,
initializer=init_ops.ones_initializer())
mv_mean = tf.get_variable('mv_mean',
shape=[shape[1] * shape[2]], dtype=dtype,
initializer=init_ops.zeros_initializer(),
trainable=False)
mv_var = tf.get_variable('mv_var',
shape=[shape[1] * shape[2]],
dtype=dtype,
initializer=init_ops.zeros_initializer(),
trainable=False)
mean, variance = tf.nn.moments(_x, [0], name='moments')
tf.add_to_collection(tf.GraphKeys.UPDATE_OPS,
assign_moving_average(mv_mean, mean, decay,
zero_debias=True))
tf.add_to_collection(tf.GraphKeys.UPDATE_OPS,
assign_moving_average(mv_var, variance, decay,
zero_debias=True))
mean, variance = utils.smart_cond(is_training,
lambda: (mean, variance),
lambda: (mv_mean, mv_var))
return tf.reshape(gamma * (_x - mean) / (variance + var_epsilon) + beta,
[-1, shape[1], shape[2]])
def fc_layer(_x, out_size, activation):
shape = _x.get_shape().as_list()
w = tf.get_variable(
name='weights',
shape=[shape[2], shape[1], out_size],
dtype=dtype,
initializer=initializers.xavier_initializer())
_x = tf.matmul(tf.transpose(_x, [2, 0, 1]), w)
return activation(batch_normalization(tf.transpose(_x, [1, 2, 0])))
with tf.variable_scope(name, reuse=reuse):
x = batch_normalization(x)
for i in range(len(num_neurons)):
with tf.variable_scope('layer_{0}'.format(i)):
x = fc_layer(x, num_neurons[i],
activation_fn if i < len(num_neurons) - 1 else tf.identity)
return x