We shall in the sequel take an unusual (but familiar to many researchers) point of view towards deep neural networks. Consider a controlled ordinary differential equation (ODE), i.e. $$ (\ast) \qquad dX_t = \sum_{i=1}^d V_i(X_t) du^i(t), \; X_0 = x $$ where $ V_i $ are some smooth vector fields on $ \mathbb{R}^m $ with all proper derivatives bounded. $ u $ is a finite variation curve with values in $ \mathbb{R}^d $. For simplicity we shall always consider finitely many jumps of $ u $ in this lecture, and $ u $ being $ C^\infty$ between two jumps.
A solution of this equation is given by a fixed point of $$ X_t = x + \sum_{i=1}^d \int_0^t V_i(X_s) du^i(s) $$ for $ t \geq 0 $. Notice that the integral is well defined due to the finite variation property.
We shall analyze two aspects in the sequel:
For the first aspect consider vector fields of the form $$ V(x) = \sum_{j=1}^N \alpha_j \phi(\langle \mu_j , x \rangle + \beta_j) $$ for some vectors $ \alpha_j, \mu_j \in \mathbb{R}^m $ and numbers $ \beta_j $. Notice that we could also take vector fields of the form $$ V(x) = W {\phi}(x) + a - x $$ for matrices $ W $ and vectors $ a $ to achive the same result.
In this case the Euler scheme, which approximates the solution of the ODE, is an iterative procedure where maps of the type $$ x \mapsto x + \sum_{i=1}^d V_i(x) \Delta u^i(t) $$ at point $ x \in \mathbb{R}^N $ and $ t \geq 0 $ are concatenated. Obviously one can understand such finite concatenations as deep neural networks with at most $ m + N $ neurons in each layers and depths being equal to the time discretization with respect to the ODE's initial value.
<font color=red>We therefore call the map $ x \mapsto X_t $ a deep neural network of infinite depths. Notice that for piecewise constant $ u $ we just obtain classical neural networks of finite depths.</font>
Notice that one could also consider the Picard-Lindelof iteration scheme to obtain neural networks, then depth would be related to the iteration step.
This obvious point of view has been taken in several papers, most recently in Neural ordinary differential equations. There are three new aspects in our approach: first the control approach, i.e. the use of additional control variables $u^i$, second, that actually every feedforward neural network is this very form, and, third, the obvious question how the choice of the vector fields influences the expressibility of the networks.
For the third aspect consider a training data set $ {(x_l,y_l)}_{l \in L} $, i.e. a subset of a graph of an $ \mathbb{R}^m$-valued continuous function on some compact set. The cardinality of $L$ should be seen as large but finite.
Consider the following ODE on $ \mathbb{R}^{mL} $ $$ dZ_t = \sum_{i=1}^d W_i(Z_t) du^i(t) \, , $$ where $$ (W_i((x_l)))_l:=V_i(x_l) $$ for $ l \in L $ and $ i = 1,\ldots,d $. We actually make a particle system of $ L $ particles moving precisely according to the dynamics of $ (\ast) $. This system also has a unique solution for all times just as equation $(\ast)$.
Representability of the non-linear function on the training set amounts precisely to saying that, for given $ t > 0 $, the particle system $ Z $ can possibly reach at $ t $ any point $ (y_l) $ if starting at $ (x_l) $ for some choice of control $u$.
For such questions the notion of controllability has been introduced, in particular Chow's theorem:
If the Lie brackets of $ (W_i) $ generate at the point $ (x_l) $ the whole space $ \mathbb{R}^{mL} $, then any point $ (y_l) $ can be reached locally. Such systems are called exactly controllable.
There are many different notions of controllability but all are related to Lie brackets on the one hand ('geometric condition' or Hörmander condition) and certain analytic assumptions on the vector fields on the other hand.
Can we reach with our particular vector fields the geometric condition? We consider some illustrative examples of course for $ d \geq 2 $.
Having this theoretical result at hand we know that there are actually 'generic' choices of vector fields (often realized by random choices) which at least locally allow for controllability.
The actual interpretation (which is another universal approximation theorem) is the following: a generic choice of vector fields $ V_i $ of the above controlled system can be realized by random neural networks and a certain number of controls $ u^1,\ldots,u^d$ leads to an infinite depth neural network with $ m + N $ nodes on each layer which can approximate any continuous function on a compact set.
This might help to explain the following curious phenomenon (we remember the standard MNIST classification from last week's lecture 1):
import numpy as np
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
import matplotlib.pyplot as plt
#matplotlib inline # Only use this if using iPython
image_index = np.random.randint(60000) #7777 # You may select anything up to 60,000
print(y_train[image_index]) # The label is printed
plt.imshow(x_train[image_index], cmap='Greys')
plt.show()
# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Making sure that the values are float so that we can get decimal points after division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value.
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])
y_train.shape
# Importing the required Keras modules containing model and layers
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
# Creating a Sequential Model and adding the layers
model = Sequential()
layer1=Conv2D(28, kernel_size=(3,3), input_shape=input_shape)
layer1.trainable=True
model.add(layer1) # (3,3) for mnist
model.add(MaxPooling2D(pool_size=(2, 2))) # (2,2) for mnist
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
layer2 = Dense(128, activation=tf.nn.relu,kernel_initializer='random_uniform')#glorot_uniform is by default
layer2.trainable=True#False
model.add(layer2)
#layer3 = Dense(2*128, activation=tf.nn.relu,kernel_initializer='random_uniform')
#layer3.trainable=False
#model.add(layer3)
model.add(Dropout(0.2))
model.add(Dense(10,activation=tf.nn.softmax))
model.summary()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
for i in range(1):
model.fit(x=x_train,y=y_train, epochs=1)
x = model.evaluate(x_test, y_test)
print('\n',x)
from keras import initializers
model = Sequential()
layer1=Dense(128,activation=tf.nn.relu, input_shape=input_shape)#128
layer1.trainable=False
model.add(layer1)
layer2=Dense(32,activation=tf.nn.relu)#128
layer2.trainable=False
model.add(layer2)
#model.add(Flatten())
#layer4=Dense(128,activation=tf.nn.relu)
#layer4.trainable=False
#layer5=ConvexCombination()([layer2,layer4])
for i in range(3):
layer3=Dense(128,activation=tf.nn.relu,kernel_initializer=initializers.random_normal(0,0.1))
layer3.trainable=False
model.add(layer3)
#model.add(MaxPooling2D(pool_size=(2, 2))) # (2,2) for mnist
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
#for i in range(5):
# layer3 = Dense(128, activation=tf.nn.relu,kernel_initializer='random_uniform')#glorot_uniform is by default
# layer3.trainable=False
# model.add(layer3)
#layer3 = Dense(2*128, activation=tf.nn.relu,kernel_initializer='random_uniform')
#layer3.trainable=False
#model.add(layer3)
#model.add(Dropout(0.2))
#layer4=Dense(100,activation='linear')
#layer4.trainable=False
#model.add(layer4)
model.add(Dense(10,activation=tf.nn.softmax))
model.summary()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
for i in range(10):
model.fit(x=x_train,y=y_train, epochs=1)
x = model.evaluate(x_test, y_test)
print('\n',x)
We realize: if we replace the well specified architecture by a random one also obtain reasonable training results (even though speed deteriorates). Of course this is an extremely generic implementation, however, it shows the effect.
img_rows=28
img_cols=28
count = 0
for i in range(100):
image_index = np.random.randint(10000)
pred = model.predict(x_test[image_index].reshape(1, img_rows, img_cols, 1))
if pred.argmax() != y_test[image_index]:
plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')
plt.show()
print('Prediction',pred.argmax())
print('Grand Truth',y_test[image_index])
count = count + 1
#break
print(count/100)
In order to understand actual training we shall derive the first variation with respect to parameters $ \theta $, i.e. we take the derivative with respect to the parameter $ \theta $, denoted by $ \partial X_T $ of $$ dX^\theta_t = \sum_{i=1}^d V^\theta_i(t,X^\theta_t) du^i(t), \; X_0 = x \; . $$ Notice that we write for convenience an additional time dependence in the vector fields since this is actually an important case. This leads to $$ d \partial X^\theta_t = \sum_{i=1}^d \big( \partial V^\theta_i(t, X^\theta_t) + dV^\theta_i(t, X^\theta_t) \partial X^\theta_t \big) du^i(t), \; \partial X^\theta_0 = 0 \, , $$ which can be solved by first solving the equation for $ X^\theta $ in a forward manner, i.e. $$ X^\theta_t = x + \sum_{i=1}^d \int_0^t V^\theta_i(s,X^\theta_s) du^i(s) \, $$ for $ t \geq 0 $, and observing that the solution of the first variation equation is actually given by an evolution operator $ J_{s,T} $ which satisfies with respect to $s$ $$ d J_{s,T} = - \sum_{i=1}^d dV^\theta_i(s,X^\theta_s) J_{s,T} du^i(s), \; J_{T,T} = \operatorname{id} $$ and the variation of constants formula $$ \partial X^\theta_T = \sum_{i=1}^d \int_0^T J_{s,T} \partial V^\theta_i(s,X^\theta_s) du^i(s) \, . $$
These remarkable formulas can be interpreted in the following way: calculate first $ J_{s,T} $ which is calculated backwards in time along the solution trajectory of the forward process $ X^\theta $ on $[0,t]$. Then integrate the derivatives of $ \partial V^\theta $ transformed by $ J_{s,T} $ to obtain the derivative of $ \partial X_T^\theta $ $$ \partial X^\theta_T = \sum_{i=1}^d \int_0^T J_{s,T} \partial V^\theta_i(s,X^\theta_s) du^i(s) \, . $$ If instead one is interested in the gradient of a loss function $ L(X^\theta_T) $, by $ \partial L(X^\theta_T) = \nabla L(X_T^\theta) \partial X^\theta_T $ the problem is solved, too.
In our approach we do also consider the controls $ u^i $ as variables which can be optimized, hence also the gradients with respect to $ u^i $ are of importance: we calculate the derivative of $ u^i + \epsilon a^i $ for $ \epsilon = 0 $ and obtain $$ d \partial_\epsilon X_T^\theta = \sum_{i=1}^d V^\theta_i(t, X^\theta_t) da^i(t) + dV^\theta_i(t, X^\theta_t) \partial_\epsilon X^\theta_t du^i(t), \; \partial_\epsilon X^\theta_0 = 0 \, , $$ hence a completely analogue situation appears, i.e. $$ \partial_\epsilon X_T^\theta = \sum_{i=1}^d \int_0^t J_{s,T} V^\theta_i(s, X^\theta_s) da^i(s) $$ by variation of constants.
Notice that the surjectivity of the map $ a \mapsto \partial_\epsilon X^\theta_T $ leads precisely to a Hörmander condition for the vector fields $ V_i $.
Let us sum up the previous considerations:
$$ L^{(i)}: x \mapsto W^{(i)}x + a^{(0)} \mapsto \phi (W^{(i)}x + a^{(0)}) $$ Translating this to a controlled ODE needs the following ingredients:
The corresponding neural network at 'time' $ 3 $ is just $$ x \mapsto L^{(0)}(x) \mapsto \phi(W^{(1)}L^{(0)}(x) + a^{(1)}) \, $$ which is a one hidden layer neural network with linear readout.
The parameters $ \theta $ are the matrix weights $ W^{(i)} $ and shifts $ a^{(i)} $.
The backwards first variation can be constructed by looking at the derivatives of $ V(t,.) $ with respect to $ x $ and solving the corresponding equation backwards: at 'time' $T=3$ we have $$ J_{s,3} v =v + 1_{[0,2[}(s)(dL^{(1)}(x)v-v) +1_{[0,1[}(s)(d L^{(0)}(x)v-v) $$ which is nothing than encoding the outer part of the chain rule.
Calculating now the derivative of the network with respect to, e.g., $ a^{(1)} $ amounts then just to calculting the variation of constants integral which gives the correct partial derivatives.
Let us look at this phenomenon in an example following Deep Learning with Keras and Tensorflow. We consider a one hidden layer neural network as above with loss function $ 0.5*(y-\hat{y})^2 $ whose gradient is easily calculated in terms of the scalar product of the error with the respective gradient of $ X^\theta $ with respect to the network parameters.
What is implemented in the sequel is a simple gradient descent (with momentum): let us denote the loss function by $ L(\theta) $, then $$ \theta^{(n+1)} = \theta^{(n)} - N \, \nabla L(\theta^{(n)}) - M \, \nabla L(\theta^{(n-1)}) \, . $$
# Import the required packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy
# Display plots in notebook
%matplotlib inline
# Define plot's default figure size
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
import random
random.seed(123)
# calculate a random number where: a <= rand < b
def rand(a, b):
return (b-a)*random.random() + a
#read the datasets
train = pd.read_csv("data/intro_to_ann.csv")
X, y = np.array(train.ix[:,0:2]), np.array(train.ix[:,2])
print(X.shape)
print(y.shape)
#Let's plot the dataset and see how it is
plt.scatter(X[:,0], X[:,1], s=40, c=1-y, cmap=plt.cm.BuGn)
# Make a matrix
def makeMatrix(I, J, fill=0.0):
return np.zeros([I,J])
# our sigmoid function
def sigmoid(x):
#return math.tanh(x)
return 1/(1+np.exp(-x))
# derivative of our sigmoid function, in terms of the output (i.e. y)
def dsigmoid(y):
return y - y**2
# the multilayerperceptron class (with one hidden layer)
class MLP:
def __init__(self, ni, nh, no):
# number of input, hidden, and output nodes
self.ni = ni + 1 # +1 for bias node
self.nh = nh
self.no = no
# activations for nodes -- these are list of respective length
self.ai = [1.0]*self.ni
self.ah = [1.0]*self.nh
self.ao = [1.0]*self.no
# create weights
self.wi = makeMatrix(self.ni, self.nh)
self.wo = makeMatrix(self.nh, self.no)
# set them to random vaules
for i in range(self.ni):
for j in range(self.nh):
self.wi[i][j] = rand(-0.2, 0.2)
for j in range(self.nh):
for k in range(self.no):
self.wo[j][k] = rand(-2.0, 2.0)
# last change in weights for momentum
self.ci = makeMatrix(self.ni, self.nh)
self.co = makeMatrix(self.nh, self.no)
def backPropagate(self, targets, N, M):
if len(targets) != self.no:
print(targets)
raise ValueError('wrong number of target values')
# calculate error terms for output
output_deltas = np.zeros(self.no)
for k in range(self.no):
error = targets[k]-self.ao[k]
output_deltas[k] = dsigmoid(self.ao[k]) * error
# calculate error terms for hidden
hidden_deltas = np.zeros(self.nh)
for j in range(self.nh):
error = 0.0
for k in range(self.no):
error += output_deltas[k]*self.wo[j][k]
hidden_deltas[j] = dsigmoid(self.ah[j]) * error
# update output weights
for j in range(self.nh):
for k in range(self.no):
change = output_deltas[k] * self.ah[j]
self.wo[j][k] += N*change + M*self.co[j][k] # N is the learning rate, M the momentum
self.co[j][k] = change
# update input weights
for i in range(self.ni):
for j in range(self.nh):
change = hidden_deltas[j]*self.ai[i]
self.wi[i][j] += N*change + M*self.ci[i][j]
self.ci[i][j] = change
# calculate error
error = 0.0
for k in range(len(targets)):
error += 0.5*(targets[k]-self.ao[k])**2
return error
def test(self, patterns):
self.predict = np.empty([len(patterns), self.no])
for i, p in enumerate(patterns):
self.predict[i] = self.activate(p)
#self.predict[i] = self.activate(p[0])
def activate(self, inputs): # here the network is constructed as a function
if len(inputs) != self.ni-1:
print(inputs)
raise ValueError('wrong number of inputs')
# input activations
for i in range(self.ni-1):
self.ai[i] = inputs[i]
# hidden activations
for j in range(self.nh):
sum_h = 0.0
for i in range(self.ni):
sum_h += self.ai[i] * self.wi[i][j]
self.ah[j] = sigmoid(sum_h)
# output activations
for k in range(self.no):
sum_o = 0.0
for j in range(self.nh):
sum_o += self.ah[j] * self.wo[j][k]
self.ao[k] = sigmoid(sum_o)
return self.ao[:]
def train(self, patterns, iterations=1000, N=0.5, M=0.1):
# N: learning rate
# M: momentum factor
patterns = list(patterns)
for i in range(iterations):
error = 0.0
for p in patterns:
inputs = p[0]
targets = p[1]
self.activate(inputs)
error += self.backPropagate([targets], N, M)
if i % 5 == 0:
print('error in interation %d : %-.5f' % (i,error))
print('Final training error: %-.5f' % error)
# create a network with two inputs, ten hidden, and one output nodes
ann = MLP(2, 10, 1)
%timeit -n 1 -r 1 ann.train(zip(X,y), iterations=1)
# Helper function to plot a decision boundary.
# This generates the contour plot to show the decision boundary visually
def plot_decision_boundary(nn_model):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
nn_model.test(np.c_[xx.ravel(), yy.ravel()])
Z = nn_model.predict
Z[Z>=0.5] = 1
Z[Z<0.5] = 0
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], s=40, c=y, cmap=plt.cm.BuGn)
plot_decision_boundary(ann)
plt.title("Model with 1 iteration")
# create a network with two inputs, ten hidden, and one output nodes
ann = MLP(2, 10, 1)
%timeit -n 1 -r 1 ann.train(zip(X,y), iterations=100)
plot_decision_boundary(ann)
plt.title("Model with 100 iterations")
Several remarks are due here:
In the sequel the same implementation done in Keras.
modelann = Sequential()
layer1=Dense(10,activation='sigmoid', input_shape=(2,))
layer1.trainable=True
modelann.add(layer1)
layer2=Dense(1,activation='sigmoid')
layer2.trainable=True
modelann.add(layer2)
modelann.compile(optimizer='adam',
loss='mse',
metrics=['accuracy'])
modelann.fit(x=X,y=y, epochs=1000)
#x = modelann.evaluate(X, y)
#print('\n',x)
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z= modelann.predict(np.c_[xx.ravel(),yy.ravel()])
Z[Z>=0.5]=1
Z[Z<0.5]=0
Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], s=40, c=y, cmap=plt.cm.BuGn)