Universal Approximation - an dynamic point of view¶

We shall in the sequel take an unusual (but familiar to many researchers) point of view towards deep neural networks. Consider a controlled ordinary differential equation (ODE), i.e. $$ (\ast) \qquad dX_t = \sum_{i=1}^d V_i(X_t) du^i(t), \; X_0 = x $$ where $ V_i $ are some smooth vector fields on $ \mathbb{R}^m $ with all proper derivatives bounded. $ u $ is a finite variation curve with values in $ \mathbb{R}^d $. For simplicity we shall always consider finitely many jumps of $ u $ in this lecture, and $ u $ being $ C^\infty$ between two jumps.

A solution of this equation is given by a fixed point of $$ X_t = x + \sum_{i=1}^d \int_0^t V_i(X_s) du^i(s) $$ for $ t \geq 0 $. Notice that the integral is well defined due to the finite variation property.

We shall analyze two aspects in the sequel:

Connection to deep neural networks.
How expressive are those networks.

For the first aspect consider vector fields of the form $$ V(x) = \sum_{j=1}^N \alpha_j \phi(\langle \mu_j , x \rangle + \beta_j) $$ for some vectors $ \alpha_j, \mu_j \in \mathbb{R}^m $ and numbers $ \beta_j $. Notice that we could also take vector fields of the form $$ V(x) = W {\phi}(x) + a - x $$ for matrices $ W $ and vectors $ a $ to achive the same result.

In this case the Euler scheme, which approximates the solution of the ODE, is an iterative procedure where maps of the type $$ x \mapsto x + \sum_{i=1}^d V_i(x) \Delta u^i(t) $$ at point $ x \in \mathbb{R}^N $ and $ t \geq 0 $ are concatenated. Obviously one can understand such finite concatenations as deep neural networks with at most $ m + N $ neurons in each layers and depths being equal to the time discretization with respect to the ODE's initial value.

<font color=red>We therefore call the map $ x \mapsto X_t $ a deep neural network of infinite depths. Notice that for piecewise constant $ u $ we just obtain classical neural networks of finite depths.</font>

Notice that one could also consider the Picard-Lindelof iteration scheme to obtain neural networks, then depth would be related to the iteration step.

This obvious point of view has been taken in several papers, most recently in Neural ordinary differential equations. There are three new aspects in our approach: first the control approach, i.e. the use of additional control variables $u^i$, second, that actually every feedforward neural network is this very form, and, third, the obvious question how the choice of the vector fields influences the expressibility of the networks.

For the third aspect consider a training data set $ {(x_l,y_l)}_{l \in L} $, i.e. a subset of a graph of an $ \mathbb{R}^m$-valued continuous function on some compact set. The cardinality of $L$ should be seen as large but finite.

Consider the following ODE on $ \mathbb{R}^{mL} $ $$ dZ_t = \sum_{i=1}^d W_i(Z_t) du^i(t) \, , $$ where $$ (W_i((x_l)))_l:=V_i(x_l) $$ for $ l \in L $ and $ i = 1,\ldots,d $. We actually make a particle system of $ L $ particles moving precisely according to the dynamics of $ (\ast) $. This system also has a unique solution for all times just as equation $(\ast)$.

Representability of the non-linear function on the training set amounts precisely to saying that, for given $ t > 0 $, the particle system $ Z $ can possibly reach at $ t $ any point $ (y_l) $ if starting at $ (x_l) $ for some choice of control $u$.

For such questions the notion of controllability has been introduced, in particular Chow's theorem:

If the Lie brackets of $ (W_i) $ generate at the point $ (x_l) $ the whole space $ \mathbb{R}^{mL} $, then any point $ (y_l) $ can be reached locally. Such systems are called exactly controllable.

There are many different notions of controllability but all are related to Lie brackets on the one hand ('geometric condition' or Hörmander condition) and certain analytic assumptions on the vector fields on the other hand.

Can we reach with our particular vector fields the geometric condition? We consider some illustrative examples of course for $ d \geq 2 $.

Take a one point training set and choose $ V_i(x) = A_i x $ to be linear vector fields, e.g. with respect to a standard normal distribution on the matrix elements. Then the geometric condition (Hörmander condition) is satisfied at any point $ x \neq 0 $ with probability $1$.
Take a one point traning set and choose $V_i $ one layer neural networks with randomly choosen directions $ \mu_j $ and activation functions with non-vanishing derivative at all points, then the geometric condition is satisfied.
Take a general $ \operatorname{card}(L) $ point training set and consider $ d = m \times \operatorname{card} $ control directions and vector fields. Assume that for every point $ x_l $ in the training set there are $m$ linearly independent vector fields. Then the geometric condition is satisfied.
We consider this question also in an infinite dimensional situation, for instance on the torus: for every $ m \geq 1 $ for \emph{every} $ d \geq 2 $ there are vector fields $ V_i $ on the torus such that the transport operators $$ V_i f(x) = d f(x) \cdot V(x) $$ on $L^2(\mathbb{T}^m)$ generate with their Lie brackets a dense subspace at any non-constant Fourier polynomial $ f \in L^2(\mathbb{T}^m) $.

Having this theoretical result at hand we know that there are actually 'generic' choices of vector fields (often realized by random choices) which at least locally allow for controllability.

The actual interpretation (which is another universal approximation theorem) is the following: a generic choice of vector fields $ V_i $ of the above controlled system can be realized by random neural networks and a certain number of controls $ u^1,\ldots,u^d$ leads to an infinite depth neural network with $ m + N $ nodes on each layer which can approximate any continuous function on a compact set.

This might help to explain the following curious phenomenon (we remember the standard MNIST classification from last week's lecture 1):

import numpy as np
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

import matplotlib.pyplot as plt
#matplotlib inline # Only use this if using iPython
image_index = np.random.randint(60000) #7777 # You may select anything up to 60,000
print(y_train[image_index]) # The label is printed
plt.imshow(x_train[image_index], cmap='Greys')
plt.show()

2

# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Making sure that the values are float so that we can get decimal points after division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value.
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])
y_train.shape

x_train shape: (60000, 28, 28, 1)
Number of images in x_train 60000
Number of images in x_test 10000

(60000,)

# Importing the required Keras modules containing model and layers
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
# Creating a Sequential Model and adding the layers
model = Sequential()
layer1=Conv2D(28, kernel_size=(3,3), input_shape=input_shape)
layer1.trainable=True
model.add(layer1) # (3,3) for mnist
model.add(MaxPooling2D(pool_size=(2, 2))) # (2,2) for mnist
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
layer2 = Dense(128, activation=tf.nn.relu,kernel_initializer='random_uniform')#glorot_uniform is by default
layer2.trainable=True#False
model.add(layer2)
#layer3 = Dense(2*128, activation=tf.nn.relu,kernel_initializer='random_uniform')
#layer3.trainable=False
#model.add(layer3)
model.add(Dropout(0.2))
model.add(Dense(10,activation=tf.nn.softmax))

Using TensorFlow backend.

WARNING:tensorflow:From /scratch/users/jteichma/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1205: calling reduce_prod (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 28)        280       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 28)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4732)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 128)               605824    
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1290      
=================================================================
Total params: 607,394
Trainable params: 607,394
Non-trainable params: 0
_________________________________________________________________

model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])
for i in range(1):
    model.fit(x=x_train,y=y_train, epochs=1)
    x = model.evaluate(x_test, y_test)
    print('\n',x)

WARNING:tensorflow:From /scratch/users/jteichma/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1290: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /scratch/users/jteichma/.local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1154: calling reduce_max (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Epoch 1/1
60000/60000 [==============================] - 45s - loss: 0.2034 - acc: 0.9383    
 9952/10000 [============================>.] - ETA: 0s
 [0.07407166879139841, 0.9771]

from keras import initializers

model = Sequential()
layer1=Dense(128,activation=tf.nn.relu, input_shape=input_shape)#128
layer1.trainable=False
model.add(layer1)
layer2=Dense(32,activation=tf.nn.relu)#128
layer2.trainable=False
model.add(layer2)
#model.add(Flatten())
#layer4=Dense(128,activation=tf.nn.relu)
#layer4.trainable=False
#layer5=ConvexCombination()([layer2,layer4])
for i in range(3):
    layer3=Dense(128,activation=tf.nn.relu,kernel_initializer=initializers.random_normal(0,0.1))   
    layer3.trainable=False
    model.add(layer3)
#model.add(MaxPooling2D(pool_size=(2, 2))) # (2,2) for mnist
model.add(Flatten()) # Flattening the 2D arrays for fully connected layers
#for i in range(5):
#    layer3 = Dense(128, activation=tf.nn.relu,kernel_initializer='random_uniform')#glorot_uniform is by default
#    layer3.trainable=False
#    model.add(layer3)
#layer3 = Dense(2*128, activation=tf.nn.relu,kernel_initializer='random_uniform')
#layer3.trainable=False
#model.add(layer3)
#model.add(Dropout(0.2))
#layer4=Dense(100,activation='linear')
#layer4.trainable=False
#model.add(layer4)
model.add(Dense(10,activation=tf.nn.softmax))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_116 (Dense)            (None, 28, 28, 128)       256       
_________________________________________________________________
dense_117 (Dense)            (None, 28, 28, 32)        4128      
_________________________________________________________________
dense_118 (Dense)            (None, 28, 28, 128)       4224      
_________________________________________________________________
dense_119 (Dense)            (None, 28, 28, 128)       16512     
_________________________________________________________________
dense_120 (Dense)            (None, 28, 28, 128)       16512     
_________________________________________________________________
flatten_18 (Flatten)         (None, 100352)            0         
_________________________________________________________________
dense_121 (Dense)            (None, 10)                1003530   
=================================================================
Total params: 1,045,162
Trainable params: 1,003,530
Non-trainable params: 41,632
_________________________________________________________________

model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])
for i in range(10):
    model.fit(x=x_train,y=y_train, epochs=1)
    x = model.evaluate(x_test, y_test)
    print('\n',x)

Epoch 1/1
60000/60000 [==============================] - 394s - loss: 0.4408 - acc: 0.8862   
10000/10000 [==============================] - 52s    

 [0.30022240661382676, 0.9167]
Epoch 1/1
60000/60000 [==============================] - 391s - loss: 0.2980 - acc: 0.9170   
10000/10000 [==============================] - 53s    

 [0.2787766061902046, 0.9214]
Epoch 1/1
60000/60000 [==============================] - 388s - loss: 0.2795 - acc: 0.9210   
10000/10000 [==============================] - 53s    

 [0.2696303952664137, 0.9242]
Epoch 1/1
60000/60000 [==============================] - 384s - loss: 0.2704 - acc: 0.9242   
10000/10000 [==============================] - 51s    

 [0.2721088460594416, 0.9251]
Epoch 1/1
60000/60000 [==============================] - 387s - loss: 0.2647 - acc: 0.9262   
10000/10000 [==============================] - 55s    

 [0.26890463502407075, 0.9235]
Epoch 1/1
60000/60000 [==============================] - 382s - loss: 0.2600 - acc: 0.9276   - ETA: 3s - loss: 0.259
10000/10000 [==============================] - 53s    

 [0.2688384540170431, 0.9259]
Epoch 1/1
60000/60000 [==============================] - 391s - loss: 0.2566 - acc: 0.9287   
10000/10000 [==============================] - 52s    

 [0.2709984438627958, 0.9264]
Epoch 1/1
60000/60000 [==============================] - 391s - loss: 0.2540 - acc: 0.9296   
10000/10000 [==============================] - 53s    

 [0.2688856121197343, 0.926]
Epoch 1/1
60000/60000 [==============================] - 327s - loss: 0.2517 - acc: 0.9305   
 9984/10000 [============================>.] - ETA: 0s
 [0.2685184384942055, 0.9268]
Epoch 1/1
60000/60000 [==============================] - 198s - loss: 0.2500 - acc: 0.9309   
 9984/10000 [============================>.] - ETA: 0s
 [0.266049697086215, 0.926]

We realize: if we replace the well specified architecture by a random one also obtain reasonable training results (even though speed deteriorates). Of course this is an extremely generic implementation, however, it shows the effect.

img_rows=28
img_cols=28
count = 0
for i in range(100):
    image_index = np.random.randint(10000)
    pred = model.predict(x_test[image_index].reshape(1, img_rows, img_cols, 1))
    if pred.argmax() != y_test[image_index]:
        plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')
        plt.show()
        print('Prediction',pred.argmax())
        print('Grand Truth',y_test[image_index])
        count = count + 1
        #break
print(count/100)

Prediction 0
Grand Truth 2

Prediction 8
Grand Truth 6

Prediction 7
Grand Truth 2

Prediction 9
Grand Truth 5

Prediction 9
Grand Truth 7

Prediction 3
Grand Truth 5

Prediction 8
Grand Truth 3

Prediction 3
Grand Truth 5

Prediction 8
Grand Truth 3

Prediction 1
Grand Truth 7
0.1

In order to understand actual training we shall derive the first variation with respect to parameters $ \theta $, i.e. we take the derivative with respect to the parameter $ \theta $, denoted by $ \partial X_T $ of $$ dX^\theta_t = \sum_{i=1}^d V^\theta_i(t,X^\theta_t) du^i(t), \; X_0 = x \; . $$ Notice that we write for convenience an additional time dependence in the vector fields since this is actually an important case. This leads to $$ d \partial X^\theta_t = \sum_{i=1}^d \big( \partial V^\theta_i(t, X^\theta_t) + dV^\theta_i(t, X^\theta_t) \partial X^\theta_t \big) du^i(t), \; \partial X^\theta_0 = 0 \, , $$ which can be solved by first solving the equation for $ X^\theta $ in a forward manner, i.e. $$ X^\theta_t = x + \sum_{i=1}^d \int_0^t V^\theta_i(s,X^\theta_s) du^i(s) \, $$ for $ t \geq 0 $, and observing that the solution of the first variation equation is actually given by an evolution operator $ J_{s,T} $ which satisfies with respect to $s$ $$ d J_{s,T} = - \sum_{i=1}^d dV^\theta_i(s,X^\theta_s) J_{s,T} du^i(s), \; J_{T,T} = \operatorname{id} $$ and the variation of constants formula $$ \partial X^\theta_T = \sum_{i=1}^d \int_0^T J_{s,T} \partial V^\theta_i(s,X^\theta_s) du^i(s) \, . $$

These remarkable formulas can be interpreted in the following way: calculate first $ J_{s,T} $ which is calculated backwards in time along the solution trajectory of the forward process $ X^\theta $ on $[0,t]$. Then integrate the derivatives of $ \partial V^\theta $ transformed by $ J_{s,T} $ to obtain the derivative of $ \partial X_T^\theta $ $$ \partial X^\theta_T = \sum_{i=1}^d \int_0^T J_{s,T} \partial V^\theta_i(s,X^\theta_s) du^i(s) \, . $$ If instead one is interested in the gradient of a loss function $ L(X^\theta_T) $, by $ \partial L(X^\theta_T) = \nabla L(X_T^\theta) \partial X^\theta_T $ the problem is solved, too.

In our approach we do also consider the controls $ u^i $ as variables which can be optimized, hence also the gradients with respect to $ u^i $ are of importance: we calculate the derivative of $ u^i + \epsilon a^i $ for $ \epsilon = 0 $ and obtain $$ d \partial_\epsilon X_T^\theta = \sum_{i=1}^d V^\theta_i(t, X^\theta_t) da^i(t) + dV^\theta_i(t, X^\theta_t) \partial_\epsilon X^\theta_t du^i(t), \; \partial_\epsilon X^\theta_0 = 0 \, , $$ hence a completely analogue situation appears, i.e. $$ \partial_\epsilon X_T^\theta = \sum_{i=1}^d \int_0^t J_{s,T} V^\theta_i(s, X^\theta_s) da^i(s) $$ by variation of constants.

Notice that the surjectivity of the map $ a \mapsto \partial_\epsilon X^\theta_T $ leads precisely to a Hörmander condition for the vector fields $ V_i $.

Let us sum up the previous considerations:

We can understand deep neural networks as flows of controlled ordinary differential equations.
Backpropagation appears as the classical forward-backward calculation of first variations.
Variations with respect to the control can also be calculated in a forward-backward manner.
Since gradients can be efficiently calculated we can set up search algorithms stochastic gradient descent, simulated annealing, etc.

Let us consider this for one layer of a neural network:¶

$$ L^{(i)}: x \mapsto W^{(i)}x + a^{(0)} \mapsto \phi (W^{(i)}x + a^{(0)}) $$ Translating this to a controlled ODE needs the following ingredients:

a control $u(t)=1_{[1,2[}(t)+2 \, 1_{[2,\infty[}(t)$, i.e. we have two jumps by $1$ at $ t = 1 $ and $ t =2 $ of size $1$, and
a time-dependent vector field $$ V(t,x) = 1_{[0,1]}(t)(L^{(0)}(x)-x)+ 1_{]1,\infty]}(t)(L^{(1)}(x) -x ) \, . $$

The corresponding neural network at 'time' $ 3 $ is just $$ x \mapsto L^{(0)}(x) \mapsto \phi(W^{(1)}L^{(0)}(x) + a^{(1)}) \, $$ which is a one hidden layer neural network with linear readout.

The parameters $ \theta $ are the matrix weights $ W^{(i)} $ and shifts $ a^{(i)} $.

The backwards first variation can be constructed by looking at the derivatives of $ V(t,.) $ with respect to $ x $ and solving the corresponding equation backwards: at 'time' $T=3$ we have $$ J_{s,3} v =v + 1_{[0,2[}(s)(dL^{(1)}(x)v-v) +1_{[0,1[}(s)(d L^{(0)}(x)v-v) $$ which is nothing than encoding the outer part of the chain rule.

Calculating now the derivative of the network with respect to, e.g., $ a^{(1)} $ amounts then just to calculting the variation of constants integral which gives the correct partial derivatives.

Let us look at this phenomenon in an example following Deep Learning with Keras and Tensorflow. We consider a one hidden layer neural network as above with loss function $ 0.5*(y-\hat{y})^2 $ whose gradient is easily calculated in terms of the scalar product of the error with the respective gradient of $ X^\theta $ with respect to the network parameters.

What is implemented in the sequel is a simple gradient descent (with momentum): let us denote the loss function by $ L(\theta) $, then $$ \theta^{(n+1)} = \theta^{(n)} - N \, \nabla L(\theta^{(n)}) - M \, \nabla L(\theta^{(n-1)}) \, . $$

# Import the required packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy

# Display plots in notebook 
%matplotlib inline
# Define plot's default figure size
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)

import random
random.seed(123)

# calculate a random number where:  a <= rand < b
def rand(a, b):
    return (b-a)*random.random() + a

#read the datasets
train = pd.read_csv("data/intro_to_ann.csv")

X, y = np.array(train.ix[:,0:2]), np.array(train.ix[:,2])

print(X.shape)
print(y.shape)

(500, 2)
(500,)

#Let's plot the dataset and see how it is
plt.scatter(X[:,0], X[:,1], s=40, c=1-y, cmap=plt.cm.BuGn)

<matplotlib.collections.PathCollection at 0x7fdd1e2f8048>

# Make a matrix 
def makeMatrix(I, J, fill=0.0):
    return np.zeros([I,J])

# our sigmoid function
def sigmoid(x):
    #return math.tanh(x)
    return 1/(1+np.exp(-x))

# derivative of our sigmoid function, in terms of the output (i.e. y)
def dsigmoid(y):
    return y - y**2

# the multilayerperceptron class (with one hidden layer)

class MLP:
    def __init__(self, ni, nh, no):
        # number of input, hidden, and output nodes
        self.ni = ni + 1 # +1 for bias node
        self.nh = nh
        self.no = no

        # activations for nodes -- these are list of respective length
        self.ai = [1.0]*self.ni
        self.ah = [1.0]*self.nh
        self.ao = [1.0]*self.no
        
        # create weights
        self.wi = makeMatrix(self.ni, self.nh)
        self.wo = makeMatrix(self.nh, self.no)
        
        # set them to random vaules
        for i in range(self.ni):
            for j in range(self.nh):
                self.wi[i][j] = rand(-0.2, 0.2)
        for j in range(self.nh):
            for k in range(self.no):
                self.wo[j][k] = rand(-2.0, 2.0)

        # last change in weights for momentum   
        self.ci = makeMatrix(self.ni, self.nh)
        self.co = makeMatrix(self.nh, self.no)
        

    def backPropagate(self, targets, N, M):
        
        if len(targets) != self.no:
            print(targets)
            raise ValueError('wrong number of target values')

        # calculate error terms for output
        output_deltas = np.zeros(self.no)
        for k in range(self.no):
            error = targets[k]-self.ao[k] 
            output_deltas[k] = dsigmoid(self.ao[k]) * error

        # calculate error terms for hidden
        hidden_deltas = np.zeros(self.nh)
        for j in range(self.nh):
            error = 0.0
            for k in range(self.no):
                error += output_deltas[k]*self.wo[j][k]
            hidden_deltas[j] = dsigmoid(self.ah[j]) * error

        # update output weights
        for j in range(self.nh):
            for k in range(self.no):
                change = output_deltas[k] * self.ah[j]
                self.wo[j][k] += N*change + M*self.co[j][k] # N is the learning rate, M the momentum
                self.co[j][k] = change

        # update input weights
        for i in range(self.ni):
            for j in range(self.nh):
                change = hidden_deltas[j]*self.ai[i]
                self.wi[i][j] += N*change + M*self.ci[i][j]
                self.ci[i][j] = change

        # calculate error
        error = 0.0
        for k in range(len(targets)):
            error += 0.5*(targets[k]-self.ao[k])**2
        return error


    def test(self, patterns):
        self.predict = np.empty([len(patterns), self.no])
        for i, p in enumerate(patterns):
            self.predict[i] = self.activate(p)
            #self.predict[i] = self.activate(p[0])
            
    def activate(self, inputs): # here the network is constructed as a function
        
        if len(inputs) != self.ni-1:
            print(inputs)
            raise ValueError('wrong number of inputs')

        # input activations
        for i in range(self.ni-1):
            self.ai[i] = inputs[i]

        # hidden activations
        for j in range(self.nh):
            sum_h = 0.0
            for i in range(self.ni):
                sum_h += self.ai[i] * self.wi[i][j]
            self.ah[j] = sigmoid(sum_h)

        # output activations
        for k in range(self.no):
            sum_o = 0.0
            for j in range(self.nh):
                sum_o += self.ah[j] * self.wo[j][k]
            self.ao[k] = sigmoid(sum_o)

        return self.ao[:]
    

    def train(self, patterns, iterations=1000, N=0.5, M=0.1):
        # N: learning rate
        # M: momentum factor
        patterns = list(patterns)
        for i in range(iterations):
            error = 0.0
            for p in patterns:
                inputs = p[0]
                targets = p[1]
                self.activate(inputs)
                error += self.backPropagate([targets], N, M)
            if i % 5 == 0:
                print('error in interation %d : %-.5f' % (i,error))
            print('Final training error: %-.5f' % error)

# create a network with two inputs, ten hidden, and one output nodes
ann = MLP(2, 10, 1)

%timeit -n 1 -r 1 ann.train(zip(X,y), iterations=1)

error in interation 0 : 31.50196
Final training error: 31.50196
91.5 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# Helper function to plot a decision boundary.
# This generates the contour plot to show the decision boundary visually
def plot_decision_boundary(nn_model):
    # Set min and max values and give it some padding
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), 
                         np.arange(y_min, y_max, h))
    # Predict the function value for the whole gid
    nn_model.test(np.c_[xx.ravel(), yy.ravel()])
    Z = nn_model.predict
    Z[Z>=0.5] = 1
    Z[Z<0.5] = 0
    Z = Z.reshape(xx.shape)
    # Plot the contour and training examples
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
    plt.scatter(X[:, 0], X[:, 1], s=40,  c=y, cmap=plt.cm.BuGn)

plot_decision_boundary(ann)
plt.title("Model with 1 iteration")

<matplotlib.text.Text at 0x7fdd1e24f2e8>

# create a network with two inputs, ten hidden, and one output nodes
ann = MLP(2, 10, 1)

%timeit -n 1 -r 1 ann.train(zip(X,y), iterations=100)

error in interation 0 : 29.18252
Final training error: 29.18252
Final training error: 25.14727
Final training error: 24.99525
Final training error: 24.96553
Final training error: 24.94232
error in interation 5 : 24.91374
Final training error: 24.91374
Final training error: 24.88046
Final training error: 24.84468
Final training error: 24.80822
Final training error: 24.77233
error in interation 10 : 24.73777
Final training error: 24.73777
Final training error: 24.70496
Final training error: 24.67407
Final training error: 24.64516
Final training error: 24.61817
error in interation 15 : 24.59302
Final training error: 24.59302
Final training error: 24.56961
Final training error: 24.54783
Final training error: 24.52756
Final training error: 24.50869
error in interation 20 : 24.49111
Final training error: 24.49111
Final training error: 24.47472
Final training error: 24.45942
Final training error: 24.44511
Final training error: 24.43169
error in interation 25 : 24.41908
Final training error: 24.41908
Final training error: 24.40718
Final training error: 24.39591
Final training error: 24.38517
Final training error: 24.37487
error in interation 30 : 24.36491
Final training error: 24.36491
Final training error: 24.35519
Final training error: 24.34558
Final training error: 24.33593
Final training error: 24.32608
error in interation 35 : 24.31578
Final training error: 24.31578
Final training error: 24.30473
Final training error: 24.29249
Final training error: 24.27840
Final training error: 24.26147
error in interation 40 : 24.24005
Final training error: 24.24005
Final training error: 24.21136
Final training error: 24.17036
Final training error: 24.10761
Final training error: 24.00473
error in interation 45 : 23.82632
Final training error: 23.82632
Final training error: 23.50945
Final training error: 22.96422
Final training error: 22.11245
Final training error: 20.95999
error in interation 50 : 19.62322
Final training error: 19.62322
Final training error: 18.25842
Final training error: 16.97667
Final training error: 15.82596
Final training error: 14.81553
error in interation 55 : 13.93791
Final training error: 13.93791
Final training error: 13.17903
Final training error: 12.52233
Final training error: 11.95116
Final training error: 11.45073
error in interation 60 : 11.00907
Final training error: 11.00907
Final training error: 10.61709
Final training error: 10.26800
Final training error: 9.95653
Final training error: 9.67844
error in interation 65 : 9.43003
Final training error: 9.43003
Final training error: 9.20805
Final training error: 9.00956
Final training error: 8.83190
Final training error: 8.67268
error in interation 70 : 8.52974
Final training error: 8.52974
Final training error: 8.40117
Final training error: 8.28527
Final training error: 8.18054
Final training error: 8.08567
error in interation 75 : 7.99952
Final training error: 7.99952
Final training error: 7.92109
Final training error: 7.84953
Final training error: 7.78406
Final training error: 7.72406
error in interation 80 : 7.66893
Final training error: 7.66893
Final training error: 7.61819
Final training error: 7.57140
Final training error: 7.52816
Final training error: 7.48813
error in interation 85 : 7.45100
Final training error: 7.45100
Final training error: 7.41651
Final training error: 7.38440
Final training error: 7.35446
Final training error: 7.32650
error in interation 90 : 7.30033
Final training error: 7.30033
Final training error: 7.27579
Final training error: 7.25275
Final training error: 7.23107
Final training error: 7.21063
error in interation 95 : 7.19134
Final training error: 7.19134
Final training error: 7.17310
Final training error: 7.15582
Final training error: 7.13942
Final training error: 7.12384
7.52 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

plot_decision_boundary(ann)
plt.title("Model with 100 iterations")

<matplotlib.text.Text at 0x7fdd1e1205f8>

Several remarks are due here:

There are different ways to convert time to depth leading to slightly different ways how to implement backpropagation. It is very efficient to have the 'continuous-time' formulas in mind.
It might be also useful to think about controls being themselves trajectories of a stochastic process, for intance a generic semi-martingale.
Take controls being $ d $ Brownian motion (with respect to Stratonovich integration) and assume a Hörmander condition valid for the particle system starting on the training data set, that is the ODE on $ \mathbb{R}^{mL} $ given by $$ dZ_t = \sum_{i=1}^d W_i(Z_t) \circ dB^i(t) \, , $$ where $$ (W_i((x_l)))_l:=V_i(x_l) $$ for $ l \in L $ and $ i = 1,\ldots,d $. Then with probability one a (possibly only small) neighborhood of $ (x_l) $ is hit by the particle system. A random search can exhibit such trajectories and whence solve the supervised learning problem by simulation.
One can also view this as a deterministic mean field control problem, where the controls only depend on the behavior of all particles.

In the sequel the same implementation done in Keras.

modelann = Sequential()
layer1=Dense(10,activation='sigmoid', input_shape=(2,))
layer1.trainable=True
modelann.add(layer1)
layer2=Dense(1,activation='sigmoid')
layer2.trainable=True
modelann.add(layer2)

modelann.compile(optimizer='adam', 
              loss='mse', 
              metrics=['accuracy'])
modelann.fit(x=X,y=y, epochs=1000)
#x = modelann.evaluate(X, y)
#print('\n',x)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
h = 0.01
# Generate a grid of points with distance h between them
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), 
                    np.arange(y_min, y_max, h))
# Predict the function value for the whole gid
Z= modelann.predict(np.c_[xx.ravel(),yy.ravel()])
Z[Z>=0.5]=1
Z[Z<0.5]=0

Z = Z.reshape(xx.shape)
# Plot the contour and training examples
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)
plt.scatter(X[:, 0], X[:, 1], s=40,  c=y, cmap=plt.cm.BuGn)

<matplotlib.collections.PathCollection at 0x7fdd1d81ed30>