The assignments for the Coursera Machine Learning course require you to implement a neural network using the feedfoward and backpropagation algorithms. I didn’t want to submit the assignments in matlab/octave, so I decided just to write them up in Python and post as a blog post. Github markdown can’t seem to process latex commmands, so the full notebook (containg maths!) this post was based on can be found here notebook with maths.
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as scipy_io
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from scipy.optimize import fmin_cg
The data
The dataset was provided as part of the assignment, essentially it is a number of greyscale images of handwritten digits (0-9), along with the correct classification for that image. The images look like this:
pix_file = '../assignments/machine-learning-ex4/ex4/ex4data1.mat'
mat_in = scipy_io.loadmat(pix_file)
X_in = mat_in['X']
y = np.squeeze(mat_in['y'])
# Split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X_in, y, test_size=0.2)
fig,ax = plt.subplots(5,5, figsize=(5,5))
axs = iter(np.ravel(ax))
for i in np.random.randint(0, 3000, 25):
ax = next(axs)
ax.imshow(X_train[i,:].reshape(20,20).T, cmap='Greys');
ax.set_axis_off();
The neural network
The network I am using for this example consists of 3 layers - an input layer, an output layer, and one hidden layer. The input layer consists of 400 input nodes (each pixel in an image, and there is actually 401 inputs when the bias, x_0=1
, term is included). I’m using 25 nodes (or neurons) in the hidden layer (this also gets a bias term added, a_0^(2)=1
), and the output layer consists of 10 nodes - one for each of the classes.
Feed-forward
The feed-forward calculation takes the inputs values, and applies the activation function (in this case the sigmoid) using the provided weights (Theta1
and Theta2
). These values are then “fed-forward” - used as the inputs to the next activation layer. This is implemented in the feedforward()
function below.
# Some helper functions
def add_bias_term(a):
'''Inserts a 1 to the firt column of the array '''
m,n = a.shape
b = np.ones([m, n+1])
b[:,1:] = a
return b
def x_to_theta(x, s_1, s_2, s_3):
'''Takes the vector of weights and unravels to the required array shapes'''
Theta1 = x[:s_2*(s_1+1)].reshape([s_2, s_1+1])
Theta2 = x[s_2*(s_1+1):].reshape([s_3, s_2+1])
return Theta1, Theta2
def sigmoid(z):
return 1./(1. + np.exp(-1.*z))
def sigmoidGradient(z):
return sigmoid(z) * (1. - sigmoid(z))
def feedforward(X, n_layers, theta=None):
'''performs the feedfoward calculation in the neural net using the sigmoid activation function
inputs:
X: input variables
n_layers: total number of layers in the neural network
theta: list of length n_layers-1 containing the weights for each neuron for each
activation layer (1,..., n_layers)
returns:
a_out: list of length n_layers-1 containg the computed activations for each layer (1,...,n_layers)
'''
a_outs = []
#check for theta
if theta is not None:
# check theta lengths
if len(theta) == n_layers-1:
#calculate the activations for each of the layers
for j in range(n_layers-1):
# add the bias terms
if j==0:
# activation is peformed on the input X values
a = add_bias_term(X)
else:
# activation is performed on the previous layers activation
a = add_bias_term(a_out)
# get the number of neurons for each layer
theta_j = theta[j]
# calculate the activation
a_out = sigmoid(np.dot(a, theta_j.T))
a_outs.append(a_out)
return a_outs
else:
print 'theta list is of wrong dimension'
return None
else:
print 'need to provide thetas'
return None
Back-propagation
In order for the neural net to “learn” the optimal weights (or parameters, Theta
) for the classification, we need to minimise the cost function over Theta
. To do this we need the cost function and its gradient.
Cost function
The cost function we’re using for this neural network is similar to that for the logistic regression, with the addition of a sum over the number of classes.
This is implemented in the nn_costfunction()
function below.
# functions to use with scipy's fmin_cg
def nn_costfunction(x, *args):
'''Calculates the cost function value for the neural net
inputs:
x: vector containing weights
args: tuple containing:
y: target vector
X_f: input variables/features
n_layers: number of layers in the neural net
k: number of classes
s_1: number of variables for each input
s_2: number of neurons in layer 2
s_3: number of neurons in layer 3
lamb: lambda weigting for regularisation
return:
J_theta: value of the cost function
TODO: make the code insensitive to the number of layers
'''
y, X_f, n_layers, k, s_1, s_2, s_3, lamb = args
# create an [m, k] y array, where the column index
# corresponding to the class of that row is set to 1
m = y.size
y_exp = np.zeros([m, k])
y_exp[range(m),y-1] = 1
# convert the input x vector to the theta arrays
Theta1, Theta2 = x_to_theta(x, s_1, s_2, s_3)
# compute the activations using the feedfoward function
a_outs = feedforward(X_f, n_layers=n_layers, theta=[Theta1, Theta2])
# the final classification is set to the last entry in a_outs
h_theta_x = a_outs[n_layers - 2]
h_theta_x[h_theta_x==1] = 0.9999 #just in case. If these are excatly 1 the log fails
# calclute the cost function
J_theta = np.sum(-1.* y_exp * np.log(h_theta_x) - (1. - y_exp) * np.log(1. - h_theta_x))/float(m)
#include regularization
if lamb is not None:
J_theta = J_theta + (lamb/(2*m)) * (np.sum(Theta1[:,1:]**2) + np.sum(Theta2[:,1:]**2))
return J_theta
Gradient
The gradient is calculated via the method of back-propagation. The basic idea behind this approach is that the error between the actual values and the predicted values (as determined by the cost function) is propagated back through the neural network, starting from the final layer. The weight connecting each node (or neuron) is assigned a portion of the error, based upon the gradient of the cost function.
This is implemented in nn_grad()
below. This has not been optimised in anyway (pretty sure it can be vectorized to remove the loop), but it works and gives a good idea of what the algorithm is doing.
def nn_grad(x, *args):
'''
Performs the back-propagation algorithm in order to calculate the partial differntial of the
cost function.
inputs:
x: vector containing weights
args: tuple containing:
y: target vector
X_f: input variables/feature
n_layers: number of layers in the neural net
k: number of classes
s_1: number of variables for each input
s_2: number of neurons in layer 2
s_3: number of neurons in layer 3
lamb: lambda weigting for regularisation
return:
J_theta: value of the cost function
TODO: make the code insensitive to the number of layers
'''
y, X_f, n_layers, k, s_1, s_2, s_3, lamb = args
# create an [m, k] y (feature) array, where the column index
# corresponding to the class of that row is set to 1
m = y.size
y_exp = np.zeros([m, k])
y_exp[range(m),y-1] = 1
# convert the input x vector to the theta arrays
Theta1, Theta2 = x_to_theta(x, s_1, s_2, s_3)
# initialise the Del arrays (used to store the accumulated cost function gradients)
Del_1 = np.zeros([s_2, s_1+1])
Del_2 = np.zeros([s_3, s_2+1])
# loop over the input variables
for t in range(m):
# this just makes the input variable a [1,s_1] vector instead of a [s_1,]
x_in = np.expand_dims(X_f[t,:], axis=0)
# compute the activations
a_outs = feedforward(x_in, n_layers=n_layers, theta=[Theta1, Theta2])
a_3 = a_outs[1]
a_2 = add_bias_term(a_outs[0])
x_in = add_bias_term(x_in)
# compute the "error" between the predicted values and actual values
delta_3 = a_3 - y_exp[t,:]
# apportion the error due to each of the neurons in layer 2
delta_2 = np.dot(delta_3, Theta2) * a_2 * (1. - a_2)
# calculate and accumulate the gradients
Del_2 += np.dot(delta_3.T, a_2)
Del_1 += np.dot(delta_2[:,1:].T, x_in)
Del_1 = Del_1/m
Del_2 = Del_2/m
# apply regularisation
if lamb is not None:
Del_1[:,1:] = Del_1[:,1:] + lamb/(2. * m) * Theta1[:,1:]
Del_2[:,1:] = Del_2[:,1:] + lamb/(2. * m) * Theta2[:,1:]
grad_out = np.zeros(Del_1.size + Del_2.size)
grad_out[:s_2*(s_1+1)] = Del_1.ravel()
grad_out[s_2*(s_1+1):] = Del_2.ravel()
return grad_out
Running the neural network
The neural network is initialised using randomly assigned arrays for initTheta1
and initTheta2
, and the scipy routine fmin_cg
is used to iterate through and minimise the cost function.
#number of classes
k = 10
# input sizes
s_1 = 400
s_2 = 25
s_3 = 10
# set regularization
lamb = 0.1
# number of layers
n_layers = 3
#arguments that get used in the cost and grad functions
args = (y_train, X_train, n_layers, k, s_1, s_2, s_3, lamb)
# initialise the Theta arrays with some random values
initE = 0.3
# np.random.seed(0)
initTheta1 = np.random.rand(s_2, s_1+1) * 2 * initE - initE
# np.random.seed(0)
initTheta2 = np.random.rand(s_3, s_2+1) * 2 * initE - initE
#put the theta arrays into a vector
x = np.zeros(initTheta1.size + initTheta2.size)
x[:initTheta1.size] = initTheta1.ravel()
x[initTheta1.size:] = initTheta2.ravel()
# print nn_costfunction(x, *args)
# print nn_grad(x, *args)
# run the optimisation code
theta_out = fmin_cg(nn_costfunction, x, fprime=nn_grad, args=args, maxiter=400)
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.079758
Iterations: 400
Function evaluations: 1349
Gradient evaluations: 1349
The minimisation function returns the optimal Theta1
and Theta2
arrays, we can then take those values and feed them through the neural network to get h_Theta_x
, our predicted classification.
Theta1, Theta2 = x_to_theta(theta_out, s_1, s_2, s_3)
a_outs = feedforward(X_test, n_layers=n_layers, theta=[Theta1, Theta2])
h_theta_x = a_outs[1]
#find the most-likely class (maximum value in row), adds 1 because index 0 is class 1, not class 0.
y_pred = np.argmax(h_theta_x, axis=1) + 1
print 'Agreement rate {0:2.2f}%'.format( 100.0*np.sum(y_test==y_pred)/float(y_test.size))
print 'Confusion matrix:'
confusion_matrix(y_test, y_pred)
Agreement rate 92.80%
Confusion matrix:
array([[101, 1, 0, 0, 3, 0, 0, 2, 0, 0],
[ 1, 98, 0, 0, 0, 2, 1, 3, 1, 0],
[ 0, 0, 79, 1, 1, 0, 0, 2, 0, 0],
[ 0, 0, 0, 99, 0, 4, 0, 0, 2, 0],
[ 0, 0, 3, 0, 92, 1, 0, 1, 1, 0],
[ 1, 1, 0, 0, 0, 91, 0, 1, 0, 1],
[ 1, 2, 0, 0, 0, 1, 87, 1, 4, 1],
[ 0, 1, 0, 0, 2, 1, 3, 95, 5, 1],
[ 1, 0, 1, 6, 0, 0, 3, 2, 92, 1],
[ 0, 0, 0, 0, 1, 0, 0, 0, 0, 94]])
This shows we get a ~93% agreement between the predicted and actual classifications. Its OK agreement, it might be possible to improve the agreement by increasing the number of neurons, layers, playing with the regularisation value, and running it for more iterations. But it has been interesting building the neural network, and has given me a good insight into the mechanics of a basic neural network.