This is my custom extension of one of Andrew NG's neural network from deep learning course where instead of producing 0 or 1 for binary classification I'm attempting
to classify multiple examples.
Both the inputs and outputs are one hot encoded.
With not much training I receive an accuracy of 'train accuracy: 67.51658067499625 %'
How can I classify a single training example instead of classifying all training examples?
I think a bug exists in my implementation as an issue with this network is training examples (train_set_x) and output values (train_set_y) both need to have same dimensions or an error related to the dimensionality of matrices is received.
For example using :
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1],[1,1,0],[1,1,1]
])
returns error :
ValueError Traceback (most recent call last)
<ipython-input-11-0d356e8d66f3> in <module>()
27 print(A)
28
---> 29 np.multiply(train_set_y,A)
30
31 def initialize_with_zeros(numberOfTrainingExamples):
ValueError: operands could not be broadcast together with shapes (3,3) (1,4)
network code :
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from scipy import ndimage
import pandas as pd
%matplotlib inline
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1,0],[1,1,0,0],[1,1,1,1]
])
numberOfFeatures = 4
numberOfTrainingExamples = 3
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
A = sigmoid(np.dot(w.T , train_set_x))
print(A)
np.multiply(train_set_y,A)
def initialize_with_zeros(numberOfTrainingExamples):
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
return w, b
def propagate(w, b, X, Y):
m = X.shape[1]
A = sigmoid(np.dot(w.T , X) + b)
cost = -(1/m)*np.sum(np.multiply(Y,np.log(A)) + np.multiply((1-Y),np.log(1-A)), axis=1)
dw = ( 1 / m ) * np.dot( X, ( A - Y ).T ) # consumes ( A - Y )
db = ( 1 / m ) * np.sum( A - Y ) # consumes ( A - Y ) again
# cost = np.squeeze(cost)
grads = {"dw": dw,
"db": db}
return grads, cost
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = True):
costs = []
for i in range(num_iterations):
grads, cost = propagate(w, b, X, Y)
dw = grads["dw"]
db = grads["db"]
w = w - (learning_rate * dw)
b = b - (learning_rate * db)
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False):
w, b = initialize_with_zeros(numberOfTrainingExamples)
parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = True)
w = parameters["w"]
b = parameters["b"]
Y_prediction_train = sigmoid(np.dot(w.T , X_train) + b)
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.0001, print_cost = True)
Update: A bug exists in this implementation in that the training example pairs (train_set_x , train_set_y) must contain the same dimensions. Can point in direction of how linear algebra should be modified?
Update 2 :
I modified #Paul Panzer answer so that learning rate is 0.001 and train_set_x , train_set_y pairs are unique :
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[0,0,1],[0,1,0],[1,0,1]
])
grads = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True)
# To classify single training example :
print(sigmoid(dw # [0,0,1,1,0] + db))
This update produces following output :
-2.09657359028
-3.94918577439
[[ 0.74043089 0.32851512 0.14776077 0.77970162]
[ 0.04810012 0.08033521 0.72846174 0.1063849 ]
[ 0.25956911 0.67148488 0.22029838 0.85223923]]
[[1 0 0 1]
[0 0 1 0]
[0 1 0 1]]
train accuracy: 79.84462279013312 %
[[ 0.51309252 0.48853845 0.50945862]
[ 0.5110232 0.48646923 0.50738869]
[ 0.51354109 0.48898712 0.50990734]]
Should print(sigmoid(dw # [0,0,1,1,0] + db)) produce a vector that once rounded matches train_set_y corresponding value : [0,1,0] ?
Modifying to produce a vector with (adding [0,0,1,1,0] to numpy array and taking transpose):
print(sigmoid(dw # np.array([[0,0,1,1,0]]).T + db))
returns :
array([[ 0.51309252],
[ 0.48646923],
[ 0.50990734]])
Again, rounding these values to nearest whole number produces vector [1,0,1] when [0,1,0] is expected.
These are incorrect operations to produce a prediction for single training example ?
Your difficulties come from mismatched dimensions, so let's walk through the problem and try and get them straight.
Your network has a number of inputs, the features, let's call their number N_in (numberOfFeatures in your code). And it has a number of outputs which correspond to different classes let's call their number N_out. Inputs and outputs are connected by the weights w.
Now here is the problem. Connections are all-to-all, so we need a weight for each of the N_out x N_in pairs of outputs and inputs. Therefore in your code the shape of w must be changed to (N_out, N_in). You probably also want an offset b for each output, so b should be a vector of size (N_out,) or rather (N_out, 1) so it plays well with the 2d terms.
I've fixed that in the modified code below and I tried to make it very explicit. I've also thrown a mock data creator into the bargain.
Re the one-hot encoded categorical output, I'm not an expert on neural networks but I think, most people understand it so that classes are mutually exclusive, so each sample in your mock output should have one one and the rest zeros.
Side note:
At one point a competing answer advised you to get rid of the 1-... terms in the cost function. While that looks like an interesting idea to me my gut feeling (Edit Now confirmed using gradient-free minimizer; use activation="hybrid" in code below. Solver will simply maximize all outputs which are active in at least one training example.) is it won't work just like that because the cost will then fail to penalise false positives (see below for detailed explanation). To make it work you'd have to add some kind of regularization. One method that appears to work is using the softmax instead of the sigmoid. The softmax is to one-hot what the sigmoid is to binary. It makes sure the output is "fuzzy one-hot".
Therefore my recommendation is:
If you want to stick with sigmoid and not explicitly enforce one-hot predictions. Keep the 1-... term.
If you want to use the shorter cost function. Enforce one-hot predictions. For example by using softmax instead of sigmoid.
I've added an activation="sigmoid"|"softmax"|"hybrid" parameter to the code that switches between models. I've also made the scipy general purpose minimizer available, which may be useful when the gradient of the cost is not at hand.
Recap on how the cost function works:
The cost is a sum over all classes and all training samples of the term
-y log (y') - (1-y) log (1-y')
where y is the expected response, i.e. the one given by the "y" training sample for the input (the "x" training sample). y' is the prediction, the response the network with its current weights and biases generates. Now, because the expected response is either 0 or 1 the cost for a single category and a single training sample can be written
-log (y') if y = 1
-log(1-y') if y = 0
because in the first case (1-y) is zero, so the second term vanishes and in the secondo case y is zero, so the first term vanishes.
One can now convince oneself that the cost is high if
the expected response y is 1 and the network prediction y' is close to zero
the expected response y is 0 and the network prediction y' is close to one
In other words the cost does its job in punishing wrong predictions. Now, if we drop the second term (1-y) log (1-y') half of this mechanism is gone. If the expected response is 1, a low prediction will still incur a cost, but if the expected response is 0, the cost will be zero, regardless of the prediction, in particular, a high prediction (or false positive) will go unpunished.
Now, because the total cost is a sum over all training samples, there are three possibilities.
all training samples prescribe that the class be zero:
then the cost will be completely independent of the predictions for this class and no learning can take place
some training samples put the class at zero, some at one:
then because "false negatives" or "misses" are still punished but false positives aren't the net will find the easiest way to minimize the cost which is to indiscriminately increase the prediction of the class for all samples
all training samples prescribe that the class be one:
essentially the same as in the second scenario will happen, only here it's no problem, because that is the correct behavior
And finally, why does it work if we use softmax instead of sigmoid? False positives will still be invisible. Now it is easy to see that the sum over all classes of the softmax is one. So I can only increase the prediction for one class if at least one other class is reduced to compensate. In particular, there can be no false positives without a false negative, and the false negative the cost will detect.
On how to get a binary prediction:
For binary expected responses rounding is indeed the appropriate procedure. For one-hot I'd rather find the largest value, set that to one and all others to zero. I've added a convenience function, predict, implementing that.
import numpy as np
from scipy import optimize as opt
from collections import namedtuple
# First, a few structures to keep ourselves organized
Problem_Size = namedtuple('Problem_Size', 'Out In Samples')
Data = namedtuple('Data', 'Out In')
Network = namedtuple('Network', 'w b activation cost gradient most_likely')
def get_dims(Out, In, transpose=False):
"""extract dimensions and ensure everything is 2d
return Data, Dims"""
# gracefully acccept lists etc.
Out, In = np.asanyarray(Out), np.asanyarray(In)
if transpose:
Out, In = Out.T, In.T
# if it's a single sample make sure it's n x 1
Out = Out[:, None] if len(Out.shape) == 1 else Out
In = In[:, None] if len(In.shape) == 1 else In
Dims = Problem_Size(Out.shape[0], *In.shape)
if Dims.Samples != Out.shape[1]:
raise ValueError("number of samples must be the same for Out and In")
return Data(Out, In), Dims
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
def sig_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA + (1-data.Out) * (1-logA)).sum(axis=0).mean()
def sig_grad (Net, Dims, data):
A = process(data.In, Net)
return dict(dw = (A - data.Out) # data.In.T / Dims.Samples,
db = (A - data.Out).mean(axis=1, keepdims=True))
def sig_ml(z):
return np.round(z).astype(int)
def sof_ml(z):
hot = np.argmax(z, axis=0)
z = np.zeros(z.shape, dtype=int)
z[hot, np.arange(len(hot))] = 1
return z
def softmax(z):
z = z - z.max(axis=0, keepdims=True)
z = np.exp(z)
return z / z.sum(axis=0, keepdims=True)
def sof_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA).sum(axis=0).mean()
sof_grad = sig_grad
def get_net(Dims, activation='softmax'):
activation, cost, gradient, ml = {
'sigmoid': (sigmoid, sig_cost, sig_grad, sig_ml),
'softmax': (softmax, sof_cost, sof_grad, sof_ml),
'hybrid': (sigmoid, sof_cost, None, sig_ml)}[activation]
return Network(w=np.zeros((Dims.Out, Dims.In)),
b=np.zeros((Dims.Out, 1)),
activation=activation, cost=cost, gradient=gradient,
most_likely=ml)
def process(In, Net):
return Net.activation(Net.w # In + Net.b)
def propagate(data, Dims, Net):
return Net.gradient(Net, Dims, data), Net.cost(Net, data)
def optimize_no_grad(Net, Dims, data):
def f(x):
Net.w[...] = x[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = x[Net.w.size:].reshape(Net.b.shape)
return Net.cost(Net, data)
x = np.r_[Net.w.ravel(), Net.b.ravel()]
res = opt.minimize(f, x, options=dict(maxiter=10000)).x
Net.w[...] = res[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = res[Net.w.size:].reshape(Net.b.shape)
def optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True):
w, b = Net.w, Net.b
costs = []
for i in range(num_iterations):
grads, cost = propagate(data, Dims, Net)
dw = grads["dw"]
db = grads["db"]
w -= learning_rate * dw
b -= learning_rate * db
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
return grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False, activation='sigmoid'):
data, Dims = get_dims(Y_train, X_train, transpose=True)
Net = get_net(Dims, activation)
if Net.gradient is None:
optimize_no_grad(Net, Dims, data)
else:
grads, costs = optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True)
Y_prediction_train = process(data.In, Net)
print(Y_prediction_train)
print(data.Out)
print(Y_prediction_train.sum(axis=0))
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - data.Out)) * 100))
return Net
def predict(In, Net, probability=False):
In = np.asanyarray(In)
is1d = In.ndim == 1
if is1d:
In = In.reshape(-1, 1)
Out = process(In, Net)
if not probability:
Out = Net.most_likely(Out)
if is1d:
Out = Out.reshape(-1)
return Out
def create_data(Dims):
Out = np.zeros((Dims.Out, Dims.Samples), dtype=int)
Out[np.random.randint(0, Dims.Out, (Dims.Samples,)), np.arange(Dims.Samples)] = 1
In = np.random.randint(0, 2, (Dims.In, Dims.Samples))
return Data(Out, In)
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[1,0,0],[0,0,1],[0,0,1]
])
Net1 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Net2 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='softmax')
Net3 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='hybrid')
Dims = Problem_Size(8, 100, 50)
data = create_data(Dims)
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='softmax')
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Both the idea of how to fix the bug and how you can extend the implementation to classify between more classes can be solved with some dimensionality analysis.
I am assuming that you by classifying multiple examples mean multiple classes and not multiple samples, as we need multiple samples to train even for 2 classes.
Where N = number of samples, D = number of features, K = number of categories(with K=2 being a special case where one can reduce this down to one dimension,ie K=1 with y=0 signifying one class and y=1 the other). The data should have the following dimensions:
X: N * D #input
y: N * K #output
W: D * K #weights, also dW has same dimensions
b: 1 * K #bias, also db has same dimensions
#A should have same dimensions as y
The order of the dimensions can be switched around, as long as the dot products are done correctly.
First dealing with your bug: You are initializing W as N * K instead of D * K ie. in the binary case:
w = np.zeros((numberOfTrainingExamples , 1))
#instead of
w = np.zeros((numberOfFeatures , 1))
This means that the only time you are initializing W to correct dimensions is when y and X (coincidentally) have same dimensions.
This will mess with your dot products as well:
np.dot(X, w) # or np.dot(w.T,X.T) if you define y as [K * N] dimensions
#instead of
np.dot(w.T , X)
and
np.dot( X.T, ( A - Y ) ) #np.dot( X.T, ( A - Y ).T ) if y:[K * N]
#instead of
np.dot( X, ( A - Y ).T )
Also make sure that the cost function returns one number (ie. not an array).
Secondly going on to K>2 you need to make some changes. b is no longer a single number, but a vector (1D-array). y and W go from being 1D-array to 2D array. To avoid confusion and hard-to-find bugs it could be good to set K, N and D to different values
Related
I am trying to build a simple neural network class from scratch using numpy, and test it using the XOR problem. But the backpropagation function (backprop) does not seem to be working correctly.
In the class, I construct instances by passing in the size of each layer, and the activation functions to use at each layer. I assume that the final activation function is softmax, so that I can calculate the derivative of cross-entropy loss wrt to Z of the last layer. I also do not have a separate set of bias matrices in my class. I just include them in the weight matrices as an extra column at the end.
I know that my backprop function is not working correctly, because the neural network does not ever converge on a somewhat correct output. I also created a numerical gradient function, and when comparing the results of both. I get drastically different numbers.
My understanding from what I have read is that the delta values of each layer (with L being the last layer, and i representing any other layer) should be:
And the respective gradients/weight-update of those layers should be:
Where * is the hardamard product, a represents the activation of some layer, and z represents the nonactivated output of some layer.
The sample data that I am using to test this is at the bottom of the file.
This is my first time trying to implement the backpropagation algorithm from scratch. So I am a bit lost on where to go from here.
import numpy as np
def sigmoid(n, deriv=False):
if deriv:
return np.multiply(n, np.subtract(1, n))
return 1 / (1 + np.exp(-n))
def softmax(X, deriv=False):
if not deriv:
exps = np.exp(X - np.max(X))
return exps / np.sum(exps)
else:
raise Error('Unimplemented')
def cross_entropy(y, p, deriv=False):
"""
when deriv = True, returns deriv of cost wrt z
"""
if deriv:
ret = p - y
return ret
else:
p = np.clip(p, 1e-12, 1. - 1e-12)
N = p.shape[0]
return -np.sum(y*np.log(p))/(N)
class NN:
def __init__(self, layers, activations):
"""random initialization of weights/biases
NOTE - biases are built into the standard weight matrices by adding an extra column
and multiplying it by one in every layer"""
self.activate_fns = activations
self.weights = [np.random.rand(layers[1], layers[0]+1)]
for i in range(1, len(layers)):
if i != len(layers)-1:
self.weights.append(np.random.rand(layers[i+1], layers[i]+1))
for j in range(layers[i+1]):
for k in range(layers[i]+1):
if np.random.rand(1,1)[0,0] > .5:
self.weights[-1][j,k] = -self.weights[-1][j,k]
def ff(self, X, get_activations=False):
"""Feedforward"""
activations, zs = [], []
for activate, w in zip(self.activate_fns, self.weights):
X = np.vstack([X, np.ones((1, 1))]) # adding bias
z = w.dot(X)
X = activate(z)
if get_activations:
zs.append(z)
activations.append(X)
return (activations, zs) if get_activations else X
def grad_descent(self, data, epochs, learning_rate):
"""gradient descent
data - list of 2 item tuples, the first item being an input, and the second being its label"""
grad_w = [np.zeros_like(w) for w in self.weights]
for _ in range(epochs):
for x, y in data:
grad_w = [n+o for n, o in zip(self.backprop(x, y), grad_w)]
self.weights = [w-(learning_rate/len(data))*gw for w, gw in zip(self.weights, grad_w)]
def backprop(self, X, y):
"""perfoms backprop for one layer of a NN with softmax/cross_entropy output layer"""
(activations, zs) = self.ff(X, True)
activations.insert(0, X)
deltas = [0 for _ in range(len(self.weights))]
grad_w = [0 for _ in range(len(self.weights))]
deltas[-1] = cross_entropy(y, activations[-1], True) # assumes output activation is softmax
grad_w[-1] = np.dot(deltas[-1], np.vstack([activations[-2], np.ones((1, 1))]).transpose())
for i in range(len(self.weights)-2, -1, -1):
deltas[i] = np.dot(self.weights[i+1][:, :-1].transpose(), deltas[i+1]) * self.activate_fns[i](zs[i], True)
grad_w[i] = np.hstack((np.dot(deltas[i], activations[max(0, i-1)].transpose()), deltas[i]))
# check gradient
num_gw = self.gradient_check(X, y, i)
print('numerical:', num_gw, '\nanalytic:', grad_w)
return grad_w
def gradient_check(self, x, y, i, epsilon=1e-4):
"""Numerically calculate the gradient in order to check analytical correctness"""
grad_w = [np.zeros_like(w) for w in self.weights]
for w, gw in zip(self.weights, grad_w):
for j in range(w.shape[0]):
for k in range(w.shape[1]):
w[j,k] += epsilon
out1 = cross_entropy(self.ff(x), y)
w[j,k] -= 2*epsilon
out2 = cross_entropy(self.ff(x), y)
gw[j,k] = np.float64(out1 - out2) / (2*epsilon)
w[j,k] += epsilon # return weight to original value
return grad_w
##### TESTING #####
X = [np.array([[0],[0]]), np.array([[0],[1]]), np.array([[1],[0]]), np.array([[1],[1]])]
y = [np.array([[1], [0]]), np.array([[0], [1]]), np.array([[0], [1]]), np.array([[1], [0]])]
data = []
for x, t in zip(X, y):
data.append((x, t))
def nn_test():
c = NN([2, 2, 2], [sigmoid, sigmoid, softmax])
c.grad_descent(data, 100, .01)
for x in X:
print(c.ff(x))
nn_test()
UPDATE: I found one small bug in the code, but it still does not converge correctly. I calculated/derived the gradients for both matrices by hand and found no errors in my implementation, so I still do not know what is wrong with it.
UPDATE #2: I created a procedural version of what I was using above with the following code. Upon testing I discovered that the NN was able to learn the correct weights for classifying each of the 4 cases in XOR separately, but when I try to train using all the training examples at once (as shown), the resultant weights almost always output something around .5 for both output nodes. Could someone please tell me why this is occurring?
X = [np.array([[0],[0]]), np.array([[0],[1]]), np.array([[1],[0]]), np.array([[1],[1]])]
y = [np.array([[1], [0]]), np.array([[0], [1]]), np.array([[0], [1]]), np.array([[1], [0]])]
weights = [np.random.rand(2, 3) for _ in range(2)]
for _ in range(1000):
for i in range(4):
#Feedforward
a0 = X[i]
z0 = weights[0].dot(np.vstack([a0, np.ones((1, 1))]))
a1 = sigmoid(z0)
z1 = weights[1].dot(np.vstack([a1, np.ones((1, 1))]))
a2 = softmax(z1)
# print('output:', a2, '\ncost:', cross_entropy(y[i], a2))
#backprop
del1 = cross_entropy(y[i], a2, True)
dcdw1 = del1.dot(np.vstack([a1, np.ones((1, 1))]).T)
del0 = weights[1][:, :-1].T.dot(del1)*sigmoid(z0, True)
dcdw0 = del0.dot(np.vstack([a0, np.ones((1, 1))]).T)
weights[0] -= .03*weights[0]*dcdw0
weights[1] -= .03*weights[1]*dcdw1
i = 0
a0 = X[i]
z0 = weights[0].dot(np.vstack([a0, np.ones((1, 1))]))
a1 = sigmoid(z0)
z1 = weights[1].dot(np.vstack([a1, np.ones((1, 1))]))
a2 = softmax(z1)
print(a2)
Softmax doesn't look right
Using cross entropy loss, the derivative for softmax is really nice (assuming you are using a 1 hot vector, where "1 hot" essentially means an array of all 0's except for a single 1, ie: [0,0,0,0,0,0,1,0,0])
For node y_n it ends up being y_n-t_n. So for a softmax with output:
[0.2,0.2,0.3,0.3]
And desired output:
[0,1,0,0]
The gradient at each of the softmax nodes is:
[0.2,-0.8,0.3,0.3]
It looks as if you are subtracting 1 from the entire array. The variable names aren't very clear, so if you could possibly rename them from L to what L represents, such as output_layer I'd be able to help more.
Also, for the other layers just to clear things up. When you say a^(L-1) as an example, do you mean "a to the power of (l-1)" or do you mean "a xor (l-1)"? Because in python ^ means xor.
EDIT:
I used this code and found the strange matrix dimensions (modified at line 69 in the function backprop)
deltas = [0 for _ in range(len(self.weights))]
grad_w = [0 for _ in range(len(self.weights))]
deltas[-1] = cross_entropy(y, activations[-1], True) # assumes output activation is softmax
print(deltas[-1].shape)
grad_w[-1] = np.dot(deltas[-1], np.vstack([activations[-2], np.ones((1, 1))]).transpose())
print(self.weights[-1].shape)
print(activations[-2].shape)
exit()
I have a following loop where I am calculating softmax transform for batches of different sizes as below
import numpy as np
def softmax(Z,arr):
"""
:param Z: numpy array of any shape (output from hidden layer)
:param arr: numpy array of any shape (start, end)
:return A: output of multinum_logit(Z,arr), same shape as Z
:return cache: returns Z as well, useful during back propagation
"""
A = np.zeros(Z.shape)
for i in prange(len(arr)):
shiftx = Z[:,arr[i,1]:arr[i,2]+1] - np.max(Z[:,int(arr[i,1]):int(arr[i,2])+1])
A[:,arr[i,1]:arr[i,2]+1] = np.exp(shiftx)/np.exp(shiftx).sum()
cache = Z
return A,cache
Since this for loop is not vectorized it is the bottleneck in my code. What is a possible solution to make it faster. I have tried using #jit of numba which makes it little faster but not enough. I was wondering if there is another way to make it faster or vectorize/parallelize it.
Sample input data for the function
Z = np.random.random([1,10000])
arr = np.zeros([100,3])
arr[:,0] = 1
temp = int(Z.shape[1]/arr.shape[0])
for i in range(arr.shape[0]):
arr[i,1] = i*temp
arr[i,2] = (i+1)*temp-1
arr = arr.astype(int)
EDIT:
I forgot to stress here that my number of class is varying. For example batch 1 has say 10 classes, batch 2 may have 15 classes. Therefore I am passing an array arr which keeps track of the which rows belong to batch1 and so on. These batches are different than the batches in traditional neural network framework
In the above example arr keeps track of starting index and end index of rows. So the denominator in the softmax function will be sum of only those observations whose index lie between the starting and ending index.
Here's a vectorized softmax function. It's the implementation of an assignment from Stanford's cs231n course on conv nets.
The function takes in optimizable parameters, input data, targets, and a regularizer. (You can ignore the regularizer as that references another class exclusive to some cs231n assignments).
It returns a loss and gradients of the parameters.
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.amax(scores,axis=1).reshape(-1,1)
softmax = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis=1).reshape(-1,1)
loss = -np.sum(np.log(softmax[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dSoftmax = softmax.copy()
dSoftmax[range(num_train), list(y)] += -1
dW = (X.T).dot(dSoftmax)
dW = dW/num_train + reg * W
return loss, dW
For comparison's sake, here is a naive (non-vectorized) implementation of the same method.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss_i = -shift_scores[y[i]] + np.log(sum(np.exp(shift_scores)))
loss += loss_i
for j in xrange(num_classes):
softmax = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j==y[i]:
dW[:,j] += (-1 + softmax) * X[i]
else:
dW[:,j] += softmax *X[i]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train + reg * W
return loss, dW
Source
I'm trying to code a neural network from scratch in python. To check whether everything works I wanted to overfit the network but the loss seems to explode at first and then comes back to the initial value and stops there (Doesn't converge). I've checked my code and could find the reason. I assume my understanding or implementation of backpropagation is incorrect but there might be some other reason. Can anyone help me out or at least point me in the right direction?
# Initialize weights and biases given dimesnsions (For this example the dimensions are set to [12288, 64, 1])
def initialize_parameters(dims):
# Initiate parameters
parameters = {}
L = len(dims) # Number of layers in the network
# Loop over the given dimensions. Initialize random weights and set biases to zero.
for i in range(1, L):
parameters["W" + str(i)] = np.random.randn(dims[i], dims[i-1]) * 0.01
parameters["b" + str(i)] = np.zeros([dims[i], 1])
return parameters
# Activation Functions
def relu(x, deriv=False):
if deriv:
return 1. * (x > 0)
else:
return np.maximum(0,x)
def sigmoid(x, deriv=False):
if deriv:
return x * (1-x)
else:
return 1/(1 + np.exp(-x))
# Forward and backward pass for 2 layer neural network. (1st relu, 2nd sigmoid)
def forward_backward(X, Y, parameters):
# Array for storing gradients
grads = {}
# Get the length of examples
m = Y.shape[1]
# First layer
Z1 = np.dot(parameters["W1"], X) + parameters["b1"]
A1 = relu(Z1)
# Second layer
Z2 = np.dot(parameters["W2"], A1) + parameters["b2"]
AL = sigmoid(Z2)
# Compute cost
cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
# Backpropagation
# Second Layer
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
dZ2 = dAL * sigmoid(AL, deriv=True)
grads["dW2"] = np.dot(dZ2, A1.T) / m
grads["db2"] = np.sum(dZ2, axis=1, keepdims=True) / m
# First layer
dA1 = np.dot(parameters["W2"].T, dZ2)
dZ1 = dA1 * relu(A1, deriv=True)
grads["dW1"] = np.dot(dZ1, X.T)
grads["db1"] = np.sum(dZ1, axis=1, keepdims=True) / m
return AL, grads, cost
# Hyperparameters
dims = [12288, 64, 1]
epoches = 2000
learning_rate = 0.1
# Initialize parameters
parameters = initialize_parameters(dims)
log_list = []
# Train the network
for i in range(epoches):
# Get X and Y
x = np.array(train[0:10],ndmin=2).T
y = np.array(labels[0:10], ndmin=2).T
# Perform forward and backward pass
AL, grads, cost = forward_backward(x, y, parameters)
# Compute cost and append to the log_list
log_list.append(cost)
# Update parameters with computed gradients
parameters = update_parameters(grads, parameters, learning_rate)
plt.plot(log_list)
plt.title("Loss of the network")
plt.show()
I am struggling to find the place where you calculate the error gradients and the input training data sample would also help...
I don't know if this will help you, but I'll share my solution for Python neural network to learn XOR problem.
import numpy as np
def sigmoid_function(x, derivative=False):
"""
Sigmoid function
“x” is the input and “y” the output, the nonlinear properties of this function means that
the rate of change is slower at the extremes and faster in the centre. Put plainly,
we want the neuron to “make its mind up” instead of indecisively staying in the middle.
:param x: Float
:param Derivative: Boolean
:return: Float
"""
if (derivative):
return x * (1 - x) # Derivative using the chain rule.
else:
return 1 / (1 + np.exp(-x))
# create dataset for XOR problem
input_data = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
ideal_output = np.array([[0.0], [1.0], [1.0], [0.0]])
#initialize variables
learning_rate = 0.1
epoch = 50000 #number or iterations basically - One round of forward and back propagation is called an epoch
# get the second element from the numpy array shape field to detect the count of features for input layer
input_layer_neurons = input_data.shape[1]
hidden_layer_neurons = 3 #number of hidden layer neurons
output_layer_neurons = 1 #number of output layer neurons
#init weight & bias
weights_hidden = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bias_hidden = np.random.uniform(1, hidden_layer_neurons)
weights_output = np.random.uniform(size=(hidden_layer_neurons, output_layer_neurons))
bias_output = np.random.uniform(1, output_layer_neurons)
for i in range(epoch):
#forward propagation
hidden_layer_input_temp = np.dot(input_data, weights_hidden) #matrix dot product to adjust for weights in the layer
hidden_layer_input = hidden_layer_input_temp + bias_hidden #adjust for bias
hidden_layer_activations = sigmoid_function(hidden_layer_input) #use the activation function
output_layer_input_temp = np.dot(hidden_layer_activations, weights_output)
output_layer_input = output_layer_input_temp + bias_output
output = sigmoid_function(output_layer_input) #final output
#backpropagation (where adjusting of the weights happens)
error = ideal_output - output #error gradient
if (i % 1000 == 0):
print("Error: {}".format(np.mean(abs(error))))
#use derivatives to compute slope of output and hidden layers
slope_output_layer = sigmoid_function(output, derivative=True)
slope_hidden_layer = sigmoid_function(hidden_layer_activations, derivative=True)
#calculate deltas
delta_output = error * slope_output_layer
error_hidden_layer = delta_output.dot(weights_output.T) #calculates the error at hidden layer
delta_hidden = error_hidden_layer * slope_hidden_layer
#change the weights
weights_output += hidden_layer_activations.T.dot(delta_output) * learning_rate
bias_output += np.sum(delta_output, axis=0, keepdims=True) * learning_rate
weights_hidden += input_data.T.dot(delta_hidden) * learning_rate
bias_hidden += np.sum(delta_hidden, axis=0, keepdims=True) * learning_rate
I have implemented and trained a neural network with Theano of k binary inputs (0,1), one hidden layer and one unit in the output layer. Once it has been trained I want to obtain inputs that maximizes the output (e.g. x which makes unit of output layer closest to 1). So far I haven't found an implementation of it, so I am trying the following approach:
Train network => obtain trained weights (theta1, theta2)
Define the neural network function with x as input and trained theta1, theta2 as fixed parameters. That is: f(x) = sigmoid( theta1*(sigmoid (theta2*x ))). This function takes x and with given trained weights (theta1, theta2) gives output between 0 and 1.
Apply gradient descent w.r.t. x on the neural network function f(x) and obtain x that maximizes f(x) with theta1 and theta2 given.
For these I have implemented the following code with a toy example (k = 2). Based on the tutorial on http://outlace.com/Beginner-Tutorial-Theano/ but changed vector y, so that there is only one combination of inputs that gives f(x) ~ 1 which is x = [0, 1].
Edit1: As suggested optimizer was set to None and bias unit was fixed to 1.
Step 1: Train neural network. This runs well and with out error.
import os
os.environ["THEANO_FLAGS"] = "optimizer=None"
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
in_units = 2
hid_units = 3
out_units = 1
theta1 = theano.shared(np.array(np.random.rand(in_units + 1, hid_units), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(hid_units + 1, out_units), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 0, 0, 0]) #training data Y
cur_cost = 0
for i in range(5000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
print(run_forward([0,1]))
Output of run forward for [0,1] is: 0.968905860574.
We can also get values of weights with theta1.get_value() and theta2.get_value()
Step 2: Define neural network function f(x). Trained weights (theta1, theta2) are constant parameters of this function.
Things get a little trickier here because of the bias unit, which is part of he vector of inputs x. To do this I concatenate b and x. But the code now runs well.
b = np.array([[1]], dtype=theano.config.floatX)
#b_sh = theano.shared(np.array([[1]], dtype=theano.config.floatX))
rand_init = np.random.rand(in_units, 1)
rand_init[0] = 1
x_sh = theano.shared(np.array(rand_init, dtype=theano.config.floatX))
th1 = T.dmatrix()
th2 = T.dmatrix()
nn_hid = T.nnet.sigmoid( T.dot(th1, T.concatenate([x_sh, b])) )
nn_predict = T.sum( T.nnet.sigmoid( T.dot(th2, T.concatenate([nn_hid, b]))))
Step 3:
Problem is now in gradient descent as is not limited to values between 0 and 1.
fc2 = (nn_predict - 1)**2
cost3 = theano.function(inputs=[th1, th2], outputs=fc2, updates=[
(x_sh, grad_desc(fc2, x_sh))])
run_forward = theano.function(inputs=[th1, th2], outputs=nn_predict)
cur_cost = 0
for i in range(10000):
cur_cost = cost3(theta1.get_value().T, theta2.get_value().T) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print x_sh.get_value()
The last iteration prints:
Cost: 0.000220317356533
[[-0.11492753]
[ 1.99729555]]
Furthermore input 1 keeps becoming more negative and input 2 increases, while the optimal solution is [0, 1]. How can this be fixed?
You are adding b=[1] via broadcasting rules as opposed to concatenating it. Also, once you concatenate it, your x_sh has one dimension to many which is why the error occurs at nn_predict and not nn_hid
I've been experimenting with very basic recurrent networks and have seen really strange behaviors. I've spent quite a bit of time trying to narrow down where it goes wrong and I ended up finding that the gradients computed by theano and by finite differentiation are radically different when using a recurrent layer. What is going on here?
Here is the kind of problem I have:
I have n_seq sequences of n_steps feature vectors of dimension n_feat, along with their labels among n_class classes. The labels are per time step, not per sequence (so I have n_seq*n_steps labels).
My goal is to train a model to correctly classify the feature vectors.
Here's my minimal example:
(In reality, there would be some sequential information in the data, hence recurrent networks should do better, but in this minimal example I generate purely random data, which is good enough to expose the bug.)
I create 2 minimal networks:
1) A regular feed-forward (not recurrent), with just the input layer and an output layer with softmax (no hidden layer). I discard the sequential information by considering a "batch" of n_seq*n_steps "independent" feature vectors.
2) An identical network but where the output layer is recurrent. The batch is now of size n_seq and each input is a full sequence of n_steps feature vectors. Finally I reshape the output back to a "batch" of size n_seq*n_steps.
If the recurrent weights are set to 0, the 2 networks should be equivalent. Indeed I do see that the initial loss for both networks are the same in this case, no matter what random initialization for the feed-forward weights I have.
If I implement a finite differentiation, I also get that the (initial) gradients for the feed-forward weights are the same (as they should). However the gradients obtained from theano are radically different (but only for the recurrent network).
Here is my code with sample result:
NOTE: when run for the first time, I get this warning, I don't know what triggers it but I bet it is relevant to my problem.
Warning: In the strict mode, all neccessary shared variables must be passed as a part of non_sequences
'must be passed as a part of non_sequences', Warning)
Any insight would be much appreciated!
CODE:
import numpy as np
import theano
import theano.tensor as T
import lasagne
# GENERATE RANDOM DATA
n_steps = 10**4
n_seq = 10
n_feat = 2
n_class = 2
data_X = lasagne.utils.floatX(np.random.randn(n_seq, n_steps, n_feat))
data_y = np.random.randint(n_class, size=(n_seq, n_steps))
# INITIALIZE WEIGHTS
# feed-forward weights (random)
W = theano.shared(lasagne.utils.floatX(np.random.randn(n_feat,n_class)), name="W")
# recurrent weights (set to 0)
W_rec = theano.shared(lasagne.utils.floatX(np.zeros((n_class,n_class))), name="Wrec")
# bias (set to 0)
b = theano.shared(lasagne.utils.floatX(np.zeros((n_class,))), name="b")
def create_functions(model, X, y, givens):
"""Helper for building a network."""
loss = lasagne.objectives.categorical_crossentropy(lasagne.layers.get_output(model, X), y).mean()
get_loss = theano.function(
[], loss,
givens=givens
)
all_params = lasagne.layers.get_all_params(model)
get_theano_grad = [
theano.function(
[], g,
givens=givens
)
for g in theano.grad(loss, all_params)
]
return get_loss, get_theano_grad
def feedforward():
"""Creates a minimal feed-forward network."""
l_in = lasagne.layers.InputLayer(
shape=(n_seq*n_steps, n_feat),
)
l_out = lasagne.layers.DenseLayer(
l_in,
num_units=n_class,
nonlinearity=lasagne.nonlinearities.softmax,
W=W,
b=b
)
model = l_out
X = T.matrix('X')
y = T.ivector('y')
givens={
X: theano.shared(data_X.reshape((n_seq*n_steps, n_feat))),
y: T.cast(theano.shared(data_y.reshape((n_seq*n_steps,))), 'int32'),
}
return (model,) + create_functions(model, X, y, givens)
def recurrent():
"""Creates a minimal recurrent network."""
l_in = lasagne.layers.InputLayer(
shape=(n_seq, n_steps, n_feat),
)
l_out = lasagne.layers.RecurrentLayer(
l_in,
num_units=n_class,
nonlinearity=lasagne.nonlinearities.softmax,
gradient_steps=1,
W_in_to_hid=W,
W_hid_to_hid=W_rec,
b=b,
)
l_reshape = lasagne.layers.ReshapeLayer(l_out, (n_seq*n_steps, n_class))
model = l_reshape
X = T.tensor3('X')
y = T.ivector('y')
givens={
X: theano.shared(data_X),
y: T.cast(theano.shared(data_y.reshape((n_seq*n_steps,))), 'int32'),
}
return (model,) + create_functions(model, X, y, givens)
def finite_diff(param, loss_func, epsilon):
"""Computes a finitie differentation gradient of loss_func wrt param."""
loss = loss_func()
P = param.get_value()
grad = np.zeros_like(P)
it = np.nditer(P , flags=['multi_index'])
while not it.finished:
ind = it.multi_index
dP = P.copy()
dP[ind] += epsilon
param.set_value(dP)
grad[ind] = (loss_func()-loss)/epsilon
it.iternext()
param.set_value(P)
return grad
def theano_diff(net, get_theano_grad):
for p,g in zip(lasagne.layers.get_all_params(net), get_theano_grad):
if p.name == "W":
gW = np.array(g())
if p.name == "b":
gb = np.array(g())
return gW, gb
def compare_ff_rec():
eps = 1e-3 # for finite differentiation
ff, get_loss_ff, get_theano_grad_ff = feedforward()
rec, get_loss_rec, get_theano_grad_rec = recurrent()
gW_ff_finite = finite_diff(W, get_loss_ff, eps)
gb_ff_finite = finite_diff(b, get_loss_ff, eps)
gW_rec_finite = finite_diff(W, get_loss_rec, eps)
gb_rec_finite = finite_diff(b, get_loss_rec, eps)
gW_ff_theano, gb_ff_theano = theano_diff(ff, get_theano_grad_ff)
gW_rec_theano, gb_rec_theano = theano_diff(rec, get_theano_grad_rec)
print "\nloss:"
print "FF:\t", get_loss_ff()
print "REC:\t", get_loss_rec()
print "\ngradients:"
print "W"
print "FF finite:\n", gW_ff_finite.ravel()
print "FF theano:\n", gW_ff_theano.ravel()
print "REC finite:\n", gW_rec_finite.ravel()
print "REC theano:\n", gW_rec_theano.ravel()
print "b"
print "FF finite:\n", gb_ff_finite.ravel()
print "FF theano:\n", gb_ff_theano.ravel()
print "REC finite:\n", gb_rec_finite.ravel()
print "REC theano:\n", gb_rec_theano.ravel()
compare_ff_rec()
RESULTS:
loss:
FF: 0.968060314655
REC: 0.968060314655
gradients:
W
FF finite:
[ 0.23925304 -0.23907423 0.14013052 -0.14001131]
FF theano:
[ 0.23917811 -0.23917811 0.14011626 -0.14011627]
REC finite:
[ 0.23931265 -0.23907423 0.14024973 -0.14001131]
REC theano:
[ 1.77408110e-05 -1.77408110e-05 1.21677476e-05 -1.21677458e-05]
b
FF finite:
[ 0.00065565 -0.00047684]
FF theano:
[ 0.00058145 -0.00058144]
REC finite:
[ 0.00071526 -0.00047684]
REC theano:
[ 7.53380482e-06 -7.53380482e-06]
The problem was coming from non-intuitive (maybe) effect of gradient_steps clipping in BPTT as explained here:
https://groups.google.com/forum/#!topic/theano-users/QNge6fC6C4s