Pytorch BCELoss function different outputs for same inputs

Pytorch BCELoss function different outputs for same inputs - python

I am trying to calculate cross entropy loss using pytorch's BCELoss Function for a binary classification problem. While tinkering I found this weird behaviour.
from torch import nn
sigmoid = nn.Sigmoid()
loss = nn.BCELoss(reduction="sum")
target = torch.tensor([0., 1.])
input1 = torch.tensor([1., 1.], requires_grad=True)
input2 = sigmoid(torch.tensor([10., 10.], requires_grad=True))
print(input2) #tensor([1.0000, 1.0000], grad_fn=<SigmoidBackward>)
print(loss(input1, target)) #tensor(100., grad_fn=<BinaryCrossEntropyBackward>)
print(loss(input2, target)) #tensor(9.9996, grad_fn=<BinaryCrossEntropyBackward>)
Since both input1 and input2 have same value, shouldn't it return the same loss value instead of 100 and 9.9996. The correct loss value should be 100 since I am multiplying log(0) ~-infinity which is capped at -100 in pytorch. https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html
What is going on here and where am I going wrong?

sigmoid(10) is not exactly equal to 1:
>>> 1 / (1 + torch.exp(-torch.tensor(10.))).item()
0.9999545833234493
In your case:
>>> sigmoid(torch.tensor([10., 10.], requires_grad=True)).tolist()
[0.9999545812606812, 0.9999545812606812]
Thus input1 is not the same as input2: [1.0, 1.0] vs [0.9999545812606812, 0.9999545812606812],
Let's compute BCE manually:
def bce(x, y):
return - (y * torch.log(x) + (1 - y) * torch.log(1 - x)).item()
# input1
x1 = torch.tensor(1.)
x2 = torch.tensor(1.)
y1 = torch.tensor(0.)
y2 = torch.tensor(1.)
print("input1:", sum([bce(x1, y1), bce(x2, y2)]))
# input2
x1 = torch.tensor(0.9999545812606812)
x2 = torch.tensor(0.9999545812606812)
y1 = torch.tensor(0.)
y2 = torch.tensor(1.)
print("input2:", sum([bce(x1, y1), bce(x2, y2)]))
input1: nan
input2: 9.999631525119185
For input1 we get nan, but according to docs:
Our solution is that BCELoss clamps its log function outputs to be greater than or equal to -100. This way, we can always have a finite loss value and a linear backward method.
That's why we have 100 in a final pytorch's BCE output.

Related

Weights in Numpy Neural Net Not Updating, Error is Static

I'm trying to build a neural network on the Mnist dataset for a HW assignment. I'm not asking anyone to DO the assignment for me, I'm just having trouble figuring out why the Training accuracy and Test Accuracy seem to be static for every epoch?
It's as if my way of updating weights is not working.
Epoch: 0, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 1, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 2, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
Epoch: 3, Train Accuracy: 10.22%, Train Cost: 3.86, Test Accuracy: 10.1%
.
.
.
However, when I run the actual forward and backprop lines in a loop without any 'fluff' of classes or methods the cost goes down. I just can't seem to get it working in the current class setup.
I've tried building my own methods that pass the weights and biases between the backprop and feed-forward methods explicitly, however, those changes haven't done anything to fix this gradient descent issue.
I'm pretty sure it has to do with the definition of the backprop method in the NeuralNetwork class below. I've been struggling to find a way to update the weights by accessing the weight and bias variables in the main training loop.
def backward(self, Y_hat, Y):
'''
Backward pass through network. Update parameters
INPUT
Y_hat: Network predicted
shape: (?, 10)
Y: Correct target
shape: (?, 10)
RETURN
cost: calculate J for errors
type: (float)
'''
#Naked Backprop
dJ_dZ2 = Y_hat - Y
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
dJ_db2 = Y_hat - Y
dJ_dX2 = np.matmul(dJ_db2, np.transpose(NeuralNetwork.W2))
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
inner_mat = np.matmul(Y-Y_hat,np.transpose(NeuralNetwork.W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
dJ_db1 = np.matmul(Y - Y_hat, np.transpose(NeuralNetwork.W2)) * d_sigmoid(Z1)
lr = 0.1
# weight updates here
#just line 'em up and do lr * the dJ_.. vars you found above
NeuralNetwork.W2 = NeuralNetwork.W2 - lr * dJ_dW2
NeuralNetwork.b2 = NeuralNetwork.b2 - lr * dJ_db2
NeuralNetwork.W1 = NeuralNetwork.W1 - lr * dJ_dW1
NeuralNetwork.b1 = NeuralNetwork.b1 - lr * dJ_db1
# calculate the cost
cost = -1 * np.sum(Y * np.log(Y_hat))
# calc gradients
# weight updates
return cost#, W1, W2, b1, b2
I'm really at a loss here, any help is appreciated!
Full code is shown here...
import keras
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
np.random.seed(0)
"""### Load MNIST Dataset"""
(x_train, y_train), (x_test, y_test) = mnist.load_data()
X = x_train[0].reshape(1,-1)/255.; Y = y_train[0]
zeros = np.zeros(10); zeros[Y] = 1
Y = zeros
#Here we implement the forward pass for the network using the single example, $X$, from above
### Initialize weights and Biases
num_hidden_nodes = 200
num_classes = 10
# init weights
#first set of weights (these are what the input matrix is multiplied by)
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
#this is the first bias layer and i think it's a 200 dimensional vector of the biases that go into each neuron before the sigmoid function.
b1 = np.zeros((1,num_hidden_nodes))
#again this are the weights for the 2nd layer that are multiplied by the activation output of the 1st layer
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
#these are the biases that are added to each neuron before the final softmax activation.
b2 = np.zeros((1,num_classes))
# multiply input with weights
Z1 = np.add(np.matmul(X,W1), b1)
def sigmoid(z):
return 1 / (1 + np.exp(- z))
def d_sigmoid(g):
return sigmoid(g) * (1. - sigmoid(g))
# activation function of Z1
X2 = sigmoid(Z1)
Z2 = np.add(np.matmul(X2,W2), b2)
# softmax
def softmax(z):
# subracting the max adds numerical stability
shiftx = z - np.max(z)
exps = np.exp(shiftx)
return exps / np.sum(exps)
def d_softmax(Y_hat, Y):
return Y_hat - Y
# the hypothesis,
Y_hat = softmax(Z2)
"""Initially the network guesses all categories equally. As we perform backprop the network will get better at discerning images and their categories."""
"""### Calculate Cost"""
cost = -1 * np.sum(Y * np.log(Y_hat))
#so i think the main thing here is like a nested chain rule thing, where we find the change in the cost with respec to each
# set of matrix weights and biases?
#here is probably the order of how we do things based on whats in math below...
'''
1. find the partial deriv of the cost function with respect to the output of the second layer, without the softmax it looks like for some reason?
2. find the partial deriv of the cost function with respect to the weights of the second layer, which is dope cause we can re-use the partial deriv from step 1
3. this one I know intuitively we're looking for the parial deriv of cost with respect to the bias term of the second layer, but how TF does that math translate into
numpy? is that the same y_hat - Y from the first step? where is there anyother Y_hat - y?
4. This is also confusing cause I know where to get the weights for layer 2 from and how to transpose them, but again, where is the Y_hat - Y?
5. Here we take the missing partial deriv from step 4 and multiply it by the d_sigmoid function of the first layer outputs before activations.
6. In this step we multiply the first layer weights (transposed) by the var from 5
7. And this is weird too, this just seems like the same step as number 5 repeated for some reason but with y-y_hat instead of y_hat-y
'''
#look at tutorials like this https://www.youtube.com/watch?v=7qYtIveJ6hU
#I think the most backprop layer steps are fine without biases but how do we find the bias derivatives
#maybe just the hypothesis matrix minus the actual y matrix?
dJ_dZ2 = Y_hat - Y
#find partial deriv of cost w respect to 2nd layer weights
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
#finding the partial deriv of cost with respect to the 2nd layer biases
#I'm still not 100% sure why this is here and why it works out to Y_hat - Y
dJ_db2 = Y_hat - Y
#finding the partial deriv of cost with respect to 2nd layer inputs
dJ_dX2 = np.matmul(dJ_db2, np.transpose(W2))
#finding the partial deriv of cost with respect to Activation of layer 1
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
#y-yhat matmul 2nd layer weights
#I added the transpose to the W2 var because the matrices were not compaible sizes without it
inner_mat = np.matmul(Y-Y_hat,np.transpose(W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
class NeuralNetwork:
# set learning rate
lr = 0.01
# init weights
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
b1 = np.zeros((1,num_hidden_nodes))
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
b2 = np.zeros((1,num_classes))
def __init__(self, num_hidden_nodes, num_classes, lr=0.01):
'''
# set learning rate
lr = lr
# init weights
W1 = np.random.uniform(-1e-3,1e-3,size=(784,num_hidden_nodes))
b1 = np.zeros((1,num_hidden_nodes))
W2 = np.random.uniform(-1e-3,1e-3,size=(num_hidden_nodes,num_classes))
b2 = np.zeros((1,num_classes))
'''
def forward(self, X1):
'''
Forward pass through the network
INPUT
X: input to network
shape: (?, 784)
RETURN
Y_hat: prediction from output of network
shape: (?, 10)
'''
Z1 = np.add(np.matmul(X,W1), b1)
X2 = sigmoid(Z1)# activation function of Z1
Z2 = np.add(np.matmul(X2,W2), b2)
Y_hat = softmax(Z2)
#return the hypothesis
return Y_hat
# store input for backward pass
# you can basically copy and past what you did in the forward pass above here
# think about what you need to store for the backward pass
return
def backward(self, Y_hat, Y):
'''
Backward pass through network. Update parameters
INPUT
Y_hat: Network predicted
shape: (?, 10)
Y: Correct target
shape: (?, 10)
RETURN
cost: calculate J for errors
type: (float)
'''
#Naked Backprop
dJ_dZ2 = Y_hat - Y
dJ_dW2 = np.matmul(np.transpose(X2), dJ_dZ2)
dJ_db2 = Y_hat - Y
dJ_dX2 = np.matmul(dJ_db2, np.transpose(NeuralNetwork.W2))
dJ_dZ1 = dJ_dX2 * d_sigmoid(Z1)
inner_mat = np.matmul(Y-Y_hat,np.transpose(NeuralNetwork.W2))
dJ_dW1 = np.matmul(np.transpose(X),inner_mat) * d_sigmoid(Z1)
dJ_db1 = np.matmul(Y - Y_hat, np.transpose(NeuralNetwork.W2)) * d_sigmoid(Z1)
lr = 0.1
# weight updates here
#just line 'em up and do lr * the dJ_.. vars you found above
NeuralNetwork.W2 = NeuralNetwork.W2 - lr * dJ_dW2
NeuralNetwork.b2 = NeuralNetwork.b2 - lr * dJ_db2
NeuralNetwork.W1 = NeuralNetwork.W1 - lr * dJ_dW1
NeuralNetwork.b1 = NeuralNetwork.b1 - lr * dJ_db1
# calculate the cost
cost = -1 * np.sum(Y * np.log(Y_hat))
# calc gradients
# weight updates
return cost#, W1, W2, b1, b2
nn = NeuralNetwork(200,10,lr=.01)
num_train = float(len(x_train))
num_test = float(len(x_test))
for epoch in range(10):
train_correct = 0; train_cost = 0
# training loop
for i in range(len(x_train)):
x = x_train[i]; y = y_train[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass through network
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was accurate
if pred_num == y:
train_correct += 1
# make a one hot categorical vector; same as keras.utils.to_categorical()
zeros = np.zeros(10); zeros[y] = 1
Y = zeros
# compute gradients and update weights
train_cost += nn.backward(Y_hat, Y)
test_correct = 0
# validation loop
for i in range(len(x_test)):
x = x_test[i]; y = y_test[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was correct
if pred_num == y:
test_correct += 1
# no backward pass here!
# compute average metrics for train and test
train_correct = round(100*(train_correct/num_train), 2)
test_correct = round(100*(test_correct/num_test ), 2)
train_cost = round( train_cost/num_train, 2)
# print status message every epoch
log_message = 'Epoch: {epoch}, Train Accuracy: {train_acc}%, Train Cost: {train_cost}, Test Accuracy: {test_acc}%'.format(
epoch=epoch,
train_acc=train_correct,
train_cost=train_cost,
test_acc=test_correct
)
print (log_message)
also, The project is in this colab & ipynb notebook

I believe this is pretty clear, in this part of your loop:
for epoch in range(10):
train_correct = 0; train_cost = 0
# training loop
for i in range(len(x_train)):
x = x_train[i]; y = y_train[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass through network
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was accurate
if pred_num == y:
train_correct += 1
# make a one hot categorical vector; same as keras.utils.to_categorical()
zeros = np.zeros(10); zeros[y] = 1
Y = zeros
# compute gradients and update weights
train_cost += nn.backward(Y_hat, Y)
test_correct = 0
# validation loop
for i in range(len(x_test)):
x = x_test[i]; y = y_test[i]
# standardizing input to range 0 to 1
X = x.reshape(1,784) /255.
# forward pass
Y_hat = nn.forward(X)
# get pred number
pred_num = np.argmax(Y_hat)
# check if prediction was correct
if pred_num == y:
test_correct += 1
# no backward pass here!
# compute average metrics for train and test
train_correct = round(100*(train_correct/num_train), 2)
test_correct = round(100*(test_correct/num_test ), 2)
train_cost = round( train_cost/num_train, 2)
# print status message every epoch
log_message = 'Epoch: {epoch}, Train Accuracy: {train_acc}%, Train Cost: {train_cost}, Test Accuracy: {test_acc}%'.format(
epoch=epoch,
train_acc=train_correct,
train_cost=train_cost,
test_acc=test_correct
)
print (log_message)
For every epoch of the 10 epochs in your loop, you are setting your train_correct and train_cost to 0, hence there is no updating after each epoch

Tensorflow: NaN for custom softmax

Simply exchanging the nn.softmax function for a combination which uses tf.exp, keeping everything else like it was, causes not only the gradients to contain NaN but also the intermediate variable s. I have no idea why this is.
tempX = x
tempW = W
tempMult = tf.matmul(tempX, W)
s = tempMult + b
#! ----------------------------
#p = tf.nn.softmax(s)
p = tf.exp(s) / tf.reduce_sum(tf.exp(s), axis=1)
#!------------------------------
myTemp = y*tf.log(p)
cost = tf.reduce_mean(-tf.reduce_sum(myTemp, reduction_indices=1)) + mylambda*tf.reduce_sum(tf.multiply(W,W))
grad_W, grad_b = tf.gradients(xs=[W, b], ys=cost)
new_W = W.assign(W - tf.multiply(learning_rate, grad_W))
new_b = b.assign(b - tf.multiply(learning_rate, grad_b))

Answer
tf.exp(s) easily overflows for large s. That's the main reason that tf.nn.softmax doesn't actually use that equation but does something equilivent to it (according to the docs).
Discussion
When I rewrote your softmax function to
p = tf.exp(s) / tf.reshape( tf.reduce_sum(tf.exp(s), axis=1), [-1,1] )
It worked without a problem.
Here is a fully working python 2.7 implementation that uses a hand-crafted softmax and works (using the reshape function)
# -- imports --
import tensorflow as tf
import numpy as np
# np.set_printoptions(precision=1) reduces np precision output to 1 digit
np.set_printoptions(precision=2, suppress=True)
# -- constant data --
x = [[0., 0.], [1., 1.], [1., 0.], [0., 1.]]
y_ = [[1., 0.], [1., 0.], [0., 1.], [0., 1.]]
# -- induction --
# 1x2 input -> 2x3 hidden sigmoid -> 3x1 sigmoid output
# Layer 0 = the x2 inputs
x0 = tf.constant(x, dtype=tf.float32)
y0 = tf.constant(y_, dtype=tf.float32)
# Layer 1 = the 2x3 hidden sigmoid
m1 = tf.Variable(tf.random_uniform([2, 3], minval=0.1, maxval=0.9, dtype=tf.float32))
b1 = tf.Variable(tf.random_uniform([3], minval=0.1, maxval=0.9, dtype=tf.float32))
h1 = tf.sigmoid(tf.matmul(x0, m1) + b1)
# Layer 2 = the 3x2 softmax output
m2 = tf.Variable(tf.random_uniform([3, 2], minval=0.1, maxval=0.9, dtype=tf.float32))
b2 = tf.Variable(tf.random_uniform([2], minval=0.1, maxval=0.9, dtype=tf.float32))
h2 = tf.matmul(h1, m2) + b2
y_out = tf.exp(h2) / tf.reshape( tf.reduce_sum(tf.exp(h2), axis=1) , [-1,1] )
# -- loss --
# loss : sum of the squares of y0 - y_out
loss = tf.reduce_sum(tf.square(y0 - y_out))
# training step : gradient decent (1.0) to minimize loss
train = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
# -- training --
# run 500 times using all the X and Y
# print out the loss and any other interesting info
#with tf.Session() as sess:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print "\nloss"
for step in range(500):
sess.run(train)
if (step + 1) % 100 == 0:
print sess.run(loss)
results = sess.run([m1, b1, m2, b2, y_out, loss])
labels = "m1,b1,m2,b2,y_out,loss".split(",")
for label, result in zip(*(labels, results)):
print ""
print label
print result
print ""
Perhaps your initial values for M and b are too large. I tried re-running my above code but with with weights initialized to large numbers and I was able to reproduce your NaN issue.

Keras - custom loss function - Computing squared distance of softmax output from truth label

I am training a classification problem that ends in a softmax layer. I want to compute the loss as the average of the product of the prediction probability and square distances of each example's softmax output from the truth label.
In pseudocode: average(probability*distance_from_label**2)
Right now I am using the following code, which runs, but converges to outputting '0' for every instance. There is something wrong in its implementation:
bs = batch_size
l = labels
X = K.constant([[0,1,...,l-1] for y in range(bs)], shape=((bs,l))
X = tf.add(-y_true, X)
X = tf.abs(X)
X = tf.multiply(y_pred, X)
X = tf.multiply(X, X)
return K.mean(X)
Is there a way to implement this square difference loss function and keep a softmax layer? And still to measure the actual Euclidian distance between the predicted and true labels, not just the element-wise difference in one-hot vectors?
For clarity, I have provided these examples:
Example 1:
label1 = [0, 0, 1, 0], prediction1 = [1, 0, 0, 0]
loss1 = 4 = (4 + 0 + 0 + 0) = (1*2^2 + 0*1^2 + 0*0^2 + 0*1^2)
Example 2:
label2 = [0, 1, 0, 0],prediction2 = [0.3, 0.1, 0.3, 0.3]
loss2 = 1.8 = (0.3 + 0 + 0.3 + 1.2) = (0.3*1^2 + 0.1*0^2 + 0.3*1^2 + 0.3*2^2)

You can use Tensorflow (or Theano) as well as Keras Backends when designing a custom loss function. Note the tf.multiply and other functions from Tensorflow.
The following code implements this custom loss function in Keras (tested and working):
def custom_loss(y_true, y_pred):
bs = 1 # batch size
l = 4 # label number
c = K.constant([[x for x in range(l)] for y in range(bs)], shape=((bs, l)))
truths = tf.multiply(y_true, c)
truths = K.sum(truths, axis=1)
truths = K.concatenate(list(truths for i in range(l)))
truths = K.reshape(truths, ((bs,l)))
distances = tf.add(truths, -c)
sqdist = tf.multiply(distances, distances)
out = tf.multiply(y_pred, sqdist)
out = K.sum(out, axis=1)
return K.mean(out)
y_true = K.constant([0,0,1,0])
y_pred = K.constant([1,0,0,0])
print(custom_loss(y_true, y_pred)) # tf.Tensor(4.0, shape=(), dtype=float32)

Hereby another implementation which achieves the same objective with slightly different weights. Note: the one-hot labels need to be dtype=float32. All credits go to my supervisor.
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(5, activation="relu"))
model.add(keras.layers.Dense(4, activation="softmax"))
opt = keras.optimizers.Adam(lr=1e-3, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
def custom_loss(y_true, y_pred):
a = tf.reshape(tf.range(tf.cast(tf.shape(y_pred)[1],dtype=tf.float32)),(1,-1))
c = tf.tile(a,(tf.shape(y_pred)[0],1))
x = tf.math.multiply(y_true,c)
x = tf.reshape(tf.reduce_sum(x, axis=1),(-1,1))
x = tf.tile(x,(1,tf.shape(y_pred)[1]))
x = tf.math.pow(tf.math.subtract(x,c),2)
x = tf.math.multiply(x,y_pred)
x = tf.reduce_sum(x, axis=1)
return tf.reduce_mean(x)
yTrue= np.array([[0,0,1,0],[0,0,1,0],[0, 1, 0, 0]]).astype(np.float32)
y_true = tf.constant(yTrue)
y_pred = tf.constant([[1,0,0,0],[0,0,1,0],[0.3, 0.1, 0.3, 0.3]])
custom_loss(y_true,y_pred)
model.compile(optimizer=opt, loss=custom_loss)
model.fit(np.random.random((3,10)),yTrue, epochs=100)

What's wrong with my backpropagation?

I'm trying to code a neural network from scratch in python. To check whether everything works I wanted to overfit the network but the loss seems to explode at first and then comes back to the initial value and stops there (Doesn't converge). I've checked my code and could find the reason. I assume my understanding or implementation of backpropagation is incorrect but there might be some other reason. Can anyone help me out or at least point me in the right direction?
# Initialize weights and biases given dimesnsions (For this example the dimensions are set to [12288, 64, 1])
def initialize_parameters(dims):
# Initiate parameters
parameters = {}
L = len(dims) # Number of layers in the network
# Loop over the given dimensions. Initialize random weights and set biases to zero.
for i in range(1, L):
parameters["W" + str(i)] = np.random.randn(dims[i], dims[i-1]) * 0.01
parameters["b" + str(i)] = np.zeros([dims[i], 1])
return parameters
# Activation Functions
def relu(x, deriv=False):
if deriv:
return 1. * (x > 0)
else:
return np.maximum(0,x)
def sigmoid(x, deriv=False):
if deriv:
return x * (1-x)
else:
return 1/(1 + np.exp(-x))
# Forward and backward pass for 2 layer neural network. (1st relu, 2nd sigmoid)
def forward_backward(X, Y, parameters):
# Array for storing gradients
grads = {}
# Get the length of examples
m = Y.shape[1]
# First layer
Z1 = np.dot(parameters["W1"], X) + parameters["b1"]
A1 = relu(Z1)
# Second layer
Z2 = np.dot(parameters["W2"], A1) + parameters["b2"]
AL = sigmoid(Z2)
# Compute cost
cost = (-1 / m) * np.sum(np.multiply(Y, np.log(AL)) + np.multiply(1 - Y, np.log(1 - AL)))
# Backpropagation
# Second Layer
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
dZ2 = dAL * sigmoid(AL, deriv=True)
grads["dW2"] = np.dot(dZ2, A1.T) / m
grads["db2"] = np.sum(dZ2, axis=1, keepdims=True) / m
# First layer
dA1 = np.dot(parameters["W2"].T, dZ2)
dZ1 = dA1 * relu(A1, deriv=True)
grads["dW1"] = np.dot(dZ1, X.T)
grads["db1"] = np.sum(dZ1, axis=1, keepdims=True) / m
return AL, grads, cost
# Hyperparameters
dims = [12288, 64, 1]
epoches = 2000
learning_rate = 0.1
# Initialize parameters
parameters = initialize_parameters(dims)
log_list = []
# Train the network
for i in range(epoches):
# Get X and Y
x = np.array(train[0:10],ndmin=2).T
y = np.array(labels[0:10], ndmin=2).T
# Perform forward and backward pass
AL, grads, cost = forward_backward(x, y, parameters)
# Compute cost and append to the log_list
log_list.append(cost)
# Update parameters with computed gradients
parameters = update_parameters(grads, parameters, learning_rate)
plt.plot(log_list)
plt.title("Loss of the network")
plt.show()

I am struggling to find the place where you calculate the error gradients and the input training data sample would also help...
I don't know if this will help you, but I'll share my solution for Python neural network to learn XOR problem.
import numpy as np
def sigmoid_function(x, derivative=False):
"""
Sigmoid function
“x” is the input and “y” the output, the nonlinear properties of this function means that
the rate of change is slower at the extremes and faster in the centre. Put plainly,
we want the neuron to “make its mind up” instead of indecisively staying in the middle.
:param x: Float
:param Derivative: Boolean
:return: Float
"""
if (derivative):
return x * (1 - x) # Derivative using the chain rule.
else:
return 1 / (1 + np.exp(-x))
# create dataset for XOR problem
input_data = np.array([[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
ideal_output = np.array([[0.0], [1.0], [1.0], [0.0]])
#initialize variables
learning_rate = 0.1
epoch = 50000 #number or iterations basically - One round of forward and back propagation is called an epoch
# get the second element from the numpy array shape field to detect the count of features for input layer
input_layer_neurons = input_data.shape[1]
hidden_layer_neurons = 3 #number of hidden layer neurons
output_layer_neurons = 1 #number of output layer neurons
#init weight & bias
weights_hidden = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bias_hidden = np.random.uniform(1, hidden_layer_neurons)
weights_output = np.random.uniform(size=(hidden_layer_neurons, output_layer_neurons))
bias_output = np.random.uniform(1, output_layer_neurons)
for i in range(epoch):
#forward propagation
hidden_layer_input_temp = np.dot(input_data, weights_hidden) #matrix dot product to adjust for weights in the layer
hidden_layer_input = hidden_layer_input_temp + bias_hidden #adjust for bias
hidden_layer_activations = sigmoid_function(hidden_layer_input) #use the activation function
output_layer_input_temp = np.dot(hidden_layer_activations, weights_output)
output_layer_input = output_layer_input_temp + bias_output
output = sigmoid_function(output_layer_input) #final output
#backpropagation (where adjusting of the weights happens)
error = ideal_output - output #error gradient
if (i % 1000 == 0):
print("Error: {}".format(np.mean(abs(error))))
#use derivatives to compute slope of output and hidden layers
slope_output_layer = sigmoid_function(output, derivative=True)
slope_hidden_layer = sigmoid_function(hidden_layer_activations, derivative=True)
#calculate deltas
delta_output = error * slope_output_layer
error_hidden_layer = delta_output.dot(weights_output.T) #calculates the error at hidden layer
delta_hidden = error_hidden_layer * slope_hidden_layer
#change the weights
weights_output += hidden_layer_activations.T.dot(delta_output) * learning_rate
bias_output += np.sum(delta_output, axis=0, keepdims=True) * learning_rate
weights_hidden += input_data.T.dot(delta_hidden) * learning_rate
bias_hidden += np.sum(delta_hidden, axis=0, keepdims=True) * learning_rate

Calculate optimal input of a neural network with theano, by using gradient descent w.r.t. inputs

I have implemented and trained a neural network with Theano of k binary inputs (0,1), one hidden layer and one unit in the output layer. Once it has been trained I want to obtain inputs that maximizes the output (e.g. x which makes unit of output layer closest to 1). So far I haven't found an implementation of it, so I am trying the following approach:
Train network => obtain trained weights (theta1, theta2)
Define the neural network function with x as input and trained theta1, theta2 as fixed parameters. That is: f(x) = sigmoid( theta1*(sigmoid (theta2*x ))). This function takes x and with given trained weights (theta1, theta2) gives output between 0 and 1.
Apply gradient descent w.r.t. x on the neural network function f(x) and obtain x that maximizes f(x) with theta1 and theta2 given.
For these I have implemented the following code with a toy example (k = 2). Based on the tutorial on http://outlace.com/Beginner-Tutorial-Theano/ but changed vector y, so that there is only one combination of inputs that gives f(x) ~ 1 which is x = [0, 1].
Edit1: As suggested optimizer was set to None and bias unit was fixed to 1.
Step 1: Train neural network. This runs well and with out error.
import os
os.environ["THEANO_FLAGS"] = "optimizer=None"
import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np
x = T.dvector()
y = T.dscalar()
def layer(x, w):
b = np.array([1], dtype=theano.config.floatX)
new_x = T.concatenate([x, b])
m = T.dot(w.T, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1
h = nnet.sigmoid(m)
return h
def grad_desc(cost, theta):
alpha = 0.1 #learning rate
return theta - (alpha * T.grad(cost, wrt=theta))
in_units = 2
hid_units = 3
out_units = 1
theta1 = theano.shared(np.array(np.random.rand(in_units + 1, hid_units), dtype=theano.config.floatX)) # randomly initialize
theta2 = theano.shared(np.array(np.random.rand(hid_units + 1, out_units), dtype=theano.config.floatX))
hid1 = layer(x, theta1) #hidden layer
out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression
cost = theano.function(inputs=[x, y], outputs=fc, updates=[
(theta1, grad_desc(fc, theta1)),
(theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)
inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 0, 0, 0]) #training data Y
cur_cost = 0
for i in range(5000):
for k in range(len(inputs)):
cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
print(run_forward([0,1]))
Output of run forward for [0,1] is: 0.968905860574.
We can also get values of weights with theta1.get_value() and theta2.get_value()
Step 2: Define neural network function f(x). Trained weights (theta1, theta2) are constant parameters of this function.
Things get a little trickier here because of the bias unit, which is part of he vector of inputs x. To do this I concatenate b and x. But the code now runs well.
b = np.array([[1]], dtype=theano.config.floatX)
#b_sh = theano.shared(np.array([[1]], dtype=theano.config.floatX))
rand_init = np.random.rand(in_units, 1)
rand_init[0] = 1
x_sh = theano.shared(np.array(rand_init, dtype=theano.config.floatX))
th1 = T.dmatrix()
th2 = T.dmatrix()
nn_hid = T.nnet.sigmoid( T.dot(th1, T.concatenate([x_sh, b])) )
nn_predict = T.sum( T.nnet.sigmoid( T.dot(th2, T.concatenate([nn_hid, b]))))
Step 3:
Problem is now in gradient descent as is not limited to values between 0 and 1.
fc2 = (nn_predict - 1)**2
cost3 = theano.function(inputs=[th1, th2], outputs=fc2, updates=[
(x_sh, grad_desc(fc2, x_sh))])
run_forward = theano.function(inputs=[th1, th2], outputs=nn_predict)
cur_cost = 0
for i in range(10000):
cur_cost = cost3(theta1.get_value().T, theta2.get_value().T) #call our Theano-compiled cost function, it will auto update weights
if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
print('Cost: %s' % (cur_cost,))
print x_sh.get_value()
The last iteration prints:
Cost: 0.000220317356533
[[-0.11492753]
[ 1.99729555]]
Furthermore input 1 keeps becoming more negative and input 2 increases, while the optimal solution is [0, 1]. How can this be fixed?

You are adding b=[1] via broadcasting rules as opposed to concatenating it. Also, once you concatenate it, your x_sh has one dimension to many which is why the error occurs at nn_predict and not nn_hid

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytorch BCELoss function different outputs for same inputs - python

Related

Weights in Numpy Neural Net Not Updating, Error is Static

Tensorflow: NaN for custom softmax

Keras - custom loss function - Computing squared distance of softmax output from truth label

What's wrong with my backpropagation?

Calculate optimal input of a neural network with theano, by using gradient descent w.r.t. inputs

Categories

Resources