When using the chain rule to calculate the slope of the cost function relative to the weights at the layer L , the formula becomes:
d C0 / d W(L) = ... . d a(L) / d z(L) . ...
With :
z (L) being the induced local field : z (L) = w1(L) * a1(L-1) + w2(L) * a2(L-1) * ...
a (L) beeing the ouput : a (L) = & (z (L))
& being the sigmoid function used as an activation function
Note that L is taken as a layer indicator and not as an index
Now:
d a(L) / d z(L) = &' ( z(L) )
With &' being the derivative of the sigmoid function
The problem:
But in this post which is written by James Loy on building a simple neural network from scratch with python, When doing the backpropagation, he didn't give z (L) as an input to &' to replace d a(L) / d z(L) in the chain rule function. Instead he gave it the output = last activation of the layer (L) as the input the the sigmoid derivative &'
def feedforward(self):
self.layer1 = sigmoid(np.dot(self.input, self.weights1))
self.output = sigmoid(np.dot(self.layer1, self.weights2))
def backprop(self):
# application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output) * sigmoid_derivative(self.output)))
Note that in code above the layer L is the layer 2 which is the last or output layer.
And sigmoid_derivative(self.output) this is where the activation of the current layer is given as input to the derivative of the sigmoid function used as an activation function.
The question:
Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?
It turned out that &( z(L) ) or output was used, just to accommodate to the way sigmoid_derivative was implemented.
Here is the code of the sigmoid_derivative:
def sigmoid(x):
return 1.0/(1+ np.exp(-x))
def sigmoid_derivative(x):
return x * (1.0 - x)
The mathematical formula of the sigmoid_derivative can be written as: &' (x) = &(x) * (1-&(x))
So to get to the formula above, &(z) and not z was passed to sigmoid_derivative in order to return: &(z) * (1.0 - &(z))
You want to use the derivative with respect to the output. During backpropagation we use the weights only to determine how much of the error belongs to each one of the weights and by doing so we can further propagate the error back through the layers.
In the tutorial, the sigmoid is applied to the last layer:
self.output = sigmoid(np.dot(self.layer1, self.weights2))
From your question:
Shouldn't we use this sigmoid_derivative(np.dot(self.layer1, self.weights2)) instead of this sigmoid_derivative(self.output)?
You cannot do:
sigmoid_derivative(np.dot(self.layer1, self.weights2))
because here you are trying to take the derivative of the sigmoid when you have not yet applied it.
This is why you have to use:
sigmoid_derivative(self.output)
You're right- looks like the author made a mistake. I'll explain: When the network is done with a forward pass (all activations + loss), you have to use gradient descent to minimize the weights according to the loss function. To do this, you need the partial derivative of the loss function with respect to each weight matrix.
Some notation before I continue: loss is L, A is activation (aka sigmoid), Z means the net input, in other words, the result of W . X. Numbers are indices, so A1 means the activation for the first layer.
You can use the chain rule to move backwards through the network and express the weights as a function of the loss. To begin the backward pass, you start by getting the derivative of the loss with respect to the last layer's activation. This is dL/dA2, because the second layer is the final layer. To update the weights of the second layer, we need to complete dA2/dZ2 and dZ/dW2.
Before continuing, remember that the second layer's activation is A2 = sigmoid(W2 . A1) and Z2 = W2 . A1. For clarity, we'll write A2 = sigmoid(Z2). Treat Z2 as its own variable. So if you compute dA2/dZ2, you get sigmoid_derivative(Z2), which is sigmoid_derivative(W2 . A1) or sigmoid_derivative(np.dot(self.layer1, self.weights2)). So it shouldn't be sigmoid_derivative(self.output) because output was activated by sigmoid.
Related
I'm trying to differentiate a gradient in PyTorch. I found this link but can't get it to work.
My code looks as follows:
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
class net_x(nn.Module):
def __init__(self):
super(net_x, self).__init__()
self.fc1=nn.Linear(2, 20)
self.fc2=nn.Linear(20, 20)
self.out=nn.Linear(20, 4)
def forward(self, x):
x=self.fc1(x)
x=self.fc2(x)
x=self.out(x)
return x
nx = net_x()
r = torch.tensor([1.0,2.0])
nx(r)
>>>tensor([-0.2356, -0.7315, -0.2100, -0.6741], grad_fn=<AddBackward0>)
But when I try to differentiate the function with respect to the first parameter
grad(nx, r[0])
I get the error
TypeError: 'net_x' object is not iterable
Update
Trying to extend this to tensors:
For some reason the gradient is the same for all inputs.
a = torch.rand((8,2), requires_grad=True)
s = []
s_t = []
for input_tensor in a:
output_tensor = nx(input_tensor)
s.append(output_tensor[0])
s_t_value = grad(output_tensor[0], input_tensor)[0][0]
s_t.append(s_t_value)
print(s_t)
But the output is:
[tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326), tensor(-0.1326)]
First thing to change if you want to have the gradients with respect to r is to set the requires_grad flag to True for this tensor :
nx = net_x()
r = torch.tensor([1.0,2.0], requires_grad=True)
Then, as explained in autograd documentation, grad computes the gradients of oputputs with respect to the inputs, so you need to save the output of the model :
y = nx(r)
Now you can compute the gradients with respect to r. But there is one last issue : grad only knows how to propagate gradients from a scalar tensor, which y is not. So you need to compute the gradients with respect to each coordinate :
for x in y:
print(grad(x, r, retain_graph=True))
or equivalently:
for i in range(y.shape[0]):
# prints the vector (dy_i/dr_0, dy_i/dr_1, ... dy_i/dr_n)
print(grad(y[i], r, retain_graph=True))
You need to retain_graph because without this flag, the computational graph is cleared after the first gradient propagation. And there you have it, the derivative of each coordinate of nx(r) with respect to r !
To answer your question in the comments :
Not an error, it's normal. So you have a batched input of size (B, 2), with B = 8. You get a batched output of shape (B, 4). Now, for each vector of the batched output, for each coordinate of this vector, you can compute the derivative with respect to the batched input, which will yield a gradient of size (B,2), like that :
for b in y: # There a B vectors b of shape (4)
for x in b: # There are 4 coordinates
# This prints a tensor of shape (B, 2)
print(grad(x, r, retain_graph=True))
Now remember the way batches work : all batches are computed together to harvest the power of GPU, but they are actually completely independant. So al b vectors are actually results of the network from different inputs. Which means, the gradient of the i-th vector b with respect to the j-th vector of the input must be 0 if i!=j. Does that make sense ? It's like computing f(x,y) = (x^2, y^2). The derivative of y^2 with respect to x is obviously 0 ! Well consider x and y to be two samples from one batch, and you have you explaination for why there are a lot of 0 in your results.
A last sample of code to make it even clearer :
inputs = [torch.randn(1, 2, requires_grad=True) for i in range(8)]
r = torch.cat(inputs) # shape : (8, 2)
y = nx(r) # shape : (8, 4)
for i in range(len(y)):
print(f"Gradients of y[{i}] wrt r[{i}]")
for x in y[i]:
# prints a tensor of size (2)
print(grad(x, inputs[i], retain_graph=True))
On to why all the gradients are the same. This is because your neural network is completely linear. You have 3 nn.Linear layers, and no non-linear activation function (as a consequence, this is literally equivalent to a network with only one layer). One property of linear layers is that their gradient is constant : d(alpha*x)/dx = alpha (independant of x). Therefore the gradients will be identical along all dimensions. Just add non-linear activation layers like sigmoids and this behavior will not happen again.
I am trying to implement an Inverse Sigmoid function to the last layer of my Convolutional Neural Network?
I am trying to build the network in Pytorch and I want to take the output from the last Convolutional Layer and then apply Inverse Sigmoid Function to it.
I have read that the logit function is the opposite of sigmoid function and I tried implementing it but its not working.
I used the logit function from the scipy library and used it in the function.
def InverseSigmoid(self, x):
x = logit(x)
return x
Sigmoid is just 1 / (1 + e**-x). So if you want to invert it you can just -ln((1 / x) - 1). For numerical stability purposes, you can also do -ln((1 / (x + 1e-8)) - 1). This is the inverse function of sigmoid, implementation is straightforward.
Thank you for any help I'm given in advance with this! I have been given the python code for a simple single layer perceptron with the task to alter the code so it is a multi-layer perceptron. I'm still very new to all of this, but from what I understand the repeating feed-forward and back-propagation cycle is what creates the hidden layers. Given the follow code, what should be altered to help create these hidden layers?
# Creating a numerically stable logistic s-shaped definition to call
def sigmoid(x):
x = np.clip(x, -500, 500)
if x.any()>=0:
return 1/(1 + np.exp(-x))
else:
return np.exp(x)/(1 + np.exp(x))
# define the dimentions and set the weights to random numbers
def init_parameters(dim1, dim2=1,std=1e-1, random = True):
if(random):
return(np.random.random([dim1,dim2])*std)
else:
return(np.zeros([dim1,dim2]))
# Single layer network: Forward Prop
# Passed in the weight vectors, bias vector, the input vector and the Y
def fwd_prop(W1, bias, X, Y):
Z1 = np.dot(W1, X) + bias # dot product of the weights and X + bias
A1 = sigmoid(Z1) # Uses sigmoid to create a predicted vector
return(A1)
#Single layer network: Backprop
def back_prop(A1, W1, bias, X, Y):
m = np.shape(X)[1] # used the calculate the cost by the number of inputs -1/m
# Cross entropy loss function
cost = (-1/m)*np.sum(Y*np.log(A1) + (1-Y)*np.log(1-A1)) # cost of error
dZ1 = A1 - Y # subtract actual from pred weights
dW1 = (1/m) * np.dot(dZ1, X.T) # calc new weight vector
dBias = (1/m) * np.sum(dZ1, axis = 1, keepdims = True) # calc new bias vector
grads ={"dW1": dW1, "dB1":dBias} # Weight and bias vectors after backprop
return(grads, cost)
def run_grad_desc(num_epochs, learning_rate, X, Y, n_1):
n_0, m = np.shape(X)
W1 = init_parameters(n_1, n_0, True)
B1 = init_parameters(n_1,1, True)
loss_array = np.ones([num_epochs])*np.nan # resets the loss_array to NaNs
for i in np.arange(num_epochs):
A1 = fwd_prop(W1, B1, X, Y) # get predicted vector
grads, cost = back_prop(A1, W1, B1, X, Y) # get gradient and the cost from BP
W1 = W1 - learning_rate*grads["dW1"] # update weight vector LR*gradient*[BP weights]
B1 = B1 - learning_rate*grads["dB1"] # update bias LR*gradient[BP bias]
loss_array[i] = cost # loss array gets cross ent values
parameter = {"W1":W1, "B1":B1} # assign
return(parameter, loss_array)
We've also been asked to be able to adjust for the nodes in a hidden layer. Being honest, I am completely lost here and am not clear on what the nodes even represent, so help here would be appreciated as well. Thanks y'all.
It looks like this network is not even single layer. Normally "single layer" means one layer of hidden neurons. This network outputs the activations of what would normally be the hidden layer.
My advice to you would be to start studying the basics of neural networks. There are lots of great resources, including on Youtube. For backpropagation, a good place to start is here
Also note that if you are in a hurry, using autograd tools like Tensorflow or Pytorch takes care of the differentiation for you. Of course, if you are doing this to learn the details of neural networks, then building one from scratch is much better.
In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight):
...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = mu
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, weight):
learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
momentum = self.get_slot(weight, "a")
if momentum.get_shape().ndims == 2: # neuron weights
momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
elif momentum.get_shape().ndims == 1: # bias weights
momentum_mean = momentum
else:
momentum_mean = momentum
momentum_update = grad + (mu_t * momentum_mean)
momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)
weight_update = learning_rate_t * momentum_t
weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)
return tf.group(*[weight_t, momentum_t])
def _prepare(self):
self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)
Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.
Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:
Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.
Old answer, kept for the sake of keeping.
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K
W = ... # Weights of size N x M
B = ... # Biases of size N
Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K
# Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K
W_star = ... # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]
X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star) # Output of size N x K
# Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.
I am trying to understand backpropagation in a simple 3 layered neural network with MNIST.
There is the input layer with weights and a bias. The labels are MNIST so it's a 10 class vector.
The second layer is a linear tranform. The third layer is the softmax activation to get the output as probabilities.
Backpropagation calculates the derivative at each step and call this the gradient.
Previous layers appends the global or previous gradient to the local gradient. I am having trouble calculating the local gradient of the softmax
Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
The derivative is explained with respect to when i = j and when i != j. This is a simple code snippet I've come up with and was hoping to verify my understanding:
def softmax(self, x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
def forward(self):
# self.input is a vector of length 10
# and is the output of
# (w * x) + b
self.value = self.softmax(self.input)
def backward(self):
for i in range(len(self.value)):
for j in range(len(self.input)):
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i))
else:
self.gradient[i] = -self.value[i]*self.input[j]
Then self.gradient is the local gradient which is a vector. Is this correct? Is there a better way to write this?
I am assuming you have a 3-layer NN with W1, b1 for is associated with the linear transformation from input layer to hidden layer and W2, b2 is associated with linear transformation from hidden layer to output layer. Z1 and Z2 are the input vector to the hidden layer and output layer. a1 and a2 represents the output of the hidden layer and output layer. a2 is your predicted output. delta3 and delta2 are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.
This is a general scenario for a 3-layer NN (input layer, only one hidden layer and one output layer). You can follow the procedure described above to compute gradients which should be easy to compute! Since another answer to this post already pointed to the problem in your code, i am not repeating the same.
As I said, you have n^2 partial derivatives.
If you do the math, you find that dSM[i]/dx[k] is SM[i] * (dx[i]/dx[k] - SM[i]) so you should have:
if i == j:
self.gradient[i,j] = self.value[i] * (1-self.value[i])
else:
self.gradient[i,j] = -self.value[i] * self.value[j]
instead of
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i])
else:
self.gradient[i] = -self.value[i]*self.input[j]
By the way, this may be computed more concisely like so (vectorized):
SM = self.value.reshape((-1,1))
jac = np.diagflat(self.value) - np.dot(SM, SM.T)
np.exp is not stable because it has Inf.
So you should subtract maximum in x.
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x - x.max())
return exps / np.sum(exps)
If x is matrix, please check the softmax function in this notebook.