Thank you for any help I'm given in advance with this! I have been given the python code for a simple single layer perceptron with the task to alter the code so it is a multi-layer perceptron. I'm still very new to all of this, but from what I understand the repeating feed-forward and back-propagation cycle is what creates the hidden layers. Given the follow code, what should be altered to help create these hidden layers?
# Creating a numerically stable logistic s-shaped definition to call
def sigmoid(x):
x = np.clip(x, -500, 500)
if x.any()>=0:
return 1/(1 + np.exp(-x))
else:
return np.exp(x)/(1 + np.exp(x))
# define the dimentions and set the weights to random numbers
def init_parameters(dim1, dim2=1,std=1e-1, random = True):
if(random):
return(np.random.random([dim1,dim2])*std)
else:
return(np.zeros([dim1,dim2]))
# Single layer network: Forward Prop
# Passed in the weight vectors, bias vector, the input vector and the Y
def fwd_prop(W1, bias, X, Y):
Z1 = np.dot(W1, X) + bias # dot product of the weights and X + bias
A1 = sigmoid(Z1) # Uses sigmoid to create a predicted vector
return(A1)
#Single layer network: Backprop
def back_prop(A1, W1, bias, X, Y):
m = np.shape(X)[1] # used the calculate the cost by the number of inputs -1/m
# Cross entropy loss function
cost = (-1/m)*np.sum(Y*np.log(A1) + (1-Y)*np.log(1-A1)) # cost of error
dZ1 = A1 - Y # subtract actual from pred weights
dW1 = (1/m) * np.dot(dZ1, X.T) # calc new weight vector
dBias = (1/m) * np.sum(dZ1, axis = 1, keepdims = True) # calc new bias vector
grads ={"dW1": dW1, "dB1":dBias} # Weight and bias vectors after backprop
return(grads, cost)
def run_grad_desc(num_epochs, learning_rate, X, Y, n_1):
n_0, m = np.shape(X)
W1 = init_parameters(n_1, n_0, True)
B1 = init_parameters(n_1,1, True)
loss_array = np.ones([num_epochs])*np.nan # resets the loss_array to NaNs
for i in np.arange(num_epochs):
A1 = fwd_prop(W1, B1, X, Y) # get predicted vector
grads, cost = back_prop(A1, W1, B1, X, Y) # get gradient and the cost from BP
W1 = W1 - learning_rate*grads["dW1"] # update weight vector LR*gradient*[BP weights]
B1 = B1 - learning_rate*grads["dB1"] # update bias LR*gradient[BP bias]
loss_array[i] = cost # loss array gets cross ent values
parameter = {"W1":W1, "B1":B1} # assign
return(parameter, loss_array)
We've also been asked to be able to adjust for the nodes in a hidden layer. Being honest, I am completely lost here and am not clear on what the nodes even represent, so help here would be appreciated as well. Thanks y'all.
It looks like this network is not even single layer. Normally "single layer" means one layer of hidden neurons. This network outputs the activations of what would normally be the hidden layer.
My advice to you would be to start studying the basics of neural networks. There are lots of great resources, including on Youtube. For backpropagation, a good place to start is here
Also note that if you are in a hurry, using autograd tools like Tensorflow or Pytorch takes care of the differentiation for you. Of course, if you are doing this to learn the details of neural networks, then building one from scratch is much better.
Related
I am handeling a timeseries dataset with n timesteps, m features and k objects.
As a result my feature vector has a shape of (n,k,m) While my targets shape is (n,m)
I want to predict the targets for every timestep and object, but with the same weights for every opject. Also my loss function looks like this.
average_loss = loss_func(prediction, labels)
sum_loss = loss_func(sum(prediction), sum(labels))
loss = loss_weight * average_loss + (1-loss_weight) * sum_loss
My plan is to not only make sure, that I predict every item as good as possible, but also that the sum of all items get perdicted. loss_weights is a constant.
Currently I am doing this kind of ugly solution:
features = local_batch.squeeze(dim = 0)
labels = torch.unsqueeze(local_labels.squeeze(dim = 0), 1)
prediction = net(features)
I set my batchsize = 1. And squeeze it to make the k objects my batch.
My network looks like this:
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden) # hidden layer
self.predict = torch.nn.Linear(n_hidden, n_output) # output layer
def forward(self, x):
x = F.relu(self.hidden(x)) # activation function for hidden layer
x = self.predict(x) # linear output
return x
How do I make sure I do a reasonable convolution over the opject dimension in order to keep the same weights for all objects, without commiting to batchsize=1? Also, how do I achieve the same loss function, where I compute the loss of the prediction sum vs target sum for any timestamp?
It's not exactly ugly -- I would do the same but generalize it a bit for batch size >1 using view.
# Using your notations
n, k, m = features.shape
features = local_batch.view(n*k, m)
prediction = net(features).view(n, k, m)
With the prediction in the correct shape (n*k*m), implementing your loss function should not be difficult.
We are currently working on a project that involves Neural Networks. Our task consists of upscaling an 16x16 image all the way up to a picture with the resolution of 64x64.
The model used in the code example consists of only one Hidden-Layer with 5 Neurons in it. Furthermore the Hidden-Layer uses a ReLu function and the Output-Layer a simple linear one. The activations get saved and loaded , when they need to be accessed.
There is something wrong with the calculation of the Lost/Cost and therefore the weight's get updated the wrong way. The weight's keep getting bigger if their postive and more negative if they are negative.
We have seperated the Back-Propagation of the Hidden-Layer and the Output Layer, since they calculate the errors differently. There are also some Shape errors in the Back-Propagation for the first Hidden-Layer after the Output-Layer , which could indicate that the calculation is somewhat wrong. We used the website : https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html to get our errors calculated.
We are by no means expert's on the topic and would be happy to accept any help that we can get, so feel free to ask us any questions you like.
{def linearfunc(x): #The Linear Output function used for the Forward-Propagation
return(x)
def linderiv(x): #The Derivative of our Linear function used for the Back-Propagation
return(1)
def backward(): #Back-Propagtion
alpha = 0.001 #Learning Rate
Weights_Output = np.random.randn(5, 12288) #Weights of the Output Layer / 5 is the number of neurons in the Hidden-Layer and 12228 the number of neurons in the Output Layer
Bias_Output = np.random.randn(12288,1) #Bias of the Output Layer / 12288 is always derived from the features of the Output Image, which is 64*64*3 (3 For the RGB Values)
Weights = np.random.randn(768,5) #Weights of the only Hidden-Layer / The number 768 is derived from the number of input features of the images being 16*16*3
Bias = np.random.randn(5,1) #Bias of the only Hidden-Layer
#Start of the Back-Propagation for the Output-Layer
DW_Output = np.zeros((12288,5)) #Preparing the shapes of DW in the Output-Layer for the Back-Propagation
DB_Output = np.zeros((12288, 1)) #Preparing the shapes of DB in the Output-Layer for the Back-Propagation
da_dz2 = linderiv(Z_Output) #Applying the derivative of the Linear Function to our Z's in the Output-Layer / 1 Step
y = np.load('DatasetY.npy') #Loading our Output Features of our Dataset with the Y values
a = np.load('Activations.npy') #Loading the Activations of the previous Layer before the Output-Layer / Our only Hidden-Layer
Error_Output = (activations_Output-y) * da_dz2 #Computing DZ(The Error) of our Output-Layer by using the formula Error = (Activations in the Output-Layer - Y-Values) multiplied with the derivative of the ReLu function applied to the Z's of the Output Layer
DW_Output = np.dot(Error_Output , activations.T) #Computing DW of our Output-Layer by using the dot product of DZ_Output and our activations and then divide by the number of images in our Trainingset
DB_Output = np.sum(Error_Output, axis = 1, keepdims = True) * (1/10000)#Computing DB of our Output-Layer by summing up all of the DZ's along axis 1 to get the right shape
Weights_Output = Weights_Output - alpha * DW_Output.T #The Weights of the Output Layer perform the Gradient Descent Step / The first time they update they all go way too high or too low
Bias_Output = Bias_Output - alpha * DB_Output #The Bias of the Output Layer perform the Gradient Descent Step / The first time they update they all go way too high or too low
#Start of the Back-Propagation for the Hidden-Layer
dw = np.zeros((5,768)) #Preparing the shapes of dw in the only Hidden-Layer for the Back-Propagation
db = np.zeros((5, 1)) #Preparing the shapes of dw in the only Hidden-Layer for the Back-Propagation
a_2 = np.load('DatasetX.npy') #Loading the activations of the Layer previous to out Hidden-Layer, which would be the Input Features
da_dz = z >= 0 #Applying the derivative of the ReLu Function to our Z's in the Hidden-Layer / 1 Step
da_dz = da_dz.astype(np.int) #Turning all the TRUE values into 1 and all the FALSE values into 0 / 2 Step
Error = Error_Output * np.dot(Weights_Output.T , da_dz) #<< Gives a shape Error when Updating the Weight's #Computing dz(Error) in our Hidden-Layer by using the Weight's and the Error from the Output Layer and multiplying that with the derivative of the ReLu function applied to the z's of the only Hidden-Layer
dw = np.dot(Error, a_2.T) * (1/10000) #Computing dw of our Hidden-Layer by using the dot product of dz and our activations and then divide by the number of images in our Trainingset
db = (np.sum(Error, axis = 1, keepdims = True)) * (1/10000) #Computing db of our Hidden-Layer by summing up all of the dz's along axis 1 to get the right shape
Weights = Weights - alpha * dw.T #The Weights of the Hidden-Layer perform the Gradient Descent Step / The first time they update they all go negative
Bias = Bias - alpha * db #The Bias of the Hidden-Layer perform the Gradient Descent Step / The first time they update they all go negative}
In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.
def _apply_dense(self, grad, weight):
...
For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.
If we take a look at layer 2 we get in apply_dense a call for the neuron weights:
and a call for the bias weights:
But I would either need both matrix in one call of apply_dense or a weight matrix like this:
X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.
How to do this?
MWE
For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from tensorflow.python.training import optimizer
class SGDmomentum(optimizer.Optimizer):
def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
super(SGDmomentum, self).__init__(use_locking, name)
self._lr = learning_rate
self._mu = mu
self._lr_t = None
self._mu_t = None
def _create_slots(self, var_list):
for v in var_list:
self._zeros_slot(v, "a", self._name)
def _apply_dense(self, grad, weight):
learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
momentum = self.get_slot(weight, "a")
if momentum.get_shape().ndims == 2: # neuron weights
momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
elif momentum.get_shape().ndims == 1: # bias weights
momentum_mean = momentum
else:
momentum_mean = momentum
momentum_update = grad + (mu_t * momentum_mean)
momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)
weight_update = learning_rate_t * momentum_t
weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)
return tf.group(*[weight_t, momentum_t])
def _prepare(self):
self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")
For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)
Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.
Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:
Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.
Old answer, kept for the sake of keeping.
I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.
Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:
X = ... # Input batch of size M x K
W = ... # Weights of size N x M
B = ... # Biases of size N
Y = tf.matmul(W, X) + B[:, tf.newaxis] # Output of size N x K
# Activation...
If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:
X = ... # Input batch of size M x K
W_star = ... # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]
X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star) # Output of size N x K
# Activation...
Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.
Following Tensorflow LSTM Regularization I am trying to add regularization term to the cost function when training parameters of LSTM cells.
Putting aside some constants I have:
def RegularizationCost(trainable_variables):
cost = 0
for v in trainable_variables:
cost += r(tf.reduce_sum(tf.pow(r(v.name),2)))
return cost
...
regularization_cost = tf.placeholder(tf.float32, shape = ())
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
...
tv = tf.trainable_variables()
s = tf.Session()
r = s.run
...
while (...):
...
reg_cost = RegularizationCost(tv)
r(optimizer, feed_dict = {x: x_b, y: y_b, regularization_cost: reg_cost})
The problem I have is that adding the regularization term hugely slows the learning process and actually the regularization term reg_cost is increasing with each iteration visibly when the term associated with pred - y pretty much stagnated i.e. the reg_cost seems not to be taken into account.
As I suspect I am adding this term in completely wrong way. I did not know how to add this term in the cost function itself so I used a workaround with scalar tf.placeholder and "manually" calculated the regularization cost. How to do it properly?
compute the L2 loss only once:
tv = tf.trainable_variables()
regularization_cost = tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv ])
cost = tf.reduce_sum(tf.pow(pred - y, 2)) + regularization_cost
optimizer = tf.train.AdamOptimizer(learning_rate = 0.01).minimize(cost)
you might want to remove the variables that are bias as those should not be regularized.
It slows down because your code creates new nodes in every iteration. This is not how you code with TF. First, you create your whole graph, including regularization terms, then, in the while loop you only execute them, each "tf.XXX" operation creates new nodes.
I am trying to understand backpropagation in a simple 3 layered neural network with MNIST.
There is the input layer with weights and a bias. The labels are MNIST so it's a 10 class vector.
The second layer is a linear tranform. The third layer is the softmax activation to get the output as probabilities.
Backpropagation calculates the derivative at each step and call this the gradient.
Previous layers appends the global or previous gradient to the local gradient. I am having trouble calculating the local gradient of the softmax
Several resources online go through the explanation of the softmax and its derivatives and even give code samples of the softmax itself
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
The derivative is explained with respect to when i = j and when i != j. This is a simple code snippet I've come up with and was hoping to verify my understanding:
def softmax(self, x):
"""Compute the softmax of vector x."""
exps = np.exp(x)
return exps / np.sum(exps)
def forward(self):
# self.input is a vector of length 10
# and is the output of
# (w * x) + b
self.value = self.softmax(self.input)
def backward(self):
for i in range(len(self.value)):
for j in range(len(self.input)):
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i))
else:
self.gradient[i] = -self.value[i]*self.input[j]
Then self.gradient is the local gradient which is a vector. Is this correct? Is there a better way to write this?
I am assuming you have a 3-layer NN with W1, b1 for is associated with the linear transformation from input layer to hidden layer and W2, b2 is associated with linear transformation from hidden layer to output layer. Z1 and Z2 are the input vector to the hidden layer and output layer. a1 and a2 represents the output of the hidden layer and output layer. a2 is your predicted output. delta3 and delta2 are the errors (backpropagated) and you can see the gradients of the loss function with respect to model parameters.
This is a general scenario for a 3-layer NN (input layer, only one hidden layer and one output layer). You can follow the procedure described above to compute gradients which should be easy to compute! Since another answer to this post already pointed to the problem in your code, i am not repeating the same.
As I said, you have n^2 partial derivatives.
If you do the math, you find that dSM[i]/dx[k] is SM[i] * (dx[i]/dx[k] - SM[i]) so you should have:
if i == j:
self.gradient[i,j] = self.value[i] * (1-self.value[i])
else:
self.gradient[i,j] = -self.value[i] * self.value[j]
instead of
if i == j:
self.gradient[i] = self.value[i] * (1-self.input[i])
else:
self.gradient[i] = -self.value[i]*self.input[j]
By the way, this may be computed more concisely like so (vectorized):
SM = self.value.reshape((-1,1))
jac = np.diagflat(self.value) - np.dot(SM, SM.T)
np.exp is not stable because it has Inf.
So you should subtract maximum in x.
def softmax(x):
"""Compute the softmax of vector x."""
exps = np.exp(x - x.max())
return exps / np.sum(exps)
If x is matrix, please check the softmax function in this notebook.