I am trying to backpropagate from scratch using batches and I am having issues calculating dx. First, I would like to start by defining variables to avoid confusion:
a - The activation value calculated by passing z through an activation function
z - The value before the activation function of the layer
x - The inputs into the layer
w - The weights that connect the inputs to the output nodes
da - The derivative of a
dz - The derivative of z
dx - The derivative of x
I know that this is the derivative of x:
dx = w.T*dz
Note: * means dot and .T means transpose
Now let me introduce the problem. Say I have a neural network with 2 inputs, 3 output nodes, and a batch size of 5. How would I go about computing dx? In this case, the weights would be of shape (z, x) or (3, 2) before transposing and dz would be of shape (z, batches) or (3, 5). If I were to use the formula above, I would get a shape of (x, batches) or (2, 5). Would I take the sum with respect to the last dimension after using the formula above to get dx (resulting in a shape of (2, 1))? Below is a representation of the dot product using made-up values:
w.T * dz = dx
[[1, 2, 3, 4, 5],
[[1, 0.5, 1], * [1, 2, 3, 4, 5], = [[2.5, 5, 7.5, 10, 12.5],
[-1, -1, -0.5] [1, 2, 3, 4, 5]] [-2.5, -5, -7.5, -10, -12.5]]
You did everything correctly. X always needs to have the same dimensions as dX in backpropagation. If X was an intermediate outcome of shape (2,5), then the gradient also has the shape (2,5). In this way you can update the matrix X. Now in your case matrix X is the input matrix, which you will never update. You only need to update W.
If X was the result of a hidden layer, your calculation of the backpropagated gradient is correct.
Related
I am trying to take an inner product of two vectors in tensorflow, for which I use the dot product:
x = tf.constant([1, 2, 3], dtype=tf.int32)
y = tf.constant([4, 5, 6], dtype=tf.int32)
# desired result
tf.tensordot(x, y, axes=1)
# Output: 32
Now I'm dealing with batch tensors which both have shape (32, 3). I still want the same operation, yielding an output vector of shape (32, ). My only succesful attempt so far is:
tf.linalg.diag_part(tf.tensordot(x, y, axes=[[1], [1]]))
# Output: <tf.Tensor: shape=(32,)>
# where each entry is the inner product of the vectors of length 3
However, I compute 32 as many inner products as required.
How do I solve my problem more efficiently?
Think about what this operation is at the end of the day: Element-wise multiplication and a sum over axis 1. So you can just do this
tf.reduce_sum(x * y, axis=1)
A model that I am working on should be predicting quite a lot of variables simultaneously (>1000). Therefore I would like to have a small neural network at the end of the network for each output.
In order to do this compactly, I would like to find a way to create a sparse trainable connection between two layers in the neural network within the Tensorflow framework.
Only a small portion of the connection matrix should be trainable: It is only the parameters that are part of the block-diagonal.
For example:
The connection matrix is the following:
The trainable parameters should be in the place of the 1's.
I have written exactly such a layer:
https://github.com/ArnovanHilten/GenNet/blob/master/GenNet_utils/LocallyDirectedConnected_tf2.py
It takes a sparse matrix as an input and lets you decide how to connect between layers. The layer uses sparse tensors and matrix multiplications.
edit
so the comment was Is this a trainable object though?
The answer: No. You cannot use sparse matrix currently and make it trainable. Instead you can use a mask matrix (see at the end)
But if you need to use sparse matrix, you just have to use tf.sparse.sparse_dense_matmul() or tf.sparse_tensor_to_dense() where your sparse interacts with a dense matrix. I have taken a simple XOR example from here and replaced dense with a sparse matrix:
#Declaring necessary modules
import tensorflow as tf
import numpy as np
"""
A simple numpy implementation of a XOR gate to understand the backpropagation
algorithm
"""
x = tf.placeholder(tf.float32,shape = [4,2],name = "x")
#declaring a place holder for input x
y = tf.placeholder(tf.float32,shape = [4,1],name = "y")
#declaring a place holder for desired output y
m = np.shape(x)[0]#number of training examples
n = np.shape(x)[1]#number of features
hidden_s = 2 #number of nodes in the hidden layer
l_r = 1#learning rate initialization
theta1 = tf.SparseTensor(indices=[[0, 0],[0, 1], [1, 1]], values=[0.1, 0.2, 0.1], dense_shape=[3, 2])
#theta1 = tf.cast(tf.Variable(tf.random_normal([3,hidden_s]),name = "theta1"),tf.float64)
theta2 = tf.cast(tf.Variable(tf.random_normal([hidden_s+1,1]),name = "theta2"),tf.float32)
#conducting forward propagation
a1 = tf.concat([np.c_[np.ones(x.shape[0])],x],1)
#the weights of the first layer are multiplied by the input of the first layer
#z1 = tf.sparse_tensor_dense_matmul(theta1, a1)
z1 = tf.matmul(a1,tf.sparse_tensor_to_dense(theta1))
#the input of the second layer is the output of the first layer, passed through the
a2 = tf.concat([np.c_[np.ones(x.shape[0])],tf.sigmoid(z1)],1)
#the input of the second layer is multiplied by the weights
z3 = tf.matmul(a2,theta2)
#the output is passed through the activation function to obtain the final probability
h3 = tf.sigmoid(z3)
cost_func = -tf.reduce_sum(y*tf.log(h3)+(1-y)*tf.log(1-h3),axis = 1)
#built in tensorflow optimizer that conducts gradient descent using specified
optimiser = tf.train.GradientDescentOptimizer(learning_rate = l_r).minimize(cost_func)
#setting required X and Y values to perform XOR operation
X = [[0,0],[0,1],[1,0],[1,1]]
Y = [[0],[1],[1],[0]]
#initializing all variables, creating a session and running a tensorflow session
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
#running gradient descent for each iterati
for i in range(200):
sess.run(optimiser, feed_dict = {x:X,y:Y})#setting place holder values using feed_dict
if i%100==0:
print("Epoch:",i)
print(sess.run(theta1))
and the output is:
Epoch: 0
SparseTensorValue(indices=array([[0, 0],
[0, 1],
[1, 1]]), values=array([0.1, 0.2, 0.1], dtype=float32), dense_shape=array([3, 2]))
Epoch: 100
SparseTensorValue(indices=array([[0, 0],
[0, 1],
[1, 1]]), values=array([0.1, 0.2, 0.1], dtype=float32), dense_shape=array([3, 2]))
So the only way is to use a mask matrix. You can use it by multiplication or tf.where
1) Multiplication: You can create mask matrix of the desired shape and multiply it with your weight matrix:
mask = tf.Variable([[1,0,0],[0,1,0],[0,0,1]],name ='mask', trainable=False)
weight = tf.cast(tf.Variable(tf.random_normal([3,3])),tf.float32)
desired_tensor = tf.matmul(weight, mask)
2) tf.where
mask = tf.Variable([[1,0,0],[0,1,0],[0,0,1]],name ='mask', trainable=False)
weight = tf.cast(tf.Variable(tf.random_normal([3,3])),tf.float32)
desired_tensor = tf.where(mask > 0, tf.ones_like(weight), weight)
Hope it helps
You can do that by using sparse tensors like so:
SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4])
and the output is:
[[1, 0, 0, 0]
[0, 0, 2, 0]
[0, 0, 0, 0]]
you can look up more on the documentation of sparse tensor here:
https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor
Hope it helps!
I want to transform general tensors / vectors in Tensorflow, but to have a concrete example let's say rotate images.
For this I would like to have a rotation matrix R, which is learned by my network, i.e. there should be gradients computable.
How would you do this?
I found tf.contrib.image.transform, but for this it is said no gradients are computed into the transformation parameters.
Via py_func also the gradients are not available or would have to be calculated by hand - before writing a long custom solution for this (if even possible), are there maybe any ready-to-use solutions?
I cannot be the first one doing this.
For the requested code: I just want to feed an image as input, maybe apply some convolutional layer and in the end get a 2x2 matrix representing my transformation:
conv_1 = tf.layers.conv2d(conv1, 16, [3, 3], strides=(2, 2), padding='same', activation=tf.nn.leaky_relu)
...
M = tf.contrib.layers.fully_connected(conv_n, 4, activation_fn=tf.nn.tanh)
The matrix M then describes how my indices are transformed (imagining each pixel in the image as a vector with endpoint x, y), and I move each pixel then to its new location.
In numpy I could for example do this:
indices = []
for i in range(28):
for j in range(28):
indices.append([i, j])
indices = np.repeat(np.expand_dims(np.asarray(indices), 0), self.batch_size, 0)
transformed = []
for b in range(self.batch_size):
transformed.append(tf.matmul(indices[b], M[b]))
transformed = tf.stack(transformed)
transformed_img = np.zeros((self.batch_size, 28, 28))
for b in range(self.batch_size):
transformed_img[b, transformed[b, :, :, 0].astype(np.int32), transformed[b, :, :, 1].astype(np.int32)] = input_img[b, :, :, 0]
I'm working with sequences of different lengths. But I would only want to grad them based on the output computed at the end of the sequence.
The samples are ordered so that they are decreasing in length and they are zero-padded. For 5 1D samples it looks like this (omitting width dimension for visibility):
array([[5, 7, 7, 4, 5, 8, 6, 9, 7, 9],
[6, 4, 2, 2, 6, 5, 4, 2, 2, 0],
[4, 6, 2, 4, 5, 1, 3, 1, 0, 0],
[8, 8, 3, 7, 7, 7, 9, 0, 0, 0],
[3, 2, 7, 5, 7, 0, 0, 0, 0, 0]])
For the LSTM I'm using nn.utils.rnn.pack_padded_sequence with the individual sequence lengths:
x = nn.utils.rnn.pack_padded_sequence(x, [10, 9, 8, 7, 5], batch_first=True)
The initialization of LSTM in the Model constructor:
self.lstm = nn.LSTM(width, n_hidden, 2)
Then I call the LSTM and unpack the values:
x, _ = self.lstm(x)
x = nn.utils.rnn.pad_packed_sequence(x1, batch_first=True)
Then I'm applying a fully connected layer and a softmax
x = x.contiguous()
x = x.view(-1, n_hidden)
x = self.linear(x)
x = x.reshape(batch_size, n_labels, 10) # 10 is the sample height
return F.softmax(x, dim=1)
This gives me an output of shape batch x n_labels x height (5x12x10).
For each sample, I would only want to use a single score, for the last output batch x n_labels (5*12). My question is How can I achieve this?
One idea is to apply tanh on the last hidden layer returned from the model but I'm not quite sure if that would give the same results. Is it possible to efficiently extract the output computed at the end of the sequence eg using the same lengths sequence used for pack_padded_sequence?
As Neaabfi answered hidden[-1] is correct. To be more specific to your question, as the docs wrote:
output, (h_n, c_n) = self.lstm(x_pack) # batch_first = True
# h_n is a vector of shape (num_layers * num_directions, batch, hidden_size)
In your case, you have a stack of 2 LSTM layers with only forward direction, then:
h_n shape is (num_layers, batch, hidden_size)
Probably, you may prefer the hidden state h_n of the last layer, then **here is what you should do:
output, (h_n, c_n) = self.lstm(x_pack)
h = h_n[-1] # h of shape (batch, hidden_size)
y = self.linear(h)
Here is the code which wraps any recurrent layer LSTM, RNN or GRU into DynamicRNN. DynamicRNN has a capacity of performing recurrent computations on sequences of varied lengths without any care about the order of lengths.
You can access the last hidden layer as follows:
output, (hidden, cell) = self.lstm(x_pack)
y = self.linear(hidden[-1])
I have two tensors of shape N x D1 and M x D2 where D1 > D2, called X and Y respectively. For my task, X acts as the input and Y acts as the filter.
I want to calculate a matrix P of shape N x M x (D1-D2+1) such that:
P[0,0,0] = dot(X[0,0:D2], Y[0,:])
P[0,0,1] = dot(X[0,1:D2+1], Y[0,:])
...
P[N-1,M-1,D1-D2] = dot(X[N-1,D1-D2:D1], Y[M-1,:])
I can create a for loop and manually slide Y and calculate the dot products.
However I prefer using the correlation operator.
As I know, tensorflow has correlation operator implemented (https://www.tensorflow.org/versions/master/api_docs/python/nn/convolution) but I don't know how can I use my tensors as inputs and filters.
tf.nn.conv2d(input, filter, strides, padding, use_cudnn_on_gpu=None, data_format=None, name=None)
In your case, I'd set strides to 1, and padding to SAME.
tf.nn.conv2d(X, Y, strides=1, padding=SAME)
Yes, you can use indeed tf.nn.conv2d(), but you should add both batch and channel dimensions:
X = tf.expand_dims(tf.expand_dims(X,0),-1)
# X.shape [batch=1, in_height, in_width, in_channels=1]
Y = tf.expand_dims(tf.expand_dims(Y,-1),-1)
# Y.shape = [filter_height, filter_width, in_channels=1, out_channels=1]
# Convolution (actually correlation, see doc of conv2d)
xcorr = tf.nn.conv2d(X, Y, padding="VALID", strides=[1, 1, 1, 1])
# Padding should be VALID, since you've already padded your input
CAVEAT: However, you cannot extrapolate this approach for batches of signals, since tf.nn.conv2d uses always the same filter over the batch dimension, and from my understanding you do want to change it.