Workaround tf.reshape breaking the flow of the gradient (jacobian) - python

I have a program in which I'm trying to calculate the jacobian of a neural network, but in order to properly define the jacobian I used tf.reshapeto make the data vectors (as far as I know, jacobian: dy/dx is only defined when y and x are vectors (not matrices nor tensors))
this is my code
#tf.function
def A_calculator():
with tf.GradientTape(watch_accessed_variables=False) as gtape:
noise=tf.random.normal([1000, 100])
gtape.watch(noisex)
fakenoise=tf.reshape(gen(noise),[1000,-1])
reshaped_noise=tf.reshape(noise,[1000,-1])
#caulculate jacobian
Jz=gtape.batch_jacobian(fakenoise,reshaped_noise)
return Jz
where genis a neural network that returns an image(generator)
My problem is that Jz is always a tensor with zero as elements
I searched for a solution for this but the closest thing was here(this is what made me suspect that the problem is tf.reshape), but the solution there doesn't solve my problem as I want to do reshape after I insert the value to the functiongen, does anybody know how solve this ? or why Jz always gives a tensor with zero values ?

Reshaping every tensor is unnecessary, as reshaping (1000,100) tensor to(1000,-1) will result in same shape. Skip reshaping altogether at all stages.
Please check the generator it could take a lot of time to produce the "fakenoise"

Related

Keras custom loss function with argwhere-like check

I am trying to create a custom loss function in Keras for generator that generates matrix. The matrix consists of higher number of elements and low number of their centers. Centers have high value comparing to elements - elements have value <0.1, while centers should reach value >0.5. It is important that the centers are at exact correct indices, while it is less important to fit elements. That is why I am trying to create loss that would do the following:
select all elements from y_true where value is >0.5, in numpy I would do indices = np.argwhere(y_true>0.5)
compare values at the given indices for y_true and y_pred, something like loss=(K.square(y_pred[indices]-y_true[indices]))
select all other elements indices_low = np.argwhere(y_true<0.5)
same as step 2, save i.e. as loss_low
return weighted loss, i.e. return loss*100+loss_low, simply to give higher wight to more important data
However, I cannot find a way to achieve this in keras backend, I have found a question about tf.where, trying to look for something similar to my problem but there seem to be nothing like tf.argwhere (can't find in docs, neither browsing net/SO). So how can I achieve this?
Note that the number and positions of centers can vary, and the generator is bad from start so it will not generate any or will generate way more than really should be, so I think that I can't simply use tf.where. I might be incorrect here as I am new to custom loss functions, any thoughts are welcome.
EDIT
After all it seems K.tf.where was exactly what I was looking for, so I have tried it out:
def custom_mse():
def mse(y_true, y_pred):
indices = K.tf.where(y_true>0.5)
loss = K.square(y_true[indices]-y_pred[indices])
indices = K.tf.where(y_true<0.5)
loss_low = K.square(y_true[indices]-y_pred[indices])
return 100*loss+loss_low
return mse
but this keeps throwing an error:
ValueError: Shape must be rank 1 but is rank 3 for 'loss_1/Generator_loss/strided_slice' (op: 'StridedSlice') with input shapes: [?,?,?,?], [1,?,4], [1,?,4], [1].
How can I use the where output?
After a while I finally found the correct solution, so it might help somebody in the future:
Firstly my code was biased by my long time work with numpy and Pandas, thus I have expected tf elements can be addressed as y_true[indices], there are actually built in functions tf.gather and tf.gather_nd for getting elements of a tensor. However, since number of elements in both losses are different, I can't use this because counting losses together will lead to incorrect size error.
This led me to a different approach, thanks to this Q&A. Understanding the code in the accepted answer I have found that you can use tf.where not only to get indices, but as well to apply masks to your tensors. The final solution for my problem is then to apply two masks on the input tensor and calculate two losses, one where I count loss for higher values and one where I count loss for lower values, then multiply the loss that should have higher weight.
def custom_mse():
def mse(y_true, y_pred):
great = K.tf.greater(y_true,0.5)
loss = K.square(tf.where(great, y_true, tf.zeros(tf.shape(y_true)))-tf.where(great, y_pred, tf.zeros(tf.shape(y_pred))))
lower = K.tf.less(y_true,0.5)
loss_low = K.square(tf.where(lower, y_true, tf.zeros(tf.shape(y_true)))-tf.where(lower, y_pred, tf.zeros(tf.shape(y_pred))))
return 100*loss+loss_low
return mse

pyTorch gradient becomes none when dividing by scalar

Consider the following code block:
import torch as torch
n=10
x = torch.ones(n, requires_grad=True)/n
y = torch.rand(n)
z = torch.sum(x*y)
z.backward()
print(x.grad) # results in None
print(y)
As written, x.grad is None. However, if I change the definition of x by removing the scalar multiplication (x = torch.ones(n, requires_grad=True)) then indeed I got a non-None gradient that is equivalent to y.
I've googled a bunch looking for this issue, and I think it reflects something fundamental in what I don't understand about how the computational graph in torch. I'd love some clarification. Thanks!
When you set x to a tensor divided by some scalar, x is no longer what is called a "leaf" Tensor in PyTorch. A leaf Tensor is a tensor at the beginning of the computation graph (which is a DAG graph with nodes representing objects such as tensors, and edges which represent a mathematical operation). More specifically, it is a tensor which was not created by some computational operation which is tracked by the autograd engine.
In your example - torch.ones(n, requires_grad=True) is a leaf tensor, but you can't access it directly in your code.
The reasoning behind not keeping the grad for non-leaf tensors is that typically, when you train a network, the weights and biases are leaf tensors and they are what we need the gradient for.
If you want to access the gradients of a non-leaf tensor, you should call the retain_grad function, which means in your code you should add:
x.retain_grad()
after the assignment to x.
It is true that you need to maintain grad. However, the easiest correction to this issues it using torch.div() funciton.

Tensorflow compute_weighted_loss Example

I want to use Tensorflows tf.losses.compute_weighted_loss but cannot find any good example. I have a multi-class classification problem and use the tf.nn.sigmoid_cross_entropy_with_logits as loss. Now I want to weigh the errors for each label independently. Let's say I have n labels, that means I need a n-sized weight vector. Unfortunately tf expects me to pass a (b, n) shaped matrix of error weights, where b is the batch size. So basically I would need to repeat the weight vector b times. That's okay given a fixed batch size, but if my batch size is variable (e.g. smaller batch at the end of the dataset) I have to adapt. Is there a way around this or did I miss something?
I just had to reshape the vector from (n,) to (1,n) to make the broadcasting possible:
error_weights = error_weights.reshape(1, error_weights.shape[0])
Adding to the existing answer, use tf.expand_dims if error_weights is a Tensor.
error_weights = tf.expand_dims(error_weights, 0) # changes shape [n] to [1, n]

efficient computation Jacobian of layers in theano

I want to take a closer look at the Jacobians of each layer in a fully connected neural network, i.e. ∂y/∂x where x is the input vector (activations previous layer) to the layer and y is the output vector (activations this layer) of that layer.
In an online learning scheme, this could be easily done as follows:
import theano
import theano.tensor as T
import numpy as np
x = T.vector('x')
w = theano.shared(np.random.randn(10, 5))
y = T.tanh(T.dot(w, x))
# computation of Jacobian
j = T.jacobian(y, x)
When learning on batches, you need an additional scan to get the Jacobian for each sample
x = T.matrix('x')
...
# computation of Jacobian
j = theano.scan(lambda i, a, b : jacobian(b[i], a)[:,i],
sequences = T.arange(y.shape[0]), non_sequences = [x, y]
)
This works perfectly well for toy examples, but when learning a network with multiple layers with 1000 hidden units and for thousands of samples, this approach leads to a massive slowdown of the computations. (The idea behind indexing the result of the Jacobian can be found in this question)
The thing is that I believe there is no need for this explicit Jacobian computation when we are already computing the derivative of the loss. After all, the gradient of the loss with regard to e.g. the inputs of the network, can be decomposed as
∂L(y,yL)/∂x = ∂L(y,yL)/∂yL ∂yL/∂y(L-1) ∂y(L-1)/∂y(L-2) ... ∂y2/∂y1 ∂y1/∂x
i.e. the gradient of the loss w.r.t. x is the product of derivatives of each layer (L would be the number of layers here).
My question is thus whether (and how) it is possible to avoid the extra computation and use the decomposition discussed above. I assume it should be possible, because automatic differentiation is practically an application of the chain rule (for as far I understood it). However, I don't seem to find anything that could back this idea. Any suggestions, hints or pointers?
T.jacobian is very inefficient because it uses scan internally. If you plan to multiply jacobian matrix with something, you should use T.Lop or T.Rop for left / right multiplication respectively. Currently "smart" jacobian does not exist theano's in gradient module. You have to hand craft them if you want optimized jacobian.
Instead of using T.scan, use batched Op such asT.batched_dot when possible. T.scan will always results in a CPU loop.

TensorFlow Multi-Layer Perceptron

I am learning TensorFlow, and my goal is to implement MultiPerceptron for my needs. I checked the MNIST tutorial with MultiPerceptron implementation and everything was clear to me except this:
_, c = sess.run([optimizer, cost], feed_dict={x: batch_x,
y: batch_y})
I guess, x is an image itself(28*28 pixels, so the input is 784 neurons) and y is a label which is an 1x10 array:
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
They feed whole batches (which are packs of data points and labels)! How does tensorflow interpret this "batch" input? And how does it update the weights: simultaneously after each element in a batch, or after running through the whole batch?
And, if I need to input one number (input_shape = [1,1]) and output four numbers (output_shape = [1,4]), how should I change the tf.placeholders and in which form should I feed them into session?
When I ask, how does tensorflow interpret it, I want to know how tensorflow splits the batch into single elements. For example, batch is a 2-D array, right? In which direction does it split an array? Or it uses matrix operations and doesn't split anything?
When I ask, how should I feed my data, I want to know, should it be a 2-D array with samples at its rows and features at its columns, or, maybe, could it be a 2-D list.
When I feed my float numpy array X_train to x, which is :
x = tf.placeholder("float", [1, n_input])
I receive an error:
ValueError: Cannot feed value of shape (1, 18) for Tensor 'Placeholder_10:0', which has shape '(1, 1)'
It appears that I have to create my data as a Tensor too?
When I tried [18x1]:
Cannot feed value of shape (18, 1) for Tensor 'Placeholder_12:0', which has shape '(1, 1)'
They feed whole bathces(which are packs of data points and labels)!
Yes, this is how neural networks are usually trained (due to some nice mathematical properties of having best of two worlds - better gradient approximation than in SGD on one hand and much faster convergence than full GD).
How does tensorflow interpret this "batch" input?
It "interprets" it according to operations in your graph. You probably have reduce mean somewhere in your graph, which calculates average over your batch, thus causing this to be the "interpretation".
And how does it update the weights: 1.simultaniusly after each element in a batch? 2. After running threw the whole batch?.
As in the previous answer - there is nothing "magical" about batch, it is just another dimension, and each internal operation of neural net is well defined for the batch of data, thus there is still a single update in the end. Since you use reduce mean operation (or maybe reduce sum?) you are updating according to mean of the "small" gradients (or sum if there is reduce sum instead). Again - you could control it (up to the agglomerative behaviour, you cannot force it to do per-sample update unless you introduce while loop into the graph).
And, if i need to imput one number(input_shape = [1,1]) and ouput four nubmers (output_shape = [1,4]), how should i change the tf.placeholders and in which form should i feed them into session? THANKS!!
just set the variables, n_input=1 and n_classes=4, and you push your data as before, as [batch, n_input] and [batch, n_classes] arrays (in your case batch=1, if by "1x1" you mean "one sample of dimension 1", since your edit start to suggest that you actually do have a batch, and by 1x1 you meant a 1d input).
EDIT: 1.when i ask, how does tensorflow interpret it, i want to know, how tensorflow split the batch into single elements. For example, batch is a 2-D array, right? In which direction it splits an array. Or it uses matrix operations and doesnt split anything? 2. When i ask, how should i feed my data, i want to know, should it be a 2-D array with samples at its rows and features at its colums, or, maybe, could it be a 2-D list.
It does not split anything. It is just a matrix, and each operation is perfectly well defined for matrices as well. Usually you put examples in rows, thus in first dimension, and this is exactly what [batch, n_inputs] says - that you have batch rows each with n_inputs columns. But again - there is nothing special about it, and you could also create a graph which accepts column-wise batches if you would really need to.

Categories

Resources