I want to take a closer look at the Jacobians of each layer in a fully connected neural network, i.e. ∂y/∂x where x is the input vector (activations previous layer) to the layer and y is the output vector (activations this layer) of that layer.
In an online learning scheme, this could be easily done as follows:
import theano
import theano.tensor as T
import numpy as np
x = T.vector('x')
w = theano.shared(np.random.randn(10, 5))
y = T.tanh(T.dot(w, x))
# computation of Jacobian
j = T.jacobian(y, x)
When learning on batches, you need an additional scan to get the Jacobian for each sample
x = T.matrix('x')
...
# computation of Jacobian
j = theano.scan(lambda i, a, b : jacobian(b[i], a)[:,i],
sequences = T.arange(y.shape[0]), non_sequences = [x, y]
)
This works perfectly well for toy examples, but when learning a network with multiple layers with 1000 hidden units and for thousands of samples, this approach leads to a massive slowdown of the computations. (The idea behind indexing the result of the Jacobian can be found in this question)
The thing is that I believe there is no need for this explicit Jacobian computation when we are already computing the derivative of the loss. After all, the gradient of the loss with regard to e.g. the inputs of the network, can be decomposed as
∂L(y,yL)/∂x = ∂L(y,yL)/∂yL ∂yL/∂y(L-1) ∂y(L-1)/∂y(L-2) ... ∂y2/∂y1 ∂y1/∂x
i.e. the gradient of the loss w.r.t. x is the product of derivatives of each layer (L would be the number of layers here).
My question is thus whether (and how) it is possible to avoid the extra computation and use the decomposition discussed above. I assume it should be possible, because automatic differentiation is practically an application of the chain rule (for as far I understood it). However, I don't seem to find anything that could back this idea. Any suggestions, hints or pointers?
T.jacobian is very inefficient because it uses scan internally. If you plan to multiply jacobian matrix with something, you should use T.Lop or T.Rop for left / right multiplication respectively. Currently "smart" jacobian does not exist theano's in gradient module. You have to hand craft them if you want optimized jacobian.
Instead of using T.scan, use batched Op such asT.batched_dot when possible. T.scan will always results in a CPU loop.
Related
I have a program in which I'm trying to calculate the jacobian of a neural network, but in order to properly define the jacobian I used tf.reshapeto make the data vectors (as far as I know, jacobian: dy/dx is only defined when y and x are vectors (not matrices nor tensors))
this is my code
#tf.function
def A_calculator():
with tf.GradientTape(watch_accessed_variables=False) as gtape:
noise=tf.random.normal([1000, 100])
gtape.watch(noisex)
fakenoise=tf.reshape(gen(noise),[1000,-1])
reshaped_noise=tf.reshape(noise,[1000,-1])
#caulculate jacobian
Jz=gtape.batch_jacobian(fakenoise,reshaped_noise)
return Jz
where genis a neural network that returns an image(generator)
My problem is that Jz is always a tensor with zero as elements
I searched for a solution for this but the closest thing was here(this is what made me suspect that the problem is tf.reshape), but the solution there doesn't solve my problem as I want to do reshape after I insert the value to the functiongen, does anybody know how solve this ? or why Jz always gives a tensor with zero values ?
Reshaping every tensor is unnecessary, as reshaping (1000,100) tensor to(1000,-1) will result in same shape. Skip reshaping altogether at all stages.
Please check the generator it could take a lot of time to produce the "fakenoise"
I am learning Neural network.
Here is the complete piece of code:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Exercises).ipynb
When I transpose features, I get the following output:
import torch
def activation(x):
return 1/(1+torch.exp(-x))
### Generate some data
torch.manual_seed(7) # Set the random seed so things are predictable
# Features are 5 random normal variables
features = torch.randn((1, 5))
# True weights for our data, random normal variables again
weights = torch.randn_like(features)
# and a true bias term
bias = torch.randn((1, 1))
product = features.t() * weights + bias
output = activation(product.sum())
tensor(0.9897)
However, if I transpose weights, I get a different output:
weights_prime = weights.view(5,1)
prod = torch.mm(features, weights_prime) + bias
y_hat = activation(prod.sum())
tensor(0.1595)
Why does this happen?
Update
I took a look at the solution:
https://github.com/udacity/deep-learning-v2-pytorch/blob/master/intro-to-pytorch/Part%201%20-%20Tensors%20in%20PyTorch%20(Solution).ipynb
And I saw this:
y = activation((features * weights).sum() + bias)
why can a matrix features(1,5) multiply another matrix weights(1,5) without transposing weights first?
Update 2
After read several posts, I realized that
matrixA * matrixB is different from torch.mm(matrixA,matrixB) and torch.matmul(matrixA,matrixB).
Could someone confirm my three understandings between?
So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication.
differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases.
In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication.
Update 3
Just to bring in the Video screenshot for someone who has the same confusion:
And here is the video link: https://www.youtube.com/watch?time_continue=98&v=6Z7WntXays8&feature=emb_logo
Looking at https://pytorch.org/docs/master/generated/torch.nn.Linear.html
The typical linear (fully connected) layer in torch uses input features of shape (N,∗,in_features) and weights of shape (out_features,in_features) to produce an output of shape (N,*,out_features). Here N is the batch size, and * is any number of other dimensions (may be none).
The implementation is:
output = input.matmul(weight.t())
So, the answer is that neither of your formulas is correct according to convention; the standard formula is the one above.
You may use a non-standard shape since you're implementing things from scratch; as long as it's consistent it may work, but I don't recommend it for learning. It's unclear what 1 and 5 is in your code, but presumably you want 5 input features and one output feature, with a batch size of 1 as well. In which case the standard shapes should be input = torch.randn((1, 5)) for batch size=1 and in_features=5, and weights = torch.randn((5, 1)) for in_features=5 and out_features=1.
There is no reason why weights should ever be the same shape as features; thus weights = torch.randn_like(features) doesn't make sense.
Lastly, for your actual questions:
"Should I transpose features or weights in Neural network?" - in torch convention, you should transpose weights, but use matmul with the features first. Other frameworks may have a different convention; as long as in_features dimension of the weights is multiplied by the num_features dimension of the input, it would work.
"Why does this happen?" - these are two completely different calculations; there is no reason to think they would produce the same result.
"So the * means element-wise multiplication, whereas torch.mm() and torch.matmul() are matrix-wise multiplication." - Yes; mm is matrix-matrix only, matmul is vector-matrix or matrix-matrix, including batched versions of same - check the docs for everything matmul can do (which is kinda a lot).
"differences between torch.mm() and torch.matmul(): mm() is used specifically for 2 dimensions matrix, whereas matmul() can be used for more complicated cases." - Yes; the big difference is that matmul can broadcast. Use it when you specifically intend that; use mm to prevent unintentional broadcasting.
"In Neutral Network for this Udacity coding exercise mentioned in my above link, it needs element-wise multiplication." - I doubt it; it's probably an error in the Udacity code. This bit of code weights = torch.randn_like(features) looks like an error in any case; the dimensions of weights have a meaning different from the dimensions of features.
This line is taking an outer product between the two vectors.
product = features.t() * weights + bias
The resulting shape is 5x5.
If you change this to a dot product, then output will match y_hat.
product = torch.mm(weights, features.t()) + bias
I'm trying to use a custom loss function. I'm now using TF 2.x where eager execution is turned on by default. I gave this a go with TF 1.x, but ran into too many problems. Is there any alternative to wrapping my function with tf.py_function()? If not, how would I wrap this?
General purpose: Autoencoder with a custom loss function built around unusual ranked differences. For now I'm just using scipy stats rankdata, but that will change in the future.
Tensor shape: n, x, x, 1
n images, each of dim x, x.
Therefore, I want to run this custom loss function on each pair of orig, pred for all n images.
General algorithm:
import scipy.stats as ss
def rank_loss(orig, pred):
orig_arr = orig.numpy() # want x,x,1
pred_arr = pred.numpy()
orig_rank = (ss.rankdata(orig_arr)) # returns flat array of length size of array
pred_rank = (ss.rankdata(pred_arr))
distance_diff = 0
for i in range(len(orig_rank)): # gets sum of rank differences
distance_diff = abs(orig_rank[i] - pred_rank[i])
return distance_diff
If I can't do this, am I limited to the available tf.<funcs> or how can I pull out the tensor as some form of an array so I can run comparison computations across the two tensors?
I also looked at tf.make_ndarray, but that doesn't seem applicable.
I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.
I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.
What is the best way to do this in pytorch?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?
I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact. Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors. A crude solution would be to add a dictionary like
clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}
Run your for loop like
for i in range(loss.size(0)):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(net.parameters())
for name, param in net.named_parameters():
clipped_grads[name] += param.grad / loss.size(0)
net.zero_grad()
for name, param in net.named_parameters():
param.grad = clipped_grads[name]
optimizer.step()
where I omitted much of the detach, requires_grad=False and similar business which may be necessary to make it behave as expected.
The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients. In principle you could take the "raw" gradient, clip it, add to clipped_gradient, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass. It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input, but you would have to verify with someone more intimately acquaintanced with autograd.
This package calculates per-sample gradient in parallel. The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.