Add values of one tensor to another without affecting the graph

Add values of one tensor to another without affecting the graph - python

I simply want to add the first three values of the third dimension of tensor2 to tensor1, without affecting the graph when doing backpropagation. Tensor2 is only required for its values, it shall not be part of the graph.
Does this work? That's how I would have done it in numpy.
tensor1[:, :, :3] += tensor2[:, :, :3]
Should I better use torch.add() or use .data? I am confused about when to use what. Thank you.

You should be able to use detatch() to return a copy of the tensor (tensor2) with requires_grad = False. Using the inplace += operator causes errors during backpropagation (i.e. at various times during the forward pass, the same variable stored 2 different values with 2 different associated gradients, but only one set of value/ gradient is stored in that variable during the backwards pass.) I'm a bit fuzzy on whether in-place operations are allowed for variables that are part of the computation graph but when the operation itself is not. You can test this to see, but to be safe I recommend:
tensor1[:,:,:3] = torch.add(tensor1[:,:,:3],tensor2[:,:,:3].detach())
Later, if you want to do another operation using tensor2 where the gradient IS part of the computation graph, you still can do this as well.

Related

How to change the value of a torch tensor with requires_grad=True so that the backpropogation can start again?

a=torch.tensor([1,2],dtype=dtype,requires_grad=True)
b=a[0]*a[1]
b.backward()
print(a.grad)
but when I use a+=1 and want to give a another value and do another round of backpropogation, it shows that a leaf Variable that requires grad is being used in an in-place operation.
However a=a+1 seems right. What's the difference between a=a+1 and a+=1?
(solved)
I find my doubt about torch actually lies in the following small example,
a=torch.tensor([1,2],dtype=dtype,requires_grad=True)
b=a[0]*a[1]
b.backward()
print(a.grad)
print(a.requires_grad)
c=2*a[0]*a[1]
c.backward()
print(a.grad)
When I do the first round of backpropogation, I get a.grad=[2,1].
Then I want to make a fresh start and calculate the differential of c with respect to a, however, the gradient seems to accumulate. How can I eliminate this effect?

The += operator is for in-place operations, while a = a+1 is not in-place (that is, a refers to a different tensor after this operation).
But in your example you don't seem to be using either of these, so it is hard to say what you want to achieve.

Freeze only some lines of a torch.nn.Embedding object

I am quite a newbie to Pytorch, and I am trying to implement a sort of "post-training" procedure on embeddings.
I have a vocabulary with a set of items, and I have learned one vector for each of them.
I keep the learned vectors in a nn.Embedding object.
What I'd like to do now is to add a new item to the vocabulary without updating the already learned vectors. The embedding for the new item would be initialized randomly, and then trained while keeping all the other embeddings frozen.
I know that in order to prevent a nn.Embedding to be trained, I need to set to False its requires_grad variable. I have also found this other question that is similar to mine. The best answer proposes to
either store the frozen vectors and the vector to train in different nn.Embedding objects, the former with requires_grad = False and the latter with requires_grad = True
or store the frozen vectors and the new one in the same nn.Embedding object, computing the gradient on all vectors, but descending it is only on the dimensions of the vector of of the new item. This, however, leads to a relevant degradation in performances (which I want to avoid, of course).
My problem is that I really need to store the vector for the new item in the same nn.Embedding object as the frozen vectors of the old items. The reason for this constraint is the following: when building my loss function with the embeddings of the items (old and new), I need to lookup the vectors based on the ids of the items, and for performances reasons I need to use Python slicing. In other words, given a list of item ids item_ids, I need to do something like vecs = embedding[item_ids]. If I used two different nn.Embedding items for the old items and the and new one I would need to use an explicit for-loop with if-else conditions, which would lead to worse performances.
Is there any way I can do this?

If you look at the implementation of nn.Embedding it uses the functional form of embedding in the forward pass. Therefore, I think you could implement a custom module that does something like this:
import torch
from torch.nn.parameter import Parameter
import torch.nn.functional as F
weights_freeze = torch.rand(10, 5) # Don't make parameter
weights_train = Parameter(torch.rand(2, 5))
weights = torch.cat((weights_freeze, weights_train), 0)
idx = torch.tensor([[11, 1, 3]])
lookup = F.embedding(idx, weights)
# Desired result
print(lookup)
lookup.sum().backward()
# 11 corresponds to idx 1 in weights_train so this has grad
print(weights_train.grad)

Using scipy minimize with constraint on one parameter

I am using a scipy.minimize function, where I'd like to have one parameter only searching for options with two decimals.
def cost(parameters,input,target):
from sklearn.metrics import mean_squared_error
output = self.model(parameters = parameters,input = input)
cost = mean_squared_error(target.flatten(), output.flatten())
return cost
parameters = [1, 1] # initial parameters
res = minimize(fun=cost, x0=parameters,args=(input,target)
model_parameters = res.x
Here self.model is a function that performs some matrix manipulation based on the parameters. Input and target are two matrices. The function works the way I want to, except I would like to have parameter[1] to have a constraint. Ideally I'd just like to give an numpy array, like np.arange(0,10,0.01). Is this possible?

In general this is very hard to do as smoothness is one of the core-assumptions of those optimizers.
Problems where some variables are discrete and some are not are hard and usually tackled either by mixed-integer optimization (working good for MI-linear-programming, quite okay for MI-convex-programming although there are less good solvers) or global-optimization (usually derivative-free).
Depending on your task-details, i recommend decomposing the problem:
outer-loop for np.arange(0,10,0.01)-like fixing of variable
inner-loop for optimizing, where this variable is fixed
return the model with the best objective (with status=success)
This will effect in N inner-optimizations, where N=state-space of your to fix-var.
Depending on your task/data, it might be a good idea to traverse the fixing-space monotonically (like using np's arange) and use the solution of iteration i as initial-point for the problem i+1 (potentially less iterations needed if guess is good). But this is probably not relevant here, see next part.
If you really got 2 parameters, like indicated, this decomposition leads to an inner-problem with only 1 variable. Then, don't use minimize, use minimize_scalar (faster and more robust; does not need an initial-point).

Tensorflow: strange behavior in cifar10 example using custom variable creation method

This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.

Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?

What's difference between tf.sub and just minus operation in tensorflow?

I am trying to use Tensorflow. Here is an very simple code.
train = tf.placeholder(tf.float32, [1], name="train")
W1 = tf.Variable(tf.truncated_normal([1], stddev=0.1), name="W1")
loss = tf.pow(tf.sub(train, W1), 2)
step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
Just ignore the optimization part (4th line). It will take a floating number and train W1 so as to increase squared difference.
My question is simple. If I use just minus sign instead of
tf.sub" as below, what is different? Will it cause a wrong result?
loss = tf.pow(train-W1, 2)
When I replace it, the result looks the same. If they are the same, why do we need to use the "tf.add/tf.sub" things?
Built-in back propagation calculation can be done only by the "tf.*" things?

Yes, - and + resolve to tf.sub ad tf.add. If you look at the tensorflow code you will see that these operators on tf.Variable are overloaded with the tf.* methods.
As to why both exists I assume the tf.* ones exist for consistency. So sub and say matmul operation can be used in the same way. While the operator overloading is for convenience.

(tf.sub appears to have been replaced with tf.subtract)
The only advantage I see is that you can specify a name of the operation as in:
tf.subtract(train, W1, name='foofoo')
This helps identify the operation causing an error as the name you provide is also shown:
ValueError: Dimensions must be equal, but are 28 and 40 for 'foofoo' (op: 'Sub') with input shapes
it may also help with TensorBoard understanding. It might be overkill for most people as python also shows the line number which triggered the error.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.