Freeze only some lines of a torch.nn.Embedding object

Freeze only some lines of a torch.nn.Embedding object - python

I am quite a newbie to Pytorch, and I am trying to implement a sort of "post-training" procedure on embeddings.
I have a vocabulary with a set of items, and I have learned one vector for each of them.
I keep the learned vectors in a nn.Embedding object.
What I'd like to do now is to add a new item to the vocabulary without updating the already learned vectors. The embedding for the new item would be initialized randomly, and then trained while keeping all the other embeddings frozen.
I know that in order to prevent a nn.Embedding to be trained, I need to set to False its requires_grad variable. I have also found this other question that is similar to mine. The best answer proposes to
either store the frozen vectors and the vector to train in different nn.Embedding objects, the former with requires_grad = False and the latter with requires_grad = True
or store the frozen vectors and the new one in the same nn.Embedding object, computing the gradient on all vectors, but descending it is only on the dimensions of the vector of of the new item. This, however, leads to a relevant degradation in performances (which I want to avoid, of course).
My problem is that I really need to store the vector for the new item in the same nn.Embedding object as the frozen vectors of the old items. The reason for this constraint is the following: when building my loss function with the embeddings of the items (old and new), I need to lookup the vectors based on the ids of the items, and for performances reasons I need to use Python slicing. In other words, given a list of item ids item_ids, I need to do something like vecs = embedding[item_ids]. If I used two different nn.Embedding items for the old items and the and new one I would need to use an explicit for-loop with if-else conditions, which would lead to worse performances.
Is there any way I can do this?

If you look at the implementation of nn.Embedding it uses the functional form of embedding in the forward pass. Therefore, I think you could implement a custom module that does something like this:
import torch
from torch.nn.parameter import Parameter
import torch.nn.functional as F
weights_freeze = torch.rand(10, 5) # Don't make parameter
weights_train = Parameter(torch.rand(2, 5))
weights = torch.cat((weights_freeze, weights_train), 0)
idx = torch.tensor([[11, 1, 3]])
lookup = F.embedding(idx, weights)
# Desired result
print(lookup)
lookup.sum().backward()
# 11 corresponds to idx 1 in weights_train so this has grad
print(weights_train.grad)

Related

Add values of one tensor to another without affecting the graph

I simply want to add the first three values of the third dimension of tensor2 to tensor1, without affecting the graph when doing backpropagation. Tensor2 is only required for its values, it shall not be part of the graph.
Does this work? That's how I would have done it in numpy.
tensor1[:, :, :3] += tensor2[:, :, :3]
Should I better use torch.add() or use .data? I am confused about when to use what. Thank you.

You should be able to use detatch() to return a copy of the tensor (tensor2) with requires_grad = False. Using the inplace += operator causes errors during backpropagation (i.e. at various times during the forward pass, the same variable stored 2 different values with 2 different associated gradients, but only one set of value/ gradient is stored in that variable during the backwards pass.) I'm a bit fuzzy on whether in-place operations are allowed for variables that are part of the computation graph but when the operation itself is not. You can test this to see, but to be safe I recommend:
tensor1[:,:,:3] = torch.add(tensor1[:,:,:3],tensor2[:,:,:3].detach())
Later, if you want to do another operation using tensor2 where the gradient IS part of the computation graph, you still can do this as well.

When does dataloader shuffle happen for Pytorch?

I have beening using shuffle option for pytorch dataloader for many times. But I was wondering when this shuffle happens and whether it is performed dynamically during iteration. Take the following code as an example:
namesDataset = NamesDataset()
namesTrainLoader = DataLoader(namesDataset, batch_size=16, shuffle=True)
for batch_data in namesTrainLoader:
print(batch_data)
When we define "namesTrainLoader", does that mean the shuffling is finished and the following iteration will be based on a fixed order of data? Will there be any randomness in the for loop after namesTrainLoader was defined?
I was trying to replace half of "batch_data" with some special value:
for batch_data in namesTrainLoader:
batch_data[：8] = special_val
pre = model(batch_data)
Let us say there will be infinite number of epoches, will "model" eventually see all the data in "namesTrainLoader"? Or half of the data of "namesTrainLoader" is actually lost to "model"?

The shuffling happens when the iterator is created. In the case of the for loop, that happens just before the for loop starts.
You can create the iterator manually with:
# Iterator gets created, the data has been shuffled at this point.
data_iterator = iter(namesTrainLoader)
By default the data loader uses torch.utils.data.RandomSampler if you set shuffle=True (without providing your own sampler). Its implementation is very straight forward and you can see where the data is shuffled when the iterator is created by looking at the RandomSampler.__iter__ method:
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
return iter(torch.randperm(n).tolist())
The return statement is the important part, where the shuffling takes place. It simply creates a random permutation of the indices.
That means you will see your entire dataset every time you fully consume the iterator, just in a different order every time. Therefore there is no data lost (not including cases with drop_last=True) and your model will see all data at every epoch.

You can check PyTorch's implementation of torch.utils.data.DataLoader here.
If you specify shuffle=True torch.utils.data.RandomSampler will be used (SequentialSampler otherwise).
When instance of DataLoader is created nothing will be shuffled, it just instantiates necessary private members of the objects and other setup like things.
When you issue special __iter__ method during iteration as in your case a special object is returned named _SingleProcessDataLoader(self) which is a generator of data (possibly batched, shuffled etc., assuming you don't use multiprocessing).
There is a bit of a rabbit hole to follow to find all private and helper related methods, but what it basically does is it uses the underlying sampler to get indices which are used to get samples from torch.utils.data.Dataset.
Sampler is run until exhaustion and the process repeats (usually it would be a single epoch).
Will there be any randomness in the for loop after namesTrainLoader
was defined?
At the start of each cycle/epoch RandomSampler shuffles the indices, so yes, it will be randomized before every epoch (when __iter__ is called and new _SingleProcessDataLoader(self) is returned) which can be done indefinitely.
[...] will "model" eventually see all the data in "namesTrainLoader"?
Yes, it most probably will see all data points eventually

Sklearn SVM with non-numpy objects and custom kernel

I want to do SVM classification (i.e. OneClassSVM) with sklearn.svm.OneClassSVM on physical states that come from a different library (tenpy). I'd define a custom kernel
def overlap(X,Y):
return np.array([[x.overlap(y) for y in Y] for x in X])
where overlap() is a defined function in said library to calculate the overlap between states. When I try to fit with my data
clf = OneClassSVM(kernel=overlap)
clf.fit(states)
where states is a list of such state objects, I get the error
TypeError: float() argument must be a string or a number, not 'MPS'
Is there a way to tell sklearn to ignore this test (w/o editing the source code)?
To my understanding the nature of the data and how it's processed is in principal not essential to the algorithm as long as there is a well-defined kernel for the objects.

Using scipy minimize with constraint on one parameter

I am using a scipy.minimize function, where I'd like to have one parameter only searching for options with two decimals.
def cost(parameters,input,target):
from sklearn.metrics import mean_squared_error
output = self.model(parameters = parameters,input = input)
cost = mean_squared_error(target.flatten(), output.flatten())
return cost
parameters = [1, 1] # initial parameters
res = minimize(fun=cost, x0=parameters,args=(input,target)
model_parameters = res.x
Here self.model is a function that performs some matrix manipulation based on the parameters. Input and target are two matrices. The function works the way I want to, except I would like to have parameter[1] to have a constraint. Ideally I'd just like to give an numpy array, like np.arange(0,10,0.01). Is this possible?

In general this is very hard to do as smoothness is one of the core-assumptions of those optimizers.
Problems where some variables are discrete and some are not are hard and usually tackled either by mixed-integer optimization (working good for MI-linear-programming, quite okay for MI-convex-programming although there are less good solvers) or global-optimization (usually derivative-free).
Depending on your task-details, i recommend decomposing the problem:
outer-loop for np.arange(0,10,0.01)-like fixing of variable
inner-loop for optimizing, where this variable is fixed
return the model with the best objective (with status=success)
This will effect in N inner-optimizations, where N=state-space of your to fix-var.
Depending on your task/data, it might be a good idea to traverse the fixing-space monotonically (like using np's arange) and use the solution of iteration i as initial-point for the problem i+1 (potentially less iterations needed if guess is good). But this is probably not relevant here, see next part.
If you really got 2 parameters, like indicated, this decomposition leads to an inner-problem with only 1 variable. Then, don't use minimize, use minimize_scalar (faster and more robust; does not need an initial-point).

Tensorflow: strange behavior in cifar10 example using custom variable creation method

This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.

Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.