When does dataloader shuffle happen for Pytorch?

When does dataloader shuffle happen for Pytorch? - python

I have beening using shuffle option for pytorch dataloader for many times. But I was wondering when this shuffle happens and whether it is performed dynamically during iteration. Take the following code as an example:
namesDataset = NamesDataset()
namesTrainLoader = DataLoader(namesDataset, batch_size=16, shuffle=True)
for batch_data in namesTrainLoader:
print(batch_data)
When we define "namesTrainLoader", does that mean the shuffling is finished and the following iteration will be based on a fixed order of data? Will there be any randomness in the for loop after namesTrainLoader was defined?
I was trying to replace half of "batch_data" with some special value:
for batch_data in namesTrainLoader:
batch_data[：8] = special_val
pre = model(batch_data)
Let us say there will be infinite number of epoches, will "model" eventually see all the data in "namesTrainLoader"? Or half of the data of "namesTrainLoader" is actually lost to "model"?

The shuffling happens when the iterator is created. In the case of the for loop, that happens just before the for loop starts.
You can create the iterator manually with:
# Iterator gets created, the data has been shuffled at this point.
data_iterator = iter(namesTrainLoader)
By default the data loader uses torch.utils.data.RandomSampler if you set shuffle=True (without providing your own sampler). Its implementation is very straight forward and you can see where the data is shuffled when the iterator is created by looking at the RandomSampler.__iter__ method:
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
return iter(torch.randperm(n).tolist())
The return statement is the important part, where the shuffling takes place. It simply creates a random permutation of the indices.
That means you will see your entire dataset every time you fully consume the iterator, just in a different order every time. Therefore there is no data lost (not including cases with drop_last=True) and your model will see all data at every epoch.

You can check PyTorch's implementation of torch.utils.data.DataLoader here.
If you specify shuffle=True torch.utils.data.RandomSampler will be used (SequentialSampler otherwise).
When instance of DataLoader is created nothing will be shuffled, it just instantiates necessary private members of the objects and other setup like things.
When you issue special __iter__ method during iteration as in your case a special object is returned named _SingleProcessDataLoader(self) which is a generator of data (possibly batched, shuffled etc., assuming you don't use multiprocessing).
There is a bit of a rabbit hole to follow to find all private and helper related methods, but what it basically does is it uses the underlying sampler to get indices which are used to get samples from torch.utils.data.Dataset.
Sampler is run until exhaustion and the process repeats (usually it would be a single epoch).
Will there be any randomness in the for loop after namesTrainLoader
was defined?
At the start of each cycle/epoch RandomSampler shuffles the indices, so yes, it will be randomized before every epoch (when __iter__ is called and new _SingleProcessDataLoader(self) is returned) which can be done indefinitely.
[...] will "model" eventually see all the data in "namesTrainLoader"?
Yes, it most probably will see all data points eventually

Related

generator runs only once but why does this generator could run multiple times?

Hi I'm trying to wrap my head around the concept of generator in Python specifically using Spacy.
As far as I understood, generator runs only once. and nlp.pipe(list) returns a generator to use
machine effectively.
And the generator worked as I predicted like below.
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
type(docs)
for x in docs:
print(x)
# First iteration, worked
one
two
three
for x in docs:
print(x)
# Nothing is printed this time
But strange thing happened when I tried to make a list using the generator
for things in nlp.pipe(example1):
print(things)
#First iteration prints things
a is something
b is other thing
c is new thing
d is extra
for things in nlp.pipe(example1):
print(things)
#Second iteration prints things again!
a is something
b is other thing
c is new thing
d is extra
Why this generator runs infinitely? I tried several times and it seems like it runs infinitely.
Thank you

I think you're confused because the term "generator" can be used to mean two different things in Python.
The first thing it can mean is a "generator object" which kind of iterator. The docs variable you created in your first example is a reference to one of these. A generator object can only be iterated once, after that it's exhausted and you'll need to create another one if you want to do more iteration.
The other thing "generator" can mean is a "generator function". A generator function is a function that returns a generator object when you call it. Indeed, the term "generator" is sometimes sloppily used for functions that return iterators generally, even when that's not technically correct. A real generator function is implemented using the yield keyword, but from the caller's perspective, it doesn't really matter how the function is implemented, just that it returns some kind of iterator.
I don't know anything about the library you're using, but it seems like nlp.pipe returns an iterator, so in at least the loosest sense (at least) it can be called a generator function. The iterator it returns is (presumably) the generator object.
Generator objects are single-use, like all iterators are supposed to be. Generator functions on the other hand, can be called as many times as you find appropriate (some might have side effects). Each time you call the generator function, you'll get a new generator object. This is why your second code block works, as you're calling nlp.pipe once for each loop, rather than iterating on the same iterator for both loops.

for things in nlp.pipe(example1) creates a new instance of the nlp.pipe() generator (i.e. an iterator).
If you had assigned the generator to a variable and used the variable multiple times, then you would have seen the effect that you were expecting:
pipeGen = nlp.pipe(example1)
for things in pipeGen:
print(things)
#First iteration will things
for things in pipeGen:
print(things)
#Second iteration will print nothing
In other words nlp.pipe() returns a NEW iterator whereas pipeGen IS an iterator.

Freeze only some lines of a torch.nn.Embedding object

I am quite a newbie to Pytorch, and I am trying to implement a sort of "post-training" procedure on embeddings.
I have a vocabulary with a set of items, and I have learned one vector for each of them.
I keep the learned vectors in a nn.Embedding object.
What I'd like to do now is to add a new item to the vocabulary without updating the already learned vectors. The embedding for the new item would be initialized randomly, and then trained while keeping all the other embeddings frozen.
I know that in order to prevent a nn.Embedding to be trained, I need to set to False its requires_grad variable. I have also found this other question that is similar to mine. The best answer proposes to
either store the frozen vectors and the vector to train in different nn.Embedding objects, the former with requires_grad = False and the latter with requires_grad = True
or store the frozen vectors and the new one in the same nn.Embedding object, computing the gradient on all vectors, but descending it is only on the dimensions of the vector of of the new item. This, however, leads to a relevant degradation in performances (which I want to avoid, of course).
My problem is that I really need to store the vector for the new item in the same nn.Embedding object as the frozen vectors of the old items. The reason for this constraint is the following: when building my loss function with the embeddings of the items (old and new), I need to lookup the vectors based on the ids of the items, and for performances reasons I need to use Python slicing. In other words, given a list of item ids item_ids, I need to do something like vecs = embedding[item_ids]. If I used two different nn.Embedding items for the old items and the and new one I would need to use an explicit for-loop with if-else conditions, which would lead to worse performances.
Is there any way I can do this?

If you look at the implementation of nn.Embedding it uses the functional form of embedding in the forward pass. Therefore, I think you could implement a custom module that does something like this:
import torch
from torch.nn.parameter import Parameter
import torch.nn.functional as F
weights_freeze = torch.rand(10, 5) # Don't make parameter
weights_train = Parameter(torch.rand(2, 5))
weights = torch.cat((weights_freeze, weights_train), 0)
idx = torch.tensor([[11, 1, 3]])
lookup = F.embedding(idx, weights)
# Desired result
print(lookup)
lookup.sum().backward()
# 11 corresponds to idx 1 in weights_train so this has grad
print(weights_train.grad)

Are there any advantages of non-generator iterators over generators in Python?

In the code below, i1 is an iterator.
def sq(x):
y = []
for i in x:
y.append(i**2)
return y
l1 = range(5)
s1 = sq(l1)
i1 = iter(s1)
I can write a generator for the same squaring operation. In the code below, g1 is a generator.
def sqg(x):
for i in x:
yield i**2
g1 = sqg(l1)
I know that generators use less code and are simpler to read and write. I know that generators also run faster because they maintain their local states.
Are there any advantages to using i1 over g1?

When you call sq(l1), inside sq, a list y is populated. This consumes memory whose size is proportional to the size of x once exhausted.
In the second case, when you call sqg(l1), sqg does not have any internal list used to store the results. It directly yields computed values, making the memory it consumes constant and independent of the size of x once exhausted.
As for advantages of non-generator iterators over generators, I don't believe there are performance advantages, but there could be structural advantages. A generator (a type of iterator like you noted) is defined to be an iterator returned by calling a function with yield statements inside of it. That means that you cannot add any additional methods that can be called to the object representing the generator, because this special type of iterator is given to you implicitly.
On the other hand, an iterator has a looser definition: an object with a __next__ method and an __iter__ method returning self. You could make a class Squares that follows the previous criteria for an iterator, and in order to get an instance to this iterator, you would have to explicitly instantiate Squares. Because you have control over the attributes of the iterator returned to you, you could add instance methods returning internal state of that iterator that aren't expressed through __next__, whereas with a generator you're locked into the generator object provided to you implicitly. Often a generator will do the job, but sometimes you need to use a non-generator iterator to get the control you need past the functionality provided by __next__.
In this specific case, I don't believe you need the explicit control given to you by using a non-generator iterator, so it would be better to use a generator.

There are advantages of creating a list s1 over a generator - it has a defined length, you can index and slice it, and you can iterate through it multiple times without re-creating it. Maybe you don't count these as advantages of the non-generator iterator, though.
Another difference is that an iterator based on a list involves doing all the work upfront and then caching the results, while the generator does the work one step at a time. If the processing task is resource-intensive, then the list approach will cause an initial pause while the list is generated, then run faster (because you only have to retrieve the results from memory; also consider that the results could be cached in a file, for example). The generator approach would have no initial pause but would then run slower as it generates each result.

Tensorflow: strange behavior in cifar10 example using custom variable creation method

This is a follow up question to this one.
I'm still working on the cifar10 example on the file cifar10.py and noticed some strange behavior regarding the creation of variables.
But first a side-question: Why are the variables created with weight decay factor of wd=0.0 and not wd=None? That way you would have less vertices in the computation graph.
Next, the strange behavior. I added the following function to make it more convenient to create variables:
def _create_variable(name, shape, initializer, wd=None):
dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
with tf.device('/cpu:0'):
var = tf.get_variable(name, shape, dtype, initializer)
if wd is not None:
wd_val = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
tf.add_to_collection('losses', wd_val)
return var
When using this function to create the variables (with the original parameters), the logits that are computed come from a range of +-1e13 for the first batch, gradually getting better reaching +-1.5. The loss on the other hand starts at around 400000 and gets bigger until it hits NaN.
When using the original functions to create the variables, the logits come from a range of +-1 right from the beginning and the loss start at around 4.5, gradually getting smaller.
Can somebody explain to me what the difference between my and the provided functions for variable generation is, and why the effect is so huge? I don't see it.
The full code of my modified cifar10.py can be found here. To test it out simple replace the original file with my version. To than switch between the original and my function simply change line 212 to CUSTOM = False
Thank you in advance.

Stupid me! I used my own function the wrong way and passed the values for stddev as the mean and used the default stddev of 1.
The curse of not addressing the arguments by their name.
Anyway, why does this cause such a huge loss; sometimes even NaN?

How to call function between optimization iterations in pyOpt?

I was wondering if there is a way that I can pass pyOpt a function that
should be called at the end of each iteration?
The reason I need something like this is that I am running a FEA
simulation in each function evaluation, and I would like to output the
FEA results (displacements, stresses) to an ExodusII file after each
optimization iteration. I originally placed my writeExodus function at
the end of the "function evaluation" function, my problem with this is
that a new "pseudo time-step" gets written to my exodus file each time
the function is evaluated rather than only at the end of each iteration,
so this obviously would lead to extra unncessary output to the exodus
file for numerical differentiation (finite difference, complex step) and
for optimizers that make multiple function evaluations per iteration
(i.e. GCMMA when checking if approximation is conservative).
So, is there a way I can tell pyOpt to execute a function (i.e. my
exodusWrite function) at the end of each iteration? Or alternatively,
is there anyway I can track the optimizer iterations in pyOpt so that
inside of my "function evaluation" function I can keep track of the
optimizer iterations and only write the exodus output when the iteration number changes?

you could either put your function into a component that you put at the end of your model. Since it won't have any connections, you'll want to set the run order manually for your group.
Alternatively, you could just hack the pyopt_sparse driver code to manually call your function. You would just add a call to your method of choice at the end if this call and it would then get called any time pyopt_sparse asks for an objective evaluation

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.