This question is making me pull my hair out.
if I do:
def mygen():
for i in range(100):
yield i
and call it from one thousand threads, how does the generator know what to send next for each thread?
Everytime I call it, does the generator save a table with the counter and the caller reference or something like that?
It's weird.
Please, clarify my mind on that one.
mygen does not have to remember anything. Every call to mygen() returns an independent iterable. These iterables, on the other hand, have state: Every time next() is called on one, it jumps to the correct place in the generator code -- when a yield is encountered, control is handed back to the caller. The actual implementation is rather messy, but in principle you can imagine that such an iterator stores the local variables, the bytecode, and the current position in the bytecode (a.k.a. instruction pointer). There is nothing special about threads here.
A function like this, when called, will return a generator object. If you have separate threads calling next() on the same generator object, they will interfere with eachother. That is to say, 5 threads calling next() 10 times each will get 50 different yields.
If two threads each create a generator by calling mygen() within the thread, they will have separate generator objects.
A generator is an object, and its state will be stored in memory, so two threads that each create a mygen() will refer to separate objects. It'd be no different than two threads creating an object from a class, they'll each have a different object, even though the class is the same.
if you're coming at this from a C background, this is not the same thing as a function with static variables. The state is maintained in an object, not statically in the variables contained in the function.
It might be clearer if you look at it this way. Instead of:
for i in mygen():
. . .
use:
gen_obj = mygen()
for i in gen_obj:
. . .
then you can see that mygen() is only called once, and it creates a new object, and it is that object that gets iterated. You could create two sequences in the same thread, if you wanted:
gen1 = mygen()
gen2 = mygen()
print(gen1.__next__(), gen2.__next__(), gen1.__next__(), gen2.__next__())
This will print 0, 0, 1, 1.
You could access the same iterator from two threads if you like, just store the generator object in a global:
global_gen = mygen()
Thread 1:
for i in global_gen:
. . .
Thread 2:
for i in global_gen:
. . .
This would probably cause all kinds of havoc. :-)
Related
Hi I'm trying to wrap my head around the concept of generator in Python specifically using Spacy.
As far as I understood, generator runs only once. and nlp.pipe(list) returns a generator to use
machine effectively.
And the generator worked as I predicted like below.
matches = ['one', 'two', 'three']
docs = nlp.pipe(matches)
type(docs)
for x in docs:
print(x)
# First iteration, worked
one
two
three
for x in docs:
print(x)
# Nothing is printed this time
But strange thing happened when I tried to make a list using the generator
for things in nlp.pipe(example1):
print(things)
#First iteration prints things
a is something
b is other thing
c is new thing
d is extra
for things in nlp.pipe(example1):
print(things)
#Second iteration prints things again!
a is something
b is other thing
c is new thing
d is extra
Why this generator runs infinitely? I tried several times and it seems like it runs infinitely.
Thank you
I think you're confused because the term "generator" can be used to mean two different things in Python.
The first thing it can mean is a "generator object" which kind of iterator. The docs variable you created in your first example is a reference to one of these. A generator object can only be iterated once, after that it's exhausted and you'll need to create another one if you want to do more iteration.
The other thing "generator" can mean is a "generator function". A generator function is a function that returns a generator object when you call it. Indeed, the term "generator" is sometimes sloppily used for functions that return iterators generally, even when that's not technically correct. A real generator function is implemented using the yield keyword, but from the caller's perspective, it doesn't really matter how the function is implemented, just that it returns some kind of iterator.
I don't know anything about the library you're using, but it seems like nlp.pipe returns an iterator, so in at least the loosest sense (at least) it can be called a generator function. The iterator it returns is (presumably) the generator object.
Generator objects are single-use, like all iterators are supposed to be. Generator functions on the other hand, can be called as many times as you find appropriate (some might have side effects). Each time you call the generator function, you'll get a new generator object. This is why your second code block works, as you're calling nlp.pipe once for each loop, rather than iterating on the same iterator for both loops.
for things in nlp.pipe(example1) creates a new instance of the nlp.pipe() generator (i.e. an iterator).
If you had assigned the generator to a variable and used the variable multiple times, then you would have seen the effect that you were expecting:
pipeGen = nlp.pipe(example1)
for things in pipeGen:
print(things)
#First iteration will things
for things in pipeGen:
print(things)
#Second iteration will print nothing
In other words nlp.pipe() returns a NEW iterator whereas pipeGen IS an iterator.
If within an instance, I have self.foo = 1, what is the difference between these (or other more complicated examples):
# 1
for i in range(10):
print(self.foo)
# 2
foo = self.foo
for i in range(10):
print(foo)
I'm currently looking at a code base where all the self variables are reassigned to something else. Just wondering if there is any reason to do so and would like to hear both from an efficiency standpoint and a code clarity standpoint.
Consider these possibilities:
The local variable self gets rebound in the middle of the loop. (That's not possible with the specific code you've given, but a different loop could conceivably do it.) In that case, #1 will see the new self's foo attribute, while #2 will not. Although, of course, you could just as easily rebind the local variable foo as the local variable self…
self is mutable, and self.foo is rebound to a different value in the middle of the loop. (That could happen more easily with, e.g., another thread operating on the same object.) Again, #1 will see the new value of the foo attribute, but #2 will not.
self.foo is itself mutable, and its value is mutated in the middle of the loop (e.g., it's a list, and some other thread calls append(2) on it). Now both #1 and #2 will see the new value.
Everything is immutable, or there's just no code (including on other threads) to mutate anything. Now both #1 and #2 are going to see the original value, because there is no other value to see.
If any of those semantic differences are relevant, then of course you want to use whichever one gives you the right answer.
Meanwhile, every time you access self.foo, that requires doing an attribute lookup. In the most common case, this means looking up 'foo' in self.__dict__, which is pretty quick, but not free. And you can easily create pathological cases where it goes through 23 base classes in MRO order before calling a __getattr__ that creates the value on the fly and returns a descriptor whose __get__ method does some non-trivial transformation.
Accessing foo, on the other hand, is going to be compiled into just loading a value out of an array on the frame using a compiled-in index. So it will almost always be faster, and in some cases it can be a lot faster.
In most real-life cases, this doesn't matter at all. But occasionally, it does. In which case copying the value to a local outside the loop is a worthwhile micro-optimization. This is a little more common with bound methods than with normal values (because they always have a descriptor call in the way); see the unique_everseen recipe in the itertools docs for an example.
Of course you could contrive a case where this optimization actually made things slower—e.g., make that loop really tiny, but put the whole thing inside an outer loop. Now the extra self.foo copy each time through the outer loop (and the fact that the bytecode involved in the loop is longer and may spill onto another cache line) could cost a lot more than it saves.
If there's no semantic difference that matters, and the performance difference doesn't matter, then it's just a matter of clarify.
If the expression is a lot more complicated than self.foo, it may well be clearer to pull out the value and give it a name.
But for a trivial case like this, it's probably clearer to just use self.foo. By taking the extra step of copying it to a local variable, you're signaling that you had some reason to do so. So a reader will wonder whether maybe self.foo can get rebound in a different thread, or maybe this loop is a major bottleneck in your code and the self.foo access is a performance issue, etc., and waste time dealing with all of those irrelevancies instead of just reading your code as intended.
I'm trying to manually step through a windows folder/ file structure using os.walk(). I'm working in Jupyter Notebooks.
If I execute:
next(os.walk(path))
I get a result that makes sense the first time, but then I keep getting exactly the same response every time I execute that statement.
However, if I do:
g=os.walk(path)
next(g)
then I do get the next logical record each time I execute:
next(g)
Note that both:
type(g) and type(os.walk(path))
return 'generator'.
Please explain why 'next' behaves differently depending on whether it is applied to g or os.walk()
Thank you--
Because every time you call os.walk, you get a new generator which starts at the top (or bottom with topdown=False). If you call next repeatedly on the same generator, on the other hand, you will iterate through all the values it generates.
In principle, this is no different than the operation of range. next(range(42)) always produces 0. If that were not the case, range would be pretty useless, since there would be no way of knowing where a given for i in range(x): iteration would start.
Every time you call os.walk(path) you create a new generator, one which is ready to walk through all the nodes accessible from the path, starting at the first one.
When you do next(os.walk(path)) you
Create a new generator.
Extract the first item from the generator using next.
Drop the generator which subsequently gets garbage collected and disappears, along with any knowledge of how many items you have extracted from it.
Repeating next(os.walk(path)) takes you back to point 1, which gets you a fresh generator starting at the first element yet again.
When you do g = os.walk(path); next(g) you
Create a new generator.
Store the generator in the variable g. This prevents it from being garbage collected, and preserves its internal state.
Extract the first element from the generator (using next) and advance its internal state.
Repeating next(g) gets you the next item in the generator you just used.
os.walk is a generator function. Each time it is called it returns a new iterable object.
When you write g=os.walk(path), you create a new iterable object named g. Each time you call next(g) the iterator takes one step.
When you write next(os.walk(path)) you create a new iterable object but do not give it a name. You have called it once but have no way of calling it again because it has not been bound to a name. That's the difference.
I've read through the documentation, but I don't understand what is meant by:
The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.
I'm using it to iterate over the list I want to operate on (allImages) as follows:
def joblib_loop():
Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages)
This returns my HOG features, like I want (and with the speed gain using all my 8 cores), but I'm just not sure what it is actually doing.
My Python knowledge is alright at best, and it's very possible that I'm missing something basic. Any pointers in the right direction would be most appreciated
Perhaps things become clearer if we look at what would happen if instead we simply wrote
Parallel(n_jobs=8)(getHog(i) for i in allImages)
which, in this context, could be expressed more naturally as:
Create a Parallel instance with n_jobs=8
create a generator for the list [getHog(i) for i in allImages]
pass that generator to the Parallel instance
What's the problem? By the time the list gets passed to the Parallel object, all getHog(i) calls have already returned - so there's nothing left to execute in Parallel! All the work was already done in the main thread, sequentially.
What we actually want is to tell Python what functions we want to call with what arguments, without actually calling them - in other words, we want to delay the execution.
This is what delayed conveniently allows us to do, with clear syntax. If we want to tell Python that we'd like to call foo(2, g=3) sometime later, we can simply write delayed(foo)(2, g=3). Returned is the tuple (foo, [2], {g: 3}), containing:
a reference to the function we want to call, e.g.foo
all arguments (short "args") without a keyword, e.g.t 2
all keyword arguments (short "kwargs"), e.g. g=3
So, by writing Parallel(n_jobs=8)(delayed(getHog)(i) for i in allImages), instead of the above sequence, now the following happens:
A Parallel instance with n_jobs=8 gets created
The list
[delayed(getHog)(i) for i in allImages]
gets created, evaluating to
[(getHog, [img1], {}), (getHog, [img2], {}), ... ]
That list is passed to the Parallel instance
The Parallel instance creates 8 threads and distributes the tuples from the list to them
Finally, each of those threads starts executing the tuples, i.e., they call the first element with the second and the third elements unpacked as arguments tup[0](*tup[1], **tup[2]), turning the tuple back into the call we actually intended to do, getHog(img2).
we need a loop to test a list of different model configurations. This is the main function that drives the grid search process and will call the score_model() function for each model configuration. We can dramatically speed up the grid search process by evaluating model configurations in parallel. One way to do that is to use the Joblib library . We can define a Parallel object with the number of cores to use and set it to the number of scores detected in your hardware.
define executor
executor = Parallel(n_jobs=cpu_count(), backend= 'multiprocessing' )
then create a list of tasks to execute in parallel, which will be one call to the score model() function for each model configuration we have.
suppose def score_model(data, n_test, cfg):
........................
define list of tasks
tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)
we can use the Parallel object to execute the list of tasks in parallel.
scores = executor(tasks)
So what you want to be able to do is pile up a set of function calls and their arguments in such a way that you can pass them out efficiently to a scheduler/executor. Delayed is a decorator that takes in a function and its args and wraps them into an object that can be put in a list and popped out as needed. Dask has the same thing which it uses in part to feed into its graph scheduler.
From reference https://wiki.python.org/moin/ParallelProcessing
The Parallel object creates a multiprocessing pool that forks the Python interpreter in multiple processes to execute each of the items of the list. The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax.
Another thing I would like to suggest is instead of explicitly defining num of cores we can generalize like this:
import multiprocessing
num_core=multiprocessing.cpu_count()
I am working with very large numpy/scipy arrays that take up a huge junk of memory. Suppose my code looks something like the following:
def do_something(a):
a = a / a.sum() #new memory is allocated
#I don't need the original a now anylonger, how to delete it?
#do a lot more stuff
#a = super large numpy array
do_something(a)
print a #still the same as originally (as passed by value)
So I am calling a function with a huge numpy array. The function then processes the array in some way or the other, but the original object is still kept in memory. Is there any way to free the memory inside the function; deleting the reference does not work.
What you want cannot be done; Python will only free the memory when all references to the array object are gone, and you cannot delete the a reference in the calling namespace from the function.
Instead, break up your problem into smaller steps. Do your calculations on a with one function, delete a then, then call another function to do the rest of the work.
Python works with a simple GC algorithm, basically it has a reference counting (it has a generational GC too, but that's not the case), that means that every reference to the object increment a counter, and every object out of scope decrement the scope.
The memory is deallocated only after the counter reach 0.
so while you've a reference to that object, it'll keep on memory.
In your case the caller of do_something still have a reference to the object, if you want that this variable gone you can reduce the scope of that variable.
If you suspect of memory leaks you can set the DEBUG_LEAK flag and inspect the output, more info here: https://docs.python.org/2/library/gc.html