Copy a generator - python

Let's say I have a generator like so
def gen():
a = yield "Hello World"
a_ = a + 1 #Imagine that on my computer "+ 1" is an expensive operation
print "a_ = ", a_
b = yield a_
print "b =", b
print "a_ =", a_
yield b
Now let's say I do
>>> g = gen()
>>> g.next()
>>> g.send(42)
a_ = 43
43
Now we have calculated a_. Now I would like to clone my generator like so.
>>> newG = clonify(g)
>>> newG.send(7)
b = 7
a_ = 43
7
but my original g still works.
>>> g.send(11)
b = 11
a_ = 43
11
Specifically, clonify takes the state of a generator, and copies it. I could just reset my generator to be like the old one, but that would require calculating a_. Note also that I would not want to modify the generator extensively. Ideally, I could just take a generator object from a library and clonify it.
Note: itertools.tee won't work, because it does not handle sends.
Note: I only care about generators created by placing yield statements in a function.

Python doesn't have any support for cloning generators.
Conceptually, this should be implementable, at least for CPython. But practically, it turns out to be very hard.
Under the covers, a generator is basically nothing but a wrapper around a stack frame.*
And a frame object is essentially just a code object, an instruction pointer (an index into that code object), the builtins/globals/locals environment, an exception state, and some flags and debugging info.
And both types are exposed to the Python level,** as are all the bits they need. So, it really should be just a matter of:
Create a frame object just like g.gi_frame, but with a copy of the locals instead of the original locals. (All the user-level questions come down to whether to shallow-copy, deep-copy, or one of the above plus recursively cloning generators here.)
Create a generator object out of the new frame object (and its code and running flag).
And there's no obvious practical reason it shouldn't be possible to construct a frame object out of its bits, just as it is for a code object or most of the other hidden builtin types.
Unfortunately, as it turns out, Python doesn't expose a way to construct a frame object. I thought you could get around that just by using ctypes.pythonapi to call PyFrame_New, but the first argument to that is a PyThreadState—which you definitely can't construct from Python, and shouldn't be able to. So, to make this work, you either have to:
Reproduce everything PyFrame_New does by banging on the C structs via ctypes, or
Manually build a fake PyThreadState by banging on the C structs (which will still require reading the code to PyFrame_New carefully to know what you have to fake).
I think this may still be doable (and I plan to play with it; if I come up with anything, I'll update the Cloning generators post on my blog), but it's definitely not going to be trivial—or, of course, even remotely portable.
There are also a couple of minor problems.
Locals are exposed to Python as a dict (whether you call locals() for your own, or access g.gi_frame.f_locals for a generator you want to clone). Under the covers, locals are actually stored on the C stack.*** You can get around this by using ctypes.pythonapi to call PyFrame_LocalsToFast and PyFrame_FastToLocals. But the dict just contains the values, not cell objects, so doing this shuffle will turn all nonlocal variables into local variables in the clone.****
Exception state is exposed to Python as a type/value/traceback 3-tuple, but inside a frame there's also a borrowed (non-refcounted) reference to the owning generator (or NULL if it's not a generator frame). (The source explains why.) So, your frame-constructing function can't refcount the generator or you have a cycle and therefore a leak, but it has to refcount the generator or you have a potentially dangling pointer until the frame is assigned to a generator. The obvious answer seems to be to leave the generator NULL at frame construction, and have the generator-constructing function do the equivalent of self.gi_f.f_generator = self; Py_DECREF(self).
* It also keeps a copy of the frame's code object and running flag, so they can be accessed after the generator exits and disposes of the frame.
** generator and frame are hidden from builtins, but they're available as types.GeneratorType types.FrameType. And they have docstrings, descriptions of their attributes in the inspect module, etc., just like function and code objects.
*** When you compile a function definition, the compiler makes a list of all the locals, stored in co_varnames, and turns each variable reference into a LOAD_FAST/STORE_FAST opcode with the index into co_varnames as its argument. When a function call is executed, the frame object stores the stack pointer in f_valuestack, pushes len(co_varnames)*sizeof(PyObject *) onto the stack, and then LOAD_FAST 0 just accesses *f_valuestack[0]. Closures are more complicated; a bit too much to explain in a comment on an SO answer.
**** I'm assuming you wanted the clone to share the original's closure references. If you were hoping to recursively clone all the frames up the stack to get a new set of closure references to bind, that adds another problem: there's no way to construct new cell objects from Python either.

You can't, in general. However, if you parametrise over some expensive operation why not lift that operation out, creating a generator factory?
def make_gen(a):
a_ = [a + 1] # Perform expensive calculation
def gen(a_=a_):
while True:
print "a_ = ", a_
a_[0] = yield a_[0]
return gen
Then you can create as many generators as you like from the returned object:
gen = make_gen(42)
g = gen()
g.send(None)
# a_ = [43]
g.send(7)
# a_ = [7]
new_g = gen()
new_g.send(None)
# a_ = [7]

Whilst not technically returning a generator, if you don't mind fully expanding your sequence:
source = ( x**2 for x in range(10) )
source1, source2 = zip(*( (s,s) for s in source ))
>>> print( source1, type(source1) )
(0, 1, 4, 9, 16, 25, 36, 49, 64, 81) <class 'tuple'>
>>> print( source2, type(source2) )
(0, 1, 4, 9, 16, 25, 36, 49, 64, 81) <class 'tuple'>
If your function is expensive, then consider using either joblib or pathos.multiprocessing. Joblib has simpler syntax and handles pool management behind the scenes, but only supports batch processing. Pathos forces you to manually manage and close your ProcessPools, but also as the pool.imap() an pool.uimap() functions which return generators
from pathos.multiprocessing import ProcessPool
pool = ProcessPool(ncpus=os.cpu_count())
try:
def expensive(x): return x**2
source = range(10)
results = pool.imap(expensive, source)
for result in results:
print(result)
except KeyboardInterrupt: pass
except: pass
finally:
pool.terminate()
In theory, you could set this to run in a separate thread and pass in two queue objects that could be read independently and could preserve generator like behavior as suggested in this answer:
How to use multiprocessing queue in Python?

Related

Does python garbage collect variables in a higher scope that will never get used again?

For example, in the code:
a = [1, 2, 3, 4] # huge list
y = sum(a)
print( do_stuff(y) )
Will the memory for the list a ever get freed up before the program ends? Will the do_stuff function call have to do all its stuff with a constantly taking up memory, even though a's never going to be used again?
And if a doesn't get garbage collected, is the solution to manually set a=None once I'm done using it?
Imagine do_stuff did this:
def do_stuff(y):
return globals()[input()]
And the user enters a, so that the list is used there after all. Python can't know that that won't happen, it would have to read the user's mind.
Consider a trivial case like this:
def do_stuff(y):
return y
Now it's clear that a doesn't get used anymore, so Python could figure that out right? Well... print isn't a keyword/statement anymore. Python would have to figure out that you didn't overwrite print with something that does use a. And even if you didn't, it would need to know that its own print doesn't use a.
You'll have to delete it yourself. I'd use del a. (Unless you want a to still exist and have the value None).
a will never be freed unless it goes out of scope (ie. it was in a function to begin with), or you manually set it to None.
Python's garbage collector uses a system called reference counting. In short, all variables have a reference counter that is incremented when a reference to the variable is created, and decremented when an variable is set to None or when it goes out of scope.
Example:
a = [999999] # 1 reference, the value [999999] is stored in memory
b = a # 2 references
def foo(x):
print(x)
foo(a) # 3 references during the function call
# back to 2 references
a = None # 1 reference
b = None # 0 references, the value [999999] is deleted from memory

Confused on why generators are useful [duplicate]

I'm starting to learn Python and I've come across generator functions, those that have a yield statement in them. I want to know what types of problems that these functions are really good at solving.
Generators give you lazy evaluation. You use them by iterating over them, either explicitly with 'for' or implicitly by passing it to any function or construct that iterates. You can think of generators as returning multiple items, as if they return a list, but instead of returning them all at once they return them one-by-one, and the generator function is paused until the next item is requested.
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time. Or for situations where the generator uses another generator, or consumes some other resource, and it's more convenient if that happened as late as possible.
Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you'd use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little 'for' loop around the generator.
For example, say you wrote a 'filesystem search' program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.
If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:
big_list = list(the_generator)
One of the reasons to use generator is to make the solution clearer for some kind of solutions.
The other is to treat results one at a time, avoiding building huge lists of results that you would process separated anyway.
If you have a fibonacci-up-to-n function like this:
# function version
def fibon(n):
a = b = 1
result = []
for i in xrange(n):
result.append(a)
a, b = b, a + b
return result
You can more easily write the function as this:
# generator version
def fibon(n):
a = b = 1
for i in xrange(n):
yield a
a, b = b, a + b
The function is clearer. And if you use the function like this:
for x in fibon(1000000):
print x,
in this example, if using the generator version, the whole 1000000 item list won't be created at all, just one value at a time. That would not be the case when using the list version, where a list would be created first.
Real World Example
Let's say you have 100 million domains in your MySQL table, and you would like to update Alexa rank for each domain.
First thing you need is to select your domain names from the database.
Let's say your table name is domains and column name is domain.
If you use SELECT domain FROM domains it's going to return 100 million rows which is going to consume lot of memory. So your server might crash.
So you decided to run the program in batches. Let's say our batch size is 1000.
In our first batch we will query the first 1000 rows, check Alexa rank for each domain and update the database row.
In our second batch we will work on the next 1000 rows. In our third batch it will be from 2001 to 3000 and so on.
Now we need a generator function which generates our batches.
Here is our generator function:
def ResultGenerator(cursor, batchsize=1000):
while True:
results = cursor.fetchmany(batchsize)
if not results:
break
for result in results:
yield result
As you can see, our function keeps yielding the results. If you used the keyword return instead of yield, then the whole function would be ended once it reached return.
return - returns only once
yield - returns multiple times
If a function uses the keyword yield then it's a generator.
Now you can iterate like this:
db = MySQLdb.connect(host="localhost", user="root", passwd="root", db="domains")
cursor = db.cursor()
cursor.execute("SELECT domain FROM domains")
for result in ResultGenerator(cursor):
doSomethingWith(result)
db.close()
I find this explanation which clears my doubt. Because there is a possibility that person who don't know Generators also don't know about yield
Return
The return statement is where all the local variables are destroyed and the resulting value is given back (returned) to the caller. Should the same function be called some time later, the function will get a fresh new set of variables.
Yield
But what if the local variables aren't thrown away when we exit a function? This implies that we can resume the function where we left off. This is where the concept of generators are introduced and the yield statement resumes where the function left off.
def generate_integers(N):
for i in xrange(N):
yield i
In [1]: gen = generate_integers(3)
In [2]: gen
<generator object at 0x8117f90>
In [3]: gen.next()
0
In [4]: gen.next()
1
In [5]: gen.next()
So that's the difference between return and yield statements in Python.
Yield statement is what makes a function a generator function.
So generators are a simple and powerful tool for creating iterators. They are written like regular functions, but they use the yield statement whenever they want to return data. Each time next() is called, the generator resumes where it left off (it remembers all the data values and which statement was last executed).
See the "Motivation" section in PEP 255.
A non-obvious use of generators is creating interruptible functions, which lets you do things like update UI or run several jobs "simultaneously" (interleaved, actually) while not using threads.
Buffering. When it is efficient to fetch data in large chunks, but process it in small chunks, then a generator might help:
def bufferedFetch():
while True:
buffer = getBigChunkOfData()
# insert some code to break on 'end of data'
for i in buffer:
yield i
The above lets you easily separate buffering from processing. The consumer function can now just get the values one by one without worrying about buffering.
I have found that generators are very helpful in cleaning up your code and by giving you a very unique way to encapsulate and modularize code. In a situation where you need something to constantly spit out values based on its own internal processing and when that something needs to be called from anywhere in your code (and not just within a loop or a block for example), generators are the feature to use.
An abstract example would be a Fibonacci number generator that does not live within a loop and when it is called from anywhere will always return the next number in the sequence:
def fib():
first = 0
second = 1
yield first
yield second
while 1:
next = first + second
yield next
first = second
second = next
fibgen1 = fib()
fibgen2 = fib()
Now you have two Fibonacci number generator objects which you can call from anywhere in your code and they will always return ever larger Fibonacci numbers in sequence as follows:
>>> fibgen1.next(); fibgen1.next(); fibgen1.next(); fibgen1.next()
0
1
1
2
>>> fibgen2.next(); fibgen2.next()
0
1
>>> fibgen1.next(); fibgen1.next()
3
5
The lovely thing about generators is that they encapsulate state without having to go through the hoops of creating objects. One way of thinking about them is as "functions" which remember their internal state.
I got the Fibonacci example from Python Generators - What are they? and with a little imagination, you can come up with a lot of other situations where generators make for a great alternative to for loops and other traditional iteration constructs.
The simple explanation:
Consider a for statement
for item in iterable:
do_stuff()
A lot of the time, all the items in iterable doesn't need to be there from the start, but can be generated on the fly as they're required. This can be a lot more efficient in both
space (you never need to store all the items simultaneously) and
time (the iteration may finish before all the items are needed).
Other times, you don't even know all the items ahead of time. For example:
for command in user_input():
do_stuff_with(command)
You have no way of knowing all the user's commands beforehand, but you can use a nice loop like this if you have a generator handing you commands:
def user_input():
while True:
wait_for_command()
cmd = get_command()
yield cmd
With generators you can also have iteration over infinite sequences, which is of course not possible when iterating over containers.
My favorite uses are "filter" and "reduce" operations.
Let's say we're reading a file, and only want the lines which begin with "##".
def filter2sharps( aSequence ):
for l in aSequence:
if l.startswith("##"):
yield l
We can then use the generator function in a proper loop
source= file( ... )
for line in filter2sharps( source.readlines() ):
print line
source.close()
The reduce example is similar. Let's say we have a file where we need to locate blocks of <Location>...</Location> lines. [Not HTML tags, but lines that happen to look tag-like.]
def reduceLocation( aSequence ):
keep= False
block= None
for line in aSequence:
if line.startswith("</Location"):
block.append( line )
yield block
block= None
keep= False
elif line.startsWith("<Location"):
block= [ line ]
keep= True
elif keep:
block.append( line )
else:
pass
if block is not None:
yield block # A partial block, icky
Again, we can use this generator in a proper for loop.
source = file( ... )
for b in reduceLocation( source.readlines() ):
print b
source.close()
The idea is that a generator function allows us to filter or reduce a sequence, producing a another sequence one value at a time.
A practical example where you could make use of a generator is if you have some kind of shape and you want to iterate over its corners, edges or whatever. For my own project (source code here) I had a rectangle:
class Rect():
def __init__(self, x, y, width, height):
self.l_top = (x, y)
self.r_top = (x+width, y)
self.r_bot = (x+width, y+height)
self.l_bot = (x, y+height)
def __iter__(self):
yield self.l_top
yield self.r_top
yield self.r_bot
yield self.l_bot
Now I can create a rectangle and loop over its corners:
myrect=Rect(50, 50, 100, 100)
for corner in myrect:
print(corner)
Instead of __iter__ you could have a method iter_corners and call that with for corner in myrect.iter_corners(). It's just more elegant to use __iter__ since then we can use the class instance name directly in the for expression.
Basically avoiding call-back functions when iterating over input maintaining state.
See here and here for an overview of what can be done using generators.
Since the send method of a generator has not been mentioned, here is an example:
def test():
for i in xrange(5):
val = yield
print(val)
t = test()
# Proceed to 'yield' statement
next(t)
# Send value to yield
t.send(1)
t.send('2')
t.send([3])
It shows the possibility to send a value to a running generator. A more advanced course on generators in the video below (including yield from explination, generators for parallel processing, escaping the recursion limit, etc.)
David Beazley on generators at PyCon 2014
Some good answers here, however, I'd also recommend a complete read of the Python Functional Programming tutorial which helps explain some of the more potent use-cases of generators.
Particularly interesting is that it is now possible to update the yield variable from outside the generator function, hence making it possible to create dynamic and interwoven coroutines with relatively little effort.
Also see PEP 342: Coroutines via Enhanced Generators for more information.
I use generators when our web server is acting as a proxy:
The client requests a proxied url from the server
The server begins to load the target url
The server yields to return the results to the client as soon as it gets them
Piles of stuff. Any time you want to generate a sequence of items, but don't want to have to 'materialize' them all into a list at once. For example, you could have a simple generator that returns prime numbers:
def primes():
primes_found = set()
primes_found.add(2)
yield 2
for i in itertools.count(1):
candidate = i * 2 + 1
if not all(candidate % prime for prime in primes_found):
primes_found.add(candidate)
yield candidate
You could then use that to generate the products of subsequent primes:
def prime_products():
primeiter = primes()
prev = primeiter.next()
for prime in primeiter:
yield prime * prev
prev = prime
These are fairly trivial examples, but you can see how it can be useful for processing large (potentially infinite!) datasets without generating them in advance, which is only one of the more obvious uses.
Also good for printing the prime numbers up to n:
def genprime(n=10):
for num in range(3, n+1):
for factor in range(2, num):
if num%factor == 0:
break
else:
yield(num)
for prime_num in genprime(100):
print(prime_num)

Python call by ref call by value using ctypes

I am trying to write a program to illustrate to A level students the difference between call by reference and call by value using Python. I had succeeded by passing mutable objects as variables to functions, but found I could also do the same using the ctypes library.
I don't quite understand how it works because there is a function byref() in the ctype library, but it didn't work in my example. However, by calling a function without byref() it did work!
My working code:
"""
Program to illustrate call by ref
"""
from ctypes import * #allows call by ref
test = c_int(56) #Python call by reference eg address
t = 67 #Python call by value eg copy
#expects a ctypes argument
def byRefExample(x):
x.value= x.value + 2
#expects a normal Python variable
def byValueExample(x):
x = x + 2
if __name__ == "__main__":
print "Before call test is",test
byRefExample(test)
print "After call test is",test
print "Before call t is",t
byValueExample(t)
print "After call t is",t
Question
When passing a normal Python variable to byValueExample() it works as expected. The copy of the function argument t changes but the variable t in the header does not. However, when I pass the ctypes variable test both the local and the header variable change, thus it is acting like a C pointer variable. Although my program works, I am not sure how and why the byref() function doesn't work when used like this:
byRefExample(byref(test))
You're actually using terminology that's not exactly correct, and potentially very misleading. I'll explain at the end. But first I'll answer in terms of your wording.
I had succeeded by passing mutable objects as variables to functions but found I could also do the same using the ctypes library.
That's because those ctypes objects are mutable objects, so you're just doing the same thing you already did. In particular, a ctypes.c_int is a mutable object holding an integer value, which you can mutate by setting its value member. So you're already doing the exact same thing you'd done without ctypes.
In more detail, compare these:
def by_ref_using_list(x):
x[0] += 1
value = [10]
by_ref_using_list(value)
print(value[0])
def by_ref_using_dict(x):
x['value'] += 1
value = {'value': 10}
by_ref_using_list(value)
print(value['value'])
class ValueHolder(object):
def __init__(self, value):
self.value = value
def by_ref_using_int_holder(x):
x.value += 1
value = ValueHolder(10)
by_ref_using_list(value)
print(value.value)
You'd expect all three of those to print out 11, because they're just three different ways of passing different kinds of mutable objects and mutating them.
And that's exactly what you're doing with c_int.
You may want to read the FAQ How do I write a function with output parameters (call by reference)?, although it seems like you already know the answers there, and just wanted to know how ctypes fits in…
So, what is byref even for, then?
It's used for calling a C function that takes values by reference C-style: by using explicit pointer types. For example:
void by_ref_in_c(int *x) {
*x += 1;
}
You can't pass this a c_int object, because it needs a pointer to a c_int. And you can't pass it an uninitialized POINTER(c_int), because then it's just going to be writing to random memory. You need to get the pointer to an actual c_int. Which you can do like this:
x = c_int(10)
xp = pointer(x)
by_ref_in_c(xp)
print(x)
That works just fine. But it's overkill, because you've created an extra Python ctypes object, xp, that you don't really need for anything. And that's what byref is for: it gives you a lightweight pointer to an object, that can only be used for passing that object by reference:
x = c_int(10)
by_ref_in_c(byref(x))
print(x)
And that explains why this doesn't work:
byRefExample(byref(test))
That call is making a lightweight pointer to test, and passing that pointer to byRefExample. But byRefExample doesn't want a pointer to a c_int, it wants a c_int.
Of course this is all in Python, not C, so there's no static type checking going on. The function call works just fine, and your code doesn't care what type it gets, so long as it has a value member that you can increment. But a POINTER doesn't have a value member. (It has a contents member instead.) So, you get an AttributeError trying to access x.value.
So, how do you do this kind of thing?
Well, using a single-element-list is a well-known hack to get around the fact that you need to share something mutable but you only have something immutable. If you use it, experienced Python programmers will know what you're up to.
That being said, if you think you need this, you're usually wrong. Often the right answer is to just return the new value. It's easier to reason about functions that don't mutate anything. You can string them together in any way you want, turn them inside-out with generators and iterators, ship them off to child processes to take advantage of those extra cores in your CPU, etc. And even if you don't do any of that stuff, it's usually faster to return a new value than to modify one in-place, even in cases where you wouldn't expect that (e.g., deleting 75% of the values in a list).
And often, when you really do need mutable values, there's already an obvious place for them to live, such as instance attributes of a class.
But sometimes you do need the single-element list hack, so it's worth having in your repertoire; just don't use it when you don't need it.
So, what's wrong with your terminology?
In a sense (the sense Ruby and Lisp programmers use), everything in Python is pass-by-reference. In another sense (the sense many Java and VB programmers use), it's all pass-by-value. But really, it's best to not call it either.* What you're passing is neither a copy of the value of a variable, nor a reference to a variable, but a reference to a value. When you call that byValueExample(t) function, you're not passing a new integer with the value 67 the way you would in C, you're passing a reference to the same integer 67 that's bound to the name t. If you could mutate 67 (you can't, because ints are immutable), the caller would see the change.
Second, Python names are not even variables in the sense you're thinking of. In C, a variable is an lvalue. It has a type and, more importantly, an address. So, you can pass around a reference to the variable itself, rather than to its value. In Python, a name is just a name (usually a key in a module, local, or object dictionary). It doesn't have a type or an address. It's not a thing you can pass around. So, there is no way to pass the variable x by reference.**
Finally, = in Python isn't an assignment operator that copies a value to a variable; it's a binding operator that gives a value a name. So, in C, when you write x = x + 1, that copies the value x + 1 to the location of the variable x, but in Python, when you write x = x + 1, that just rebinds the local variable x to refer to the new value x + 1. That won't have any effect on whatever value x used to be bound to. (Well, if it was the only reference to that value, the garbage collector might clean it up… but that's it.)
This is actually a lot easier to understand if you're coming from C++, which really forces you to understand rvalues and lvalues and different kinds of references and copy construction vs. copy assignment and so on… In C, it's all deceptively simple, which makes it harder to realize how very different it is from the equally-simple Python.
* Some people in the Python community like to call it "pass-by-sharing". Some researchers call it "pass-by-object". Others choose to first differentiate between value semantics and reference semantics, before describing calling styles, so you can call this "reference-semantics pass-by-copy". But, while at least those names aren't ambiguous, they also aren't very well known, so they're not likely to help anyone. I think it's better to describe it than to try to figure out the best name for it…
** Of course, because Python is fully reflective, you can always pass the string x and the context in which it's found, directly or indirectly… If your byRefExample did globals()['x'] = x + 2, that would affect the global x. But… don't do that.
Python uses neither "call-by-reference" or "call-by-value" but "call-by-object". Assignment gives names to objects.
test = c_int(56)
t = 67
test is a name given to a ctypes.c_int object that internally has a value name assigned to an int object.
t is a name give to an int object.
When calling byRefExample(test), x is another name given to the ctypes.c_int object referenced by test.
x.value = x.value + 2
The above reassigns the 'value' name stored in the ctypes.c_int object to a completely new int object with a different value. Since value is an attribute of the same ctypes.c_int object referred by the names test and x, x.value and test.value are referring to the same value.
When calling byValueExample(t), x is another name given to the int object referenced by t.
x = x + 2
The above reassigns the name x to a completely new int object with a different value. x and t no longer refer to the same object, so t will not observe the change. It still refers to the original int object.
You can observe this by printing the id() of the objects at different points in time:
from ctypes import *
test = c_int(56)
t = 67
print('test id =',id(test))
print('t id =',id(t))
#expects a ctypes argument
def byRefExample(x):
print('ByRef x',x,id(x))
print('ByRef x.value',x.value,id(x.value))
x.value = x.value + 2
print('ByRef x.value',x.value,id(x.value))
print('ByRef x',x,id(x))
#expects a normal Python variable
def byValueExample(x):
print('ByVal x',x,id(x))
x = x + 2
print('ByVal x',x,id(x))
print("Before call test is",test,id(test))
print("Before call test is",test.value,id(test.value))
byRefExample(test)
print("After call test is",test.value,id(test.value))
print("After call test is",test,id(test))
print("Before call t is",t,id(t))
byValueExample(t)
print("After call t is",t,id(t))
Output (with comments):
test id = 80548680
t id = 507083328
Before call test is c_long(56) 80548680
Before call test.value is 56 507082976
ByRef x c_long(56) 80548680 # same id as test
ByRef x.value 56 507082976
ByRef x.value 58 507083040 # x.value is new object!
ByRef x c_long(58) 80548680 # but x is still the same.
After call test.value is 58 507083040 # test.value sees new object because...
After call test is c_long(58) 80548680 # test is same object as x.
Before call t is 67 507083328
ByVal x 67 507083328 # same id as t
ByVal x 69 507083392 # x is new object!
After call t is 67 507083328 # t id same old object.

How do Python Generators know who is calling?

This question is making me pull my hair out.
if I do:
def mygen():
for i in range(100):
yield i
and call it from one thousand threads, how does the generator know what to send next for each thread?
Everytime I call it, does the generator save a table with the counter and the caller reference or something like that?
It's weird.
Please, clarify my mind on that one.
mygen does not have to remember anything. Every call to mygen() returns an independent iterable. These iterables, on the other hand, have state: Every time next() is called on one, it jumps to the correct place in the generator code -- when a yield is encountered, control is handed back to the caller. The actual implementation is rather messy, but in principle you can imagine that such an iterator stores the local variables, the bytecode, and the current position in the bytecode (a.k.a. instruction pointer). There is nothing special about threads here.
A function like this, when called, will return a generator object. If you have separate threads calling next() on the same generator object, they will interfere with eachother. That is to say, 5 threads calling next() 10 times each will get 50 different yields.
If two threads each create a generator by calling mygen() within the thread, they will have separate generator objects.
A generator is an object, and its state will be stored in memory, so two threads that each create a mygen() will refer to separate objects. It'd be no different than two threads creating an object from a class, they'll each have a different object, even though the class is the same.
if you're coming at this from a C background, this is not the same thing as a function with static variables. The state is maintained in an object, not statically in the variables contained in the function.
It might be clearer if you look at it this way. Instead of:
for i in mygen():
. . .
use:
gen_obj = mygen()
for i in gen_obj:
. . .
then you can see that mygen() is only called once, and it creates a new object, and it is that object that gets iterated. You could create two sequences in the same thread, if you wanted:
gen1 = mygen()
gen2 = mygen()
print(gen1.__next__(), gen2.__next__(), gen1.__next__(), gen2.__next__())
This will print 0, 0, 1, 1.
You could access the same iterator from two threads if you like, just store the generator object in a global:
global_gen = mygen()
Thread 1:
for i in global_gen:
. . .
Thread 2:
for i in global_gen:
. . .
This would probably cause all kinds of havoc. :-)

What can you use generator functions for?

I'm starting to learn Python and I've come across generator functions, those that have a yield statement in them. I want to know what types of problems that these functions are really good at solving.
Generators give you lazy evaluation. You use them by iterating over them, either explicitly with 'for' or implicitly by passing it to any function or construct that iterates. You can think of generators as returning multiple items, as if they return a list, but instead of returning them all at once they return them one-by-one, and the generator function is paused until the next item is requested.
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time. Or for situations where the generator uses another generator, or consumes some other resource, and it's more convenient if that happened as late as possible.
Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you'd use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little 'for' loop around the generator.
For example, say you wrote a 'filesystem search' program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.
If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:
big_list = list(the_generator)
One of the reasons to use generator is to make the solution clearer for some kind of solutions.
The other is to treat results one at a time, avoiding building huge lists of results that you would process separated anyway.
If you have a fibonacci-up-to-n function like this:
# function version
def fibon(n):
a = b = 1
result = []
for i in xrange(n):
result.append(a)
a, b = b, a + b
return result
You can more easily write the function as this:
# generator version
def fibon(n):
a = b = 1
for i in xrange(n):
yield a
a, b = b, a + b
The function is clearer. And if you use the function like this:
for x in fibon(1000000):
print x,
in this example, if using the generator version, the whole 1000000 item list won't be created at all, just one value at a time. That would not be the case when using the list version, where a list would be created first.
Real World Example
Let's say you have 100 million domains in your MySQL table, and you would like to update Alexa rank for each domain.
First thing you need is to select your domain names from the database.
Let's say your table name is domains and column name is domain.
If you use SELECT domain FROM domains it's going to return 100 million rows which is going to consume lot of memory. So your server might crash.
So you decided to run the program in batches. Let's say our batch size is 1000.
In our first batch we will query the first 1000 rows, check Alexa rank for each domain and update the database row.
In our second batch we will work on the next 1000 rows. In our third batch it will be from 2001 to 3000 and so on.
Now we need a generator function which generates our batches.
Here is our generator function:
def ResultGenerator(cursor, batchsize=1000):
while True:
results = cursor.fetchmany(batchsize)
if not results:
break
for result in results:
yield result
As you can see, our function keeps yielding the results. If you used the keyword return instead of yield, then the whole function would be ended once it reached return.
return - returns only once
yield - returns multiple times
If a function uses the keyword yield then it's a generator.
Now you can iterate like this:
db = MySQLdb.connect(host="localhost", user="root", passwd="root", db="domains")
cursor = db.cursor()
cursor.execute("SELECT domain FROM domains")
for result in ResultGenerator(cursor):
doSomethingWith(result)
db.close()
I find this explanation which clears my doubt. Because there is a possibility that person who don't know Generators also don't know about yield
Return
The return statement is where all the local variables are destroyed and the resulting value is given back (returned) to the caller. Should the same function be called some time later, the function will get a fresh new set of variables.
Yield
But what if the local variables aren't thrown away when we exit a function? This implies that we can resume the function where we left off. This is where the concept of generators are introduced and the yield statement resumes where the function left off.
def generate_integers(N):
for i in xrange(N):
yield i
In [1]: gen = generate_integers(3)
In [2]: gen
<generator object at 0x8117f90>
In [3]: gen.next()
0
In [4]: gen.next()
1
In [5]: gen.next()
So that's the difference between return and yield statements in Python.
Yield statement is what makes a function a generator function.
So generators are a simple and powerful tool for creating iterators. They are written like regular functions, but they use the yield statement whenever they want to return data. Each time next() is called, the generator resumes where it left off (it remembers all the data values and which statement was last executed).
See the "Motivation" section in PEP 255.
A non-obvious use of generators is creating interruptible functions, which lets you do things like update UI or run several jobs "simultaneously" (interleaved, actually) while not using threads.
Buffering. When it is efficient to fetch data in large chunks, but process it in small chunks, then a generator might help:
def bufferedFetch():
while True:
buffer = getBigChunkOfData()
# insert some code to break on 'end of data'
for i in buffer:
yield i
The above lets you easily separate buffering from processing. The consumer function can now just get the values one by one without worrying about buffering.
I have found that generators are very helpful in cleaning up your code and by giving you a very unique way to encapsulate and modularize code. In a situation where you need something to constantly spit out values based on its own internal processing and when that something needs to be called from anywhere in your code (and not just within a loop or a block for example), generators are the feature to use.
An abstract example would be a Fibonacci number generator that does not live within a loop and when it is called from anywhere will always return the next number in the sequence:
def fib():
first = 0
second = 1
yield first
yield second
while 1:
next = first + second
yield next
first = second
second = next
fibgen1 = fib()
fibgen2 = fib()
Now you have two Fibonacci number generator objects which you can call from anywhere in your code and they will always return ever larger Fibonacci numbers in sequence as follows:
>>> fibgen1.next(); fibgen1.next(); fibgen1.next(); fibgen1.next()
0
1
1
2
>>> fibgen2.next(); fibgen2.next()
0
1
>>> fibgen1.next(); fibgen1.next()
3
5
The lovely thing about generators is that they encapsulate state without having to go through the hoops of creating objects. One way of thinking about them is as "functions" which remember their internal state.
I got the Fibonacci example from Python Generators - What are they? and with a little imagination, you can come up with a lot of other situations where generators make for a great alternative to for loops and other traditional iteration constructs.
The simple explanation:
Consider a for statement
for item in iterable:
do_stuff()
A lot of the time, all the items in iterable doesn't need to be there from the start, but can be generated on the fly as they're required. This can be a lot more efficient in both
space (you never need to store all the items simultaneously) and
time (the iteration may finish before all the items are needed).
Other times, you don't even know all the items ahead of time. For example:
for command in user_input():
do_stuff_with(command)
You have no way of knowing all the user's commands beforehand, but you can use a nice loop like this if you have a generator handing you commands:
def user_input():
while True:
wait_for_command()
cmd = get_command()
yield cmd
With generators you can also have iteration over infinite sequences, which is of course not possible when iterating over containers.
My favorite uses are "filter" and "reduce" operations.
Let's say we're reading a file, and only want the lines which begin with "##".
def filter2sharps( aSequence ):
for l in aSequence:
if l.startswith("##"):
yield l
We can then use the generator function in a proper loop
source= file( ... )
for line in filter2sharps( source.readlines() ):
print line
source.close()
The reduce example is similar. Let's say we have a file where we need to locate blocks of <Location>...</Location> lines. [Not HTML tags, but lines that happen to look tag-like.]
def reduceLocation( aSequence ):
keep= False
block= None
for line in aSequence:
if line.startswith("</Location"):
block.append( line )
yield block
block= None
keep= False
elif line.startsWith("<Location"):
block= [ line ]
keep= True
elif keep:
block.append( line )
else:
pass
if block is not None:
yield block # A partial block, icky
Again, we can use this generator in a proper for loop.
source = file( ... )
for b in reduceLocation( source.readlines() ):
print b
source.close()
The idea is that a generator function allows us to filter or reduce a sequence, producing a another sequence one value at a time.
A practical example where you could make use of a generator is if you have some kind of shape and you want to iterate over its corners, edges or whatever. For my own project (source code here) I had a rectangle:
class Rect():
def __init__(self, x, y, width, height):
self.l_top = (x, y)
self.r_top = (x+width, y)
self.r_bot = (x+width, y+height)
self.l_bot = (x, y+height)
def __iter__(self):
yield self.l_top
yield self.r_top
yield self.r_bot
yield self.l_bot
Now I can create a rectangle and loop over its corners:
myrect=Rect(50, 50, 100, 100)
for corner in myrect:
print(corner)
Instead of __iter__ you could have a method iter_corners and call that with for corner in myrect.iter_corners(). It's just more elegant to use __iter__ since then we can use the class instance name directly in the for expression.
Basically avoiding call-back functions when iterating over input maintaining state.
See here and here for an overview of what can be done using generators.
Since the send method of a generator has not been mentioned, here is an example:
def test():
for i in xrange(5):
val = yield
print(val)
t = test()
# Proceed to 'yield' statement
next(t)
# Send value to yield
t.send(1)
t.send('2')
t.send([3])
It shows the possibility to send a value to a running generator. A more advanced course on generators in the video below (including yield from explination, generators for parallel processing, escaping the recursion limit, etc.)
David Beazley on generators at PyCon 2014
Some good answers here, however, I'd also recommend a complete read of the Python Functional Programming tutorial which helps explain some of the more potent use-cases of generators.
Particularly interesting is that it is now possible to update the yield variable from outside the generator function, hence making it possible to create dynamic and interwoven coroutines with relatively little effort.
Also see PEP 342: Coroutines via Enhanced Generators for more information.
I use generators when our web server is acting as a proxy:
The client requests a proxied url from the server
The server begins to load the target url
The server yields to return the results to the client as soon as it gets them
Piles of stuff. Any time you want to generate a sequence of items, but don't want to have to 'materialize' them all into a list at once. For example, you could have a simple generator that returns prime numbers:
def primes():
primes_found = set()
primes_found.add(2)
yield 2
for i in itertools.count(1):
candidate = i * 2 + 1
if not all(candidate % prime for prime in primes_found):
primes_found.add(candidate)
yield candidate
You could then use that to generate the products of subsequent primes:
def prime_products():
primeiter = primes()
prev = primeiter.next()
for prime in primeiter:
yield prime * prev
prev = prime
These are fairly trivial examples, but you can see how it can be useful for processing large (potentially infinite!) datasets without generating them in advance, which is only one of the more obvious uses.
Also good for printing the prime numbers up to n:
def genprime(n=10):
for num in range(3, n+1):
for factor in range(2, num):
if num%factor == 0:
break
else:
yield(num)
for prime_num in genprime(100):
print(prime_num)

Categories

Resources