python - populate generator on class init or defer? - python

I've got a django app that, at a very pseudo-codey level, does something like this:
class SerialisedContentItem():
def __init__(self, item),
self.__output = self.__jsonify(item)
def fetch_output(self):
return self.__output
def __jsonify(self, item):
serialized = do_a_bunch_of_serialisey_stuff()
return json.dumps(serialized)
So basically - as soon as the class is instantiated, it:
runs an internal function to generate an output string of JSON
stores it in an internal variable
exposes a public function that can be
called later to retrieve the JSON
It's then being used to generate a page something like this:
for item in page.items:
json_item = SerialisedContentItem(item)
yield json_item.fetch_output()
This, to me, seems a bit pointless. And it's also causing issues with some business logic changes we need to make.
What I'd prefer to do is defer the calling of the "jsonify" function until I actually want it. Roughly speaking, changing the above to:
class SerialisedContentItem():
def __init__(self, item),
self.__item = item
def fetch_output(self):
return self.__jsonify(self.__item):
This seems simpler, and mucks with my logic slightly less.
But: is there a downside I'm not seeing? Is my change less performant, or not a good way of doing things?

As long as you only call fetch_output once per item, there's no performance hit (there would be one, obviously, if you called fetch_output twice on the same SerializedContentItem instance). And not doing useless operations is usually a good thing too (you don't expect open("/path/to/some/file.ext") to read the file's content, do you ?)
The only caveat is that, with the original version, if item is mutated between the initialization of SerializedContentItem and the call to fetch_output, the change won't be reflected in the json output (since it's created right at initialisation time), while with your "lazy" version those changes WILL reflect in the json. Whether this is a no-go, a potential issue or actually just what you want depends on the context, so only you can tell.
EDIT:
what's prompted my question: according to my (poor) understanding of yield, using it here makes sense: I only need to iterate page items once, so do so in a way that minimises the memory footprint. But the bulk of the work isn't currently being done in the yield function, it's being done on the line above it when the class is instantiated, making yield a bit pointless. Or am I misunderstanding how it works?
I'm afraid you are indeed misunderstanding yield. Defering the json seralization until the yield json_item.fetch_output() will change nothing (nada, zero, zilch, shunya) to memory comsuption wrt/ the original version.
yield is not a function, it's a keyword. What it does is to turn the function containing it into a "generator function" - a function that returns a generator (a lazy iterator) object, that you can then iterate over. It will not change anything to the memory used to jsonify an item, and whether this jsonification happens "on the same line" as the yield keyword or not is totally irrelevant.
What a generator brings you (wrt/ memory use) is that you don't have to create a whole list of contents at once, ie:
def eager():
result = []
for i in range(1000):
result.append("foo {}\n".format(i))
return result
with open("file.txt", "w") as outfile:
for item in eager():
outfile.write(item)
This FIRST creates a 1000 items long list in memory, then iterate over it.
vs
def lazy():
result = []
for i in range(1000):
yield "foo {}\n".format(i)
with open("file.txt", "w") as outfile:
for item in lazy():
outfile.write(item)
this lazily generates string after string on each iteration, so you don't end up with a 1000 items list in memory - BUT you still generated 1000 strings, each of them using the same amount of space as with the first solution. The difference is that since (in this example) you don't keep any reference on those strings they can be garbage collected on each iteration, while storing them in a list prevent them from being collected until there' no more reference on the list itself.

Related

Confused on why generators are useful [duplicate]

I'm starting to learn Python and I've come across generator functions, those that have a yield statement in them. I want to know what types of problems that these functions are really good at solving.
Generators give you lazy evaluation. You use them by iterating over them, either explicitly with 'for' or implicitly by passing it to any function or construct that iterates. You can think of generators as returning multiple items, as if they return a list, but instead of returning them all at once they return them one-by-one, and the generator function is paused until the next item is requested.
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time. Or for situations where the generator uses another generator, or consumes some other resource, and it's more convenient if that happened as late as possible.
Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you'd use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little 'for' loop around the generator.
For example, say you wrote a 'filesystem search' program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.
If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:
big_list = list(the_generator)
One of the reasons to use generator is to make the solution clearer for some kind of solutions.
The other is to treat results one at a time, avoiding building huge lists of results that you would process separated anyway.
If you have a fibonacci-up-to-n function like this:
# function version
def fibon(n):
a = b = 1
result = []
for i in xrange(n):
result.append(a)
a, b = b, a + b
return result
You can more easily write the function as this:
# generator version
def fibon(n):
a = b = 1
for i in xrange(n):
yield a
a, b = b, a + b
The function is clearer. And if you use the function like this:
for x in fibon(1000000):
print x,
in this example, if using the generator version, the whole 1000000 item list won't be created at all, just one value at a time. That would not be the case when using the list version, where a list would be created first.
Real World Example
Let's say you have 100 million domains in your MySQL table, and you would like to update Alexa rank for each domain.
First thing you need is to select your domain names from the database.
Let's say your table name is domains and column name is domain.
If you use SELECT domain FROM domains it's going to return 100 million rows which is going to consume lot of memory. So your server might crash.
So you decided to run the program in batches. Let's say our batch size is 1000.
In our first batch we will query the first 1000 rows, check Alexa rank for each domain and update the database row.
In our second batch we will work on the next 1000 rows. In our third batch it will be from 2001 to 3000 and so on.
Now we need a generator function which generates our batches.
Here is our generator function:
def ResultGenerator(cursor, batchsize=1000):
while True:
results = cursor.fetchmany(batchsize)
if not results:
break
for result in results:
yield result
As you can see, our function keeps yielding the results. If you used the keyword return instead of yield, then the whole function would be ended once it reached return.
return - returns only once
yield - returns multiple times
If a function uses the keyword yield then it's a generator.
Now you can iterate like this:
db = MySQLdb.connect(host="localhost", user="root", passwd="root", db="domains")
cursor = db.cursor()
cursor.execute("SELECT domain FROM domains")
for result in ResultGenerator(cursor):
doSomethingWith(result)
db.close()
I find this explanation which clears my doubt. Because there is a possibility that person who don't know Generators also don't know about yield
Return
The return statement is where all the local variables are destroyed and the resulting value is given back (returned) to the caller. Should the same function be called some time later, the function will get a fresh new set of variables.
Yield
But what if the local variables aren't thrown away when we exit a function? This implies that we can resume the function where we left off. This is where the concept of generators are introduced and the yield statement resumes where the function left off.
def generate_integers(N):
for i in xrange(N):
yield i
In [1]: gen = generate_integers(3)
In [2]: gen
<generator object at 0x8117f90>
In [3]: gen.next()
0
In [4]: gen.next()
1
In [5]: gen.next()
So that's the difference between return and yield statements in Python.
Yield statement is what makes a function a generator function.
So generators are a simple and powerful tool for creating iterators. They are written like regular functions, but they use the yield statement whenever they want to return data. Each time next() is called, the generator resumes where it left off (it remembers all the data values and which statement was last executed).
See the "Motivation" section in PEP 255.
A non-obvious use of generators is creating interruptible functions, which lets you do things like update UI or run several jobs "simultaneously" (interleaved, actually) while not using threads.
Buffering. When it is efficient to fetch data in large chunks, but process it in small chunks, then a generator might help:
def bufferedFetch():
while True:
buffer = getBigChunkOfData()
# insert some code to break on 'end of data'
for i in buffer:
yield i
The above lets you easily separate buffering from processing. The consumer function can now just get the values one by one without worrying about buffering.
I have found that generators are very helpful in cleaning up your code and by giving you a very unique way to encapsulate and modularize code. In a situation where you need something to constantly spit out values based on its own internal processing and when that something needs to be called from anywhere in your code (and not just within a loop or a block for example), generators are the feature to use.
An abstract example would be a Fibonacci number generator that does not live within a loop and when it is called from anywhere will always return the next number in the sequence:
def fib():
first = 0
second = 1
yield first
yield second
while 1:
next = first + second
yield next
first = second
second = next
fibgen1 = fib()
fibgen2 = fib()
Now you have two Fibonacci number generator objects which you can call from anywhere in your code and they will always return ever larger Fibonacci numbers in sequence as follows:
>>> fibgen1.next(); fibgen1.next(); fibgen1.next(); fibgen1.next()
0
1
1
2
>>> fibgen2.next(); fibgen2.next()
0
1
>>> fibgen1.next(); fibgen1.next()
3
5
The lovely thing about generators is that they encapsulate state without having to go through the hoops of creating objects. One way of thinking about them is as "functions" which remember their internal state.
I got the Fibonacci example from Python Generators - What are they? and with a little imagination, you can come up with a lot of other situations where generators make for a great alternative to for loops and other traditional iteration constructs.
The simple explanation:
Consider a for statement
for item in iterable:
do_stuff()
A lot of the time, all the items in iterable doesn't need to be there from the start, but can be generated on the fly as they're required. This can be a lot more efficient in both
space (you never need to store all the items simultaneously) and
time (the iteration may finish before all the items are needed).
Other times, you don't even know all the items ahead of time. For example:
for command in user_input():
do_stuff_with(command)
You have no way of knowing all the user's commands beforehand, but you can use a nice loop like this if you have a generator handing you commands:
def user_input():
while True:
wait_for_command()
cmd = get_command()
yield cmd
With generators you can also have iteration over infinite sequences, which is of course not possible when iterating over containers.
My favorite uses are "filter" and "reduce" operations.
Let's say we're reading a file, and only want the lines which begin with "##".
def filter2sharps( aSequence ):
for l in aSequence:
if l.startswith("##"):
yield l
We can then use the generator function in a proper loop
source= file( ... )
for line in filter2sharps( source.readlines() ):
print line
source.close()
The reduce example is similar. Let's say we have a file where we need to locate blocks of <Location>...</Location> lines. [Not HTML tags, but lines that happen to look tag-like.]
def reduceLocation( aSequence ):
keep= False
block= None
for line in aSequence:
if line.startswith("</Location"):
block.append( line )
yield block
block= None
keep= False
elif line.startsWith("<Location"):
block= [ line ]
keep= True
elif keep:
block.append( line )
else:
pass
if block is not None:
yield block # A partial block, icky
Again, we can use this generator in a proper for loop.
source = file( ... )
for b in reduceLocation( source.readlines() ):
print b
source.close()
The idea is that a generator function allows us to filter or reduce a sequence, producing a another sequence one value at a time.
A practical example where you could make use of a generator is if you have some kind of shape and you want to iterate over its corners, edges or whatever. For my own project (source code here) I had a rectangle:
class Rect():
def __init__(self, x, y, width, height):
self.l_top = (x, y)
self.r_top = (x+width, y)
self.r_bot = (x+width, y+height)
self.l_bot = (x, y+height)
def __iter__(self):
yield self.l_top
yield self.r_top
yield self.r_bot
yield self.l_bot
Now I can create a rectangle and loop over its corners:
myrect=Rect(50, 50, 100, 100)
for corner in myrect:
print(corner)
Instead of __iter__ you could have a method iter_corners and call that with for corner in myrect.iter_corners(). It's just more elegant to use __iter__ since then we can use the class instance name directly in the for expression.
Basically avoiding call-back functions when iterating over input maintaining state.
See here and here for an overview of what can be done using generators.
Since the send method of a generator has not been mentioned, here is an example:
def test():
for i in xrange(5):
val = yield
print(val)
t = test()
# Proceed to 'yield' statement
next(t)
# Send value to yield
t.send(1)
t.send('2')
t.send([3])
It shows the possibility to send a value to a running generator. A more advanced course on generators in the video below (including yield from explination, generators for parallel processing, escaping the recursion limit, etc.)
David Beazley on generators at PyCon 2014
Some good answers here, however, I'd also recommend a complete read of the Python Functional Programming tutorial which helps explain some of the more potent use-cases of generators.
Particularly interesting is that it is now possible to update the yield variable from outside the generator function, hence making it possible to create dynamic and interwoven coroutines with relatively little effort.
Also see PEP 342: Coroutines via Enhanced Generators for more information.
I use generators when our web server is acting as a proxy:
The client requests a proxied url from the server
The server begins to load the target url
The server yields to return the results to the client as soon as it gets them
Piles of stuff. Any time you want to generate a sequence of items, but don't want to have to 'materialize' them all into a list at once. For example, you could have a simple generator that returns prime numbers:
def primes():
primes_found = set()
primes_found.add(2)
yield 2
for i in itertools.count(1):
candidate = i * 2 + 1
if not all(candidate % prime for prime in primes_found):
primes_found.add(candidate)
yield candidate
You could then use that to generate the products of subsequent primes:
def prime_products():
primeiter = primes()
prev = primeiter.next()
for prime in primeiter:
yield prime * prev
prev = prime
These are fairly trivial examples, but you can see how it can be useful for processing large (potentially infinite!) datasets without generating them in advance, which is only one of the more obvious uses.
Also good for printing the prime numbers up to n:
def genprime(n=10):
for num in range(3, n+1):
for factor in range(2, num):
if num%factor == 0:
break
else:
yield(num)
for prime_num in genprime(100):
print(prime_num)

Persistent variable in Python function

I've got this script written which does some involved parsing while running through a large file. For every line in the file (after some heavy manipulation), I need to add a check to see whether it meets certain criteria and if it does, include it in a list for additional processing later.
The function doing the parsing is already a bit cluttered, and I'm wondering whether it's possible to shunt the line-checking and list manipulation to another function to make things easier to modify later on. My inclination is to use a global variable which gets modified by the function, but I know that's usually poor form. Until now I have never made use of classes, but I vaguely recall there being some advantage to them regarding persistent local variables.
One version of this part of the script could be:
matchingLines = []
def lineProcess(line):
global matchingLines
if line.startswith(criteria):
matchingLines.append(line)
for line in myFile:
# lots of other stuff
lineProcess(line)
Obviously in this simple example it's not much of a pain to just do the checking in the main function and not bother with any additional functions. But in principle, I'm wondering what a better general way of doing this sort of thing is without using external variables.
EDIT: Part of the reason I find the separate function attractive is because once I've collected the list of lines, I am going to use them to manipulate another external file and it would be convenient to have the whole operate wrapped up in a contained module. However, I recognize that this might be premature optimization.
Something like the following might be considered more pythonic:
def is_valid_line(line):
"""Return True if the line is valid, else False."""
return line.startswith(criteria)
valid_lines = [l for l in myFile if is_valid_line(l)]
Incidentally it would be even better practice to use a generator expression rather than a list e.g.
valid_lines = (l for l in myFile if is_valid_line(l))
That way the file reading and line-validation will only actually happen when something tries to iterate over valid_lines, and not before. E.g. In the following case:
valid_lines = [l for l in myFile if is_valid_line(l)]
for line in valid lines:
stuff_that_can_raise_exception(line)
In this case, you have read and validated the entire (huge) file and have the full list of validated lines, and then the first line causes an error, the time spent validating the whole file is wasted. If you use a generator expression (the (x for x in y)) version instead of a list comprehension (the [x for x in y]) then when the error happens you haven't actually validated the file yet (only the first line). I only mention it because I am terrible for not doing this more often myself, and it can yield big performance gains (in CPU and memory) in a lot of cases.
You could use a class, and have matching_lines be an attribute. But you could also just do this:
def process_line(line, matching_lines):
if line.startswith(criteria):
matching_lines.append(line)
...
matches = []
for line in my_file:
# lots of other stuff
process_line(line, matches)

Reducing collection of reduce in Python

If I am reducing over a collection in Python, what's the most efficient way to get the rest of the collection (the unvisited items) ? It is quite often that I need to reduce over a collection, but I want my reducing function to take the unvisited items of the collection I am reducing over.
edit - to clarify, I want something like:
reduce(lambda to-return, item, rest: (code here), collection, initial)
where rest is the items not yet seen by my lambda
This is the best I can do. It expects that the "collection" be sliceable:
def myreduce(func,collection,*args):
"""func takes 3 parameters. The previous value,
the current value, and the rest of the collection"""
def new_func(x,y):
try:
return func(x[1],y[1],collection[y[0]:])
except TypeError:
return func(x,y[1],collection[y[0]:])
return reduce(new_func,enumerate(collection),*args)
print myreduce(lambda x,y,rest:x+y+sum(rest),range(30))
Note that this is very poorly tested. Please test thoroughly before you attempt to use this in any real code. If you really want this to work for any iterable, you could put a collection = tuple(collection) in there at the top I suppose (assuming you have enough memory to store your entire iterable in memory at once)

How to write Python generator function that never yields anything

I want to write a Python generator function that never actually yields anything. Basically it's a "do-nothing" drop-in that can be used by other code which expects to call a generator (but doesn't always need results from it). So far I have this:
def empty_generator():
# ... do some stuff, but don't yield anything
if False:
yield
Now, this works OK, but I'm wondering if there's a more expressive way to say the same thing, that is, declare a function to be a generator even if it never yields any value. The trick I've employed above is to show Python a yield statement inside my function, even though it is unreachable.
Another way is
def empty_generator():
return
yield
Not really "more expressive", but shorter. :)
Note that iter([]) or simply [] will do as well.
An even shorter solution:
def empty_generator():
yield from []
For maximum readability and maintainability, I would prioritize a construct which goes at the top of the function. So either
your original if False: yield construct, but hoisted to the very first line, or
a separate decorator which adds generator behavior to a non-generator callable.
(That's assuming you didn't just need a callable which did something and then returned an empty iterable/iterator. If so then you could just use a regular function and return ()/return iter(()) at the end.)
Imagine the reader of your code sees:
def name_fitting_what_the_function_does():
# We need this function to be an empty generator:
if False: yield
# that crucial stuff that this function exists to do
Having this at the top immediately cues in every reader of this function to this detail, which affects the whole function - affects the expectations and interpretations of this function's behavior and usage.
How long is your function body? More than a couple lines? Then as a reader, I will feel righteous fury and condemnation towards the author if I don't get a cue that this function is a generator until the very end, because I will probably have spent significant mental cost weaving a model in my head based on the assumption that this is a regular function - the first yield in a generator should ideally be immediately visible, when you don't even know to look for it.
Also, in a function longer than a few lines, a construct at the very beginning of the function is more trustworthy - I can trust that anyone who has looked at a function has probably seen its first line every time they looked at it. That means a higher chance that if that line was mistaken or broken, someone would have spotted it. That means I can be less vigilant for the possibility that this whole thing is actually broken but being used in a way that makes the breakage non-obvious.
If you're working with people who are sufficiently fluently familiar with the workings of Python, you could even leave off that comment, because to someone who immediately remembers that yield is what makes Python turn a function into a generator, it is obvious that this is the effect, and probably the intent since there is no other reason for correct code to have a non-executed yield.
Alternatively, you could go the decorator route:
#generator_that_yields_nothing
def name_fitting_what_the_function_does():
# that crucial stuff for which this exists
def generator_that_yields_nothing(wrapped):
#functools.wraps(wrapped)
def wrapper_generator():
if False: yield
wrapped()
return wrapper_generator

What can you use generator functions for?

I'm starting to learn Python and I've come across generator functions, those that have a yield statement in them. I want to know what types of problems that these functions are really good at solving.
Generators give you lazy evaluation. You use them by iterating over them, either explicitly with 'for' or implicitly by passing it to any function or construct that iterates. You can think of generators as returning multiple items, as if they return a list, but instead of returning them all at once they return them one-by-one, and the generator function is paused until the next item is requested.
Generators are good for calculating large sets of results (in particular calculations involving loops themselves) where you don't know if you are going to need all results, or where you don't want to allocate the memory for all results at the same time. Or for situations where the generator uses another generator, or consumes some other resource, and it's more convenient if that happened as late as possible.
Another use for generators (that is really the same) is to replace callbacks with iteration. In some situations you want a function to do a lot of work and occasionally report back to the caller. Traditionally you'd use a callback function for this. You pass this callback to the work-function and it would periodically call this callback. The generator approach is that the work-function (now a generator) knows nothing about the callback, and merely yields whenever it wants to report something. The caller, instead of writing a separate callback and passing that to the work-function, does all the reporting work in a little 'for' loop around the generator.
For example, say you wrote a 'filesystem search' program. You could perform the search in its entirety, collect the results and then display them one at a time. All of the results would have to be collected before you showed the first, and all of the results would be in memory at the same time. Or you could display the results while you find them, which would be more memory efficient and much friendlier towards the user. The latter could be done by passing the result-printing function to the filesystem-search function, or it could be done by just making the search function a generator and iterating over the result.
If you want to see an example of the latter two approaches, see os.path.walk() (the old filesystem-walking function with callback) and os.walk() (the new filesystem-walking generator.) Of course, if you really wanted to collect all results in a list, the generator approach is trivial to convert to the big-list approach:
big_list = list(the_generator)
One of the reasons to use generator is to make the solution clearer for some kind of solutions.
The other is to treat results one at a time, avoiding building huge lists of results that you would process separated anyway.
If you have a fibonacci-up-to-n function like this:
# function version
def fibon(n):
a = b = 1
result = []
for i in xrange(n):
result.append(a)
a, b = b, a + b
return result
You can more easily write the function as this:
# generator version
def fibon(n):
a = b = 1
for i in xrange(n):
yield a
a, b = b, a + b
The function is clearer. And if you use the function like this:
for x in fibon(1000000):
print x,
in this example, if using the generator version, the whole 1000000 item list won't be created at all, just one value at a time. That would not be the case when using the list version, where a list would be created first.
Real World Example
Let's say you have 100 million domains in your MySQL table, and you would like to update Alexa rank for each domain.
First thing you need is to select your domain names from the database.
Let's say your table name is domains and column name is domain.
If you use SELECT domain FROM domains it's going to return 100 million rows which is going to consume lot of memory. So your server might crash.
So you decided to run the program in batches. Let's say our batch size is 1000.
In our first batch we will query the first 1000 rows, check Alexa rank for each domain and update the database row.
In our second batch we will work on the next 1000 rows. In our third batch it will be from 2001 to 3000 and so on.
Now we need a generator function which generates our batches.
Here is our generator function:
def ResultGenerator(cursor, batchsize=1000):
while True:
results = cursor.fetchmany(batchsize)
if not results:
break
for result in results:
yield result
As you can see, our function keeps yielding the results. If you used the keyword return instead of yield, then the whole function would be ended once it reached return.
return - returns only once
yield - returns multiple times
If a function uses the keyword yield then it's a generator.
Now you can iterate like this:
db = MySQLdb.connect(host="localhost", user="root", passwd="root", db="domains")
cursor = db.cursor()
cursor.execute("SELECT domain FROM domains")
for result in ResultGenerator(cursor):
doSomethingWith(result)
db.close()
I find this explanation which clears my doubt. Because there is a possibility that person who don't know Generators also don't know about yield
Return
The return statement is where all the local variables are destroyed and the resulting value is given back (returned) to the caller. Should the same function be called some time later, the function will get a fresh new set of variables.
Yield
But what if the local variables aren't thrown away when we exit a function? This implies that we can resume the function where we left off. This is where the concept of generators are introduced and the yield statement resumes where the function left off.
def generate_integers(N):
for i in xrange(N):
yield i
In [1]: gen = generate_integers(3)
In [2]: gen
<generator object at 0x8117f90>
In [3]: gen.next()
0
In [4]: gen.next()
1
In [5]: gen.next()
So that's the difference between return and yield statements in Python.
Yield statement is what makes a function a generator function.
So generators are a simple and powerful tool for creating iterators. They are written like regular functions, but they use the yield statement whenever they want to return data. Each time next() is called, the generator resumes where it left off (it remembers all the data values and which statement was last executed).
See the "Motivation" section in PEP 255.
A non-obvious use of generators is creating interruptible functions, which lets you do things like update UI or run several jobs "simultaneously" (interleaved, actually) while not using threads.
Buffering. When it is efficient to fetch data in large chunks, but process it in small chunks, then a generator might help:
def bufferedFetch():
while True:
buffer = getBigChunkOfData()
# insert some code to break on 'end of data'
for i in buffer:
yield i
The above lets you easily separate buffering from processing. The consumer function can now just get the values one by one without worrying about buffering.
I have found that generators are very helpful in cleaning up your code and by giving you a very unique way to encapsulate and modularize code. In a situation where you need something to constantly spit out values based on its own internal processing and when that something needs to be called from anywhere in your code (and not just within a loop or a block for example), generators are the feature to use.
An abstract example would be a Fibonacci number generator that does not live within a loop and when it is called from anywhere will always return the next number in the sequence:
def fib():
first = 0
second = 1
yield first
yield second
while 1:
next = first + second
yield next
first = second
second = next
fibgen1 = fib()
fibgen2 = fib()
Now you have two Fibonacci number generator objects which you can call from anywhere in your code and they will always return ever larger Fibonacci numbers in sequence as follows:
>>> fibgen1.next(); fibgen1.next(); fibgen1.next(); fibgen1.next()
0
1
1
2
>>> fibgen2.next(); fibgen2.next()
0
1
>>> fibgen1.next(); fibgen1.next()
3
5
The lovely thing about generators is that they encapsulate state without having to go through the hoops of creating objects. One way of thinking about them is as "functions" which remember their internal state.
I got the Fibonacci example from Python Generators - What are they? and with a little imagination, you can come up with a lot of other situations where generators make for a great alternative to for loops and other traditional iteration constructs.
The simple explanation:
Consider a for statement
for item in iterable:
do_stuff()
A lot of the time, all the items in iterable doesn't need to be there from the start, but can be generated on the fly as they're required. This can be a lot more efficient in both
space (you never need to store all the items simultaneously) and
time (the iteration may finish before all the items are needed).
Other times, you don't even know all the items ahead of time. For example:
for command in user_input():
do_stuff_with(command)
You have no way of knowing all the user's commands beforehand, but you can use a nice loop like this if you have a generator handing you commands:
def user_input():
while True:
wait_for_command()
cmd = get_command()
yield cmd
With generators you can also have iteration over infinite sequences, which is of course not possible when iterating over containers.
My favorite uses are "filter" and "reduce" operations.
Let's say we're reading a file, and only want the lines which begin with "##".
def filter2sharps( aSequence ):
for l in aSequence:
if l.startswith("##"):
yield l
We can then use the generator function in a proper loop
source= file( ... )
for line in filter2sharps( source.readlines() ):
print line
source.close()
The reduce example is similar. Let's say we have a file where we need to locate blocks of <Location>...</Location> lines. [Not HTML tags, but lines that happen to look tag-like.]
def reduceLocation( aSequence ):
keep= False
block= None
for line in aSequence:
if line.startswith("</Location"):
block.append( line )
yield block
block= None
keep= False
elif line.startsWith("<Location"):
block= [ line ]
keep= True
elif keep:
block.append( line )
else:
pass
if block is not None:
yield block # A partial block, icky
Again, we can use this generator in a proper for loop.
source = file( ... )
for b in reduceLocation( source.readlines() ):
print b
source.close()
The idea is that a generator function allows us to filter or reduce a sequence, producing a another sequence one value at a time.
A practical example where you could make use of a generator is if you have some kind of shape and you want to iterate over its corners, edges or whatever. For my own project (source code here) I had a rectangle:
class Rect():
def __init__(self, x, y, width, height):
self.l_top = (x, y)
self.r_top = (x+width, y)
self.r_bot = (x+width, y+height)
self.l_bot = (x, y+height)
def __iter__(self):
yield self.l_top
yield self.r_top
yield self.r_bot
yield self.l_bot
Now I can create a rectangle and loop over its corners:
myrect=Rect(50, 50, 100, 100)
for corner in myrect:
print(corner)
Instead of __iter__ you could have a method iter_corners and call that with for corner in myrect.iter_corners(). It's just more elegant to use __iter__ since then we can use the class instance name directly in the for expression.
Basically avoiding call-back functions when iterating over input maintaining state.
See here and here for an overview of what can be done using generators.
Since the send method of a generator has not been mentioned, here is an example:
def test():
for i in xrange(5):
val = yield
print(val)
t = test()
# Proceed to 'yield' statement
next(t)
# Send value to yield
t.send(1)
t.send('2')
t.send([3])
It shows the possibility to send a value to a running generator. A more advanced course on generators in the video below (including yield from explination, generators for parallel processing, escaping the recursion limit, etc.)
David Beazley on generators at PyCon 2014
Some good answers here, however, I'd also recommend a complete read of the Python Functional Programming tutorial which helps explain some of the more potent use-cases of generators.
Particularly interesting is that it is now possible to update the yield variable from outside the generator function, hence making it possible to create dynamic and interwoven coroutines with relatively little effort.
Also see PEP 342: Coroutines via Enhanced Generators for more information.
I use generators when our web server is acting as a proxy:
The client requests a proxied url from the server
The server begins to load the target url
The server yields to return the results to the client as soon as it gets them
Piles of stuff. Any time you want to generate a sequence of items, but don't want to have to 'materialize' them all into a list at once. For example, you could have a simple generator that returns prime numbers:
def primes():
primes_found = set()
primes_found.add(2)
yield 2
for i in itertools.count(1):
candidate = i * 2 + 1
if not all(candidate % prime for prime in primes_found):
primes_found.add(candidate)
yield candidate
You could then use that to generate the products of subsequent primes:
def prime_products():
primeiter = primes()
prev = primeiter.next()
for prime in primeiter:
yield prime * prev
prev = prime
These are fairly trivial examples, but you can see how it can be useful for processing large (potentially infinite!) datasets without generating them in advance, which is only one of the more obvious uses.
Also good for printing the prime numbers up to n:
def genprime(n=10):
for num in range(3, n+1):
for factor in range(2, num):
if num%factor == 0:
break
else:
yield(num)
for prime_num in genprime(100):
print(prime_num)

Categories

Resources