Reusing Generators in Different Unit Tests

Reusing Generators in Different Unit Tests - python

I'm running into an issue while unit-testing a Python project that I'm working on which uses generators. Simplified, the project/unit-test looks like this:
I have a setUp() function which creates a Person instance. Person is a class that has a generator, next_task(), which yields the next task that a Person has.
I now have two unit-tests that test different things about the way the generator works, using a for loop. The first test works exactly as I'd expect, and the second one never even enters the loop. In both unit tests, the first line of code is:
for rank, task in enumerate(self.person.next_task()):
My guess is that this isn't working because the same generator function is being used in two separate unit tests. But that doesn't seem like the way that generators or unit-tests are supposed to work. Shouldn't I be able to iterate twice across the list of tasks? Also, shouldn't each unit-test be working with an essentially different instance of the Person, since the Person instance is created in setUp()?

If you are really creating a new Person object in setUp then it should work as you expect. There are several reasons why it may not be working:
1) you are initialising the Person's tasks from another iterator, and that is exhausted by the second time you create Person.
2) You are creating a new Person object each time but the task generator is a class variable instead of an instance variable, so is shared between the class instances.
3) You think you are creating a new Person object but in reality you are not for some reason. Perhaps it is implemented as a singleton.
4) the unittest setUp method is broken.
Of these I think (4) is least likely, but we would need to see more of you code before we can track down the real problem.

The yielded results of a generator are consumed by the first for loop that uses it. Afterwards, the generator function returns and is finished - and thus empty. As the second unit test uses the very same generator object, it doesn't enter the loop. You have to create a new generator for the second unit test, or use itertools.tee to make N separate iterators out of one generator.
Generators do not work the way you think. On each call to generatorObject.next(), the next yielded result is returned, but the result does not get stored anywhere. That's why generators are often used for lazy operations. If you want to reuse the results, you can use itertools.tee as I said, or convert the generator to a tuple/list of results.
A hint on using itertools.tee from the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Related

Is there a good reason why classes shouldn't include a list of all objects created?

Learning lot about python. For one of my programs, I need to compare all objects that have been created, so I put them in a list. I thought it would maybe be simpler if I created a class variable that includes every object created.
This seems so obvious to me that I wonder why it isn't done all the time, so I figure there must be a really really good reason for that.
So for something like
class Basket:
baskets = []
def __init__:(self, id, volume):
self.id = id
self.volume = id
baskets.append(self)
Why is this not done more often? It seems obvious. My assumption is that there are very good reasons why you wouldn't, so I'm curious to hear them.

This is one of those ideas new programmers come up with over and over again, that turns out to be unuseful and counterproductive in practice. It's important to be able to manage the objects you create, but a class-managed single list of every instance of that class ever turns out to do a very bad job of that.
The core problem is that the "every" in "every object created" is much too broad. Code that actually needs to operate on every single instance of a specific class is extremely rare. Much more commonly, code needs to operate on every instance that particular code creates, or every member of a particular group of objects.
Using a single list of all instances makes your code inflexible. It's a design that encourages writing code to operate on "all the instances" instead of "all the instances the code cares about". When the scope of a program expands, that design makes it really hard to create instances the code doesn't or shouldn't care about.
Plus, a list of every instance is a data structure with almost no structure. It does nothing to express the relationships between objects. If two objects are in such a list, that just says "both these objects exist". You quickly end up needing more complex data structures to represent useful information about your objects, and once you have those, the class-managed list doesn't do anything useful.
For example, you've got a Basket class. We don't have enough information to tell whether this is a shopping basket, or a bin-packing problem, or what. "Volume" suggests maybe it's a bin-packing problem, so let's go with that. We've got a number of items to pack into a number of baskets, and the solver has to know about "all the baskets" to figure out how to pack items into baskets... except, it really needs to know about all the baskets in this problem. Not every instance of Basket in the entire program.
What if you want to solve two bin-packing problems, with two separate sets of baskets? Throwing all the baskets into a single list makes it hard to keep track of things. What if you want to solve two bin-packing problems at the same time, maybe in two different threads? Then you can't even just clear the list when you're done with one problem before moving on to the next.
What if you want to write unit tests? Those will need to create Basket instances. If you have a class-managed list of all baskets, the tests will add Basket instances to that list, making the tests interfere with each other. The contents of the list when one test runs will depend on test execution order. That's not good. Unit tests are supposed to be independent of each other.
Consider the built-in classes. int, dict, str, classes like those. Have you ever wanted a list of every int in your entire program, or every string? It wouldn't be very useful. It'd include all sorts of stuff you don't care about, and stuff you didn't even know existed. Random constants from modules you've never heard of, os.name, the Python copyright string, etc. You wouldn't have the slightest clue where most of it even came from. How would you get anything useful done with a list like that?
On a smaller scale, the same thing applies to a list of every instance of a class you write. Sure, your class won't be used in quite as many situations as a class like int, but as the scope of a program expands, your class will end up used in more ways, and those uses probably won't need to know about each other. A single list of instances intrinsically makes it hard for different uses of a class to avoid interfering with each other.

Is it wrong practice to modify function arguments?

Is it a bad practice to modify function arguments?
_list = [1,2,3]
def modify_list(list):
list.append(4)
print(_list)
modify_list(_list)
print(_list)

At first it was supposed to be a comment, but it needed more formatting and space to explain the example. ;)
If you:
know what you're doing
can justify the use
don't use mutable default arguments (they are way too confusing in the way they behave, I can't imagine the reason their use would ever be justified)
don't use global mutables anywhere near that thing (modifying mutable global's contents AND modifying mutable argument's contents?)
and, most importantly, document this thing,
this thing shouldn't cause much harm (but still might bite you if you only think you know what you're doing, but in fact you don't) and can be useful!
Example:
I've worked with scripts (made by other programmers) that used mutable arguments. In this case: dictionaries.
The script was supposed to be run with threads but also allowed single-thread run. Using dictionaries instead of return values removed the difference of getting the result in single- and multiple-thread runs:
Normally value returned by a thread is encapsulated, but we only used the value after .join anyway and didn't care about threads killed by exceptions (single-thread run was mostly for debugging/local run).
That way, dictionaries (more than one in a single function) were used for appending new results in each run, without the need of collecting returned values manually and filtering them (the called function knew in which dict to put the result in, used lock to ensure thread safety).
Was it a "good" or "wrong" way of doing things?
In my opinion it was a pythonic way of doing things:
easily readable in both forms - dealing with the result was the same in single- and multi-threaded
data was "automatically" nicely formatted - as opposed to de-capsulating thread results, manual collecting and parsing them
and fairly easy to understand - with first and last point in my list above ;)

How to inspect generators in the repl/ipython in Python3

I've been trying to switch to Python3. Surprisingly, my difficulty is not with modules or my own code breaking. My issue is that I am always trying and testing different aspects of my code in IPython as I write it, and having generators by default makes this infuriating. I'm hoping there is either a gap in my knowledge or some sort of work around to resolve this.
My issues are:
Whenever I test a few lines of code or a function and get a generator, I have no idea what's inside since I'm getting a response like this: <generator object <genexpr> at 0x0000000007947168>. Getting around it means I can't just run code directly from my editor -- I need to dump the output into a variable and/or wrap it in a list().
Once I do start to inspect the generator, I either consume it (fully or partially) which messes it up if I wish to test it further. Partially consuming is especially annoying, because sometimes I don't notice and see odd results from subsequent code.
Oddly enough, I keep finding that I am introducing bugs (or extraneous code), not because I don't understand lazy evaluation, but because of the mismatch in what I'm evaluating in the console and what's making it's way into my editor slipping through my view.
Off the top of my head, I'd like to do one of the following:
Configure IPython in some way to force some kind of strict evaluation (unless I shut it off explicitly)
Inspect a generator without consuming it (or maybe inspect it and then restart itself?)

Your idea of previewing or rewinding a generator is not possible in the general case. That's because generators can have side effects, which you'd either get earlier than expected (when you preview), or get multiple times (before and after rewinding). Consider the following generator, for example:
def foo_gen():
print("start")
yield 1
print("middle")
yield 2
print("end")
If you could preview the results yielded by this generator (1 and 2), would you expect to get the print outs too?
That said, there may be some ways for you to make your code easier to deal with.
Consider using list comprehensions instead of generator expressions. This is quite simple in most situations, just put square brackets around the genexp you already have. In many situations where you pass a generator to other code, any iterable object (such as a list) will work just as well.
Similarly, if you're getting generators passed into your code from other places, you can often pass the generator to list and use the list in your later code. This is of course not very memory efficient, since you're consuming the whole generator up front, but if you want to see the values in the interactive console, that's probably going to be necessary.
You can also use itertools.tee to get two (or more) iterators that will yield the same values as the iterable you pass in. This will allow you to inspect the values from one, while passing the other on. Be aware though that the tee code will need to store all the values yielded by any of the iterators until it has been yielded by all of the other iterators too (so if you run one iterator far ahead of the others, you may end up using as much or more memory than if you'd just used a list).

In case it helps anyone else, this is a line magic for IPython I threw together in response to the answer. It makes it a tiny bit less painful:
%ins <var> will create two copies of <var> using itertools.tee. One will be re-assigned to <var> (so you can re-use it in it's original state), the other will be passed to print(list()) so it outputs to terminal.
%ins <expr> will pass the expression to print(list())
To install save it as ins.py in ~/.ipython/profile_default/startup
from IPython.core.magic import register_line_magic
import itertools
#register_line_magic
def ins(line):
if globals().get(line, None):
gen1, gen2 = eval("itertools.tee({})".format(line))
globals()[line] = gen2
print(list(gen1))
else:
print(list(eval(line)))
# You need to delete this item from the namespace
del ins

How to efficiently share a common parent atribute class by multiprocessing tasks?

I have a class named "Problem", and another two called "Colony", and "Ant".
A "Problem" has an attribute of type "Colony", and each "Colony" has a list of "Ants".
Each Ant is thought as a Task to be run in a multiprocessing.JoinableQueue() in a method in the "Problem" class, and when their method __call__ is called, they must consult & modify an attribute graph in the "Problem" class, which would have to be accessed by every ant.
What would be the most efficient way to implement this?
I have thought of passing to each ant in the constructor method a copy of the graph, and then when they are finished, join all the subgraphs into a graph. But I think it would be better to somehow share the resource directly by all ants, like using "semaphore" style design.
Any ideas?.
Thanks

If splitting the data and joining the results can be done reasonably, this is almost always going to be more efficient—and a lot simpler—than having them all fight over shared data.
There are cases where there is no reasonable way to do this (it's either very complicated, or very slow, to join the results back up). However, even in that case there can be a reasonable alternative: return "mutation commands" of some form. The parent process can then, e.g., iterate over the output queue and apply each result to the single big array.
If even that isn't feasible, then you need sharing. There are two parts to sharing: making the data itself sharable, and locking it.
Whatever your graph type is, it probably isn't inherently shareable; it's probably got internal pointers and so on. This means you will need to construct some kind of representation in terms of multiprocessing.Array, or multiprocessing.sharedctypes around Structures, or the like, or in terms of bytes in a file that each process can mmap, or by using whatever custom multiprocessing support may exist in modules like NumPy that you may be using. Then, all of your tasks can mutate the Array (or whatever), and at the end, if you need an extra step to turn that back into a useful graph object, it should be pretty quick.
Next, for locking, the really simple thing to do is create a single multiprocessing.Lock, and have each task grab the lock when it needs to mutate the shared data. In some cases, it can be more efficient to have multiple locks, protecting different parts of the shared data. And in some cases it can be more efficient to grab the lock for each mutation instead of grabbing it once for a whole "transaction" worth of sequences (but of course it may not be correct). Without knowing your actual code, there's no way to make a judgment on these tradeoffs; in fact, a large part of the art of shared-data threading is knowing how to work this stuff out. (And a large part of the reason shared-nothing threading is easier and usually more efficient is that you don't need to work this stuff out.)
Meanwhile, I'm not sure why you need an explicit JoinableQueue here in the first place. It sounds like everything you want can be done with a Pool. To take a simpler but concrete example:
a = [[0,1,2], [3,4,5], [6,7,8], [9,10,11]]
with multiprocessing.Pool() as pool:
b = pool.map(reversed, a, chunksize=1)
c = [list(i) for i in b]
This is a pretty stupid example, but it illustrates that each task operates on one of the rows of a and returns something, which I can then combine in some custom way (by calling list on each one) to get the result I want.

Design question in Python: should this be one generic function or two specific ones?

I'm creating a basic database utility class in Python. I'm refactoring an old module into a class. I'm now working on an executeQuery() function, and I'm unsure of whether to keep the old design or change it. Here are the 2 options:
(The old design:) Have one generic executeQuery method that takes the query to execute and a boolean commit parameter that indicates whether to commit (insert, update, delete) or not (select), and determines with an if statement whether to commit or to select and return.
(This is the way I'm used to, but that might be because you can't have a function that sometimes returns something and sometimes doesn't in the languages I've worked with:) Have 2 functions, executeQuery and executeUpdateQuery (or something equivalent). executeQuery will execute a simple query and return a result set, while executeUpdateQuery will make changes to the DB (insert, update, delete) and return nothing.
Is it accepted to use the first way? It seems unclear to me, but maybe it's more Pythonistic...? Python is very flexible, maybe I should take advantage of this feature that can't really be accomplished in this way in more strict languages...
And a second part of this question, unrelated to the main idea - what is the best way to return query results in Python? Using which function to query the database, in what format...?

It's propably just me and my FP fetish, but I think a function executed solely for side effects is very different from a non-destructive function that fetches some data, and therefore have different names. Especially if the generic function would do something different depending on exactly that (the part on the commit parameter seems to imply that).
As for how to return results... I'm a huge fan of generators, but if the library you use for database connections returns a list anyway, you might as well pass this list on - a generator wouldn't buy you anything in this case. But if it allows you to iterate over the results (one at a time), seize the opportunity to save a lot of memory on larger queries.

I don't know how to answer the first part of your question, it seems like a matter of style more than anything else. Maybe you could invoke the Single Responsibility Principle to argue that it should be two separate functions.
When you're going to return a sequence of indeterminate length, it's best to use a Generator.

I'd have two methods, one which updates the database and one which doesn't. Both could delegate to a common private method, if they share a lot of code.
By separating the two methods, it becomes clear to callers what the different semantics are between the two, makes documenting the different methods easier, and clarifies what return types to expect. Since you can pull out shared code into private methods on the object, there's no worry about duplicating code.
As for returning query results, it'll depend on whether you're loading all the results from the database before returning, or returning a cursor object. I'd be tempted to do something like the following:
with db.executeQuery('SELECT * FROM my_table') as results:
for row in results:
print row['col1'], row['col2']
... so the executeQuery method returns a ContextManager object (which cleans up any open connections, if needed), which also acts as a Generator. And the results from the generator act as read-only dicts.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.