How to inspect generators in the repl/ipython in Python3 - python

I've been trying to switch to Python3. Surprisingly, my difficulty is not with modules or my own code breaking. My issue is that I am always trying and testing different aspects of my code in IPython as I write it, and having generators by default makes this infuriating. I'm hoping there is either a gap in my knowledge or some sort of work around to resolve this.
My issues are:
Whenever I test a few lines of code or a function and get a generator, I have no idea what's inside since I'm getting a response like this: <generator object <genexpr> at 0x0000000007947168>. Getting around it means I can't just run code directly from my editor -- I need to dump the output into a variable and/or wrap it in a list().
Once I do start to inspect the generator, I either consume it (fully or partially) which messes it up if I wish to test it further. Partially consuming is especially annoying, because sometimes I don't notice and see odd results from subsequent code.
Oddly enough, I keep finding that I am introducing bugs (or extraneous code), not because I don't understand lazy evaluation, but because of the mismatch in what I'm evaluating in the console and what's making it's way into my editor slipping through my view.
Off the top of my head, I'd like to do one of the following:
Configure IPython in some way to force some kind of strict evaluation (unless I shut it off explicitly)
Inspect a generator without consuming it (or maybe inspect it and then restart itself?)

Your idea of previewing or rewinding a generator is not possible in the general case. That's because generators can have side effects, which you'd either get earlier than expected (when you preview), or get multiple times (before and after rewinding). Consider the following generator, for example:
def foo_gen():
print("start")
yield 1
print("middle")
yield 2
print("end")
If you could preview the results yielded by this generator (1 and 2), would you expect to get the print outs too?
That said, there may be some ways for you to make your code easier to deal with.
Consider using list comprehensions instead of generator expressions. This is quite simple in most situations, just put square brackets around the genexp you already have. In many situations where you pass a generator to other code, any iterable object (such as a list) will work just as well.
Similarly, if you're getting generators passed into your code from other places, you can often pass the generator to list and use the list in your later code. This is of course not very memory efficient, since you're consuming the whole generator up front, but if you want to see the values in the interactive console, that's probably going to be necessary.
You can also use itertools.tee to get two (or more) iterators that will yield the same values as the iterable you pass in. This will allow you to inspect the values from one, while passing the other on. Be aware though that the tee code will need to store all the values yielded by any of the iterators until it has been yielded by all of the other iterators too (so if you run one iterator far ahead of the others, you may end up using as much or more memory than if you'd just used a list).

In case it helps anyone else, this is a line magic for IPython I threw together in response to the answer. It makes it a tiny bit less painful:
%ins <var> will create two copies of <var> using itertools.tee. One will be re-assigned to <var> (so you can re-use it in it's original state), the other will be passed to print(list()) so it outputs to terminal.
%ins <expr> will pass the expression to print(list())
To install save it as ins.py in ~/.ipython/profile_default/startup
from IPython.core.magic import register_line_magic
import itertools
#register_line_magic
def ins(line):
if globals().get(line, None):
gen1, gen2 = eval("itertools.tee({})".format(line))
globals()[line] = gen2
print(list(gen1))
else:
print(list(eval(line)))
# You need to delete this item from the namespace
del ins

Related

Is it wrong practice to modify function arguments?

Is it a bad practice to modify function arguments?
_list = [1,2,3]
def modify_list(list):
list.append(4)
print(_list)
modify_list(_list)
print(_list)
At first it was supposed to be a comment, but it needed more formatting and space to explain the example. ;)
If you:
know what you're doing
can justify the use
don't use mutable default arguments (they are way too confusing in the way they behave, I can't imagine the reason their use would ever be justified)
don't use global mutables anywhere near that thing (modifying mutable global's contents AND modifying mutable argument's contents?)
and, most importantly, document this thing,
this thing shouldn't cause much harm (but still might bite you if you only think you know what you're doing, but in fact you don't) and can be useful!
Example:
I've worked with scripts (made by other programmers) that used mutable arguments. In this case: dictionaries.
The script was supposed to be run with threads but also allowed single-thread run. Using dictionaries instead of return values removed the difference of getting the result in single- and multiple-thread runs:
Normally value returned by a thread is encapsulated, but we only used the value after .join anyway and didn't care about threads killed by exceptions (single-thread run was mostly for debugging/local run).
That way, dictionaries (more than one in a single function) were used for appending new results in each run, without the need of collecting returned values manually and filtering them (the called function knew in which dict to put the result in, used lock to ensure thread safety).
Was it a "good" or "wrong" way of doing things?
In my opinion it was a pythonic way of doing things:
easily readable in both forms - dealing with the result was the same in single- and multi-threaded
data was "automatically" nicely formatted - as opposed to de-capsulating thread results, manual collecting and parsing them
and fairly easy to understand - with first and last point in my list above ;)

Walk a list without creating unneeded object

I usually do this:
[worker.do_work() for worker in workers]
This has the advantage of being very readable and contained in a single line, but the problem of creating an object (a list) which I do not need, which means garbage collection is unnecessarily triggered.
The obvious alternative:
for worker in workers:
worker.do_work()
Is also quite readable, but uses two lines.
Is there a single-line way of achieving the same result, without creating unnecessary objects?
Sure, there is.
def doLotsOfWork(wks):
for w in wks:
w.do_work()
And now, your "one liner":
doLotsOfWork(workers)
In short, there's no "shorter" (or better way) besides using a for loop. I'd advise you not to use the list comprehension because it uses side effects - that's code smell.
"GC" in python is quite different from java. A ref-count decrement is much much cheaper than mark-and-sweep. Benchmark it, then decide if you're placing too much emphasis on a small cost.
To make it a one liner, simply define a helper function and then it's a single line to invoke it. Bury the function in an imported library if convenient.

Reusing Generators in Different Unit Tests

I'm running into an issue while unit-testing a Python project that I'm working on which uses generators. Simplified, the project/unit-test looks like this:
I have a setUp() function which creates a Person instance. Person is a class that has a generator, next_task(), which yields the next task that a Person has.
I now have two unit-tests that test different things about the way the generator works, using a for loop. The first test works exactly as I'd expect, and the second one never even enters the loop. In both unit tests, the first line of code is:
for rank, task in enumerate(self.person.next_task()):
My guess is that this isn't working because the same generator function is being used in two separate unit tests. But that doesn't seem like the way that generators or unit-tests are supposed to work. Shouldn't I be able to iterate twice across the list of tasks? Also, shouldn't each unit-test be working with an essentially different instance of the Person, since the Person instance is created in setUp()?
If you are really creating a new Person object in setUp then it should work as you expect. There are several reasons why it may not be working:
1) you are initialising the Person's tasks from another iterator, and that is exhausted by the second time you create Person.
2) You are creating a new Person object each time but the task generator is a class variable instead of an instance variable, so is shared between the class instances.
3) You think you are creating a new Person object but in reality you are not for some reason. Perhaps it is implemented as a singleton.
4) the unittest setUp method is broken.
Of these I think (4) is least likely, but we would need to see more of you code before we can track down the real problem.
The yielded results of a generator are consumed by the first for loop that uses it. Afterwards, the generator function returns and is finished - and thus empty. As the second unit test uses the very same generator object, it doesn't enter the loop. You have to create a new generator for the second unit test, or use itertools.tee to make N separate iterators out of one generator.
Generators do not work the way you think. On each call to generatorObject.next(), the next yielded result is returned, but the result does not get stored anywhere. That's why generators are often used for lazy operations. If you want to reuse the results, you can use itertools.tee as I said, or convert the generator to a tuple/list of results.
A hint on using itertools.tee from the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Check if something is a list

What is the easiest way to check if something is a list?
A method doSomething has the parameters a and b. In the method, it will loop through the list a and do something. I'd like a way to make sure a is a list, before looping through - thus avoiding an error or the unfortunate circumstance of passing in a string then getting back a letter from each loop.
This question must have been asked before - however my googles failed me. Cheers.
To enable more usecases, but still treat strings as scalars, don't check for a being a list, check that it isn't a string:
if not isinstance(a, basestring):
...
Typechecking hurts the generality, simplicity, and maintainability of your code. It is seldom used in good, idiomatic Python programs.
There are two main reasons people want to typecheck:
To issue errors if the caller provides the wrong type.
This is not worth your time. If the user provides an incompatible type for the operation you are performing, an error will already be raised when the compatibility is hit. It is worrisome that this might not happen immediately, but it typically doesn't take long at all and results in code that is more robust, simple, efficient, and easier to write.
Oftentimes people insist on this with the hope they can catch all the dumb things a user can do. If a user is willing to do arbitrarily dumb things, there is nothing you can do to stop him. Typechecking mainly has the potential of keeping a user who comes in with his own types that are drop-in replacements for the ones replaced or when the user recognizes that your function should actually be polymorphic and provides something different that can accept the same operation.
If I had a big system where lots of things made by lots of people should fit together right, I would use a system like zope.interface to make testing that everything fits together right.
To do different things based on the types of the arguments received.
This makes your code worse because your API is inconsistent. A function or method should do one thing, not fundamentally different things. This ends up being a feature not usually worth supporting.
One common scenario is to have an argument that can either be a foo or a list of foos. A cleaner solution is simply to accept a list of foos. Your code is simpler and more consistent. If it's an important, common use case only to have one foo, you can consider having another convenience method/function that calls the one that accepts a list of foos and lose nothing. Providing the first API would not only have been more complicated and less consistent, but it would break when the types were not the exact values expected; in Python we distinguish between objects based on their capabilities, not their actual types. It's almost always better to accept an arbitrary iterable or a sequence instead of a list and anything that works like a foo instead of requiring a foo in particular.
As you can tell, I do not think either reason is compelling enough to typecheck under normal circumstances.
I'd like a way to make sure a is a list, before looping through
Document the function.
Usually it's considered not a good style to perform type-check in Python, but try
if isinstance(a, list):
...
(I think you may also check if a.__iter__ exists.)

Python - Things I shouldn't be doing?

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?
As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).
First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.
from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants
That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.
If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.
I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.

Categories

Resources