Walk a list without creating unneeded object - python

I usually do this:
[worker.do_work() for worker in workers]
This has the advantage of being very readable and contained in a single line, but the problem of creating an object (a list) which I do not need, which means garbage collection is unnecessarily triggered.
The obvious alternative:
for worker in workers:
worker.do_work()
Is also quite readable, but uses two lines.
Is there a single-line way of achieving the same result, without creating unnecessary objects?

Sure, there is.
def doLotsOfWork(wks):
for w in wks:
w.do_work()
And now, your "one liner":
doLotsOfWork(workers)
In short, there's no "shorter" (or better way) besides using a for loop. I'd advise you not to use the list comprehension because it uses side effects - that's code smell.

"GC" in python is quite different from java. A ref-count decrement is much much cheaper than mark-and-sweep. Benchmark it, then decide if you're placing too much emphasis on a small cost.
To make it a one liner, simply define a helper function and then it's a single line to invoke it. Bury the function in an imported library if convenient.

Related

Is it wrong practice to modify function arguments?

Is it a bad practice to modify function arguments?
_list = [1,2,3]
def modify_list(list):
list.append(4)
print(_list)
modify_list(_list)
print(_list)
At first it was supposed to be a comment, but it needed more formatting and space to explain the example. ;)
If you:
know what you're doing
can justify the use
don't use mutable default arguments (they are way too confusing in the way they behave, I can't imagine the reason their use would ever be justified)
don't use global mutables anywhere near that thing (modifying mutable global's contents AND modifying mutable argument's contents?)
and, most importantly, document this thing,
this thing shouldn't cause much harm (but still might bite you if you only think you know what you're doing, but in fact you don't) and can be useful!
Example:
I've worked with scripts (made by other programmers) that used mutable arguments. In this case: dictionaries.
The script was supposed to be run with threads but also allowed single-thread run. Using dictionaries instead of return values removed the difference of getting the result in single- and multiple-thread runs:
Normally value returned by a thread is encapsulated, but we only used the value after .join anyway and didn't care about threads killed by exceptions (single-thread run was mostly for debugging/local run).
That way, dictionaries (more than one in a single function) were used for appending new results in each run, without the need of collecting returned values manually and filtering them (the called function knew in which dict to put the result in, used lock to ensure thread safety).
Was it a "good" or "wrong" way of doing things?
In my opinion it was a pythonic way of doing things:
easily readable in both forms - dealing with the result was the same in single- and multi-threaded
data was "automatically" nicely formatted - as opposed to de-capsulating thread results, manual collecting and parsing them
and fairly easy to understand - with first and last point in my list above ;)

How to inspect generators in the repl/ipython in Python3

I've been trying to switch to Python3. Surprisingly, my difficulty is not with modules or my own code breaking. My issue is that I am always trying and testing different aspects of my code in IPython as I write it, and having generators by default makes this infuriating. I'm hoping there is either a gap in my knowledge or some sort of work around to resolve this.
My issues are:
Whenever I test a few lines of code or a function and get a generator, I have no idea what's inside since I'm getting a response like this: <generator object <genexpr> at 0x0000000007947168>. Getting around it means I can't just run code directly from my editor -- I need to dump the output into a variable and/or wrap it in a list().
Once I do start to inspect the generator, I either consume it (fully or partially) which messes it up if I wish to test it further. Partially consuming is especially annoying, because sometimes I don't notice and see odd results from subsequent code.
Oddly enough, I keep finding that I am introducing bugs (or extraneous code), not because I don't understand lazy evaluation, but because of the mismatch in what I'm evaluating in the console and what's making it's way into my editor slipping through my view.
Off the top of my head, I'd like to do one of the following:
Configure IPython in some way to force some kind of strict evaluation (unless I shut it off explicitly)
Inspect a generator without consuming it (or maybe inspect it and then restart itself?)
Your idea of previewing or rewinding a generator is not possible in the general case. That's because generators can have side effects, which you'd either get earlier than expected (when you preview), or get multiple times (before and after rewinding). Consider the following generator, for example:
def foo_gen():
print("start")
yield 1
print("middle")
yield 2
print("end")
If you could preview the results yielded by this generator (1 and 2), would you expect to get the print outs too?
That said, there may be some ways for you to make your code easier to deal with.
Consider using list comprehensions instead of generator expressions. This is quite simple in most situations, just put square brackets around the genexp you already have. In many situations where you pass a generator to other code, any iterable object (such as a list) will work just as well.
Similarly, if you're getting generators passed into your code from other places, you can often pass the generator to list and use the list in your later code. This is of course not very memory efficient, since you're consuming the whole generator up front, but if you want to see the values in the interactive console, that's probably going to be necessary.
You can also use itertools.tee to get two (or more) iterators that will yield the same values as the iterable you pass in. This will allow you to inspect the values from one, while passing the other on. Be aware though that the tee code will need to store all the values yielded by any of the iterators until it has been yielded by all of the other iterators too (so if you run one iterator far ahead of the others, you may end up using as much or more memory than if you'd just used a list).
In case it helps anyone else, this is a line magic for IPython I threw together in response to the answer. It makes it a tiny bit less painful:
%ins <var> will create two copies of <var> using itertools.tee. One will be re-assigned to <var> (so you can re-use it in it's original state), the other will be passed to print(list()) so it outputs to terminal.
%ins <expr> will pass the expression to print(list())
To install save it as ins.py in ~/.ipython/profile_default/startup
from IPython.core.magic import register_line_magic
import itertools
#register_line_magic
def ins(line):
if globals().get(line, None):
gen1, gen2 = eval("itertools.tee({})".format(line))
globals()[line] = gen2
print(list(gen1))
else:
print(list(eval(line)))
# You need to delete this item from the namespace
del ins

How to efficiently share a common parent atribute class by multiprocessing tasks?

I have a class named "Problem", and another two called "Colony", and "Ant".
A "Problem" has an attribute of type "Colony", and each "Colony" has a list of "Ants".
Each Ant is thought as a Task to be run in a multiprocessing.JoinableQueue() in a method in the "Problem" class, and when their method __call__ is called, they must consult & modify an attribute graph in the "Problem" class, which would have to be accessed by every ant.
What would be the most efficient way to implement this?
I have thought of passing to each ant in the constructor method a copy of the graph, and then when they are finished, join all the subgraphs into a graph. But I think it would be better to somehow share the resource directly by all ants, like using "semaphore" style design.
Any ideas?.
Thanks
If splitting the data and joining the results can be done reasonably, this is almost always going to be more efficient—and a lot simpler—than having them all fight over shared data.
There are cases where there is no reasonable way to do this (it's either very complicated, or very slow, to join the results back up). However, even in that case there can be a reasonable alternative: return "mutation commands" of some form. The parent process can then, e.g., iterate over the output queue and apply each result to the single big array.
If even that isn't feasible, then you need sharing. There are two parts to sharing: making the data itself sharable, and locking it.
Whatever your graph type is, it probably isn't inherently shareable; it's probably got internal pointers and so on. This means you will need to construct some kind of representation in terms of multiprocessing.Array, or multiprocessing.sharedctypes around Structures, or the like, or in terms of bytes in a file that each process can mmap, or by using whatever custom multiprocessing support may exist in modules like NumPy that you may be using. Then, all of your tasks can mutate the Array (or whatever), and at the end, if you need an extra step to turn that back into a useful graph object, it should be pretty quick.
Next, for locking, the really simple thing to do is create a single multiprocessing.Lock, and have each task grab the lock when it needs to mutate the shared data. In some cases, it can be more efficient to have multiple locks, protecting different parts of the shared data. And in some cases it can be more efficient to grab the lock for each mutation instead of grabbing it once for a whole "transaction" worth of sequences (but of course it may not be correct). Without knowing your actual code, there's no way to make a judgment on these tradeoffs; in fact, a large part of the art of shared-data threading is knowing how to work this stuff out. (And a large part of the reason shared-nothing threading is easier and usually more efficient is that you don't need to work this stuff out.)
Meanwhile, I'm not sure why you need an explicit JoinableQueue here in the first place. It sounds like everything you want can be done with a Pool. To take a simpler but concrete example:
a = [[0,1,2], [3,4,5], [6,7,8], [9,10,11]]
with multiprocessing.Pool() as pool:
b = pool.map(reversed, a, chunksize=1)
c = [list(i) for i in b]
This is a pretty stupid example, but it illustrates that each task operates on one of the rows of a and returns something, which I can then combine in some custom way (by calling list on each one) to get the result I want.

Check if something is a list

What is the easiest way to check if something is a list?
A method doSomething has the parameters a and b. In the method, it will loop through the list a and do something. I'd like a way to make sure a is a list, before looping through - thus avoiding an error or the unfortunate circumstance of passing in a string then getting back a letter from each loop.
This question must have been asked before - however my googles failed me. Cheers.
To enable more usecases, but still treat strings as scalars, don't check for a being a list, check that it isn't a string:
if not isinstance(a, basestring):
...
Typechecking hurts the generality, simplicity, and maintainability of your code. It is seldom used in good, idiomatic Python programs.
There are two main reasons people want to typecheck:
To issue errors if the caller provides the wrong type.
This is not worth your time. If the user provides an incompatible type for the operation you are performing, an error will already be raised when the compatibility is hit. It is worrisome that this might not happen immediately, but it typically doesn't take long at all and results in code that is more robust, simple, efficient, and easier to write.
Oftentimes people insist on this with the hope they can catch all the dumb things a user can do. If a user is willing to do arbitrarily dumb things, there is nothing you can do to stop him. Typechecking mainly has the potential of keeping a user who comes in with his own types that are drop-in replacements for the ones replaced or when the user recognizes that your function should actually be polymorphic and provides something different that can accept the same operation.
If I had a big system where lots of things made by lots of people should fit together right, I would use a system like zope.interface to make testing that everything fits together right.
To do different things based on the types of the arguments received.
This makes your code worse because your API is inconsistent. A function or method should do one thing, not fundamentally different things. This ends up being a feature not usually worth supporting.
One common scenario is to have an argument that can either be a foo or a list of foos. A cleaner solution is simply to accept a list of foos. Your code is simpler and more consistent. If it's an important, common use case only to have one foo, you can consider having another convenience method/function that calls the one that accepts a list of foos and lose nothing. Providing the first API would not only have been more complicated and less consistent, but it would break when the types were not the exact values expected; in Python we distinguish between objects based on their capabilities, not their actual types. It's almost always better to accept an arbitrary iterable or a sequence instead of a list and anything that works like a foo instead of requiring a foo in particular.
As you can tell, I do not think either reason is compelling enough to typecheck under normal circumstances.
I'd like a way to make sure a is a list, before looping through
Document the function.
Usually it's considered not a good style to perform type-check in Python, but try
if isinstance(a, list):
...
(I think you may also check if a.__iter__ exists.)

Python - Things I shouldn't be doing?

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?
As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).
First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.
from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants
That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.
If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.
I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.

Categories

Resources