If within an instance, I have self.foo = 1, what is the difference between these (or other more complicated examples):
# 1
for i in range(10):
print(self.foo)
# 2
foo = self.foo
for i in range(10):
print(foo)
I'm currently looking at a code base where all the self variables are reassigned to something else. Just wondering if there is any reason to do so and would like to hear both from an efficiency standpoint and a code clarity standpoint.
Consider these possibilities:
The local variable self gets rebound in the middle of the loop. (That's not possible with the specific code you've given, but a different loop could conceivably do it.) In that case, #1 will see the new self's foo attribute, while #2 will not. Although, of course, you could just as easily rebind the local variable foo as the local variable self…
self is mutable, and self.foo is rebound to a different value in the middle of the loop. (That could happen more easily with, e.g., another thread operating on the same object.) Again, #1 will see the new value of the foo attribute, but #2 will not.
self.foo is itself mutable, and its value is mutated in the middle of the loop (e.g., it's a list, and some other thread calls append(2) on it). Now both #1 and #2 will see the new value.
Everything is immutable, or there's just no code (including on other threads) to mutate anything. Now both #1 and #2 are going to see the original value, because there is no other value to see.
If any of those semantic differences are relevant, then of course you want to use whichever one gives you the right answer.
Meanwhile, every time you access self.foo, that requires doing an attribute lookup. In the most common case, this means looking up 'foo' in self.__dict__, which is pretty quick, but not free. And you can easily create pathological cases where it goes through 23 base classes in MRO order before calling a __getattr__ that creates the value on the fly and returns a descriptor whose __get__ method does some non-trivial transformation.
Accessing foo, on the other hand, is going to be compiled into just loading a value out of an array on the frame using a compiled-in index. So it will almost always be faster, and in some cases it can be a lot faster.
In most real-life cases, this doesn't matter at all. But occasionally, it does. In which case copying the value to a local outside the loop is a worthwhile micro-optimization. This is a little more common with bound methods than with normal values (because they always have a descriptor call in the way); see the unique_everseen recipe in the itertools docs for an example.
Of course you could contrive a case where this optimization actually made things slower—e.g., make that loop really tiny, but put the whole thing inside an outer loop. Now the extra self.foo copy each time through the outer loop (and the fact that the bytecode involved in the loop is longer and may spill onto another cache line) could cost a lot more than it saves.
If there's no semantic difference that matters, and the performance difference doesn't matter, then it's just a matter of clarify.
If the expression is a lot more complicated than self.foo, it may well be clearer to pull out the value and give it a name.
But for a trivial case like this, it's probably clearer to just use self.foo. By taking the extra step of copying it to a local variable, you're signaling that you had some reason to do so. So a reader will wonder whether maybe self.foo can get rebound in a different thread, or maybe this loop is a major bottleneck in your code and the self.foo access is a performance issue, etc., and waste time dealing with all of those irrelevancies instead of just reading your code as intended.
Related
I am coding in Python trying to decide whether I should return a numpy array (the result of a diff on some other array) or return numpy.where(diff)[0], which is a smaller array but requires that little extra work to create. Let's call the method where this happens methodB.
I call methodB from methodA. The rub is that I won't necessarily always need the where() result in methodA, but I might. So is it worth doing this work inside methodB, or should I pass back the (much larger memory-wise) diff itself and then only process it further in methodA if needed? That would be the more efficient choice assuming methodA just gets a reference to the result.
So, are function results ever not copied when they are passed back the the code that called that function?
I believe that when methodB finishes, all the memory in its frame will be reclaimed by the system, so methodA has to actually copy anything returned by methodB in to its own frame in order to be able to use it. I would call this "return by value". Is this correct?
Yes, you are correct. In Python, arguments are always passed by value, and return values are always returned by value. However, the value being returned (or passed) is a reference to a potentially shared, potentially mutable object.
There are some types for which the value being returned or passed may be the actual object itself, e.g. this is the case for integers, but the difference between the two can only be observed for mutable objects which integers aren't, and de-referencing an object reference is completely transparent, so you will never notice the difference. To simplify your mental model, you may just assume that arguments and return values are always passed by value (this is true anyhow), and that the value being passed is always a reference (this is not always true, but you cannot tell the difference, you can treat it as a simple performance optimization).
Note that passing / returning a reference by value is in no way similar (and certainly not the same thing) as passing / returning by reference. In particular, it does not allow you to mutate the name binding in the caller / callee, as pass-by-reference would allow you to.
This particular flavor of pass-by-value, where the value is typically a reference is the same in e.g. ECMAScript, Ruby, Smalltalk, and Java, and is sometimes called "call by object sharing" (coined by Barbara Liskov, I believe), "call by sharing", "call by object", and specifically within the Python community "call by assignment" (thanks to #timgeb) or "call by name-binding" (thanks to #Terry Jan Reedy) (not to be confused with call by name, which is again a different thing).
Assignment never copies data. If you have a function foo that returns a value, then an assignment like result = foo(arg) never copies any data. (You could, of course, have copy-operations in the function's body.) Likewise, return x does not copy the object x.
Your question lacks a specific example, so I can't go into more detail.
edit: You should probably watch the excellent Facts and Myths about Python names and values talk.
So roughly your code is:
def methodA(arr):
x = methodB(arr)
....
def methodB(arr):
diff = somefn(arr)
# return diff or
# return np.where(diff)[0]
arr is a (large) array, that is passed a reference to methodA and methodB. No copies are made.
diff is a similar size array that is generated in methodB. If that is returned, it be referenced in the methodA namespace by x. No copy is made in returning it.
If the where array is returned, diff disappears when methodB returns. Assuming it doesn't share a data buffer with some other array (such as arr), all the memory that it occupied is recovered.
But as long as memory isn't tight, returning diff instead of the where result won't be more expensive. Nothing is copied during the return.
A numpy array consists of small object wrapper with attributes like shape and dtype. It also has a pointer to a potentially large data buffer. Where possible numpy tries to share buffers, but readily makes new ndarray objects. Thus there's an important distinction between view and copy.
I see what I missed now: Objects are created on the heap, but function frames are on the stack. So when methodB finishes, its frame will be reclaimed, but that object will still exist on the heap, and methodA can access it with a simple reference.
I would like to write a Python function that mutates one of the arguments (which is a list, ie, mutable). Something like this:
def change(array):
array.append(4)
change(array)
I'm more familiar with passing by value than Python's setup (whatever you decide to call it). So I would usually write such a function like this:
def change(array):
array.append(4)
return array
array = change(array)
Here's my confusion. Since I can just mutate the argument, the second method would seem redundant. But the first one feels wrong. Also, my particular function will have several parameters, only one of which will change. The second method makes it clear what argument is changing (because it is assigned to the variable). The first method gives no indication. Is there a convention? Which is 'better'? Thank you.
The first way:
def change(array):
array.append(4)
change(array)
is the most idiomatic way to do it. Generally, in python, we expect a function to either mutate the arguments, or return something1. The reason for this is because if a function doesn't return anything, then it makes it abundantly clear that the function must have had some side-effect in order to justify it's existence (e.g. mutating the inputs).
On the flip side, if you do things the second way:
def change(array):
array.append(4)
return array
array = change(array)
you're vulnerable to have hard to track down bugs where a mutable object changes all of a sudden when you didn't expect it to -- "But I thought change made a copy"...
1Technically every function returns something, that _something_ just happens to be None ...
The convention in Python is that functions either mutate something, or return something, not both.
If both are useful, you conventionally write two separate functions, with the mutator named for an active verb like change, and the non-mutator named for a participle like changed.
Almost everything in builtins and the stdlib follows this pattern. The list.append method you're calling returns nothing. Same with list.sort—but sorted leaves its argument alone and instead returns a new sorted copy.
There are a handful of exceptions for some of the special methods (e.g., __iadd__ is supposed to mutate and then return self), and a few cases where there clearly has to be one thing getting mutating and a different thing getting returned (like list.pop), and for libraries that are attempting to use Python as a sort of domain-specific language where being consistent with the target domain's idioms is more important than being consistent with Python's idioms (e.g., some SQL query expression libraries). Like all conventions, this one is followed unless there's a good reason not to.
So, why was Python designed this way?
Well, for one thing, it makes certain errors obvious. If you expected a function to be non-mutating and return a value, it'll be pretty obvious that you were wrong, because you'll get an error like AttributeError: 'NoneType' object has no attribute 'foo'.
It also makes conceptual sense: a function that returns nothing must have side-effects, or why would anyone have written it?
But there's also the fact that each statement in Python mutates exactly one thing—almost always the leftmost object in the statement. In other languages, assignment is an expression, mutating functions return self, and you can chain up a whole bunch of mutations into a single line of code, and that makes it harder to see the state changes at a glance, reason about them in detail, or step through them in a debugger.
Of course all of this is a tradeoff—it makes some code more verbose in Python than it would be in, say, JavaScript—but it's a tradeoff that's deeply embedded in Python's design.
It hardly ever makes sense to both mutate an argument and return it. Not only might it cause confusion for whoever's reading the code, but it leaves you susceptible to the mutable default argument problem. If the only way to get the result of the function is through the mutated argument, it won't make sense to give the argument a default.
There is a third option that you did not show in your question. Rather than mutating the object passed as the argument, make a copy of that argument and return it instead. This makes it a pure function with no side effects.
def change(array):
array_copy = array[:]
array_copy.append(4)
return array_copy
array = change(array)
From the Python documentation:
Some operations (for example y.append(10) and y.sort()) mutate the
object, whereas superficially similar operations (for example y = y +
[10] and sorted(y)) create a new object. In general in Python (and in
all cases in the standard library) a method that mutates an object
will return None to help avoid getting the two types of operations
confused. So if you mistakenly write y.sort() thinking it will give
you a sorted copy of y, you’ll instead end up with None, which will
likely cause your program to generate an easily diagnosed error.
However, there is one class of operations where the same operation
sometimes has different behaviors with different types: the augmented
assignment operators. For example, += mutates lists but not tuples or
ints (a_list += [1, 2, 3] is equivalent to a_list.extend([1, 2, 3])
and mutates a_list, whereas some_tuple += (1, 2, 3) and some_int += 1
create new objects).
Basically, by convention, a function or method that mutates an object does not return the object itself.
Which design pattern is preferred?
For a dictionary d that may or may not have a key named 'foo'...
Pattern A
if d.get('foo'):
func(d.get('foo'))
Pattern B
foo = d.get('foo')
if foo:
func(foo)
I think I prefer the 2-line approach of Pattern A. Does the second lookup cost more than the extra assignment of Pattern B?
The two things you're writing do completely different things.
The first one uses the string 'foo' as the argument to func.
The second one uses whatever value is in d['foo'] as the argument to func.
Which one is right depends entirely on which one is what you actually want to do.
In your edited version, doing the lookup twice is silly.
Of course it "costs more"—you have to hash 'foo' twice, and look up that hash value (and possibly probe a few times) twice, and that costs twice as much as doing it once. But that performance cost is unlikely to matter in any real program. (Even if your key were expensive to hash—which a three-character string is not—most types that are expensive to hash will cache the hash value the first time…)
A more realistic potential problem is that you've added a race condition for no good reason if you ever want to make your code multithreaded or just reentrant.
But, more importantly, repeating yourself always introduces opportunities for error, because you have to repeat yourself exactly right, and it's not always obvious when you haven't done so, especially if you later edit one of the copies. (The fact that you got the question wrong in the first place is a pretty good argument…) And for the same reasons, you also introduce opportunities for people to read your code wrong, or just stumble over reading it and have to think about something that should have been obvious. That's why DRY (Don't Repeat Yourself—unless you have a good reason to do so) is a fundamental principle in programming.
That being said, the first one would be much better written as if 'foo' in d:. If you don't actually need the value, don't retrieve it.
And the second one might be better written with EAFP instead of LBYL:
try:
func(d['foo'])
except KeyError:
pass
Or, in Python 3.5:
func(d['foo']) except KeyError: None
I understand why Python requires explicit self qualifier when referring to instance attributes.
But I often forget it, since I didn't need it in C++.
The bug I introduce this way is sometimes extremely hard to catch; e.g., suppose I write
if x is not None:
f()
instead of
if self.x is not None:
f()
Suppose attribute x is usually None, so f() is rarely called. And suppose f() only creates a subtle side effect (e.g., a change in a numeric value, or clearing the cache, etc.). Unless I have insane amount of unit tests, this mistake is likely to remain unnoticed for a long time.
I am wondering if anyone knows coding techniques or IDE features that could help me catch or avoid this type of bug.
Don't name your instance attributes the same things as your globals/locals.
If there isn't a global/local of the same name, you'll get a global "foo" is not defined error when you try to access self.foo but forget the self..
As a corollary: give your variables descriptive names. Don't name everything x - not only does this make it far less likely you'll have variables with the same name as attributes, it also makes your code easier to read.
What are the best practices and recommendations for using explicit del statement in python? I understand that it is used to remove attributes or dictionary/list elements and so on, but sometimes I see it used on local variables in code like this:
def action(x):
result = None
something = produce_something(x)
if something:
qux = foo(something)
result = bar(qux, something)
del qux
del something
return result
Are there any serious reasons for writing code like this?
Edit: consider qux and something to be something "simple" without a __del__ method.
I don't remember when I last used del -- the need for it is rare indeed, and typically limited to such tasks as cleaning up a module's namespace after a needed import or the like.
In particular, it's not true, as another (now-deleted) answer claimed, that
Using del is the only way to make sure
a object's __del__ method is called
and it's very important to understand this. To help, let's make a class with a __del__ and check when it is called:
>>> class visdel(object):
... def __del__(self): print 'del', id(self)
...
>>> d = visdel()
>>> a = list()
>>> a.append(d)
>>> del d
>>>
See? del doesn't "make sure" that __del__ gets called: del removes one reference, and only the removal of the last reference causes __del__ to be called. So, also:
>>> a.append(visdel())
>>> a[:]=[1, 2, 3]
del 550864
del 551184
when the last reference does go away (including in ways that don't involve del, such as a slice assignment as in this case, or other rebindings of names and other slots), then __del__ gets called -- whether del was ever involved in reducing the object's references, or not, makes absolutely no difference whatsoever.
So, unless you specifically need to clean up a namespace (typically a module's namespace, but conceivably that of a class or instance) for some specific reason, don't bother with del (it can be occasionally handy for removing an item from a container, but I've found that I'm often using the container's pop method or item or slice assignment even for that!-).
No.
I'm sure someone will come up with some silly reason to do this, e.g. to make sure someone doesn't accidentally use the variable after it's no longer valid. But probably whoever wrote this code was just confused. You can remove them.
When you are running programs handling really large amounts of data ( to my experience when the totals memory consumption of the program approaches something like 1GB) deleting some objects:
del largeObject1
del largeObject2
…
can give your program the necessary breathing room to function without running out of memory. This can be the easiest way to modify a given program, in case of a “MemoryError” runtime error.
Actually, I just came across a use for this. If you use locals() to return a dictionary of local variables (useful when parsing things) then del is useful to get rid of a temporary that you don't want to return.