Why are mutable strings slower than immutable strings? - python

Why are mutable strings slower than immutable strings?
EDIT:
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s[0] = 'a'
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
13.5236170292
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... s = 'abcd'
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
6.24725079536
>>> import UserString
... def test():
... s = UserString.MutableString('Python')
... for i in range(3):
... s = 'a' + s[1:]
...
... if __name__=='__main__':
... from timeit import Timer
... t = Timer("test()", "from __main__ import test")
... print t.timeit()
38.6385951042
i think it is obvious why i put s = UserString.MutableString('Python') on second test.

In a hypothetical language that offers both mutable and immutable, otherwise equivalent, string types (I can't really think of one offhand -- e.g., Python and Java both have immutable strings only, and other ways to make one through mutation which add indirectness and therefore can of course slow things down a bit;-), there's no real reason for any performance difference -- for example, in C++, interchangeably using a std::string or a const std::string I would expect to cause no performance difference (admittedly a compiler might be able to optimize code using the latter better by counting on the immutability, but I don't know any real-world ones that do perform such theoretically possible optimizations;-).
Having immutable strings may and does in fact allow very substantial optimizations in Java and Python. For example, if the strings get hashed, the hash can be cached, and will never have to be recomputed (since the string can't change) -- that's especially important in Python, which uses hashed strings (for look-ups in sets and dictionaries) so lavishly and even "behind the scenes". Fresh copies never need to be made "just in case" the previous one has changed in the meantime -- references to a single copy can always be handed out systematically whenever that string is required. Python also copiously uses "interning" of (some) strings, potentially allowing constant-time comparisons and many other similarly fast operations -- think of it as one more way, a more advanced one to be sure, to take advantage of strings' immutability to cache more of the results of operations often performed on them.
That's not to say that a given compiler is going to take advantage of all possible optimizations, of course. For example, when a slice of a string is requested, there is no real need to make a new object and copy the data over -- the new slice might refer to the old one with an offset (and an independently stored length), potentially a great optimization for big strings out of which many slices are taken. Python doesn't do that because, unless particular care is taken in memory management, this might easily result in the "big" string being all kept in memory when only a small slice of it is actually needed -- but it's a tradeoff that a different implementation might definitely choose to perform (with that burden of extra memory management, to be sure -- more complex, harder-to-debug compiler and runtime code for the hypothetical language in question).
I'm just scratching the surface here -- and many of these advantages would be hard to keep if otherwise interchangeable string types could exist in both mutable and immutable versions (which I suspect is why, to the best of my current knowledge at least, C++ compilers actually don't bother with such optimizations, despite being generally very performance-conscious). But by offering only immutable strings as the primitive, fundamental data type (and thus implicitly accepting some disadvantage when you'd really need a mutable one;-), languages such as Java and Python can clearly gain all sorts of advantages -- performance issues being only one group of them (Python's choice to allow only immutable primitive types to be hashable, for example, is not a performance-centered design decision -- it's more about clarity and predictability of behavior for sets and dictionaries!-).

I don't know if they are really a lot slower but they make thinking about programming easier a lot of the times, because the state of the object/string can't change. That's the most important property to immutability to me.
Furthermore you might assume that immutable string are faster because they have less state(which can change), which might mean lower memory consumption, CPU-cycles.
I also found this interesting article while googling which I would like to quote:
knowing that a string is immutable
makes it easy to lay it out at
construction time — fixed and
unchanging storage requirements

with an immutable string, python can intern it and refer to it internally by it's address in memory. This means that to compare two strings, it only has to compare their addresses in memory (unless one of them isn't interned). Also, keep in mind that not all strings are interned. I've seen example of constructed strings that are not interned.
with mutable strings, string comparison would involve comparing them character by character and would also require either storing identical strings in different locations (malloc is not free) or adding logic to keep track of how many times a given string is referred to and making a copy for every mutation if there were more than one referrer.
It seems like python is optimized for string comparison. This makes sense because even string manipulation involves string comparison in most cases so for most use cases, it's the lowest common denominator.
Another advantage of immutable strings is that it makes it possible for them to be hashable which is a requirement for using them for dictionary keys. imagine a scenario where they were mutable:
s = 'a'
d = {s : 1}
s = s + 'b'
d[s] = ?
I suppose python could keep track of which dicts have which strings as keys and update all of their hashtables when a string was modified but that's just adding more overhead to dict insertion. It's not to far off the mark to say that you can't do anything in python without a dict insertion/lookup so that would be very very bad. It also adds overhead to string manipulation.

The obvious answer to your question is that normal strings are implemented in C, while MutableString is implemented in Python.
Not only does every operation on a mutable string have the overhead of going through one or more Python function calls, but the implementation is essentially a wrapper round an immutable string - when you modify the string it creates a new immutable string and throws the old one away. You can read the source in the UserString.py file in your Python lib directory.
To quote the Python docs:
Note:
This UserString class from this module
is available for backward
compatibility only. If you are writing
code that does not need to work with
versions of Python earlier than Python
2.2, please consider subclassing directly from the built-in str type
instead of using UserString (there is
no built-in equivalent to
MutableString).
This module defines a class that acts
as a wrapper around string objects. It
is a useful base class for your own
string-like classes, which can inherit
from them and override existing
methods or add new ones. In this way
one can add new behaviors to strings.
It should be noted that these classes
are highly inefficient compared to
real string or Unicode objects; this
is especially the case for
MutableString.
(Emphasis added).

Related

what does it mean by 'passed by assignment'?

As follow is my understanding of types & parameters passing in java and python:
In java, there are primitive types and non-primitive types. Former are not object, latter are objects.
In python, they are all objects.
In java, arguments are passed by value because:
primitive types are copied and then passed, so they are passed by value for sure. non-primitive types are passed by reference but reference(pointer) is also value, so they are also passed by value.
In python, the only difference is that 'primitive types'(for example, numbers) are not copied, but simply taken as objects.
Based on official doc, arguments are passed by assignment. What does it mean by 'passed by assignment'? Is objects in java work the same way as python? What result in the difference (passed by value in java and passed by argument in python)?
And is there any wrong understanding above?
tl;dr: You're right that Python's semantics are essentially Java's semantics, without any primitive types.
"Passed by assignment" is actually making a different distinction than the one you're asking about.1 The idea is that argument passing to functions (and other callables) works exactly the same way assignment works.
Consider:
def f(x):
pass
a = 3
b = a
f(a)
b = a means that the target b, in this case a name in the global namespace, becomes a reference to whatever value a references.
f(a) means that the target x, in this case a name in the local namespace of the frame built to execute f, becomes a reference to whatever value a references.
The semantics are identical. Whenever a value gets assigned to a target (which isn't always a simple name—e.g., think lst[0] = a or spam.eggs = a), it follows the same set of assignment rules—whether it's an assignment statement, a function call, an as clause, or a loop iteration variable, there's just one set of rules.
But overall, your intuitive idea that Python is like Java but with only reference types is accurate: You always "pass a reference by value".
Arguing over whether that counts as "pass by reference" or "pass by value" is pointless. Trying to come up with a new unambiguous name for it that nobody will argue about is even more pointless. Liskov invented the term "call by object" three decades ago, and if that never caught on, anything someone comes up with today isn't likely to do any better.
You understand the actual semantics, and that's what matters.
And yes, this means there is no copying. In Java, only primitive values are copied, and Python doesn't have primitive values, so nothing is copied.
the only difference is that 'primitive types'(for example, numbers) are not copied, but simply taken as objects
It's much better to see this as "the only difference is that there are no 'primitive types' (not even simple numbers)", just as you said at the start.
It's also worth asking why Python has no primitive types—or why Java does.2
Making everything "boxed" can be very slow. Adding 2 + 3 in Python means dereferencing the 2 and 3 objects, getting the native values out of them, adding them together, and wrapping the result up in a new 5 object (or looking it up in a table because you already have an existing 5 object). That's a lot more work than just adding two ints.3
While a good JIT like Hotspot—or like PyPy for Python—can often automatically do those optimizations, sometimes "often" isn't good enough. That's why Java has native types: to let you manually optimize things in those cases.
Python, instead, relies on third-party libraries like Numpy, which let you pay the boxing costs just once for a whole array, instead of once per element. Which keeps the language simpler, but at the cost of needing Numpy.4
1. As far as I know, "passed by assignment" appears a couple times in the FAQs, but is not actually used in the reference docs or glossary. The reference docs already lean toward intuitive over rigorous, but the FAQ, like the tutorial, goes much further in that direction. So, asking what a term in the FAQ means, beyond the intuitive idea it's trying to get across, may not be a meaningful question in the first place.
2. I'm going to ignore the issue of Java's lack of operator overloading here. There's no reason they couldn't include special language rules for a handful of core classes, even if they didn't let you do the same thing with your own classes—e.g., Go does exactly that for things like range, and people rarely complain.
3. … or even than looping over two arrays of 30-bit digits, which is what Python actually does. The cost of working on unlimited-size "bigints" is tiny compared to the cost of boxing, so Python just always pays that extra, barely-noticeable cost. Python 2 did, like Java, have separate fixed and bigint types, but a couple decades of experience showed that it wasn't getting any performance benefits out of the extra complexity.
4. The implementation of Numpy is of course far from simple. But using it is pretty simple, and a lot more people need to use Numpy than need to write Numpy, so that turns out to be a pretty decent tradeoff.
Similar to passing reference types by value in C#.
Docs: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/passing-reference-type-parameters#passing-reference-types-by-value
Code demo:
# mutable object
l = [9, 8, 7]
def createNewList(l1: list):
# l1+[0] will create a new list object, the reference address of the local variable l1 is changed without affecting the variable l
l1 = l1+[0]
def changeList(l1: list):
# Add an element to the end of the list, because l1 and l refer to the same object, so l will also change
l1.append(0)
print(l)
createNewList(l)
print(l)
changeList(l)
print(l)
# immutable object
num = 9
def changeValue(val: int):
# int is an immutable type, and changing the val makes the val point to the new object 8,
# it's not change the num value
value = 8
print(num)
changeValue(num)
print(num)

Why are named tuples always tracked by python's GC?

As we (or at least I) learned in this answer simple tuples that only contain immutable values are not tracked by python's garbage collector, once it figures out that they can never be involved in reference cycles:
>>> import gc
>>> x = (1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
False
Why isn't this the case for namedtuples, which are a subclass of tuple from the collections module that features named fields?
>>> import gc
>>> from collections import namedtuple
>>> foo = namedtuple('foo', ['x', 'y'])
>>> x = foo(1, 2)
>>> gc.is_tracked(x)
True
>>> gc.collect()
0
>>> gc.is_tracked(x)
True
Is there something inherent in their implementation that prevents this or was it just overlooked?
The only comment about this that I could find is in the gcmodule.c file of the Python sources:
NOTE: about untracking of mutable objects.
Certain types of container cannot participate in a reference cycle, and so do not need to be tracked by the garbage collector.
Untracking these objects reduces the cost of garbage collections.
However, determining which objects may be untracked is not free,
and the costs must be weighed against the benefits for garbage
collection.
There are two possible strategies for when to untrack a container:
When the container is created.
When the container is examined by the garbage collector.
Tuples containing only immutable objects (integers, strings etc,
and recursively, tuples of immutable objects) do not need to be
tracked. The interpreter creates a large number of tuples, many of
which will not survive until garbage collection. It is therefore
not worthwhile to untrack eligible tuples at creation time.
Instead, all tuples except the empty tuple are tracked when
created. During garbage collection it is determined whether any
surviving tuples can be untracked. A tuple can be untracked if all
of its contents are already not tracked. Tuples are examined for
untracking in all garbage collection cycles. It may take more than
one cycle to untrack a tuple.
Dictionaries containing only immutable objects also do not need to
be tracked. Dictionaries are untracked when created. If a tracked
item is inserted into a dictionary (either as a key or value), the
dictionary becomes tracked. During a full garbage collection (all
generations), the collector will untrack any dictionaries whose
contents are not tracked.
The module provides the python function is_tracked(obj), which
returns the current tracking status of the object. Subsequent
garbage collections may change the tracking status of the object.
Untracking of certain containers was introduced in issue #4688, and the algorithm was refined in response to issue #14775.
(See the linked issues to see the real code that was introduced to allow untracking)
This comment is a bit ambiguous, however it does not state that the algorithm to choose which object to "untrack" applies to generic containers. This means that the code check only tuples ( and dicts), not their subclasses.
You can see this in the code of the file:
/* Try to untrack all currently tracked dictionaries */
static void
untrack_dicts(PyGC_Head *head)
{
PyGC_Head *next, *gc = head->gc.gc_next;
while (gc != head) {
PyObject *op = FROM_GC(gc);
next = gc->gc.gc_next;
if (PyDict_CheckExact(op))
_PyDict_MaybeUntrack(op);
gc = next;
}
}
Note the call to PyDict_CheckExact, and:
static void
move_unreachable(PyGC_Head *young, PyGC_Head *unreachable)
{
PyGC_Head *gc = young->gc.gc_next;
/* omissis */
if (PyTuple_CheckExact(op)) {
_PyTuple_MaybeUntrack(op);
}
Note tha call to PyTuple_CheckExact.
Also note that a subclass of tuple need not be immutable. This means that if you wanted to extend this mechanism outside tuple and dict you'd need a generic is_immutable function. This would be really expensive, if at all possible due to Python's dynamism (e.g. methods of the class may change at runtime, while this is not possible for tuple because it is a built-in type). Hence the devs chose to stick to few special case only some well-known built-ins.
This said, I believe they could special case namedtuples too since they are pretty simple classes. There would be some issues for example when you call namedtuple you are creating a new class, hence the GC should check for a subclass.
And this might be a problem with code like:
class MyTuple(namedtuple('A', 'a b')):
# whatever code you want
pass
Because the MyTuple class need not be immutable, so the GC should check that the class is a direct subclass of namedtuple to be safe. However I'm pretty sure there are workarounds for this situation.
They probably didn't because namedtuples are part of the standard library, not the python core. Maybe the devs didn't want to make the core dependent on a module of the standard library.
So, to answer your question:
No, there is nothing in their implementation that inherently prevents untracking for namedtuples
No, I believe they did not "simply overlook" this. However only python devs could give a clear answer to why they chose not to include them. My guess is that they didn't think it would provide a big enough benefit for the change and they didn't want to make the core dependent on the standard library.
#Bakunu gave an excellent answer - accept it :-)
A gloss here: No untracking gimmick is "free": there are real costs, in both runtime and explosion of tricky code to maintain. The base tuple and dict types are very heavily used, both by user programs and by the CPython implementation, and it's very often possible to untrack them. So special-casing them is worth some pain, and benefits "almost all" programs. While it's certainly possible to find examples of programs that would benefit from untracking namedtuples (or ...) too, it wouldn't benefit the CPython implementation or most user programs. But it would impose costs on all programs (more conditionals in the gc code to ask "is this a namedtuple?", etc).
Note that all container objects benefit from CPython's "generational" cyclic gc gimmicks: the more collections a given container survives, the less often that container is scanned (because the container is moved to an "older generation", which is scanned less often). So there's little potential gain unless a container type occurs in great numbers (often true of tuples, rarely true of dicts) or a container contains a great many objects (often true of dicts, rarely true of tuples).

Why do Python variables take a new address (id) every time they're modified?

Just wondering what the logic behind this one is? On the surface it seems kind of inefficient, that every time you do something simple like "x=x+1" that it has to take a new address and discard the old one.
The Python variable (called an identifier or name, in Python) is a reference to a value. The id() function says something for that value, not the name.
Many values are not mutable; integers, strings, floats all do not change in place. When you add 1 to another integer, you return a new integer that then replaces the reference to the old value.
You can look at Python names as labels, tied to values. If you imagine values as balloons, you are retying the label a new balloon each time you assign to that name. If there are no other labels attached to a balloon anymore, it simply drifts away in the wind, never to be seen again. The id() function gives you a unique number for that balloon.
See this previous answer of mine where I talk a little bit more about that idea of values-as-balloons.
This may seem inefficient. For many often used and small values, Python actually uses a process called interning, where it will cache a stash of these values for re-use. None is such a value, as are small integers and the empty tuple (()). You can use the intern() function to do the same with strings you expect to use a lot.
But note that values are only cleaned up when their reference count (the number of 'labels') drops to 0. Loads of values are reused all over the place all the time, especially those interned integers and singletons.
Because the basic types are immutable, so every time you modify it, it needs to be instantiated again
...which is perfectly fine, especially for thread-safe functions
The = operator doesn't modify an object, it assigns the name to a completely different object, which may or may not already have an id.
For your example, integers are immutable; there's no way to add something to one and keep the same id.
And, in fact, small integers are interned at least in cPython, so if you do:
x = 1
y = 2
x = x + 1
Then x and y may have the same id.
In python "primitive" types like ints and strings are immutable, which means they can not be modified.
Python is actually quite efficient, because, as #Wooble commented, «Very short strings and small integers are interned.»: if two variables reference the same (small) immutable value their id is the same (reducing duplicated immutables).
>>> a = 42
>>> b = 5
>>> id(a) == id(b)
False
>>> b += 37
>>> id(a) == id(b)
True
The reason behind the use of immutable types is a safe approach to the concurrent access on those values.
At the end of the day it depends on a design choice.
Depending on your needs you can take more advantage of an implementation instead of another.
For instance, a different philosophy can be found in a somewhat similar language, Ruby, where those types that in Python are immutable, are not.
To be accurate, assignment x=x+1 doesn't modify the object that x is referencing, it just lets the x point to another object whose value is x+1.
To understand the logic behind, one needs to understand the difference between value semantics and reference semantics.
An object with value semantics means only its value matters, not its identity. While an object with reference semantics focuses on its identity(in Python, identity can be returned from id(obj)).
Typically, value semantics implies immutability of the object. Or conversely, if an object is mutable(i.e. in-place change), that means it has reference semantics.
Let's briefly explain the rationale behind this immutability.
Objects with reference semantics can be changed in-place without losing their original addresses/identities. This makes sense in that it's the identity of an object with reference semantics that makes itself distinguishable from other objects.
In contrast, an object with value-semantics should never change itself.
First, this is possible and reasonable in theory. Since only the value(not its identity) is significant, when a change is needed, it's safe to swap it to another identity with different value. This is called referential transparency. Be noted that this is impossible for the objects with reference semantics.
Secondly, this is beneficial in practice. As the OP thought, it seems inefficient to discard the old objects each time when it's changed , but most time it's more efficient than not. For one thing, Python(or any other language) has intern/cache scheme to make less objects to be created. What's more, if objects of value-semantics were designed to be mutable, it would take much more space in most cases.
For example, Date has a value semantics. If it's designed to be mutable, any method that returning a date from internal field will exposes the handle to outside world, which is risky(e.g. outside can directly modify this internal field without resorting to public interface). Similarly, if one passes any date object by reference to some function/method, this object could be modified in that function/method, which may be not as expected. To avoid these kinds of side-effect, one has to do defensive programming: instead of directly returning the inner date field, he returns a clone of it; instead of passing by reference, he passes by value which means extra copies are made. As one could imagine, there are more chances to create more objects than necessary. What's worse, code becomes more complicated with these extra cloning.
In a word, immutability enforces the value-semantics, it usually involves less object creation, has less side-effects and less hassles, and is more test-friendly. Besides, immutable objects are inherently thread-safe, which means less locks and better efficiency in multithreading environment.
That's the reason why basic data types of value-semantics like number, string, date, time are all immutable(well, string in C++ is an exception, that's why there're so many const string& stuffs to avoid string being modified unexpectedly). As a lesson, Java made mistakes on designing value-semantic class Date, Point, Rectangle, Dimension as mutable.
As we know, objects in OOP have three characteristics: state, behavior and identity. Objects with value semantics are not typical objects in that their identities do not matter at all. Usually they are passive, and mostly used to describe other real, active objects(i.e. those with reference semantics). This is a good hint to distinguish between value semantics and reference semantics.

Python: Do (explicit) string parameters hurt performance?

Suppose some function that always gets some parameter s that it does not use.
def someFunc(s):
# do something _not_ using s, for example
a=1
now consider this call
someFunc("the unused string")
which gives a string as a parameter that is not built during runtime but compiled straight into the binary (hope thats right).
The question is: when calling someFunc this way for, say, severalthousand times the reference to "the unused string" is always passed but does that slow the program down?
in my naive thoughts i'd say the reference to "the unused string" is 'constant' and available in O(1) when a call to someFunc occurs. So i'd say 'no, that does not hurt performance'.
Same question as before: "Am I right?"
thanks for some :-)
The string is passed (by reference) each time, but the overhead is way too tiny to really affect performance unless it's in a super-tight loop.
this is an implementation detail of CPython, and may not apply to other pythons but yes, in many cases in a compiled module, a constant string will reference the same object, minimizing the overhead.
In general, even if it didn't, you really shouldn't worry about it, as it's probably imperceptibly tiny compared to other things going on.
However, here's a little interesting piece of code:
>>> def somefunc(x):
... print id(x) # prints the memory address of object pointed to by x
...
>>>
>>> def test():
... somefunc("hello")
...
>>> test()
134900896
>>> test()
134900896 # Hooray, like expected, it's the same object id
>>> somefunc("h" + "ello")
134900896 # Whoa, how'd that work?
What's happening here is that python keeps a global string lookup and in many cases, even when you concatenate two strings, you will get the same object if the values match up.
Note that this is an implementation detail, and you should NOT rely on it, as strings from any of: files, sockets, databases, string slicing, regex, or really any C module are not guaranteed to have this property. But it is interesting nonetheless.

The advantages of having static function like len(), max(), and min() over inherited method calls

i am a python newbie, and i am not sure why python implemented len(obj), max(obj), and min(obj) as a static like functions (i am from the java language) over obj.len(), obj.max(), and obj.min()
what are the advantages and disadvantages (other than obvious inconsistency) of having len()... over the method calls?
why guido chose this over the method calls? (this could have been solved in python3 if needed, but it wasn't changed in python3, so there gotta be good reasons...i hope)
thanks!!
The big advantage is that built-in functions (and operators) can apply extra logic when appropriate, beyond simply calling the special methods. For example, min can look at several arguments and apply the appropriate inequality checks, or it can accept a single iterable argument and proceed similarly; abs when called on an object without a special method __abs__ could try comparing said object with 0 and using the object change sign method if needed (though it currently doesn't); and so forth.
So, for consistency, all operations with wide applicability must always go through built-ins and/or operators, and it's those built-ins responsibility to look up and apply the appropriate special methods (on one or more of the arguments), use alternate logic where applicable, and so forth.
An example where this principle wasn't correctly applied (but the inconsistency was fixed in Python 3) is "step an iterator forward": in 2.5 and earlier, you needed to define and call the non-specially-named next method on the iterator. In 2.6 and later you can do it the right way: the iterator object defines __next__, the new next built-in can call it and apply extra logic, for example to supply a default value (in 2.6 you can still do it the bad old way, for backwards compatibility, though in 3.* you can't any more).
Another example: consider the expression x + y. In a traditional object-oriented language (able to dispatch only on the type of the leftmost argument -- like Python, Ruby, Java, C++, C#, &c) if x is of some built-in type and y is of your own fancy new type, you're sadly out of luck if the language insists on delegating all the logic to the method of type(x) that implements addition (assuming the language allows operator overloading;-).
In Python, the + operator (and similarly of course the builtin operator.add, if that's what you prefer) tries x's type's __add__, and if that one doesn't know what to do with y, then tries y's type's __radd__. So you can define your types that know how to add themselves to integers, floats, complex, etc etc, as well as ones that know how to add such built-in numeric types to themselves (i.e., you can code it so that x + y and y + x both work fine, when y is an instance of your fancy new type and x is an instance of some builtin numeric type).
"Generic functions" (as in PEAK) are a more elegant approach (allowing any overriding based on a combination of types, never with the crazy monomaniac focus on the leftmost arguments that OOP encourages!-), but (a) they were unfortunately not accepted for Python 3, and (b) they do of course require the generic function to be expressed as free-standing (it would be absolutely crazy to have to consider the function as "belonging" to any single type, where the whole POINT is that can be differently overridden/overloaded based on arbitrary combination of its several arguments' types!-). Anybody who's ever programmed in Common Lisp, Dylan, or PEAK, knows what I'm talking about;-).
So, free-standing functions and operators are just THE right, consistent way to go (even though the lack of generic functions, in bare-bones Python, does remove some fraction of the inherent elegance, it's still a reasonable mix of elegance and practicality!-).
It emphasizes the capabilities of an object, not its methods or type. Capabilites are declared by "helper" functions such as __iter__ and __len__ but they don't make up the interface. The interface is in the builtin functions, and beside this also in the buit-in operators like + and [] for indexing and slicing.
Sometimes, it is not a one-to-one correspondance: For example, iter(obj) returns an iterator for an object, and will work even if __iter__ is not defined. If not defined, it goes on to look if the object defines __getitem__ and will return an iterator accessing the object index-wise (like an array).
This goes together with Python's Duck Typing, we care only about what we can do with an object, not that it is of a particular type.
Actually, those aren't "static" methods in the way you are thinking about them. They are built-in functions that really just alias to certain methods on python objects that implement them.
>>> class Foo(object):
... def __len__(self):
... return 42
...
>>> f = Foo()
>>> len(f)
42
These are always available to be called whether or not the object implements them or not. The point is to have some consistency. Instead of some class having a method called length() and another called size(), the convention is to implement len and let the callers always access it by the more readable len(obj) instead of obj.methodThatDoesSomethingCommon
I thought the reason was so these basic operations could be done on iterators with the same interface as containers. However, it actually doesn't work with len:
def foo():
for i in range(10):
yield i
print len(foo())
... fails with TypeError. len() won't consume and count an iterator; it only works with objects that have a __len__ call.
So, as far as I'm concerned, len() shouldn't exist. It's much more natural to say obj.len than len(obj), and much more consistent with the rest of the language and the standard library. We don't say append(lst, 1); we say lst.append(1). Having a separate global method for length is an odd, inconsistent special case, and eats a very obvious name in the global namespace, which is a very bad habit of Python.
This is unrelated to duck typing; you can say getattr(obj, "len") to decide whether you can use len on an object just as easily--and much more consistently--than you can use getattr(obj, "__len__").
All that said, as language warts go--for those who consider this a wart--this is a very easy one to live with.
On the other hand, min and max do work on iterators, which gives them a use apart from any particular object. This is straightforward, so I'll just give an example:
import random
def foo():
for i in range(10):
yield random.randint(0, 100)
print max(foo())
However, there are no __min__ or __max__ methods to override its behavior, so there's no consistent way to provide efficient searching for sorted containers. If a container is sorted on the same key that you're searching, min/max are O(1) operations instead of O(n), and the only way to expose that is by a different, inconsistent method. (This could be fixed in the language relatively easily, of course.)
To follow up with another issue with this: it prevents use of Python's method binding. As a simple, contrived example, you can do this to supply a function to add values to a list:
def add(f):
f(1)
f(2)
f(3)
lst = []
add(lst.append)
print lst
and this works on all member functions. You can't do that with min, max or len, though, since they're not methods of the object they operate on. Instead, you have to resort to functools.partial, a clumsy second-class workaround common in other languages.
Of course, this is an uncommon case; but it's the uncommon cases that tell us about a language's consistency.

Categories

Resources