When I read about Python's C3 method resolution order, I often hear it reduced to "children come before parents, and the order of subclasses is respected". Yet that only seems to hold true if all the subclasses inherit from the same ancestor.
E.g.
class X():
def FOO(self):
return 11
class A(X):
def doIt(self):
return super().FOO()
def FOO(self):
return 42
class B(X):
def doIt(self):
return super().FOO()
def FOO(self):
return 52
class KID(A,B):
pass
Here the MRO of KID is:
KID, A, B, X
However, if I changed B to be instead:
class B(object):
The MRO of KID becomes:
KID, A, X, B
It seems we are searching A's superclass before we have finished searching all KID's parents.
So it seems a bit less intuitive now than "kids first, breadth first" to "kids first, breadth first if common ancestor else depth first".
It would be quite the gotcha that if a class stopped using a common ancestor the MRO changes (even though the overall hierarchy is the same apart from that one link), and you started calling a deeper ancestor method rather than the one in that class.
All classes in Python 3 have a common base class, object. You can omit the class from the class definition, but it is there unless you already indirectly inherit from object. (In Python 2 you have to explicitly inherit from object to even have the use of super() as this is a new-style class feature).
You changed the base class of B from X to object, but X also inherits from object. The MRO changed to take this into account. The same simplification of the C3 rules (children come before parents, and the order of subclasses is respected) is still applicable here. B comes before object, as does X, and A and B are still listed in the same order. However, X should come before B, as both inherit from object and the subclass A(X) comes before B in KID.
Note that nowhere it is said C3 is breadth first. If anything, it is depth first. See The Python 2.3 Method Resolution Order for an in-depth description of the algorithm and how it applies to Python, but the linearisation of any class is the result of merging the linearisations of the base classes plus the base classes themselves:
L[KID] = KID + merge(L[A], L[B], (A, B))
where L[..] is the C3 linearisation of that class (their MRO).
So the linearisation of A comes before B when merging, making C3 look at hierarchies in depth rather than in breadth. Merging starts with the left-most list and takes any element that doesn't appear in the tails of the other lists (so everything but the first element), then takes the next, etc.
In your first example, L[A] and L[B] are almost the same (they both end in (X, object) as their MRO, with only the first element differing), so merging is simple; you merge (A, X, object) and (B, X, object), and merging these gives you only A from the first list, then the whole second list, ending up with (KID, A, B, X, object) after prepending KID:
L[KID] = KID + merge((A, X, object), (B, X, object), (A, B))
# ^ ^^^^^^
# \ & \ both removed as they appear in the next list
= KID + (A,) + (B, X, object)
= (KID, A, B, X, object)
In your second example, L[A] is unchanged, but L[B] is now (B, object) (dropping X), so merging prefers X before B as (A, X, object) comes first when merging and X doesn't appear in the second list. Thus
L[KID] = KID + merge((A, X, object), (B, object), (A, B))
# ^^^^^^
# \removed as it appears in the next list
= KID + (A, X) + (B, object)
= (KID, A, X, B, object)
Related
I'm trying to get a better understanding of MRO in Python & came across this example:
class A:
def process(self):
print('A process()')
class B(A):
pass
class C(A):
def process(self):
print('C process()')
class D(B,C):
pass
obj = D()
obj.process()
which prints "C process()". I understand why, because the order goes D>B>C>A. but, when the class C doesn't inherit A, then "A process()" is printed & the order shifts to D>B>A>C. what causes the order to shift here? why isn't the C superclass reached before the A class now?
The C3 linearization algorithm is somewhat depth-first, so A, being reachable from B (which is listed before C in the base class list) is added before C.
The rationale is that D is more "B-like" than "C-like", so anything that is part of "B" should appear before "C".
(For fun, see what happens if you try something like class D(B, A, C) when C still inherits from A.)
I have this code, showing a classic diamond pattern:
class A:
def __init__( self, x ):
print( "A:" + x )
class B( A ):
def __init__( self, x ):
print( "B:" + x )
super().__init__( "b" )
class C( A ):
def __init__( self, x ):
print( "C:" + x )
super().__init__( "c" )
class D( B, C ):
def __init__( self ):
super().__init__( "d" )
d = D()
The output is:
B:d
C:b
A:c
B:d makes sense, since D derives from B.
The A:c I almost get, though I could equally see A:b.
However, the C:b bit doesn't make sense: C does not derive from B.
Could someone explain?
Questions such as this unfortunately do not mention the parameters.
Python uses the C3 linearization algorithm to establish the method resolution order, which is the same order that super delegates in.
Basically, the algorithm keeps lists for every class containing that class and every class it inherits from, for all classes that the class in question inherits from. It then constructs an ordering of classes by taking classes that aren't inherited by any unexamined classes one by one, until it reaches the root, object. Below, I use O for object for brevity:
L(O) = [O]
L(A) = [A] + merge(L(O), [O]) = [A, O]
L(B) = [B] + merge(L(A), [A]) = [B] + merge([A, O], [A]) = [B, A] + merge([O])
= [B, A, O]
L(C) = [C] + merge(L(A), [A]) = [C] + merge([A, O], [A]) = [C, A] + merge([O])
= [C, A, O]
L(D) = [D] + merge(L(B), L(C), [B, C]) = [D] + merge([B, A, O], [C, A, O], [B, C])
= [D, B] + merge([A, O], [C, A, O], [C]) = [D, B, C] + merge([A, O], [A, O])
= [D, B, C, A, O]
Classes in Python are dynamically composed - that includes inheritance.
The C:b output does not imply that B magically inherits from C. If you instantiate either B or C, none knows about the other.
>>> B('root')
B:root
A:b
However, D does know about both B and C:
class D(B,C):
...
There is a lot of technicalities available on this. However, there are basically two parts in how this works:
Direct Base Classes are resolved in order they appear.
B comes before C.
Recursive Base Classes are resolved to not duplicate.
A Base Class of both B and C must follow both.
For the class D, that means the base classes resolve as B->C->A! C has sneaked in between B and A - but only for class D, not for class B.
Note that there is actually another class involved: all classes derive from object by default.
>>> D.__mro__
(__main__.D, __main__.B, __main__.C, __main__.A, object)
You have already written A knowing that there is no base to take its parameters. However, neither B nor C can assume this. They both expect to derive from an A object. Subclassing does imply that both B and C are valid A-objects as well, though!
It is valid for both B and C to precede B and C, since the two are subclasses of A. B->C->A->object does not break that B expects its super class to be of type A.
With all other combinations, one ends up with C preceding nothing (invalid) or object preceding something (invalid). That rules out depth-first resolution B->A->object->C and duplicates B->A->object->C->A->object.
This method resolution order is practical to enable mixins: classes that rely on other classes to define how methods are resolved.
There is a nice example of how a logger for dictionary access can accept both dict and OrderedDict.
# basic Logger working on ``dict``
class LoggingDict(dict):
def __setitem__(self, key, value):
logging.info('Settingto %r' % (key, value))
super().__setitem__(key, value)
# mixin of different ``dict`` subclass
class LoggingOD(LoggingDict, collections.OrderedDict):
pass
You can always check the method resolution order that any class should have:
>>> D.mro()
[__main__.D, __main__.B, __main__.C, __main__.A, object]
As you can see, if everybody is doing the right thing (i.e. calling super), the MRO will be 1st parent, 2nd parent, 1st parent's parent and so on...
You can just think of depth first and then left to right to find the order although ever since python 2.3 the algorithm changed but the outcome is usually the same.
In this case B and C have the same parent A and A doesn't call super
I just want to be able to unpack the instance variables of class foo, for example:
x = foo("name", "999", "24", "0.222")
a, b, c, d = *x
a, b, c, d = [*x]
I am not sure as to which is the correct method for doing so when implementing my own __iter__ method, however, the latter is the one that has worked with mixed "success". I say mixed because doing so with the presented code appears to alter the original instance object x, such that it is no longer valid.
class foo:
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __iter__(self):
return iter([a, b, c, d])
I have read the myriad posts on this site regarding __iter__, __next__, generators etc., and also a python book and docs.python.org and seem unable to figure what I am not understanding. I've gathered that __iter__ needs to return an iterable (which can be just be self, but I am not sure how that works for what I want). I've also tried various ways of playing around with implementing __next__ and iterating over vars(foo).items(), either by casting to a list or as a dictionary, with no success.
I don't believe this is a duplicate post on account that the only similar questions I've seen present a single list sequence object attribute or employ a range of numbers instead of a four non-container variables.
If you want the instance's variables, you should access them with .self:
def __iter__(self):
return iter([self.a, self.b, self.c, self.d])
with this change,
a, b, c, d = list(x)
will get you the variables.
You could go to the more risky method of using vars(x) or x.__dict__, sort it by the variables name (and that's why it is also a limited one, the variables are saved in no-order), and extract the second element of each tuple. But I would say the iterator is definitely better.
You can store the arguments in an attribute (self.e below) or return them on function call:
class foo:
def __init__(self, *args):
self.a, self.b, self.c, self.d = self.e = args
def __call__(self):
return self.e
x = foo("name", "999", "24", "0.222")
a, b, c, d = x.e
# or
a, b, c, d = x()
class A(object):
def a(self, b=1):
print 'Up'
d = {1 : a}
def b( self ):
print self.d[1]
print self.b
print self.d[1].__get__( self, A )()
# print self.d[1]()
class B( object ):
def a( self ):
print 'here??'
return 10000
d = {1 : a}
def b( self ):
print 'hurray'
o = A()
o.b()
b = B()
type( o ).__dict__['b'].__get__( b, type( b ) )()
Hi Folks,
I was going through Python: Bind an Unbound Method? and http://users.rcn.com/python/download/Descriptor.htm and trying to experiment on my learning.
But, I have hit some new doubts now:-
In the last line of my code, I'm able to use __get__ with b object and instance: type(b). This only works if method b is defined in class B. Why is it so?
Even though the last line requires me to provide a method b in class B, still the method b in class A gets called. Why is it so?
To my utter surprise, after the above step, I notice that the method a of class A is not called by the code of method b of class A; instead, it calls the method a of class B. Why is it so?
I'm quite confused after seeing this behaviour. I might also need to learn more on descriptors. But, it would be a great help if you could answer my doubts
In the last line of my code, I'm able to use __get__ with b object and instance: type(b). This only works if method b is defined in class B. Why is it so?
You have to define a method b in class B, because in A.b you have print self.b. Here, self is an instance of the B class, so self.b means "the b method belonging to this B", not "the b method belonging to the class that this method exists in". If you delete print self.b, then the code will work even if B has no b.
Even though the last line requires me to provide a method b in class B, still the method b in class A gets called. Why is it so?
A.b is being called because you are explicitly accessing it with type( o ).__dict__['b']. Whether you bind that method to an A instance or a B instance doesn't matter; it's still A.b.
To my utter surprise, after the above step, I notice that the method a of class A is not called by the code of method b of class A; instead, it calls the method a of class B. Why is it so?
Even though b belongs to the class A, the self you pass to it is still an instance of the B class. Any attributes you access on that self will be B attributes, and any methods you call on it will be B methods.
What's a correct and good way to implement __hash__()?
I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.
As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions).
What's a good practice to get such values? Are collisions a problem?
In my case I have a small class which acts as a container class holding some ints, some floats and a string.
An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.
Here's an example of using a key for hash and equality:
class A:
def __key(self):
return (self.attr_a, self.attr_b, self.attr_c)
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
if isinstance(other, A):
return self.__key() == other.__key()
return NotImplemented
Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.
John Millikin proposed a solution similar to this:
class A(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
return (isinstance(othr, type(self))
and (self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
def __hash__(self):
return hash((self._a, self._b, self._c))
The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?
Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states
The only required property is that objects which compare equal have the same hash value
Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.
Outdated/bad solution
The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:
class B(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
if isinstance(othr, type(self)):
return ((self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
return NotImplemented
def __hash__(self):
return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
hash((self._a, self._b, self._c)))
Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).
Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that
for c in some_string:
hash = 101 * hash + ord(c)
worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.
A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:
class A:
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __iter__(self):
yield "a", self._a
yield "b", self._b
yield "c", self._c
def __hash__(self):
return hash(tuple(self))
def __eq__(self, other):
return (isinstance(other, type(self))
and tuple(self) == tuple(other))
(here __eq__ is not required for hash, but it's easy to implement).
Now add some mutable members to see how it works:
a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c)) # -5279839567404192660
dict(A(a, b, c)) # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c)) # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))
things only fall apart if you try to put non-hashable members in the object model:
hash(A(a, b, [1])) # TypeError: unhashable type: 'list'
I can try to answer the second part of your question.
The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.
In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!
Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.
Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.
In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.
Here are some links if you're interested more:
coalesced hashing on wikipedia
Wikipedia also has a summary of various collision resolution methods:
Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.
A very good explanation on when and how implement the __hash__ function is on programiz website:
Just a screenshot to provide an overview:
(Retrieved 2019-12-13)
As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.
class Person:
def __init__(self, age, name):
self.age = age
self.name = name
def __eq__(self, other):
return self.age == other.age and self.name == other.name
def __hash__(self):
print('The hash is:')
return hash((self.age, self.name))
person = Person(23, 'Adam')
print(hash(person))
Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.
I would favor bit operations. Like, the following C pseudo code:
int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);
Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.
For strings, I've got little/no idea.
#dataclass(frozen=True) (Python 3.7)
This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets:
dataclass_cheat.py
from dataclasses import dataclass, FrozenInstanceError
#dataclass(frozen=True)
class MyClass1:
n: int
s: str
#dataclass(frozen=True)
class MyClass2:
n: int
my_class_1: MyClass1
d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6
assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6
# Due to `frozen=True`
o = MyClass1(n=1, s='a')
try:
o.n = 2
except FrozenInstanceError as e:
pass
else:
raise 'error'
As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:
d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1
works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.
frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html
Tested on Python 3.10.7, Ubuntu 22.10.