What happens if an object's __hash__ changes? - python

In Python, I know that the value __hash__ returns for a given object is supposed to be the same for the lifetime of that object. But, out of curiosity, what happens if it isn't? What sort of havoc would this cause?
class BadIdea(object):
def __hash__(self):
return random.randint(0, 10000)
I know __contains__ and __getitem__ would behave strangely, and dicts and sets would act odd because of that. You also might end up with "orphaned" values in the dict/set.
What else could happen? Could it crash the interpreter, or corrupt internal structures?

Your main problem would indeed be with dicts and sets. If you insert an object into a dict/set, and that object's hash changes, then when you try to retrieve that object you will end up looking in a different spot in the dict/set's underlying array and hence won't find the object. This is precisely why dict keys should always be immutable.
Here's a small example: let's say we put o into a dict, and o's initial hash is 3. We would do something like this (a slight simplification but gets the point across):
Hash table:
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| | | | o | | | | |
+---+---+---+---+---+---+---+---+
^
we put o here, since it hashed to 3
Now let's say the hash of o changes to 6. If we want to retrieve o from the dict, we'll look at spot 6, but there's nothing there! This will cause a false negative when querying the data structure. In reality, each element of the array above could have a "value" associated with it in the case of a dict, and there could be multiple elements in a single spot (e.g. a hash collision). Also, we'd generally take the hash value modulo the size of the array when deciding where to put the element. Irrespective of all these details, though, the example above still accurately conveys what could go wrong when the hash code of an object changes.
Could it crash the interpreter, or corrupt internal structures?
No, this won't happen. When we say an object's hash changing is "dangerous", we mean dangerous in the sense that it essentially defeats the purpose of hashing and makes the code difficult if not impossible to reason about. We don't mean dangerous in the sense that it could cause crashes.

There's a great post on Github about it: What happens when you mess with hashing.
First, you need to know that Python expects (quoted from the article):
The hash of an object does not change across the object's lifetime (in other words, a hashable object should be immutable).
a == b implies hash(a) == hash(b) (note that the reverse might not hold in the case of a hash collision).
Here's the code example which show the problem of a variant hash, with a slightly different class example, but the idea remains the same:
>>> class Bad(object):
... def __init__(self, arg):
... self.arg = arg
... def __hash__(self):
... return hash(self.arg)
...
>>> Bad(1)
<__main__.Bad object at ...>
>>> hash(Bad(1))
1
>>> a = Bad(1)
>>> b = {a:1}
>>> a.arg = 2
>>> hash(a)
2
>>> b[a]
Traceback (most recent call last):
...
KeyError: <__main__.Bad object at ...>
Here, we implicitly changed the hash of a by mutating the argument of a that is used to compute the hash. As a result, the object is no longer found in a dictionary, which uses the hash to find the object.
Note that Python doesn't prevent me from doing this. I could make it if I want, by making __setattr__ raise AttributeError, but even then I could forcibly change it by modifying the object's __dict__. This is what is meant when we say that Python is a "consenting adults" language.
It won't make Python crash but unexpected behavior will happen with dict/set and everything based on object hash.

Related

understanding python id() uniqueness

Python documentation for id() function states the following:
This is an integer which is guaranteed to be unique and constant for
this object during its lifetime. Two objects with non-overlapping
lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object in memory.
Although, the snippet below shows that id's are repeated. Since I didn't explicitly del the objects, I presume they are all alive and unique (I do not know what non-overlapping means).
>>> g = [0, 1, 0]
>>> for h in g:
... print(h, id(h))
...
0 10915712
1 10915744
0 10915712
>>> a=0
>>> b=1
>>> c=0
>>> d=[a, b,c]
>>> for e in d:
... print(e, id(e))
...
0 10915712
1 10915744
0 10915712
>>> id(a)
10915712
>>> id(b)
10915744
>>> id(c)
10915712
>>>
How can the id values for different objects be the same? Is it so because the value 0 (object of class int) is a constant and the interpreter/C compiler optimizes?
If I were to do a = c, then I understand c to have the same id as a since c would just be a reference to a (alias). I expected the objects a and c to have different id values otherwise, but, as shown above, they have the same values.
What's happening? Or am I looking at this the wrong way?
I would expect the id's for user-defined class' objects to ALWAYS be unique even if they have the exact same member values.
Could someone explain this behavior? (I looked at the other questions that ask uses of id(), but they steer in other directions)
EDIT (09/30/2019):
TO extend what I already wrote, I ran python interpreters in separate terminals and checked the id's for 0 on all of them, they were exactly the same (for the same interpreter); multiple instances of different interpreters had the same id for 0. Python2 vs Python3 had different values, but the same Python2 interpreter had same id values.
My question is because the id()'s documentation doesn't state any such optimizations, which seems misleading (I don't expect every quirk to be noted, but some note alongside the CPython note would be nice)...
EDIT 2 (09/30/2019):
The question is stemmed in understanding this behavior and knowing if there are any hooks to optimize user-define classes in a similar way (by modifying the __equals__ method to identify if two objects are same; perhaps the would point to the same address in memory i.e. same id? OR use some metaclass properties)
Ids are guaranteed to be unique for the lifetime of the object. If an object gets deleted, a new object can acquire the same id. CPython will delete items immediately when their refcount drops to zero. The garbage collector is only needed to break up reference cycles.
CPython may also cache and re-use certain immutable objects like small integers and strings defined by literals that are valid identifiers. This is an implementation detail that you should not rely upon. It is generally considered improper to use is checks on such objects.
There are certain exceptions to this rule, for example, using an is check on possibly-interned strings as an optimization before comparing them with the normal == operator is fine. The dict builtin uses this strategy for lookups to make them faster for identifiers.
a is b or a == b # This is OK
If the string happens to be interned, then the above can return true with a simple id comparison instead of a slower character-by-character comparison, but it still returns true if and only if a == b (because if a is b then a == b must also be true). However, a good implementation of .__eq__() would already do an is check internally, so at best you would only avoid the overhead of calling the .__eq__().
Thanks for the answer, would you elaborate around the uniqueness for user-defined objects, are they always unique?
The id of any object (be it user-defined or not) is unique for the lifetime of the object. It's important to distinguish objects from variables. It's possible to have two or more variables refer to the same object.
>>> a = object()
>>> b = a
>>> c = object()
>>> a is b
True
>>> a is c
False
Caching optimizations mean that you are not always guaranteed to get a new object in cases where one might naiively think one should, but this does not in any way violate the uniqueness guarantee of IDs. Builtin types like int and str may have some caching optimizations, but they follow exactly the same rules: If they are live at the same time, and their IDs are the same, then they are the same object.
Caching is not unique to builtin types. You can implement caching for your own objects.
>>> def the_one(it=object()):
... return it
...
>>> the_one() is the_one()
True
Even user-defined classes can cache instances. For example, this class only makes one instance of itself.
>>> class TheOne:
... _the_one = None
... def __new__(cls):
... if not cls._the_one:
... cls._the_one = super().__new__(cls)
... return cls._the_one
...
>>> TheOne() is TheOne() # There can be only one TheOne.
True
>>> id(TheOne()) == id(TheOne()) # This is what an is-check does.
True
Note that each construction expression evaluates to an object with the same id as the other. But this id is unique to the object. Both expressions reference the same object, so of course they have the same id.
The above class only keeps one instance, but you could also cache some other number. Perhaps recently used instances, or those configured in a way you expect to be common (as ints do), etc.

Calculate a identifier for an object [duplicate]

This would be similar to the java.lang.Object.hashcode() method.
I need to store objects I have no control over in a set, and make sure that only if two objects are actually the same object (not contain the same values) will the values be overwritten.
id(x)
will do the trick for you. But I'm curious, what's wrong about the set of objects (which does combine objects by value)?
For your particular problem I would probably keep the set of ids or of wrapper objects. A wrapper object will contain one reference and compare by x==y <==> x.ref is y.ref.
It's also worth noting that Python objects have a hash function as well. This function is necessary to put an object into a set or dictionary. It is supposed to sometimes collide for different objects, though good implementations of hash try to make it less likely.
That's what "is" is for.
Instead of testing "if a == b", which tests for the same value,
test "if a is b", which will test for the same identifier.
As ilya n mentions, id(x) produces a unique identifier for an object.
But your question is confusing, since Java's hashCode method doesn't give a unique identifier. Java's hashCode works like most hash functions: it always returns the same value for the same object, two objects that are equal always get equal codes, and unequal hash values imply unequal hash codes. In particular, two different and unequal objects can get the same value.
This is confusing because cryptographic hash functions are quite different from this, and more like (though not exactly) the "unique id" that you asked for.
The Python equivalent of Java's hashCode method is hash(x).
You don't have to compare objects before placing them in a set. set() semantics already takes care of this.
class A(object):
a = 10
b = 20
def __hash__(self):
return hash((self.a, self.b))
a1 = A()
a2 = A()
a3 = A()
a4 = a1
s = set([a1,a2,a3,a4])
s
=> set([<__main__.A object at 0x222a8c>, <__main__.A object at 0x220684>, <__main__.A object at 0x22045c>])
Note: You really don't have to override hash to prove this behaviour :-)

Python: Identical strings (or numbers) with unique ids?

Python is wonderfully optimized, but I have a case where I'd like to work around it. It seems for small numbers and strings, python will automatically collapse multiple objects into one. For example:
>>> a = 1
>>> b = 1
>>> id(a) == id(b)
True
>>> a = str(a)
>>> b = str(b)
>>> id(a) == id(b)
True
>>> a += 'foobar'
>>> b += 'foobar'
>>> id(a) == id(b)
False
>>> a = a[:-6]
>>> b = b[:-6]
>>> id(a) == id(b)
True
I have a case where I'm comparing objects based on their Python ids. This is working really well except for the few cases where I run into small numbers. Does anyone know how to turn off this optimization for specific strings and integers? Something akin to an anti-intern()?
You shouldn't be relying on these objects to be different objects at all. There's no way to turn this behavior off without modifying and recompiling Python, and which particular objects it applies to is subject to change without notice.
You can't turn it off without re-compiling your own version of CPython.
But if you want to have "separate" versions of the same small integers, you can do that by maintaining your own id (for example a uuid4) associated with the object.
Since ints and strings are immutable, there's no obvious reason to do this - if you can't modify the object at all, you shouldn't care whether you have the "original" or a copy because there is no use-case where it can make any difference.
Related: How to create the int 1 at two different memory locations?
Sure, it can be done, but its never really a good idea:
#
Z =1
class MyString(string):
def __init__(self, *args):
global Z
super(MyString,
self).__init__(*args)
self.i = Z
Z += 1
>>> a = MyString("1")
>>> b = MyString("1")
>>> a is b
False
btw, to compare if objects have the same id just use a is b instead of id(a)==id(b)
The Python documentation on id() says
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object in memory.
So it's guaranteed to be unique, it must be intended as a way to tell if two variables are bound to the same object.
In a comment on StackOverflow here, Alex Martelli says the CPython implementation is not the authoritative Python, and other correct implementations of Python can and do behave differently in some ways - and that the Python Language Reference (PLR) is the closest thing Python has to a definitive specification.
In the PLR section on objects it says much the same:
Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is‘ operator compares the identity of two objects; the id() function returns an integer representing its identity (currently implemented as its address).
The language reference doesn't say it's guaranteed to be unique. It also says (re: the object's lifetime):
Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.
and:
CPython implementation detail: CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references. See the documentation of the gc module for information on controlling the collection of cyclic garbage. Other implementations act differently and CPython may change. Do not depend on immediate finalization of objects when they become unreachable (ex: always close files).
This isn't actually an answer, I was hoping this would end up somewhere conclusive. But I don't want to delete it now I've quoted and cited.
I'll go with turning your premise around: python will automatically collapse multiple objects into one. - no it willn't, they were never multiple objects, they can't be, because they have the same id().
If id() is Python's definitive answer on whether two objects are the same or different, your premise is incorrect - this isn't an optimization, it's a fundamental part of Python's view on the world.
This version accounts for wim's concerns about more aggressive internment in the future. It will use more memory, which is why I discarded it originally, but probably is more future proof.
>>> class Wrapper(object):
... def __init__(self, obj):
... self.obj = obj
>>> a = 1
>>> b = 1
>>> aWrapped = Wrapper(a)
>>> bWrapped = Wrapper(b)
>>> aWrapped is bWrapped
False
>>> aUnWrapped = aWrapped.obj
>>> bUnwrapped = bWrapped.obj
>>> aUnWrapped is bUnwrapped
True
Or a version that works like the pickle answer (wrap + pickle = wrapple):
class Wrapple(object):
def __init__(self, obj):
self.obj = obj
#staticmethod
def dumps(obj):
return Wrapple(obj)
def loads(self):
return self.obj
aWrapped = Wrapple.dumps(a)
aUnWrapped = Wrapple.loads(a)
Well, seeing as no one posted a response that was useful, I'll just let you know what I ended up doing.
First, some friendly advice to someone who might read this one day. This is not recommended for normal use, so if you're contemplating it, ask yourself if you have a really good reason. There are good reason, but they are rare, and if someone says there aren't, they just aren't thinking hard enough.
In the end, I just used pickle.dumps() on all the objects and passed the output in instead of the real object. On the other side I checked the id and then used pickle.loads() to restore the object. The nice part of this solution was it works for all types including None and Booleans.
>>> a = 1
>>> b = 1
>>> a is b
True
>>> aPickled = pickle.dumps(a)
>>> bPickled = pickle.dumps(b)
>>> aPickled is bPickled
False
>>> aUnPickled = pickle.loads(aPickled)
>>> bUnPickled = pickle.loads(bPickled)
>>> aUnPickled is bUnPickled
True
>>> aUnPickled
1

Why do python sets hold False and Zero exclusively?

when creating a set:
>>> falsey_set = {0, '', False, None} # set([False, '', None])
>>> falsey_set = {False, '', 0, None} # set([0,'', None])
>>> # adding an item to the set doesn't change anything either
>>> falsey_set.add(False) # set([0,'',None])
or a dictionary, which mimics the behavior somewhat:
>>> falsey_dict = {0:"zero", False:"false"} # {0:'false'} # that's not a typo
>>> falsey_dict = {False:'false', 0:'zero'} # {False: 'zero'} # again, not a typo
>>> falsey_set.add(()) # set([0,'', None, ()])
>>> falsey_set.add({})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>> falsey_dict[()] = 'list' # {False:'zero', ():'list'}
>>> falsey_dict({}) = 'dict'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
0 and False always remove one another from the set. In the case of dictionaries they are incorrect altogether. Is there a reason for this? While I realize that booleans are derived from integers in Python. What's the pythonic reasoning for acting this way in the context of sets specifically (I don't care about dictionaries too much)? Since while useful in truthy comparison like:
>>> False == 0 # True
There is obvious value in differentiation:
>>> False is 0 # False
I've been looking over the documentation and can't seem to find a reference for the behavior
Update
#delnan I think you hit the nail on the head with hash determinism you've mentioned in the comments. As #mgilson notes both False and 0 use the same hashing function, however so do object and many of its subclasses (i.e.:super) that have identical hash functions. The key seems to be in the phrase Hashable objects which compare equal must have the same hash value from the documentation. Since, False == 0 and and both are hashable, their outputs must by Python's definition be equivalent. Finally, the definition of hashable states how sets use hashability in set membership with the following: Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally. While I still don't understand why they both use the same hashing function - I can settle with going this deep.
If we all agree then someone propose a polished answer, and I'll accept it. If there could be some improvement or if I'm off base then please let it be known below.
It's because False and 0 hash to the same value and are equal.
The reason that they hash to the same value is because bool is a subclass of int so bool.__hash__ simply calls the same underlying mechanics that int.__hash__ calls...
>>> bool.__hash__ is int.__hash__
True
First, let's try to explain what's going on at the beginning, with your falsey_set and falsey_dict, so you see that it's not "incorrect", but in fact only possible consistent solution. To do so, we will remove bools from the picture temporarily, and use something that more people grasp intuitively: decimal numbers.
>>> numset = {3, 5, 3.0, 4} # {3.0, 4, 5}
>>> numset.add(3) # no change
I hope you agree that this is exactly how set should work. If you don't, then it seems that either you think 3 and 3.0 are not really equal, or you think that a set should be allowed to have equal elements. Neither of these are really productive beliefs IMO.
(Of course, which one of 3 and 3.0 ends up in the set is a matter of processing displays, and set is a bit weird since it is an atrophied dict where key and value are the same. But it is consistent and specified in Pythton. The point for now is, surely they cannot both be in a set.)
One more point: as you see, the thing I can add many other true things into my set (like 4 and 5) doesn't matter at all. Same, the fact you can add many other false things in your set (like '' and None) doesn't matter at all. Truth is a red herring. A set can have true elements and false elements. What is cannot have, is equal elements.
>>> numdict = {3:"a", 3.0:"b"} # {3:"b"}
This looks weirder at a first glance, but is in fact much clearer what's going on, since keys and values are separate. Python rules are precise: read dict display from left to right, take every pair a:b, then if key a is already in the dict, update its value to b, else insert key a into the dict with value b.
With that algorithm, I guess it's obvious how the final dict ends up like that, and all the other behaviours you've noticed. What's important is that, like in a set, what you really need in a dict is to have only one value for any given key. Having two equal keys in the same dict would be an invitation to disaster, since then you'd be able to assign them different values.
So, in a nutshell: I think you dug yourself too deep with hash functions and other implementation stuff. These are nice way of seeing how Python does X, once you realize that X is the right thing to do. But first you have to see that X is the right thing to do. And I hope I've shown that to you now. A set cannot have equal elements. It would defeat a widely used purpose of a set, removing duplicates. And 3 and 3.0 really are equal. This has nothing to do with types, some embeddings are so natural we've erased them on a mathematical level.
Of course, that leaves the question "why are 0 and False really equal"? In fact, the answer is not very different: just another mathematically erased embedding that's so incredibly useful we would have to jump through many ridiculous hoops without it. For more about that, read about Iverson bracket. ;-) But anyway, it seems you know about that part. The above is what was problematic, I guess.

Why can a Python dict have multiple keys with the same hash?

I am trying to understand the Python hash function under the hood. I created a custom class where all instances return the same hash value.
class C:
def __hash__(self):
return 42
I just assumed that only one instance of the above class can be in a dict at any time, but in fact a dict can have multiple elements with the same hash.
c, d = C(), C()
x = {c: 'c', d: 'd'}
print(x)
# {<__main__.C object at 0x7f0824087b80>: 'c', <__main__.C object at 0x7f0823ae2d60>: 'd'}
# note that the dict has 2 elements
I experimented a little more and found that if I override the __eq__ method such that all the instances of the class compare equal, then the dict only allows one instance.
class D:
def __hash__(self):
return 42
def __eq__(self, other):
return True
p, q = D(), D()
y = {p: 'p', q: 'q'}
print(y)
# {<__main__.D object at 0x7f0823a9af40>: 'q'}
# note that the dict only has 1 element
So I am curious to know how a dict can have multiple elements with the same hash.
Here is everything about Python dicts that I was able to put together (probably more than anyone would like to know; but the answer is comprehensive). A shout out to Duncan for pointing out that Python dicts use slots and leading me down this rabbit hole.
Python dictionaries are implemented as hash tables.
Hash tables must allow for hash collisions i.e. even if two keys have same hash value, the implementation of the table must have a strategy to insert and retrieve the key and value pairs unambiguously.
Python dict uses open addressing to resolve hash collisions (explained below) (see dictobject.c:296-297).
Python hash table is just a continguous block of memory (sort of like an array, so you can do O(1) lookup by index).
Each slot in the table can store one and only one entry. This is important
Each entry in the table actually a combination of the three values - . This is implemented as a C struct (see dictobject.h:51-56)
The figure below is a logical representation of a python hash table. In the figure below, 0, 1, ..., i, ... on the left are indices of the slots in the hash table (they are just for illustrative purposes and are not stored along with the table obviously!).
# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1| ... |
-+-----------------+
.| ... |
-+-----------------+
i| ... |
-+-----------------+
.| ... |
-+-----------------+
n| ... |
-+-----------------+
When a new dict is initialized it starts with 8 slots. (see dictobject.h:49)
When adding entries to the table, we start with some slot, i that is based on the hash of the key. CPython uses initial i = hash(key) & mask. Where mask = PyDictMINSIZE - 1, but that's not really important). Just note that the initial slot, i, that is checked depends on the hash of the key.
If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>). But what if that slot is occupied!? Most likely because another entry has the same hash (hash collision!)
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don't match, it starts probing.
Probing just means it searches the slots by slot to find an empty slot. Technically we could just go one by one, i+1, i+2, ... and use the first available one (that's linear probing). But for reasons explained beautifully in the comments (see dictobject.c:33-126), CPython uses random probing. In random probing, the next slot is picked in a pseudo random order. The entry is added to the first empty slot. For this discussion, the actual algorithm used to pick the next slot is not really important (see dictobject.c:33-126 for the algorithm for probing). What is important is that the slots are probed until first empty slot is found.
The same thing happens for lookups, just starts with the initial slot i (where i depends on the hash of the key). If the hash and the key both don't match the entry in the slot, it starts probing, until it finds a slot with a match. If all slots are exhausted, it reports a fail.
BTW, the dict will be resized if it is two-thirds full. This avoids slowing down lookups. (see dictobject.h:64-65)
There you go! The Python implementation of dict checks for both hash equality of two keys and the normal equality (==) of the keys when inserting items. So in summary, if there are two keys, a and b and hash(a)==hash(b), but a!=b, then both can exist harmoniously in a Python dict. But if hash(a)==hash(b) and a==b, then they cannot both be in the same dict.
Because we have to probe after every hash collision, one side effect of too many hash collisions is that the lookups and insertions will become very slow (as Duncan points out in the comments).
I guess the short answer to my question is, "Because that's how it's implemented in the source code ;)"
While this is good to know (for geek points?), I am not sure how it can be used in real life. Because unless you are trying to explicitly break something, why would two objects that are not equal, have same hash?
For a detailed description of how Python's hashing works see my answer to Why is early return slower than else?
Basically it uses the hash to pick a slot in the table. If there is a value in the slot and the hash matches, it compares the items to see if they are equal.
If the hash matches but the items aren't equal, then it tries another slot. There's a formula to pick this (which I describe in the referenced answer), and it gradually pulls in unused parts of the hash value; but once it has used them all up, it will eventually work its way through all slots in the hash table. That guarantees eventually we either find a matching item or an empty slot. When the search finds an empty slot, it inserts the value or gives up (depending whether we are adding or getting a value).
The important thing to note is that there are no lists or buckets: there is just a hash table with a particular number of slots, and each hash is used to generate a sequence of candidate slots.
Edit: the answer below is one of possible ways to deal with hash collisions, it is however not how Python does it. Python's wiki referenced below is also incorrect. The best source given by #Duncan below is the implementation itself: https://github.com/python/cpython/blob/master/Objects/dictobject.c I apologize for the mix-up.
It stores a list (or bucket) of elements at the hash then iterates through that list until it finds the actual key in that list. A picture says more than a thousand words:
Here you see John Smith and Sandra Dee both hash to 152. Bucket 152 contains both of them. When looking up Sandra Dee it first finds the list in bucket 152, then loops through that list until Sandra Dee is found and returns 521-6955.
The following is wrong it's only here for context: On Python's wiki you can find (pseudo?) code how Python performs the lookup.
There's actually several possible solutions to this problem, check out the wikipedia article for a nice overview: http://en.wikipedia.org/wiki/Hash_table#Collision_resolution
Hash tables, in general have to allow for hash collisions! You will get unlucky and two things will eventually hash to the same thing. Underneath, there is a set of objects in a list of items that has that same hash key. Usually, there is only one thing in that list, but in this case, it'll keep stacking them into the same one. The only way it knows they are different is through the equals operator.
When this happens, your performance will degrade over time, which is why you want your hash function to be as "random as possible".
In the thread I did not see what exactly python does with instances of a user-defined classes when we put it into a dictionary as a keys. Let's read some documentation: it declares that only hashable objects can be used as a keys. Hashable are all immutable built-in classes and all user-defined classes.
User-defined classes have __cmp__() and
__hash__() methods by default; with them, all objects
compare unequal (except with themselves) and
x.__hash__() returns a result derived from id(x).
So if you have a constantly __hash__ in your class, but not providing any __cmp__ or __eq__ method, then all your instances are unequal for the dictionary.
In the other hand, if you providing any __cmp__ or __eq__ method, but not providing __hash__, your instances are still unequal in terms of dictionary.
class A(object):
def __hash__(self):
return 42
class B(object):
def __eq__(self, other):
return True
class C(A, B):
pass
dict_a = {A(): 1, A(): 2, A(): 3}
dict_b = {B(): 1, B(): 2, B(): 3}
dict_c = {C(): 1, C(): 2, C(): 3}
print(dict_a)
print(dict_b)
print(dict_c)
Output
{<__main__.A object at 0x7f9672f04850>: 1, <__main__.A object at 0x7f9672f04910>: 3, <__main__.A object at 0x7f9672f048d0>: 2}
{<__main__.B object at 0x7f9672f04990>: 2, <__main__.B object at 0x7f9672f04950>: 1, <__main__.B object at 0x7f9672f049d0>: 3}
{<__main__.C object at 0x7f9672f04a10>: 3}

Categories

Resources