What does set() in Python use to test equality between objects? - python

I have written some code in Python which has a class called product and overrided the magic functions __eq__ and __hash__. Now I need to make a set which should remove duplicates from the list based on the ID of the product. As you can see the output of this code the hashes of two objects are the same, yet when i make a set of those two objects the length is 2 not one.
But, when i change the __eq__ method of the code to this
def __eq__(self, b) -> bool:
if self.id == b.id:
return True
return False
and use it with the same hash function it works and the length of the set is 1. So i am confused whether the set data-structure uses the __eq__ method to test for equality or the __hash__ method.

Equality tests can be expensive, so the set starts by comparing hashes. If the hashes are not equal, then the check ends. If the hashes are equal, the set then tests for equality. If it only used __eq__, it might have to do a lot of unnecessary work, but if it only used __hash__, there would be no way to resolve a hash collision.
Here's a simple example of using equality to resolve a hash collision. All integers are their own hashes, except for -1:
>>> hash(-1)
-2
>>> hash(-2)
-2
>>> s = set()
>>> s.add(-1)
>>> -2 in s
False
Here's an example of the set skipping an equality check because the hashes aren't equal. Let's subclass an int so it return a new hash every second:
>>> class TimedInt(int):
... def __hash__(self):
... return int(time.time())
...
>>> a = TimedInt(5)
>>> a == 5
True
>>> a == a
True
>>> s = set()
>>> s.add(a) # Now wait a few seconds...
>>> a in s
False

Related

How does Python guarantees all objects in a set are unique? [duplicate]

This question already has an answer here:
add object into python's set collection and determine by object's attribute
(1 answer)
Closed 6 years ago.
I'm using set() and __hash__ method of python class to prevent adding same hash object in set. According to python data-model document, set() consider same hash object as same object and just add them once.
But it behaves different as below:
class MyClass(object):
def __hash__(self):
return 0
result = set()
result.add(MyClass())
result.add(MyClass())
print(len(result)) # len = 2
While in case of string value, it works correctly.
result.add('aida')
result.add('aida')
print(len(result)) # len = 1
My question is: why the same hash objects are not same in set?
Your reading is incorrect. The __eq__ method is used for equality checks. The documents just state that the __hash__ value must also be the same for 2 objects a and b for which a == b (i.e. a.__eq__(b)) is true.
This is a common logic mistake: a == b being true implies that hash(a) == hash(b) is also true. However, an implication does not necessarily mean equivalence, that in addition to the prior, hash(a) == hash(b) would mean that a == b.
To make all instances of MyClass compare equal to each other, you need to provide an __eq__ method for them; otherwise Python will compare their identities instead. This might do:
class MyClass(object):
def __hash__(self):
return 0
def __eq__(self, other):
# another object is equal to self, iff
# it is an instance of MyClass
return isinstance(other, MyClass)
Now:
>>> result = set()
>>> result.add(MyClass())
>>> result.add(MyClass())
1
In reality you'd base the __hash__ on those properties of your object that are used for __eq__ comparison, for example:
class Person
def __init__(self, name, ssn):
self.name = name
self.ssn = ssn
def __eq__(self, other):
return isinstance(other, Person) and self.ssn == other.ssn
def __hash__(self):
# use the hashcode of self.ssn since that is used
# for equality checks as well
return hash(self.ssn)
p = Person('Foo Bar', 123456789)
q = Person('Fake Name', 123456789)
print(len({p, q}) # 1
Sets need two methods to make an object hashable: __hash__ and __eq__. Two instances must return the same hash value when they are considered equal. An instance is considered already present in a set if both the hash is present in the set and the instance is considered equal to one of the instances with that same hash in the set.
Your class doesn't implement __eq__, so the default object.__eq__ is used instead, which only returns true if obj1 is obj2 is also true. In other words, two instances are only considered equal if they are the exact same instance.
Just because their hashes match, doesn't make them unique as far as a set is concerned; even objects with different hashes can end up in the same hash table slot, as the modulus of the hash against the table size is used.
Add your a custom __eq__ method that returns True when two instances are supposed to be equal:
def __eq__(self, other):
if not isinstance(other, type(self)):
return False
# all instances of this class are considered equal to one another
return True

How does one make Python objects hashable when there is nothing to distinguish them?

Let's say I want to use a set() to store a bunch of objects whose only distinction is that they exist and are not other instances of the same class. Otherwise, they are not distinguishable, e.g., no def __eq__(self, other): return self.qux == other.qux, because that qux is the same (or random) for all of them. How do you define an __eq__ and __hash__ function for that class?
You don't need to implement either __eq__ or __hash__.
User-defined classes have __eq__() and __hash__() methods by
default; with them, all objects compare unequal (except with
themselves) and x.__hash__() returns an appropriate value such that
x == y implies both that x is y and hash(x) == hash(y).
Source: Data model
The default is something like:
class OnlyExists:
def __eq__(self, other):
return False
def __hash__(self):
return id(self)
Because it's unequal to everything, instances can only be found by identity. Giving a minimal hash implementation (i.e. not just returning the same hash value for every instance) means that the instances don't all end up in the same "bucket", which would be a catastrophic collision and mean all dictionary/set searches fall to O(n).
>>> class OnlyExists:
... pass
...
>>> a = OnlyExists()
>>> b = OnlyExists()
>>> s = {a, b}
>>> len(s)
2
>>> a in s
True
>>> b in s
True
>>> OnlyExists() in s
False

Python set() and __hash__ confusion

One of the classes I defined is used in a set() to filter out equal objects. But it doesn't work as I expect it, so I obviously understand something wrong.
class Foo(object):
def __hash__(self):
return 7
x = set()
x.add(Foo())
assert len(x) == 1
x.add(Foo())
assert len(x) == 1 # AssertionError
I expect the set to consist of only one element, but it has two. Why is that?
Hash collisions are know to occur in sets (hash maps), no hash algorithm is good enough to have a unique hash for every item, or else it would take a long time to calculate. When a collision does occur, python falls back to checking the equality of the values with __eq__ to make sure they are not the same.
class Foo(object):
def __hash__(self):
return 7
def __eq__(self, other):
return True
>>> x = set()
>>> x.add(Foo())
>>> assert len(x) == 1
>>> x.add(Foo())
>>> assert len(x) == 1
>>>
This is a reason why you see frightening runtimes over here but note that you can expect O(1) amortized membership checks in sets, even though they have O(N) worst case (everything is a hash collision). The worst case is very very very unlikely to occur due to Python's smart implementation.

When is a python object's hash computed and why is the hash of -1 different?

Following on from this question, I'm interested to know when is a python object's hash computed?
At an instance's __init__ time,
The first time __hash__() is called,
Every time __hash__() is called, or
Any other opportunity I might be missing?
May this vary depending on the type of the object?
Why does hash(-1) == -2 whilst other integers are equal to their hash?
The hash is generally computed each time it's used, as you can quite easily check yourself (see below).
Of course, any particular object is free to cache its hash. For example, CPython strings do this, but tuples don't (see e.g. this rejected bug report for reasons).
The hash value -1 signals an error in CPython. This is because C doesn't have exceptions, so it needs to use the return value. When a Python object's __hash__ returns -1, CPython will actually silently change it to -2.
See for yourself:
class HashTest(object):
def __hash__(self):
print('Yes! __hash__ was called!')
return -1
hash_test = HashTest()
# All of these will print out 'Yes! __hash__ was called!':
print('__hash__ call #1')
hash_test.__hash__()
print('__hash__ call #2')
hash_test.__hash__()
print('hash call #1')
hash(hash_test)
print('hash call #2')
hash(hash_test)
print('Dict creation')
dct = {hash_test: 0}
print('Dict get')
dct[hash_test]
print('Dict set')
dct[hash_test] = 0
print('__hash__ return value:')
print(hash_test.__hash__()) # prints -1
print('Actual hash value:')
print(hash(hash_test)) # prints -2
From here:
The hash value -1 is reserved (it’s used to flag errors in the C implementation).
If the hash algorithm generates this value, we simply use -2 instead.
As integer's hash is integer itself it's just changed right away.
It is easy to see that option #3 holds for user defined objects. This allows the hash to vary if you mutate the object, but if you ever use the object as a dictionary key you must be sure to prevent the hash ever changing.
>>> class C:
def __hash__(self):
print("__hash__ called")
return id(self)
>>> inst = C()
>>> hash(inst)
__hash__ called
43795408
>>> hash(inst)
__hash__ called
43795408
>>> d = { inst: 42 }
__hash__ called
>>> d[inst]
__hash__ called
Strings use option #2: they calculate the hash value once and cache the result. This is safe because strings are immutable so the hash can never change, but if you subclass str the result might not be immutable so the __hash__ method will be called every time again. Tuples are usually thought of as immutable so you might think the hash could be cached, but in fact a tuple's hash depends on the hash of its content and that might include mutable values.
For #max who doesn't believe that subclasses of str can modify the hash:
>>> class C(str):
def __init__(self, s):
self._n = 1
def __hash__(self):
return str.__hash__(self) + self._n
>>> x = C('hello')
>>> hash(x)
-717693723
>>> x._n = 2
>>> hash(x)
-717693722

Python != operation vs "is not"

In a comment on this question, I saw a statement that recommended using
result is not None
vs
result != None
What is the difference? And why might one be recommended over the other?
== is an equality test. It checks whether the right hand side and the left hand side are equal objects (according to their __eq__ or __cmp__ methods.)
is is an identity test. It checks whether the right hand side and the left hand side are the very same object. No methodcalls are done, objects can't influence the is operation.
You use is (and is not) for singletons, like None, where you don't care about objects that might want to pretend to be None or where you want to protect against objects breaking when being compared against None.
First, let me go over a few terms. If you just want your question answered, scroll down to "Answering your question".
Definitions
Object identity: When you create an object, you can assign it to a variable. You can then also assign it to another variable. And another.
>>> button = Button()
>>> cancel = button
>>> close = button
>>> dismiss = button
>>> print(cancel is close)
True
In this case, cancel, close, and dismiss all refer to the same object in memory. You only created one Button object, and all three variables refer to this one object. We say that cancel, close, and dismiss all refer to identical objects; that is, they refer to one single object.
Object equality: When you compare two objects, you usually don't care that it refers to the exact same object in memory. With object equality, you can define your own rules for how two objects compare. When you write if a == b:, you are essentially saying if a.__eq__(b):. This lets you define a __eq__ method on a so that you can use your own comparison logic.
Rationale for equality comparisons
Rationale: Two objects have the exact same data, but are not identical. (They are not the same object in memory.)
Example: Strings
>>> greeting = "It's a beautiful day in the neighbourhood."
>>> a = unicode(greeting)
>>> b = unicode(greeting)
>>> a is b
False
>>> a == b
True
Note: I use unicode strings here because Python is smart enough to reuse regular strings without creating new ones in memory.
Here, I have two unicode strings, a and b. They have the exact same content, but they are not the same object in memory. However, when we compare them, we want them to compare equal. What's happening here is that the unicode object has implemented the __eq__ method.
class unicode(object):
# ...
def __eq__(self, other):
if len(self) != len(other):
return False
for i, j in zip(self, other):
if i != j:
return False
return True
Note: __eq__ on unicode is definitely implemented more efficiently than this.
Rationale: Two objects have different data, but are considered the same object if some key data is the same.
Example: Most types of model data
>>> import datetime
>>> a = Monitor()
>>> a.make = "Dell"
>>> a.model = "E770s"
>>> a.owner = "Bob Jones"
>>> a.warranty_expiration = datetime.date(2030, 12, 31)
>>> b = Monitor()
>>> b.make = "Dell"
>>> b.model = "E770s"
>>> b.owner = "Sam Johnson"
>>> b.warranty_expiration = datetime.date(2005, 8, 22)
>>> a is b
False
>>> a == b
True
Here, I have two Dell monitors, a and b. They have the same make and model. However, they neither have the same data nor are the same object in memory. However, when we compare them, we want them to compare equal. What's happening here is that the Monitor object implemented the __eq__ method.
class Monitor(object):
# ...
def __eq__(self, other):
return self.make == other.make and self.model == other.model
Answering your question
When comparing to None, always use is not. None is a singleton in Python - there is only ever one instance of it in memory.
By comparing identity, this can be performed very quickly. Python checks whether the object you're referring to has the same memory address as the global None object - a very, very fast comparison of two numbers.
By comparing equality, Python has to look up whether your object has an __eq__ method. If it does not, it examines each superclass looking for an __eq__ method. If it finds one, Python calls it. This is especially bad if the __eq__ method is slow and doesn't immediately return when it notices that the other object is None.
Did you not implement __eq__? Then Python will probably find the __eq__ method on object and use that instead - which just checks for object identity anyway.
When comparing most other things in Python, you will be using !=.
Consider the following:
class Bad(object):
def __eq__(self, other):
return True
c = Bad()
c is None # False, equivalent to id(c) == id(None)
c == None # True, equivalent to c.__eq__(None)
None is a singleton, and therefore identity comparison will always work, whereas an object can fake the equality comparison via .__eq__().
>>> () is ()
True
>>> 1 is 1
True
>>> (1,) == (1,)
True
>>> (1,) is (1,)
False
>>> a = (1,)
>>> b = a
>>> a is b
True
Some objects are singletons, and thus is with them is equivalent to ==. Most are not.

Categories

Resources