python set operations on custom classes - union definition / lookup - python

Having a class with overridden __eq__ and __hash__ method I use sets for easy lookups (has also other reasons).
class Foo:
def __init__(self, name, value):
self.name = name
self.value = value
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return self.value
def __repr__(self):
return "{} {}".format(self.name, self.value)
mylist = [Foo(None, 1), Foo(None, 2), Foo(None,3)]
reference = [Foo("a", 1), Foo("b", 2), Foo("c", 3)]
Apparently python uses the first set when I perform operations like a union for the result set:
print(set(mylist) | set(reference)) # {None 1, None 2, None 3}
print(set(reference) | set(mylist)) # {a 1, b 2, c 3}
I could not find any documentation on this behaviour.
Is there a formal definition for this?
Or is it just undefined which set the interpreter takes on unions?
EDIT To make it clear:
A union on two sets is mathematically a symmetric operation, the behavior here is not symmetric. Can I rely on it?

I think the answer is in your class definition.
__eq__ checks for value equality, while __hash__ returns a value. As a result, when sets are unified, the elements of one set are equal to their counterparts of the other set (respective values are equal), which causes the unified set return the first specified set.

You're eq method implementation is based only on 'value' property
Changing it to the following implementation should solve the problem:
def __eq__(self, other):
return (self.value == other.value) and (self.name == other.name)

Related

Python: Accessing dict with hashable object fails

I am using a hashable object as a key to a dictionary. The objects are hashable and I can store key-value-pairs in the dict, but when I create a copy of the same object (that gives me the same hash), I get a KeyError.
Here is some small example code:
class Object:
def __init__(self, x): self.x = x
def __hash__(self): return hash(self.x)
o1 = Object(1.)
o2 = Object(1.)
hash(o1) == hash(o2) # This is True
data = {}
data[o1] = 2.
data[o2] # Desired: This should output 2.
In my scenario above, how can I achieve that data[o2] also returns 2.?
You need to implement both __hash__ and __eq__:
class Object:
def __init__(self, x): self.x = x
def __hash__(self): return hash(self.x)
def __eq__(self, other): return self.x == other.x if isinstance(other, self.__class__) else NotImplemented
Per Python documentation:
if a class does not define an __eq__() method it should not define a __hash__() operation either
After finding the hash, Python's dictionary compares the keys using __eq__ and realize they're different, that's why you're not getting the correct output.
You can use the __eq__ magic method to implement a equality check on your object.
def __eq__(self, other):
if (isinstance(other, C)):
return self.x == self.x
You can learn more about magic methods from this link.
So as stated before your object need to implement __ eq__ trait (equality ==), If you want to understand why:
Sometimes hash of different object are the same, this is called collision.
Dictionary manages that by testing if the objects are equals. If they are not dictionary has to manage the collision. How they do that Is implementation details and can vary a lot. A dummy implementation would be list of tuple key value.
Under the hood, a dummy implementation may look like that :
dico[key] = [(object1, value), (object2, value)]

How does Python guarantees all objects in a set are unique? [duplicate]

This question already has an answer here:
add object into python's set collection and determine by object's attribute
(1 answer)
Closed 6 years ago.
I'm using set() and __hash__ method of python class to prevent adding same hash object in set. According to python data-model document, set() consider same hash object as same object and just add them once.
But it behaves different as below:
class MyClass(object):
def __hash__(self):
return 0
result = set()
result.add(MyClass())
result.add(MyClass())
print(len(result)) # len = 2
While in case of string value, it works correctly.
result.add('aida')
result.add('aida')
print(len(result)) # len = 1
My question is: why the same hash objects are not same in set?
Your reading is incorrect. The __eq__ method is used for equality checks. The documents just state that the __hash__ value must also be the same for 2 objects a and b for which a == b (i.e. a.__eq__(b)) is true.
This is a common logic mistake: a == b being true implies that hash(a) == hash(b) is also true. However, an implication does not necessarily mean equivalence, that in addition to the prior, hash(a) == hash(b) would mean that a == b.
To make all instances of MyClass compare equal to each other, you need to provide an __eq__ method for them; otherwise Python will compare their identities instead. This might do:
class MyClass(object):
def __hash__(self):
return 0
def __eq__(self, other):
# another object is equal to self, iff
# it is an instance of MyClass
return isinstance(other, MyClass)
Now:
>>> result = set()
>>> result.add(MyClass())
>>> result.add(MyClass())
1
In reality you'd base the __hash__ on those properties of your object that are used for __eq__ comparison, for example:
class Person
def __init__(self, name, ssn):
self.name = name
self.ssn = ssn
def __eq__(self, other):
return isinstance(other, Person) and self.ssn == other.ssn
def __hash__(self):
# use the hashcode of self.ssn since that is used
# for equality checks as well
return hash(self.ssn)
p = Person('Foo Bar', 123456789)
q = Person('Fake Name', 123456789)
print(len({p, q}) # 1
Sets need two methods to make an object hashable: __hash__ and __eq__. Two instances must return the same hash value when they are considered equal. An instance is considered already present in a set if both the hash is present in the set and the instance is considered equal to one of the instances with that same hash in the set.
Your class doesn't implement __eq__, so the default object.__eq__ is used instead, which only returns true if obj1 is obj2 is also true. In other words, two instances are only considered equal if they are the exact same instance.
Just because their hashes match, doesn't make them unique as far as a set is concerned; even objects with different hashes can end up in the same hash table slot, as the modulus of the hash against the table size is used.
Add your a custom __eq__ method that returns True when two instances are supposed to be equal:
def __eq__(self, other):
if not isinstance(other, type(self)):
return False
# all instances of this class are considered equal to one another
return True

Special Values of a Class

Right now I am creating a class which represents a closed interval. Its core functionality is to provide an intersect method.
class Interval:
def __init__(self, a, b):
# check a <= b otherwise swap
self.a = a
self.b = b
def intersect(self, other):
a = self.a if self.a > other.a else other.a
b = self.b if self.b < other.b else other.b
if b < a:
# return some value representing an empty interval, providing the intersect method
return Intervall(a,b)
It should be possible to represent special Values like all points [-oo,oo] or the empty set {}. Which still serve the intersect method. My current approach is to create a new class, but this seems kinda tedious.
class EmptyInterval:
def intersect(self, other):
return self
Assuming those special values' intersect methods take precedence I'd prepend on the Intervall class' method:
class Intervall:
...
def intersect(self,other):
if not isinstance(self, other):
other.intersect(self)
...
To clarify - the following should be legal:
a = Intervall(1,2)
b = Intervall(3,4)
c = a.intersect(b) # resulting in an empty interval
c.intersect(a) # resulting again in an empty interval
Is there some elegant / more pythonic / less nauseating ugly way to implement such a behavior?
First I thought of inheritance, but that seems quite unfitting because of the precedence those special values should have; i.e. I do not know how to implement it via inheritance.
Define a couple special functions in your class Interval:
#staticmethod
def everything():
return Interval(-math.inf, math.inf)
#staticmethod
def nothing():
return Interval(math.nan, math.nan)
You may find it more natural to write nothing() like this:
return Interval(0, 0)
or this:
return Interval(math.inf, math.inf)
It rather depends on your other code, and what you think is the most natural way to represent the empty interval. Note that any less or greater comparison with NAN will return false, so this may have some impact on which way you decide to represent the empty interval (for example is nothing().intersect(nothing()) supposed to be true or false?).
Maybe this could be another solution:
Instead of passing a,b separately, I could pass a tuple (a,b). Further I could declare a couple of singletons as class variables. During instantiation I'd pass that singleton and would only have to check whether that value is one of the singletons and act accordingly.
class Interval:
EMPTY = object()
EVERYTHING = object()
def __init__(self, bounds):
self.bound = bounds
def intersect(self, other):
if self.bounds == self.EMPTY or other.bounds == self.EMPTY:
return Interval(self.EMPTY)
...
if b < a:
return Interval(self.EMPTY)
return Interval((a,b))
I guess this maybe less error prone than John's answer, because of the general behavior math.inf and / or math.nan impose. Also it would allow to strictly forbid those values to be passed, as Interval(math.nan, 1) would be nonsensical.
But it may be more effort to implement in a more complex setting.

How to make a subclass of tuple hashable in Python?

I have a class which is a subclass of tuple. I want to use instances of that class as elements of a set, but I get the error that it is an unhashable type. I guess this is because I've overridden the __eq__ and __ne__ methods. What should I do to restore my type's hashability? I'm using Python 3.2.
objects that compare equal should have the same hash value
So it's a good idea to base the hash on the properties you are using to compare equality
Adrien's example would be better like this
class test(tuple):
def __eq__(self,comp):
return self[0] == comp[0]
def __ne__(self,comp):
return self[0] != comp[0]
def __hash__(self):
return hash((self[0],))
Simply leverage the hash of the tuple containing the stuff we care about for equality
you will need to majke your type hashable, which means implementing the __hash__() member function in your class deriving from tuple.
for example:
class test(tuple):
def __eq__(self,comp):
return self[0] == comp[0]
def __ne__(self,comp):
return self[0] != comp[0]
def __hash__(self):
return hash(self[0])
and this is what it looks like now:
>>> set([test([1,]),test([2,]),test([3,])])
{(1,), (2,), (3,)}
>>> hash(test([1,]))
1
note: you should absolutely read the documentation for the __hash__() function, in order to understand the relationship between the comparison operators and the hash computation.

What's a correct and good way to implement __hash__()?

What's a correct and good way to implement __hash__()?
I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.
As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions).
What's a good practice to get such values? Are collisions a problem?
In my case I have a small class which acts as a container class holding some ints, some floats and a string.
An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.
Here's an example of using a key for hash and equality:
class A:
def __key(self):
return (self.attr_a, self.attr_b, self.attr_c)
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
if isinstance(other, A):
return self.__key() == other.__key()
return NotImplemented
Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.
John Millikin proposed a solution similar to this:
class A(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
return (isinstance(othr, type(self))
and (self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
def __hash__(self):
return hash((self._a, self._b, self._c))
The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?
Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states
The only required property is that objects which compare equal have the same hash value
Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.
Outdated/bad solution
The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:
class B(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
if isinstance(othr, type(self)):
return ((self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
return NotImplemented
def __hash__(self):
return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
hash((self._a, self._b, self._c)))
Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).
Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that
for c in some_string:
hash = 101 * hash + ord(c)
worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.
A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:
class A:
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __iter__(self):
yield "a", self._a
yield "b", self._b
yield "c", self._c
def __hash__(self):
return hash(tuple(self))
def __eq__(self, other):
return (isinstance(other, type(self))
and tuple(self) == tuple(other))
(here __eq__ is not required for hash, but it's easy to implement).
Now add some mutable members to see how it works:
a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c)) # -5279839567404192660
dict(A(a, b, c)) # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c)) # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))
things only fall apart if you try to put non-hashable members in the object model:
hash(A(a, b, [1])) # TypeError: unhashable type: 'list'
I can try to answer the second part of your question.
The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.
In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!
Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.
Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.
In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.
Here are some links if you're interested more:
coalesced hashing on wikipedia
Wikipedia also has a summary of various collision resolution methods:
Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.
A very good explanation on when and how implement the __hash__ function is on programiz website:
Just a screenshot to provide an overview:
(Retrieved 2019-12-13)
As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.
class Person:
def __init__(self, age, name):
self.age = age
self.name = name
def __eq__(self, other):
return self.age == other.age and self.name == other.name
def __hash__(self):
print('The hash is:')
return hash((self.age, self.name))
person = Person(23, 'Adam')
print(hash(person))
Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.
I would favor bit operations. Like, the following C pseudo code:
int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);
Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.
For strings, I've got little/no idea.
#dataclass(frozen=True) (Python 3.7)
This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets:
dataclass_cheat.py
from dataclasses import dataclass, FrozenInstanceError
#dataclass(frozen=True)
class MyClass1:
n: int
s: str
#dataclass(frozen=True)
class MyClass2:
n: int
my_class_1: MyClass1
d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6
assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6
# Due to `frozen=True`
o = MyClass1(n=1, s='a')
try:
o.n = 2
except FrozenInstanceError as e:
pass
else:
raise 'error'
As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:
d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1
works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.
frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html
Tested on Python 3.10.7, Ubuntu 22.10.

Categories

Resources