Related
I have written some code in Python which has a class called product and overrided the magic functions __eq__ and __hash__. Now I need to make a set which should remove duplicates from the list based on the ID of the product. As you can see the output of this code the hashes of two objects are the same, yet when i make a set of those two objects the length is 2 not one.
But, when i change the __eq__ method of the code to this
def __eq__(self, b) -> bool:
if self.id == b.id:
return True
return False
and use it with the same hash function it works and the length of the set is 1. So i am confused whether the set data-structure uses the __eq__ method to test for equality or the __hash__ method.
Equality tests can be expensive, so the set starts by comparing hashes. If the hashes are not equal, then the check ends. If the hashes are equal, the set then tests for equality. If it only used __eq__, it might have to do a lot of unnecessary work, but if it only used __hash__, there would be no way to resolve a hash collision.
Here's a simple example of using equality to resolve a hash collision. All integers are their own hashes, except for -1:
>>> hash(-1)
-2
>>> hash(-2)
-2
>>> s = set()
>>> s.add(-1)
>>> -2 in s
False
Here's an example of the set skipping an equality check because the hashes aren't equal. Let's subclass an int so it return a new hash every second:
>>> class TimedInt(int):
... def __hash__(self):
... return int(time.time())
...
>>> a = TimedInt(5)
>>> a == 5
True
>>> a == a
True
>>> s = set()
>>> s.add(a) # Now wait a few seconds...
>>> a in s
False
Let's say I want to use a set() to store a bunch of objects whose only distinction is that they exist and are not other instances of the same class. Otherwise, they are not distinguishable, e.g., no def __eq__(self, other): return self.qux == other.qux, because that qux is the same (or random) for all of them. How do you define an __eq__ and __hash__ function for that class?
You don't need to implement either __eq__ or __hash__.
User-defined classes have __eq__() and __hash__() methods by
default; with them, all objects compare unequal (except with
themselves) and x.__hash__() returns an appropriate value such that
x == y implies both that x is y and hash(x) == hash(y).
Source: Data model
The default is something like:
class OnlyExists:
def __eq__(self, other):
return False
def __hash__(self):
return id(self)
Because it's unequal to everything, instances can only be found by identity. Giving a minimal hash implementation (i.e. not just returning the same hash value for every instance) means that the instances don't all end up in the same "bucket", which would be a catastrophic collision and mean all dictionary/set searches fall to O(n).
>>> class OnlyExists:
... pass
...
>>> a = OnlyExists()
>>> b = OnlyExists()
>>> s = {a, b}
>>> len(s)
2
>>> a in s
True
>>> b in s
True
>>> OnlyExists() in s
False
I have a class that looks more or less like this:
class Something():
def __init__(self,a=None,b=None):
self.a = a
self.b = b
I want to be able to sort it in a list, normally I'd just implement method like this:
def __lt__(self,other):
return (self.a, self.b) < (other.a, other.b)
But this will raise an error in following case:
sort([Something(1,None),Something(1,1)])
While I want is for None values to be treated as greated than or following output:
[Something(1,1),Something(1,None)]
First thing that somes to my mind is change __lt__ to:
def __lt__(self,other):
if self.a and other.a:
if self.a != other.a:
return self.a < other.a
elif self.a is None:
return True
elif other.a is None:
return False
if self.b and other.b:
if self.b != other.b:
return self.b < other.b
elif self.b is None:
return True
return False
This would give me the correct results but its just ugly and python usually has a simpler way, and I don't really want to do it for each variable that I use in sorting of my full class(omitted from here to make problem clearer).
So what is the pythonic way of solving this?
Note
I also tried following but I'm assuming that even better is possible:
This would:
def __lt__(self,other):
sorting_attributes = ['a', 'b']
for attribute in sorting_attributes:
self_value = getattr(self,attribute)
other_value = getattr(other,attribute)
if self_value and other_value:
if self_value != other_value:
return self_value < other_value
elif self_value is None:
return True
elif self_value is None:
return False
Really trying to internalize the Zen of Pyhton and I know that my code is ugly so how do I fix it?
A completely different design I thought of later (posted separately because it's so different it should really be evaluated independently):
Map all your attributes to tuples, where the first element of every tuple is a bool based on the None-ness of the attribute, and the second is the attribute value itself. None/non-None mismatches would short-circuit on the bool representing None-ness preventing the TypeError, everything else would fall back to comparing the good types:
def __lt__(self, other):
def _key(attr):
# Use attr is not None to make None less than everything, is None for greater
return (attr is None, attr)
return (_key(self.a), _key(self.b)) < (_key(other.a), _key(other.b))
Probably slightly slower than my other solution in the case where no None/non-None pair occurs, but much simpler code. It also has the advantage of continuing to raise TypeErrors when mismatched types other than None/non-None arise, rather than potentially misbehaving. I'd definitely call this one my Pythonic solution, even if it is slightly slower in the common case.
An easy way to do this is to convert None to infinity, i.e. float('inf'):
def __lt__(self, other):
def convert(i):
return float('inf') if i is None else i
return [convert(i) for i in (self.a, self.b)] < [convert(i) for i in (other.a, other.b)]
A solution for the general case (where there may not be a convenient "bigger than any value" solution, and you don't want the code to grow more complex as the number of attributes increases), which still operates as fast as possible in the presumed common case of no None values. It does assume TypeError means None was involved, so if you're likely to have mismatched types besides None, this gets more complicated, but frankly, a class design like that is painful to contemplate. This works for any scenario with two or more keys (so attrgetter returns a tuple) and only requires changing the names used to construct the attrgetter to add or remove fields to compare.
def __lt__(self, other, _key=operator.attrgetter('a', 'b')):
# Get the keys once for both inputs efficiently (avoids repeated lookup)
sattrs = _key(self)
oattrs = _key(other)
try:
return sattrs < oattrs # Fast path for no Nones or only paired Nones
except TypeError:
for sattr, oattr in zip(sattrs, oattrs):
# Only care if exactly one is None, because until then, must be equal, or TypeError
# wouldn't occur as we would have short-circuited
if (sattr is None) ^ (oattr is None):
# Exactly one is None, so if it's the right side, self is lesser
return oattr is None
# TypeError implied we should see a mismatch, so assert this to be sure
# we didn't have a non-None related type mismatch
assert False, "TypeError raised, but no None/non-None pair seen
A useful feature of this design is that under no circumstances are rich comparisons invoked for any given attribute more than once; the failed attempt at the fast path proves that there must (assuming invariant of types being either compatible or None golds) be a run of zero or more attribute pairs with equal values, followed by a None/non-None mismatch. Since everything we care about is known equal or a None/non-None mismatch, we don't need to invoke potentially expensive rich comparisons again, we just do cheap identity testing to find the None/non-None mismatch and then return based on which side was None.
What's a correct and good way to implement __hash__()?
I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.
As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions).
What's a good practice to get such values? Are collisions a problem?
In my case I have a small class which acts as a container class holding some ints, some floats and a string.
An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.
Here's an example of using a key for hash and equality:
class A:
def __key(self):
return (self.attr_a, self.attr_b, self.attr_c)
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
if isinstance(other, A):
return self.__key() == other.__key()
return NotImplemented
Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.
John Millikin proposed a solution similar to this:
class A(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
return (isinstance(othr, type(self))
and (self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
def __hash__(self):
return hash((self._a, self._b, self._c))
The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?
Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states
The only required property is that objects which compare equal have the same hash value
Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.
Outdated/bad solution
The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:
class B(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
if isinstance(othr, type(self)):
return ((self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
return NotImplemented
def __hash__(self):
return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
hash((self._a, self._b, self._c)))
Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).
Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that
for c in some_string:
hash = 101 * hash + ord(c)
worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.
A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:
class A:
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __iter__(self):
yield "a", self._a
yield "b", self._b
yield "c", self._c
def __hash__(self):
return hash(tuple(self))
def __eq__(self, other):
return (isinstance(other, type(self))
and tuple(self) == tuple(other))
(here __eq__ is not required for hash, but it's easy to implement).
Now add some mutable members to see how it works:
a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c)) # -5279839567404192660
dict(A(a, b, c)) # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c)) # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))
things only fall apart if you try to put non-hashable members in the object model:
hash(A(a, b, [1])) # TypeError: unhashable type: 'list'
I can try to answer the second part of your question.
The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.
In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!
Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.
Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.
In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.
Here are some links if you're interested more:
coalesced hashing on wikipedia
Wikipedia also has a summary of various collision resolution methods:
Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.
A very good explanation on when and how implement the __hash__ function is on programiz website:
Just a screenshot to provide an overview:
(Retrieved 2019-12-13)
As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.
class Person:
def __init__(self, age, name):
self.age = age
self.name = name
def __eq__(self, other):
return self.age == other.age and self.name == other.name
def __hash__(self):
print('The hash is:')
return hash((self.age, self.name))
person = Person(23, 'Adam')
print(hash(person))
Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.
I would favor bit operations. Like, the following C pseudo code:
int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);
Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.
For strings, I've got little/no idea.
#dataclass(frozen=True) (Python 3.7)
This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets:
dataclass_cheat.py
from dataclasses import dataclass, FrozenInstanceError
#dataclass(frozen=True)
class MyClass1:
n: int
s: str
#dataclass(frozen=True)
class MyClass2:
n: int
my_class_1: MyClass1
d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6
assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6
# Due to `frozen=True`
o = MyClass1(n=1, s='a')
try:
o.n = 2
except FrozenInstanceError as e:
pass
else:
raise 'error'
As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:
d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1
works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.
frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html
Tested on Python 3.10.7, Ubuntu 22.10.
I'm trying to write some Python code that includes union/intersection of sets that potentially can be very large. Much of the time, these sets will be essentially set(xrange(1<<32)) or something of the kind, but often there will be ranges of values that do not belong in the set (say, 'bit 5 cannot be clear'), or extra values thrown in. For the most part, the set contents can be expressed algorithmically.
I can go in and do the dirty work to subclass set and create something, but I feel like this must be something that's been done before, and I don't want to spend days on wheel reinvention.
Oh, and just to make it harder, once I've created the set, I need to be able to iterate over it in random order. Quickly. Even if the set has a billion entries. (And that billion-entry set had better not actually take up gigabytes, because I'm going to have a lot of them.)
Is there code out there? Anyone have neat tricks? Am I asking for the moon?
You say:
For the most part, the set contents can be expressed algorithmically.
How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.
For example, if you had a set a and set b which are instances of these pseudo sets:
>>> u = Union(a, b)
And then you use u with the full set API, which will turn around and query a and b using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.
Edit: Quick example with a very limited API:
class Base(object):
def union(self, other):
return Union(self, other)
def intersection(self, other):
return Intersection(self, other)
class RangeSet(Base):
def __init__(self, low, high):
self.low = low
self.high = high
def __contains__(self, value):
return value >= self.low and value < self.high
class Union(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return any(value in x for x in self.sets)
class Intersection(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return all(value in x for x in self.sets)
a = RangeSet(0, 10)
b = RangeSet(5, 15)
u = a.union(b)
i = a.intersection(b)
print 3 in u
print 7 in u
print 12 in u
print 3 in i
print 7 in i
print 12 in i
Running gives you:
True
True
True
False
True
False
You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.
You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?
If you are using python 3.0, you can subclass collections.Set
This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.
I would avoid subclassing set, since clearly you can usefully reuse no part of set's implementation. I would even avoid subclassing collections.Set, since the latter requires you to supply a __len__ -- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N), with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.
If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:
import random
class AbSet(object):
def __init__(self, predicate, maxitem=1<<32):
# set of all ints, >=0 and <maxitem, satisfying the predicate
self.maxitem = maxitem
self.predicate = predicate
self.added = set()
self.removed = set()
def copy(self):
x = type(self)(self.predicate, self.maxitem)
x.added = set(self.added)
x.removed = set(self.removed)
return x
def __contains__(self, item):
if item in self.removed: return False
if item in self.added: return True
return (0 <= item < self.maxitem) and self.predicate(item)
def __iter__(self):
# random endless iteration
while True:
x = random.randrange(self.maxitem)
if x in self: yield x
def add(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item not in self:
self.removed.discard(item)
self.added.add(item)
def discard(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item in self:
self.removed.add(item)
self.added.discard(item)
def union(self, o):
pred = lambda v: self.predicate(v) or o.predicate(v),
x = type(self)(pred, max(self.maxitem, o.maxitem))
toadd = [v for v in (self.added|o.added) if not pred(v)]
torem = [v for v in (self.removed|o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
def intersection(self, o):
pred = lambda v: self.predicate(v) and o.predicate(v),
x = type(self)(pred, min(self.maxitem, o.maxitem))
toadd = [v for v in (self.added&o.added) if not pred(v)]
torem = [v for v in (self.removed&o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.