Hashing Custom Objects - python

hash() method in python can match all immutable objects to unique hash value. However, I cannot understand the behavior of hash() method for objects of user-defined classes. Some of the resources say that if user-defined class do not contain __hash__() and __eq__() methods, that object cannot be hashed. On the other hand, the others claim the opposite one.
In other words, what is the role of __eq__() and __hash__() methods in order to hash custom objects ?

If you don't implement __hash__, hash() will use the default implementation. If you don't implement __eq__, the default implementation will be used when you compare two instances.
class C:
pass
class D:
def __hash__(self):
return 1
def __eq__(self, other):
return True
print(hash(C())) # changing for every C instance
print(C() == C()) # False since objects are different
print(hash(D())) # always 1
print(D() == D()) # always True

Basically, 'hash' should be quick, and act as a "triage" calculation to know if two objects are equal or not.
The 'eq' should precisely be the function that tells if the objects are definitely
equal or not. Maybe this funciton has to perform a lot of checks ( for instance if you want to define the equality of your objects by the equality of all the member fields, and maybe there is are a lot of them)
The purpose of these two functions is to have a quick way of saying "no, they are not equal" (the hash function), since the comparisons are often used a lot, and most often two objects are not supposed to be "equals".
Instead of executing a lot of "eq" functions, you execute a lot of quick "hash" functions, and if both the hashes match, you execute "eq" to confirm the equality or not.

Related

Is there a hash of a class instance in Python?

Let's suppose I have a class like this:
class MyClass:
def __init__(self, a):
self._a = a
And I construct such instances:
obj1 = MyClass(5)
obj2 = MyClass(12)
obj3 = MyClass(5)
Is there a general way to hash my objects such that objects constructed with same values have equal hashes? In this case:
myhash(obj1) != myhash(obj2)
myhash(obj1) == myhash(obj3)
By general I mean a Python function that can work with objects created by any class I can define. For different classes and same values the hash function must return different results, of course; otherwise this question would be about hashing of several arguments instead.
def myhash(obj):
items = sorted(obj.__dict__.items(), key=lambda it: it[0])
return hash((type(obj),) + tuple(items))
This solution obviously has limitations:
It assumes that all fields in __dict__ are important.
It assumes that __dict__ is present, e.g. this won't work with __slots__.
It assumes that all values are hashable
It breaks the Liskov substitution principle.
The question is badly formed for a couple reasons:
Hashes don't test eqaulity, just inequality. That is, they guarantee that hash(a) != hash(b) implies a != b, but the reverse does not hold true. For example, checking "aKey" in myDict will do a linear search through all keys in myDict that have the same hash as "aKey".
You seem to wanting to do something with storage. Note that the hash of "aKey" will change between runs, so don't write it to a file. See the bottom of __hash__ for more information.
In general, you need to think carefully about subclasses, hashes, and equality. There is a pit here, so even the official documentation quietly sidesteps what the hash of instance means. Do note that each instance has a __dict__ for local variables and the __class__ with more information.
Hope this helps those who come after you.

Why does Python's bool builtin only look at the class-level __bool__ method [duplicate]

This question already has an answer here:
Assigning (instead of defining) a __getitem__ magic method breaks indexing [duplicate]
(1 answer)
Closed 6 years ago.
The documentation clearly states that
When this method (__bool__) is not defined, __len__() is called, if it is defined, and the object is considered true if its result is nonzero. If a class defines neither __len__() nor __bool__(), all its instances are considered true.
Bold is my insertion, italics is mine but the text is actually there. The fact that the class must contain the method is readily tested by
class A:
pass
a = A()
a.__bool__ = (lamda self: False).__get__(a, type(a))
print(bool(A()), bool(a))
The result is True True, as the documentation claims. Overriding __len__ yields the same result:
b = A()
b.__len__ = (lambda self: 0).__get__(b, type(b))
print(bool(A()), bool(b))
This works exactly as the documentation claims it will. However, I find the reasoning behind this to be a little counter-intuitive. I understand that the bool builtin does not look at the methods of the instance, but I do not understand why. Does someone with a knowledge of the internal workings know why only the class-level __bool__ and __len__ methods affect truthiness while instance-level methods are ignored?
The reason is how special methods are looked up.
For custom classes, implicit invocations of special methods are only guaranteed to work correctly if defined on an object’s type, not in the object’s instance dictionary.
...
The rationale behind this behaviour lies with a number of special methods such as __hash__() and __repr__() that are implemented by all objects, including type objects. If the implicit lookup of these methods used the conventional lookup process, they would fail when invoked on the type object itself.
...
In addition to bypassing any instance attributes in the interest of correctness, implicit special method lookup generally also bypasses the __getattribute__() method even of the object’s metaclass.
...
Bypassing the __getattribute__() machinery in this fashion provides significant scope for speed optimisations within the interpreter, at the cost of some flexibility in the handling of special methods (the special method must be set on the class object itself in order to be consistently invoked by the interpreter).

Python builtin functions aren't really functions, right?

Was just thinking about Python's dict "function" and starting to realize that dict isn't really a function at all. For example, if we do dir(dict), we get all sorts of methods that aren't include in the usual namespace of an user defined function. Extending that thought, its similar to dir(list) and dir(len). They aren't function, but really types. But then I'm confused about the documentation page, http://docs.python.org/2/library/functions.html, which clearly says functions. (I guess it should really just says builtin callables)
So what gives? (Starting to seem that making the distinction of classes and functions is trivial)
It's a callable, as are classes in general. Calling dict() is effectively to call the dict constructor. It is like when you define your own class (C, say) and you call C() to instantiate it.
One way that dict is special, compared to, say, sum, is that though both are callable, and both are implemented in C (in cpython, anyway), dict is a type; that is, isinstance(dict, type) == True. This means that you can use dict as the base class for other types, you can write:
class MyDictSubclass(dict):
pass
but not
class MySumSubclass(sum):
pass
This can be useful to make classes that behave almost like a builtin object, but with some enhancements. For instance, you can define a subclass of tuple that implements + as vector addition instead of concatenation:
class Vector(tuple):
def __add__(self, other):
return Vector(x + y for x, y in zip(self, other))
Which brings up another interesting point. type is also implemented in C. It's also callable. Like dict (and unlike sum) it's an instance of type; isinstance(type, type) == True. Because of this weird, seemingly impossible cycle, type can be used to make new classes of classes, (called metaclasses). You can write:
class MyTypeSubclass(type):
pass
class MyClass(object):
__metaclass__ = MyTypeSubclass
or, in Python 3:
class MyClass(metaclass=MyTypeSubclass):
pass
Which give the interesting result that isinstance(MyClass, MyTypeSubclass) == True. How this is useful is a bit beyond the scope of this answer, though.
dict() is a constructor for a dict instance. When you do dir(dict) you're looking at the attributes of class dict. When you write a = dict() you're setting a to a new instance of type dict.
I'm assuming here that dict() is what you're referring to as the "dict function". Or are you calling an indexed instance of dict, e.g. a['my_key'] a function?
Note that calling dir on the constructor dict.__init__
dir(dict.__init__)
gives you what you would expect, including the same stuff as you'd get for any other function. Since a call to the dict() constructor results in a call to dict.__init__(instance), that explains where those function attributes went. (Of course there's a little extra behind-the-scenes work in any constructor, but that's the same for dicts as for any object.)

python compare 2 similar objects with duck typing

Maybe my design is totally out of whack, but if I have 2 derived class objects that are comparable, but class D1 will basically always > class D2. (Say comparing Ivy Bridge to 286). How would I implement class D1's comparison to reflect that without using isinstance(D2)?
I saw this:
Comparing two objects
and
If duck-typing in Python, should you test isinstance?
I could add a "type" attribute, and then compare the types, but then I might as well use isinstance. The easiest way would be to use isinstance... Any better suggestions?
I would ask myself "what is it about D2 that makes it always greater than D1?" In other words, do they have some common attribute that it would make sense to base the comparison off of. If there is no good answer for this question, it might be worth asking whether creating comparisons for these two objects actually make sense.
IF, after considering these things, you still think that doing the comparison is a good idea, then just use isinstance. There's a reason it still exists in the language -- and python is constantly deprecating things that are considered bad practice which implies that isinstance isn't always a bad thing.
The problem is when isinstance is used to do type checking unnecessarily. In other words, users often use it in a "Look before you leap" context which is completely unnecessary.
if not isinstance(arg,Foo):
raise ValueError("I want a Foo")
In this case, if the user doesn't put something that looks enough like a Foo into the function, it will raise an exception anyway. Why restrict it to only Foo objects? In your case however, it seems like the type of the objects actually matter from a conceptual standpoint. This is why isinstance exists (in my opinion).
I would do something like this:
class D1(object):
def __val_cmp(self, other):
# compare by attributes here
if self.attr < other.attr:
return -1
elif self.attr > other.attr:
return 1
return 0
def __cmp__(self, other):
greater = isinstance(other, type(self))
lesser = isinstance(self, type(other))
if greater and lesser:
# same type so compare by attributes
return self.__val_cmp(other)
elif greater:
return 1
elif lesser:
return -1
else:
# other type is not a parent or child type, so just compare by attributes
return self.__val_cmp(other)
If D2 is a subtype of D1, instances of D2 will always compare less than instances of D1.
If D0 is a parent type of D1, instances of D0 will always compare greater than instances of D1.
If you compare an instance of D1 to another instance of D1, the comparison will be by the class's attributes.
If you compare an instance of D1 to an instance of an unknown class, the comparison will be by the class's attributes

Python: detect duplicates using a set

I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I'm concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn't this properly detect duplicate Person objects?
You forgot to also define __eq__().
If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either; if it defines __cmp__() or __eq__() but not __hash__(), its instances will not be usable in hashed collections. If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__(), since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the == operator to make sure they are really equal. In your case, this will only yield True if the two objects are the same object (the standard implementation for user-defined classes).
Long story short: Also define __eq__() to make it work.
Hash function is not enough to distinguish object you have to implement the comparison function (ie. __eq__).
A hash function effectively says "A maybe equals B" or "A not equals B (for sure)".
If it says "maybe equals" then equality has to be checked anyway to make sure, which is why you also need to implement __eq__.
Nevertheless, defining __hash__ will significantly speed things up by making "A not equal B (for sure)" an O(1) operation.
The hash function must however always follow the "hash rule":
"hash rule": equal things must hash to the same value
(justification: or else we'd say "A not equals B (for sure)" when that is not the case)
For example you could hash everything by def __hash__(self): return 1. This would still be correct, but it would be inefficient because you'd have to check __eq__ each time, which may be a long process if you have complicated large data structures (e.g. with large lists, dictionaries, etc.).
Do note that you technically follow the "hash rule" do this by ignoring age in your implementation def __hash__(self): return self.name. If Bob is a person of age 20 and Bob is another person of age 30 and they are different people (likely unless this is some sort of keeps-track-of-people-over-time-as-they-age program), then they will hash to the same value and have to be compared with __eq__. This is perfectly fine, but I would implement it like so:
def __hash__(self):
return hash( (self.name, self.age) )
Do note that your way is still correct. It would however have been a coding error to use hash( (self.name, self.age) ) in a world where Person("Bob", age=20) and Person("Bob", age=30) were actually the same person, because the hash function would be saying they're different while the equals function would not (but be ignored).
You also need the __ eq __() method.

Categories

Resources