Python: detect duplicates using a set - python

I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I'm concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn't this properly detect duplicate Person objects?

You forgot to also define __eq__().
If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either; if it defines __cmp__() or __eq__() but not __hash__(), its instances will not be usable in hashed collections. If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__(), since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).

A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the == operator to make sure they are really equal. In your case, this will only yield True if the two objects are the same object (the standard implementation for user-defined classes).
Long story short: Also define __eq__() to make it work.

Hash function is not enough to distinguish object you have to implement the comparison function (ie. __eq__).

A hash function effectively says "A maybe equals B" or "A not equals B (for sure)".
If it says "maybe equals" then equality has to be checked anyway to make sure, which is why you also need to implement __eq__.
Nevertheless, defining __hash__ will significantly speed things up by making "A not equal B (for sure)" an O(1) operation.
The hash function must however always follow the "hash rule":
"hash rule": equal things must hash to the same value
(justification: or else we'd say "A not equals B (for sure)" when that is not the case)
For example you could hash everything by def __hash__(self): return 1. This would still be correct, but it would be inefficient because you'd have to check __eq__ each time, which may be a long process if you have complicated large data structures (e.g. with large lists, dictionaries, etc.).
Do note that you technically follow the "hash rule" do this by ignoring age in your implementation def __hash__(self): return self.name. If Bob is a person of age 20 and Bob is another person of age 30 and they are different people (likely unless this is some sort of keeps-track-of-people-over-time-as-they-age program), then they will hash to the same value and have to be compared with __eq__. This is perfectly fine, but I would implement it like so:
def __hash__(self):
return hash( (self.name, self.age) )
Do note that your way is still correct. It would however have been a coding error to use hash( (self.name, self.age) ) in a world where Person("Bob", age=20) and Person("Bob", age=30) were actually the same person, because the hash function would be saying they're different while the equals function would not (but be ignored).

You also need the __ eq __() method.

Related

In Python, can an object become hashable after it is created?

I want to create recursive data structures via a class Wrapper, and I want to be able to do something like this:
A = Wrapper()
A.assign_value((A, 5))
my_set = set()
my_set.add(A)
This requires A to be hashable, but one of the requirements is (according to docs):
it has a hash value which never changes during its lifetime
The Wrapper object can become hashable, and then used freely in sets and as dictionary keys, so long as assign_value can only be used once, but I'm not sure if that meets the definition, since the value changes over its lifetime (even though the value is guaranteed not to change any more). Also, I'm unsure how I would implement a __hash__ function that sometimes indicates "this object is not actually hashable", if it's even possible. It seems the very existence of a valid __hash__ function indicates that an object is hashable, regardless of what it returns when called, in standard Python.
My current idea is to set the __hash__ function in the instance rather than the class, like this:
class Wrapper():
def __init__(self):
self.assigned = False
def _hash(self):
return hash(Wrapper, self.val)
def _eq(self, a):
if isinstance(self, Wrapper) and a.assigned:
return self.val == a.val
return NotImplemented
def assign_value(self, val):
if self.assigned == True:
raise NotImplementedError()
self.assigned = True
self.val = val
self.__hash__ = self._hash
self.__eq__ = self._eq
I am realizing that the hash function will enter an infinite loop, constantly trying to compute hash(A), in my given example. __eq__ will also face this problem, though in that case it can be resolved the same way recursive lists are handled in Python. Setting a default hash value (like hash(Wrapper)) to be used in the computation for Wrapper objects does seem to fix the infinite recursion problem in theory though, making the whole thing still technically possible. I could also just say def _hash(self): return 0, despite the performance cost for sets and dicts.
Would this work as I expect? Is it the best way to do this? Most importantly, should I be doing this in the first place?

Is there a hash of a class instance in Python?

Let's suppose I have a class like this:
class MyClass:
def __init__(self, a):
self._a = a
And I construct such instances:
obj1 = MyClass(5)
obj2 = MyClass(12)
obj3 = MyClass(5)
Is there a general way to hash my objects such that objects constructed with same values have equal hashes? In this case:
myhash(obj1) != myhash(obj2)
myhash(obj1) == myhash(obj3)
By general I mean a Python function that can work with objects created by any class I can define. For different classes and same values the hash function must return different results, of course; otherwise this question would be about hashing of several arguments instead.
def myhash(obj):
items = sorted(obj.__dict__.items(), key=lambda it: it[0])
return hash((type(obj),) + tuple(items))
This solution obviously has limitations:
It assumes that all fields in __dict__ are important.
It assumes that __dict__ is present, e.g. this won't work with __slots__.
It assumes that all values are hashable
It breaks the Liskov substitution principle.
The question is badly formed for a couple reasons:
Hashes don't test eqaulity, just inequality. That is, they guarantee that hash(a) != hash(b) implies a != b, but the reverse does not hold true. For example, checking "aKey" in myDict will do a linear search through all keys in myDict that have the same hash as "aKey".
You seem to wanting to do something with storage. Note that the hash of "aKey" will change between runs, so don't write it to a file. See the bottom of __hash__ for more information.
In general, you need to think carefully about subclasses, hashes, and equality. There is a pit here, so even the official documentation quietly sidesteps what the hash of instance means. Do note that each instance has a __dict__ for local variables and the __class__ with more information.
Hope this helps those who come after you.

Hashing Custom Objects

hash() method in python can match all immutable objects to unique hash value. However, I cannot understand the behavior of hash() method for objects of user-defined classes. Some of the resources say that if user-defined class do not contain __hash__() and __eq__() methods, that object cannot be hashed. On the other hand, the others claim the opposite one.
In other words, what is the role of __eq__() and __hash__() methods in order to hash custom objects ?
If you don't implement __hash__, hash() will use the default implementation. If you don't implement __eq__, the default implementation will be used when you compare two instances.
class C:
pass
class D:
def __hash__(self):
return 1
def __eq__(self, other):
return True
print(hash(C())) # changing for every C instance
print(C() == C()) # False since objects are different
print(hash(D())) # always 1
print(D() == D()) # always True
Basically, 'hash' should be quick, and act as a "triage" calculation to know if two objects are equal or not.
The 'eq' should precisely be the function that tells if the objects are definitely
equal or not. Maybe this funciton has to perform a lot of checks ( for instance if you want to define the equality of your objects by the equality of all the member fields, and maybe there is are a lot of them)
The purpose of these two functions is to have a quick way of saying "no, they are not equal" (the hash function), since the comparisons are often used a lot, and most often two objects are not supposed to be "equals".
Instead of executing a lot of "eq" functions, you execute a lot of quick "hash" functions, and if both the hashes match, you execute "eq" to confirm the equality or not.

How is the method __lt__ invoked and processed by calling the method sort() in Python?

Executing this code:
class Person(object):
def __init__(self, name):
self.name = name
try:
lastBlank = name.rindex(' ')
self.lastName = name[lastBlank+1:]
except:
self.lastName = name
def __lt__(self, other):
if self.lastName == other.lastName:
return self.name < other.name
return self.lastName < other.lastName
def __str__(self):
return self.name
me = Person('Michael Guttag')
him = Person('Barack Hussein Obama')
her = Person('Madonna')
pList = [me, him, her]
pList.sort() #invoke __lt__()
for p in pList:
print p
and it outputs:
Michael Guttag
Madonna
Barack Hussein Obama
In the book where it exemplifies class and operator overloading (or polymorphism) it says:
In addition to providing the syntactic convenience of writing infix expressions that use <, this overloading provides automatic access to any polymorphic method defined using __lt__. The built-in method sort is one such method. So, for example, if pList is a list composed of elements of type Person, the call pList.sort() will sort that list using the __lt__ method defined in class Person.
I don't understand how sort() is a polymorphic method (actually the concept of polymorphism is still unclear to me after lots of researching), and how the method __lt__ does the sorting to give such an output. I need a step-by-step guide. Thank you!
The rich comparison methods - __lt__, __le__, __gt__, __ge__ and __eq__ implement the various comparison operations. All of them compares two objects (one of them being self). Together, they let you answer questions like "out of these two objects, which, if either, is bigger"? There are quite a few sorting algorithms that let you take a list of objects, and provided you can answer questions like that one on every pair of them, will let you put the whole list in the right order. For example, one of the easier to understand ones is called selection sort - it works like this:
Compare the first two items. Take the smallest (if they're equal, pick the first one), call it s
Compare s to the next item that you haven't checked yet, and if that item is smaller, it is now s (otherwise, s is still the old s)
Do Step 2 repeatedly until you have no more items you haven't checked. This means s is now the smallest item in your list.
Move s to the start the list, shifting all the others up to fill the 'gap' where s was.
Consider the rest of the list (ie, excluding the first item), and run this whole process on that.
Keep going - excluding the first two, then the first three, and so on - until the remaining list only has one item. The whole list is now sorted.
Python's list.sort method runs a more complex, but faster, algorithm called Timsort. It achieves the same thing: given that you can know how any two elements are ordered relative to each other, it will sort the whole list for you.
The details of how Timsort works aren't important - the point of using an existing method instead of writing your own is so you can just believe that it does work. One thing that does matter is that nowhere in the documentation does it guarantee that it will use exactly < to compare each pair of elements - it is a good idea to implement at least __eq__ as well so that sort can know if two Persons happen to be equal. You could write it like this:
def __eq__(self, other):
return self.name == other.name
Once you've done that, there is a bit of builtin magic called functools.total_ordering that can fill in all the rest of the comparison methods for you. You can use it like this:
import functools
#functools.total_ordering
class Person:
....

python compare 2 similar objects with duck typing

Maybe my design is totally out of whack, but if I have 2 derived class objects that are comparable, but class D1 will basically always > class D2. (Say comparing Ivy Bridge to 286). How would I implement class D1's comparison to reflect that without using isinstance(D2)?
I saw this:
Comparing two objects
and
If duck-typing in Python, should you test isinstance?
I could add a "type" attribute, and then compare the types, but then I might as well use isinstance. The easiest way would be to use isinstance... Any better suggestions?
I would ask myself "what is it about D2 that makes it always greater than D1?" In other words, do they have some common attribute that it would make sense to base the comparison off of. If there is no good answer for this question, it might be worth asking whether creating comparisons for these two objects actually make sense.
IF, after considering these things, you still think that doing the comparison is a good idea, then just use isinstance. There's a reason it still exists in the language -- and python is constantly deprecating things that are considered bad practice which implies that isinstance isn't always a bad thing.
The problem is when isinstance is used to do type checking unnecessarily. In other words, users often use it in a "Look before you leap" context which is completely unnecessary.
if not isinstance(arg,Foo):
raise ValueError("I want a Foo")
In this case, if the user doesn't put something that looks enough like a Foo into the function, it will raise an exception anyway. Why restrict it to only Foo objects? In your case however, it seems like the type of the objects actually matter from a conceptual standpoint. This is why isinstance exists (in my opinion).
I would do something like this:
class D1(object):
def __val_cmp(self, other):
# compare by attributes here
if self.attr < other.attr:
return -1
elif self.attr > other.attr:
return 1
return 0
def __cmp__(self, other):
greater = isinstance(other, type(self))
lesser = isinstance(self, type(other))
if greater and lesser:
# same type so compare by attributes
return self.__val_cmp(other)
elif greater:
return 1
elif lesser:
return -1
else:
# other type is not a parent or child type, so just compare by attributes
return self.__val_cmp(other)
If D2 is a subtype of D1, instances of D2 will always compare less than instances of D1.
If D0 is a parent type of D1, instances of D0 will always compare greater than instances of D1.
If you compare an instance of D1 to another instance of D1, the comparison will be by the class's attributes.
If you compare an instance of D1 to an instance of an unknown class, the comparison will be by the class's attributes

Categories

Resources