Why do Sets allow multiple of the same object with changing hashCodes? - python

According to my understanding, sets can only have one of each object in them. But I have found the following example where a set has two of the same object
class myObject:
def __init__(self, x):
self.x = x
def set(self, x):
self.x = x
def __hash__(self):
return self.x
def __eq__(self, o):
return self.x == o.x
def __str__(self):
return str(self.x)
def __repr__(self):
return str(self.x)
When I run the following:
x = myObject(1)
mySet = {x}
x.set(2)
mySet.add(x)
print(mySet)
x.set(3)
print(mySet)
I get the following output:
{2, 2}
{3, 3}
If I remove the __str__ and __repr__ methods it shows there are two objects in the set with the same memory address:
{<__main__.myObject object at 0x10e3a10d0>, <__main__.myObject object at 0x10e3a10d0>}
I am aware python doesn't allow things like lists to be hashed because the hashcode can change causing a similar error to what is shown above. Why is python allowing this but not for things like lists etc. Surely Python should also have some way of managing changing hashes.
I have tested this same example on java and the same thing happens. Why do these languages allow this?

had a link in here to the docs, which address the hash
What is a hash: https://docs.python.org/3/reference/datamodel.html#object.hash
What is hashable (IMPORTANT): https://docs.python.org/3/glossary.html#term-hashable
"An object is hashable if it has a hash value which never changes
during its lifetime..."
Looking at the set object itself after you have modified x.
s = set()
x = myObject(2)
Then look at the set member's hash:
Then:
x.set(4)
No change. In fact, if you continue to use that set in other places (e.g. fs = frozenset(s)) you will continue to pass around the old hash.

Related

Python: Accessing dict with hashable object fails

I am using a hashable object as a key to a dictionary. The objects are hashable and I can store key-value-pairs in the dict, but when I create a copy of the same object (that gives me the same hash), I get a KeyError.
Here is some small example code:
class Object:
def __init__(self, x): self.x = x
def __hash__(self): return hash(self.x)
o1 = Object(1.)
o2 = Object(1.)
hash(o1) == hash(o2) # This is True
data = {}
data[o1] = 2.
data[o2] # Desired: This should output 2.
In my scenario above, how can I achieve that data[o2] also returns 2.?
You need to implement both __hash__ and __eq__:
class Object:
def __init__(self, x): self.x = x
def __hash__(self): return hash(self.x)
def __eq__(self, other): return self.x == other.x if isinstance(other, self.__class__) else NotImplemented
Per Python documentation:
if a class does not define an __eq__() method it should not define a __hash__() operation either
After finding the hash, Python's dictionary compares the keys using __eq__ and realize they're different, that's why you're not getting the correct output.
You can use the __eq__ magic method to implement a equality check on your object.
def __eq__(self, other):
if (isinstance(other, C)):
return self.x == self.x
You can learn more about magic methods from this link.
So as stated before your object need to implement __ eq__ trait (equality ==), If you want to understand why:
Sometimes hash of different object are the same, this is called collision.
Dictionary manages that by testing if the objects are equals. If they are not dictionary has to manage the collision. How they do that Is implementation details and can vary a lot. A dummy implementation would be list of tuple key value.
Under the hood, a dummy implementation may look like that :
dico[key] = [(object1, value), (object2, value)]

Change the underlying data representation with the descriptor protocol

Suppose I have an existing class, for example doing some mathematical stuff:
class Vector:
def __init__(self, x, y):
self.x = y
self.y = y
def norm(self):
return math.sqrt(math.pow(self.x, 2) + math.pow(self.y, 2))
Now, for some reason, I'd like to have that Python does not store the members x and y like any variable. I'd rather want that Python internally stores them as strings. Or that it stores them into a dedicated buffer, maybe for interoperability with some C code. So (for the string case) I build the following descriptor:
class MyStringMemory(object):
def __init__(self, convert):
self.convert = convert
def __get__(self, obj, objtype):
print('Read')
return self.convert(self.prop)
def __set__(self, obj, val):
print('Write')
self.prop = str(val)
def __delete__(self, obj):
print('Delete')
And I wrap the existing vector class in a new class where members x and y become MyStringMemory:
class StringVector(Vector):
def __init__(self, x, y):
self.x = x
self.y = y
x = MyStringMemory(float)
y = MyStringMemory(float)
Finally, some driving code:
v = StringVector(1, 2)
print(v.norm())
v.x, v.y = 10, 20
print(v.norm())
After all, I replaced the internal representation of x and y to be strings without any change in the original class, but still with its full functionality.
I just wonder: Will that concept work universally or do I run into serious pitfalls? As I said, the main idea is to store the data into a specific buffer location that is later on accessed by a C code.
Edit: The intention of what I'm doing is as follows. Currently, I have a nicely working program where some physical objects, all of type MyPhysicalObj interact with each other. The code inside the objects is vectorized with Numpy. Now I'd also like to vectorize some code over all objects. For example, each object has an energy that is computed by a complicated vectorized code per-object. Now I'd like to sum up all energies. I can iterate over all objects and sum up, but that's slow. So I'd rather have that property energy for each object automatically stored into a globally predefined buffer, and I can just use numpy.sum over that buffer.
There is one pitfall regarding python descriptors.
Using your code, you will reference the same value, stored in StringVector.x.prop and StringVector.y.prop respectively:
v1 = StringVector(1, 2)
print('current StringVector "x": ', StringVector.__dict__['x'].prop)
v2 = StringVector(3, 4)
print('current StringVector "x": ', StringVector.__dict__['x'].prop)
print(v1.x)
print(v2.x)
will have the following output:
Write
Write
current StringVector "x": 1
Write
Write
current StringVector "x": 3
Read
3.0
Read
3.0
I suppose this is not what you want=). To store unique value per object inside object, make the following changes:
class MyNewStringMemory(object):
def __init__(self, convert, name):
self.convert = convert
self.name = '_' + name
def __get__(self, obj, objtype):
print('Read')
return self.convert(getattr(obj, self.name))
def __set__(self, obj, val):
print('Write')
setattr(obj, self.name, str(val))
def __delete__(self, obj):
print('Delete')
class StringVector(Vector):
def __init__(self, x, y):
self.x = x
self.y = y
x = MyNewStringMemory(float, 'x')
y = MyNewStringMemory(float, 'y')
v1 = StringVector(1, 2)
v2 = StringVector(3, 4)
print(v1.x, type(v1.x))
print(v1._x, type(v1._x))
print(v2.x, type(v2.x))
print(v2._x, type(v2._x))
Output:
Write
Write
Write
Write
Read
Read
1.0 <class 'float'>
1 <class 'str'>
Read
Read
3.0 <class 'float'>
3 <class 'str'>
Also, you definitely could save data inside centralized store, using descriptor's __set__ method.
Refer to this document: https://docs.python.org/3/howto/descriptor.html
If you need a generic convertor('convert') like you did, this is the way to go.
The biggest downside will be performance when you will need to create a lot of instances( I assumed you might, since the class called Vector). This will be slow since python class initiation is slow.
In this case you might consider using namedTuple you can see the docs have a similar scenario as you have.
As a side note: If that possible, why not creating a dict with the string representation of x and y on the init method? and then keep using the x and y as normal variables without all the converting

Capturing the external modification of a mutable python object serving as an instance class variable

I am trying to track the external modification of entries of a mutable python object (e.g., a list tor dictionary). This ability is particularly helpful in the following two situations:
1) When one would like to avoid the assignment of unwanted values to the mutable python object. Here's a simple example where x must be a list of integers only:
class foo(object):
def __init__(self,x):
self.x = x
def __setattr__(self,attr_name,attr_value):
# x must be a list of integers only
if attr_name == 'x' and not isinstance(attr_value,list):
raise TypeError('x must be a list!')
elif attr_name == 'x' and len([a for a in attr_value if not isinstance(a,int)]) > 0:
raise TypeError('x must be a list of integers only')
self.__dict__[attr_name] = attr_value
# The following works fine and it throws an error because x has a non-integer entry
f = foo(x = ['1',2,3])
# The following assigns an authorized list to x
f = foo(x = [1,2,3])
# However, the following does not throw any error.
#** I'd like my code to throw an error whenever a non-integer value is assigned to an element of x
f.x[0] = '1'
print 'f.x = ',f.x
2) When one needs to update a number of other variables after modifying the mutable Python object. Here's an example, where x is a dictionary and x_vals needs to get updated whenever any changes (such as deleting an entry or assigning a new value for a particular key) are made to x :
class foo(object):
def __init__(self,x,y = None):
self.set_x(x)
self.y = y
def set_x(self,x):
"""
x has to be a dictionary
"""
if not isinstance(x,dict):
raise TypeError('x must be a dicitonary')
self.__dict__['x'] = x
self.find_x_vals()
def find_x_vals(self):
"""
NOTE: self.x_vals needs to get updated each time one modifies x
"""
self.x_vals = self.x.values()
def __setattr__(self,name,value):
# Any Changes made to x --> NOT SURE HOW TO CODE THIS PART! #
if name == 'x' or ...:
raise AttributeError('Use set_x to make changes to x!')
else:
self.__dict__[name] = value
if __name__ == '__main__':
f = foo(x={'a':1, 'b':2, 'c':3}, y = True)
print f.x_vals
# I'd like this to throw an error asking to use set_x so self.x_vals
# gets updated too
f.x['a'] = 5
# checks if x_vals was updated
print f.x_vals
# I'd like this to throw an error asking to use set_x so self.x_vals gets updated too
del f.x['a']
print f.x_vals
You could make x_vals a property like that:
#property
def x_vals(self):
return self.x.values()
And it would keep x_vals up to date each time you access it. It would event be faster because you wouldn't have to update it each time you change x.
If your only problem is keeping x_vals up to date, it's going to solve it, and save you the hassle of subclassing stuff.
You cannot use property because the thing you are trying to protect is mutable, and property only helps with the geting, seting, and deleteing of the object itself, not that objects internal state.
What you could do is create a dict subclass (or just a look-a-like if you only need a couple of the dict abilities) to manage access. Then your custom class could manage the __getitem__, __setitem__, and __delitem__ methods.
Update for question revision
My original answer is still valid -- whether you use property or __getattribute__1 you still have the basic problem: once you hand over the retrieved attribute you have no control over what happens to it nor what it does.
You have two options to work around this:
create subclasses of the classes you want to protect, and put the restrictions in them (from my original answer), or
create a generic wrapper to act as a gateway.
A very rough example of the gateway wrapper:
class Gateway():
"use this to wrap an object and provide restrictions to it's data"
def __init__(self, obj, valid_key=None, valid_value=None):
self.obj = obj
self.valid_key = valid_key
self.valid_value = valid_value
def __setitem__(self, name, value):
"""
a dictionary can have any value for name, any value for value
a list will have an integer for name, any value for value
"""
valid_key = self.valid_key
valid_value = self.valid_value
if valid_key is not None:
if not valid_key(name):
raise Exception('%r not allowed as key/index' % type(name))
if valid_value is not None:
if not valid_value(value):
raise Exception('%r not allowed as value' % value)
self.obj[name] = value
and a simple example:
huh = Gateway([1, 2, 3], valid_value=lambda x: isinstance(x, int))
huh[0] = '1'
Traceback (most recent call last):
...
Exception: '1' not allowed as value
To use Gateway you will need to override more methods, such as append (for list).
1 Using __getattribute__ is not advised as it is the piece that controls all the aspects of attribute lookup. It is easy to get wrong.

Overriding __eq__ and __hash__ to compare a dict attribute of two instances

I'm struggling to understand how to correctly compare objects based on an underlying dict attribute that each instance possesses.
Since I'm overriding __eq__, do I need to override __hash__ as well? I haven't a firm grasp on when/where to do so and could really use some help.
I created a simple example below to illustrate the maximum recursion exception that I've run into. A RegionalCustomerCollection organizes account IDs by geographical region. RegionalCustomerCollection objects are said to be equal if the regions and their respective accountids are. Essentially, all items() should be equal in content.
from collections import defaultdict
class RegionalCustomerCollection(object):
def __init__(self):
self.region_accountids = defaultdict(set)
def get_region_accountid(self, region_name=None):
return self.region_accountids.get(region_name, None)
def set_region_accountid(self, region_name, accountid):
self.region_accountids[region_name].add(accountid)
def __eq__(self, other):
if (other == self):
return True
if isinstance(other, RegionalCustomerCollection):
return self.region_accountids == other.region_accountids
return False
def __repr__(self):
return ', '.join(["{0}: {1}".format(region, acctids)
for region, acctids
in self.region_accountids.items()])
Let's create two object instances and populate them with some sample data:
>>> a = RegionalCustomerCollection()
>>> b = RegionalCustomerCollection()
>>> a.set_region_accountid('northeast',1)
>>> a.set_region_accountid('northeast',2)
>>> a.set_region_accountid('northeast',3)
>>> a.set_region_accountid('southwest',4)
>>> a.set_region_accountid('southwest',5)
>>> b.set_region_accountid('northeast',1)
>>> b.set_region_accountid('northeast',2)
>>> b.set_region_accountid('northeast',3)
>>> b.set_region_accountid('southwest',4)
>>> b.set_region_accountid('southwest',5)
Now let's try to compare the two instances and generate the recursion exception:
>>> a == b
...
RuntimeError: maximum recursion depth exceeded while calling a Python object
Your object shouldn't return a hash because it's mutable. If you put this object into a dictionary or set and then change it afterward, you may never be able to find it again.
In order to make an object unhashable, you need to do the following:
class MyClass(object):
__hash__ = None
This will ensure that the object is unhashable.
[in] >>> m = MyClass()
[in] >>> hash(m)
[out] >>> TypeError: unhashable type 'MyClass'
Does this answer your question? I'm suspecting not because you were explicitly looking for a hash function.
As far as the RuntimeError you're receiving, it's because of the following line:
if self == other:
return True
That gets you into an infinite recursion loop. Try the following instead:
if self is other:
return True
You don't need to override __hash__ to compare two objects (you'll need to if you want custom hashing, i.e. to improve performance when inserting into sets or dictionaries).
Also, you have infinite recursion here:
def __eq__(self, other):
if (other == self):
return True
if isinstance(other, RegionalCustomerCollection):
return self.region_accountids == other.region_accountids
return False
If both objects are of type RegionalCustomerCollection then you'll have infinite recursion since == calls __eq__.

How to remove duplicates in set for objects?

I have set of objects:
class Test(object):
def __init__(self):
self.i = random.randint(1,10)
res = set()
for i in range(0,1000):
res.add(Test())
print len(res) = 1000
How to remove duplicates from set of objects ?
Thanks for answers, it's work:
class Test(object):
def __init__(self, i):
self.i = i
# self.i = random.randint(1,10)
# self.j = random.randint(1,20)
def __keys(self):
t = ()
for key in self.__dict__:
t = t + (self.__dict__[key],)
return t
def __eq__(self, other):
return isinstance(other, Test) and self.__keys() == other.__keys()
def __hash__(self):
return hash(self.__keys())
res = set()
res.add(Test(2))
...
res.add(Test(8))
result: [2,8,3,4,5,6,7]
but how to save order ? Sets not support order. Can i use list instead set for example ?
Your objects must be hashable (i.e. must have __eq__() and __hash__() defined) for sets to work properly with them:
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
def __eq__(self, other):
return self.i == other.i
def __hash__(self):
return self.i
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
If you have several attributes, hash and compare a tuple of them (thanks, delnan):
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
self.k = random.randint(1, 10)
self.j = random.randint(1, 10)
def __eq__(self, other):
return (self.i, self.k, self.j) == (other.i, other.k, other.j)
def __hash__(self):
return hash((self.i, self.k, self.j))
Your first question is already answered by Pavel Anossov.
But you have another question:
but how to save order ? Sets not support order. Can i use list instead set for example ?
You can use a list, but there are a few downsides:
You get the wrong interface.
You don't get automatic handling of duplicates. You have to explicitly write if foo not in res: res.append(foo). Obviously, you can wrap this up in a function instead of writing it repeatedly, but it's still extra work.
It's going to be a lot less efficient if the collection can get large. Basically, adding a new element, checking whether an element already exists, etc. are all going to be O(N) instead of O(1).
What you want is something that works like an ordered set. Or, equivalently, like a list that doesn't allow duplicates.
If you do all your adds first, and then all your lookups, and you don't need lookups to be fast, you can get around this by first building a list, then using unique_everseen from the itertools recipes to remove duplicates.
Or you could just keep a set and a list or elements by order (or a list plus a set of elements seen so far). But that can get a bit complicated, so you might want to wrap it up.
Ideally, you want to wrap it up in a type that has exactly the same API as set. Something like an OrderedSet akin to collections.OrderedDict.
Fortunately, if you scroll to the bottom of that docs page, you'll see that exactly what you want already exists; there's a link to an OrderedSet recipe at ActiveState.
So, copy it, paste it into your code, then just change res = set() to res = OrderedSet(), and you're done.
I think you can easily do what you want with a list as you asked in your first post since you defined the eq operator :
l = []
if Test(0) not in l :
l.append(Test(0))
My 2 cts ...
Pavel Anossov's answer is great for allowing your class to be used in a set with the semantics you want. However, if you want to preserve the order of your items, you'll need a bit more. Here's a function that de-duplicates a list, as long as the list items are hashable:
def dedupe(lst):
seen = set()
results = []
for item in lst:
if item not in seen:
seen.add(item)
results.append(item)
return results
A slightly more idiomatic version would be a generator, rather than a function that returns a list. This gets rid of the results variable, using yield rather than appending the unique values to it. I've also renamed the lst parameter to iterable, since it will work just as well on any iterable object (such as another generator).
def dedupe(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Categories

Resources