In Python, can an object become hashable after it is created? - python

I want to create recursive data structures via a class Wrapper, and I want to be able to do something like this:
A = Wrapper()
A.assign_value((A, 5))
my_set = set()
my_set.add(A)
This requires A to be hashable, but one of the requirements is (according to docs):
it has a hash value which never changes during its lifetime
The Wrapper object can become hashable, and then used freely in sets and as dictionary keys, so long as assign_value can only be used once, but I'm not sure if that meets the definition, since the value changes over its lifetime (even though the value is guaranteed not to change any more). Also, I'm unsure how I would implement a __hash__ function that sometimes indicates "this object is not actually hashable", if it's even possible. It seems the very existence of a valid __hash__ function indicates that an object is hashable, regardless of what it returns when called, in standard Python.
My current idea is to set the __hash__ function in the instance rather than the class, like this:
class Wrapper():
def __init__(self):
self.assigned = False
def _hash(self):
return hash(Wrapper, self.val)
def _eq(self, a):
if isinstance(self, Wrapper) and a.assigned:
return self.val == a.val
return NotImplemented
def assign_value(self, val):
if self.assigned == True:
raise NotImplementedError()
self.assigned = True
self.val = val
self.__hash__ = self._hash
self.__eq__ = self._eq
I am realizing that the hash function will enter an infinite loop, constantly trying to compute hash(A), in my given example. __eq__ will also face this problem, though in that case it can be resolved the same way recursive lists are handled in Python. Setting a default hash value (like hash(Wrapper)) to be used in the computation for Wrapper objects does seem to fix the infinite recursion problem in theory though, making the whole thing still technically possible. I could also just say def _hash(self): return 0, despite the performance cost for sets and dicts.
Would this work as I expect? Is it the best way to do this? Most importantly, should I be doing this in the first place?

Related

Cannot overwrite __contains__ with a lambda

I would like to create a class can be used in in statements, and the condition is passed to the object in __init__. An example:
class Set:
def __init__(self, contains):
self.__contains__ = contains # or setattr; doesn't matter
top = Set(lambda _: True)
bottom = Set(lambda _: False)
The problem with this is that 3 in top returns TypeError: argument of type 'Set' is not iterable, even though top.__contains__(3) returns True as expected.
What's more, if I modify the code as such:
class Set:
def __init__(self, contains):
self.__contains__ = contains
def __contains__(self, x):
return False
top = Set(lambda _: True)
, 3 in top will return False, whereas top.__contains__(3) returns True as expected, again.
What is happening here? I am on Python 3.9.2.
(Note: the same happens with other methods that are part of the data model, such as __gt__, __eq__ , etc.)
That's because magic methods are looked up on the class, not the instance. The interpreter circumvents the usual attribute-getting mechanisms when performing "overloadable" operations.
It seems to be this way because of how it was originally implemented in CPython, for example because of how type slots work (not the __slots__ slots, that's a different thing): how + or * or other operators works on a value is decided by its class, not on per instance basis.
There's a performance benefit to this: looking up a dunder method can involve a dictionary lookup, or worse, some dynamic computations with __getattr__/__getattribute__. However, I don't know if this is the main reason it is this way.
I wasn't able to find a detailed written description, but there's a talk by Armin Ronacher on YouTube going quite in depth on this.
__contains__ is an instance method that takes a self arg.
https://docs.python.org/3/reference/datamodel.html#object.\_\_contains\_\_
For objects that don’t define __contains__(), the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__()
So I think what is happening in each case is:
The problem with this is that 3 in top returns TypeError: argument of type 'Set' is not iterable, even though top.__contains__(3) returns True as expected.
class Set:
def __init__(self, contains):
self.__contains__ = contains # or setattr; doesn't matter
top = Set(lambda _: True)
You Set class doesn't have a __contains__ method, only the instance has it. So Python doesn't recognise Set objects as implementing this protocol, so it falls back to trying the search approach via __iter__... but your Set class is not iterable.
3 in top will return False, whereas top.__contains__(3) returns True as expected, again.
class Set:
def __init__(self, contains):
self.__contains__ = contains
def __contains__(self, x):
return False
top = Set(lambda _: True)
This time your Set class does have the __contains__ method, so Python will try to use it. We can see from the behaviour that 3 in top is different from top.__contains__(3). What actually happens for 3 in top is Python does something like Set.__contains__(top, 3)
So depending how you call it you get either the method on the class, or the lambda you overrode on the instance.

Design __eq__ that compares __dict__ of self and other safe from RecursionError

I've stumbled upon really weird python 3 issue, cause of which I do not understand.
I'd like to compare my objects by checking if all their attributes are equal.
Some of the child classes will have fields that contain references to methods bound to self - and that causes RecursionError
Here's the PoC:
class A:
def __init__(self, field):
self.methods = [self.method]
self.field = field
def __eq__(self, other):
if type(self) != type(other):
return False
return self.__dict__ == other.__dict__
def method(self):
pass
first = A(field='foo')
second = A(field='bar')
print(first == second)
Running the code above in python 3 raises RecursionError and I'm not sure why. It seems that the A.__eq__ is used to compare the functions kept in self.methods. So my first question is - why? Why the object's __eq__ is called to compare bound function of that object?
The second question is - What kind of filter on __dict__ should I use to protect the __eq__ from this issue? I mean - in the PoC above the self.method is kept simply in a list, but sometimes it may be in another structure. The filtering would have to include all the possible containers that can hold the self-reference.
One clarification: I do need to keep the self.method function in a self.methods field. The usecase here is similar to unittest.TestCase._cleanups - a stack of methods that are to be called after the test is finished. The framework must be able to run the following code:
# obj is a child instance of the A class
obj.append(obj.child_method)
for method in obj.methods:
method()
Another clarification: the only code I can change is the __eq__ implementation.
"Why the object's __eq__ is called to compare bound function of that object?":
Because bound methods compare by the following algorithm:
Is the self bound to each method equal?
If so, is the function implementing the method the same?
Step 1 causes your infinite recursion; in comparing the __dict__, it eventually ends up comparing the bound methods, and to do so, it has to compare the objects to each other again, and now you're right back where you started, and it continues forever.
The only "solution"s I can come up with off-hand are:
Something like the reprlib.recursive_repr decorator (which would be extremely hacky, since you'd be heuristically determining if you're comparing for bound method related reasons based on whether __eq__ was re-entered), or
A wrapper for any bound methods you store that replaces equality testing of the respective selfs with identity testing.
The wrapper for bound methods isn't terrible at least. You'd basically just make a simple wrapper of the form:
class IdentityComparableMethod:
__slots__ = '_method',
def __new__(cls, method):
# Using __new__ prevents reinitialization, part of immutability contract
# that justifies defining __hash__
self = super().__new__(cls)
self._method = method
return self
def __getattr__(self, name):
'''Attribute access should match bound method's'''
return getattr(self._method, name)
def __eq__(self, other):
'''Comparable to other instances, and normal methods'''
if not isinstance(other, (IdentityComparableMethod, types.MethodType)):
return NotImplemented
return (self.__self__ is other.__self__ and
self.__func__ is other.__func__)
def __hash__(self):
'''Hash identically to the method'''
return hash(self._method)
def __call__(self, *args, **kwargs):
'''Delegate to method'''
return self._method(*args, **kwargs)
def __repr__(self):
return '{0.__class__.__name__}({0._method!r})'.format(self)
then when storing bound methods, wrap them in that class, e.g.:
self.methods = [IdentityComparableMethod(self.method)]
You may want to make methods itself enforce this via additional magic (so it only stores functions or IdentityComparableMethods), but that's the basic idea.
Other answers address more targeted filtering, this is just a way to make that filtering unnecessary.
Performance note: I didn't heavily optimize for performance; __getattr__ is the simplest way of reflecting all the attributes of the underlying method. If you want comparisons to go faster, you can fetch out __self__ during initialization and cache it on self directly to avoid __getattr__ calls, changing the __slots__ and __new__ declaration to:
__slots__ = '_method', '__self__'
def __new__(cls, method):
# Using __new__ prevents reinitialization, part of immutability contract
# that justifies defining __hash__
self = super().__new__(cls)
self._method = method
self.__self__ = method.__self__
return self
That makes a pretty significant difference in comparison speed; in local %timeit tests, the first == second comparison dropped from 2.77 μs to 1.05 μs. You could cache __func__ as well if you like, but since it's the fallback comparison, it's less likely to be checked at all (and you'd slow construction a titch for an optimization you're less likely to use).
Alternatively, instead of caching, you can just manually define #propertys for __self__ and __func__, which are slower than raw attributes (comparison ran in 1.41 μs), but incur no construction time cost at all (so if no comparison is ever run, you don't pay the lookup cost).
The reason why self.methods = [self.method] and then performing __eq__ ends up creating a recursion error is nicely explained in one of the comments in this question by #Aran-Fey
self.getX == other.getX compares two bound methods. Bound methods are considered equal if the method is the same, and the instances they're bound to are equal. So comparing two bound methods also compares the instances, which calls the __eq__ method again, which compares bound methods again, etc
One way to resolve it is to perform key-wise comparison on self.__dict__ and ignore methods key
class A:
def __init__(self, field):
self.methods = [self.method]
self.field = field
def __eq__(self, other):
#Iterate through all keys
for key in self.__dict__:
#Perform comparison on values except the key methods
if key != 'methods':
if self.__dict__[key] != other.__dict__[key]:
return False
return True
def method(self):
pass
first = A(field='foo')
second = A(field='bar')
print(first == second)
The output will be False
Edit:
I think the "==" cause the error. You can install deepdiff and modify your code to:
class A:
def __init__(self, field):
self.methods = [self.method]
self.field = field
def __eq__(self, other):
import deepdiff
if type(self) != type(other):
return False
return deepdiff.DeepDiff(self.__dict__, other.__dict__) == {}
def method(self):
pass
Then,
A(field='foo') == A(field='bar') returns False
and
A(field='foo') == A(field='foo') returns True
Original Answer:
Try replacing
self.methods = [self.method]
with
self.methods = [A.method]
And the result is False
The issue you're running into is being caused by a very old bug in CPython. The good news is that it has already been fixed for Python 3.8 (which will soon be getting its first beta release).
To understand the issue, you need to understand how the equality check for methods from Python 2.5 through 3.7 worked. A bound method has a self and a func attribute. In the versions of Python where this bug is an issue, a comparison of two bound methods would compare both the func and the self values for Python-level equality (using the C-API equivalent to the Python == operator). With your class, this leads to infinite recursion, since the objects want to compare the bound methods stored in their methods lists, and the bound methods need to compare their self attributes.
The fixed code uses an identity comparison, rather than an equality comparison, for the self attribute of bound method objects. This has additional benefits, as methods of "equal" but not identical objects will no longer be considered equal when they shouldn't be. The motivating example was a set of callbacks. You might want your code to avoid calling the same callback several times if it was registered multiple times, but you wouldn't want to incorrectly skip over a callback if it was bound to an equal (but not identical) object. For instance, two empty containers append method registered, and you wouldn't want them to be equal:
class MyContainer(list): # inherits == operator and from list, so empty containers are equal
def append(self, value):
super().append(value)
callbacks = []
def register_callback(cb):
if cb not in callbacks: # this does an == test against all previously registered callbacks
callbacks.append(cb)
def do_callbacks(*args):
for cb in callbacks:
cb(*args)
container1 = MyContainer()
register_callback(container1.append)
container2 = MyContainer()
register_callback(container2.append)
do_callbacks('foo')
print(container1 == container2) # this should be true, if both callbacks got run
The print call at the end of the code will output False in most recent versions, but in Python 3.8 thanks to the bug fix, it will write True, as it should.
I'll post the solution I came up with (inspired by #devesh-kumar-singh answer), however it does seem bitter-sweet.
def __eq__(self, other):
if type(self) != type(other):
return False
for key in self.__dict__:
try:
flag = self.__dict__[key] == other.__dict__[key]
if flag is False:
# if one of the attributes is different, the objects are as well
return False
except RecursionError:
# We stumbled upon an attribute that is somehow bound to self
pass
return flag
The benefit over #tianbo-ji solution is that it's faster if we find a difference in __dict__ values before we stuble upon bound method. But if we don't - it's an order of magnitude slower.

return something other than the object on creation of object

I want to make a class that knows to not instantiate based on input parameters. For a simple example if I want to create an object that can only exist if one of its input parameters is > 1 then
foo = new_object(0.1)
should return None to foo rather than the object.
Its strikes me as an elegant way to create objects as it means I need no code outside the class to decide whether to create it or not
Is there a way to do this, or equally useful, would this be bad practice, and why?
You'll need to override __new__ -- make sure it takes the same arguments as __init__:
class Test(object):
def __init__(self, value):
self.value = value
def __new__(cls, value):
if value > 1:
return object.__new__(cls)
return None
def __repr__(self):
return "Test value %d" % self.value
t1 = Test(2)
print repr(t1)
t2 = Test(1)
print repr(t2)
Python has support for returning objects of different types from __new__ but it's a fairly rare practice.
In your use-case, if you are choosing between
if value < 1:
foo = None
else:
foo = Test(value)
and
foo = Test(value) # will None if value <= 1
and this is something you have to do many times, then I would definitely consider having the class do it.
In those cases where you don't have control over new_object you can make your own factory function:
def maybe_foo(value):
if value > 1:
return new_object(value)
return None
You can override __new__() to effectively turn object instantiation in to a factory-like operation like you want to do here. I sometimes like to use __new__() of an abstract base class as a factory for the concrete subclasses as long as the list of concrete subclasses can be limited and known. Just make sure it is the best solution for your problem, as it probably isn't...
Quite obviously this would be a bad practice, for the very simple reason that nobody does it like this. Calling a constructor is supposed to construct the object instance, not selectively decide if it wants to or not. You're not supposed to need to check for failure when constructing objects. So it has a quite high "wtf quota", which is never a good idea.
That said, I'm not sure if it's even possible, since __init__() is run on the instance after it's already been created (and doesn't end with return self). This being Python, I'm sure something can be wrangled. My point is that doing so is a bad idea.

is it possible to overwrite "self" to point to another object inside self.method in python?

class Wrapper(object):
def __init__(self, o):
# get wrapped object and do something with it
self.o = o
def fun(self, *args, **kwargs):
self = self.o # here want to swap
# or some low level C api like
# some_assign(self, self.o)
# so that it swaps id() mem addr to self.o
return self.fun(*args, **kwargs) # and now it's class A
class A(object):
def fun(self):
return 'A.fun'
a = A()
w = Wrapper(a)
print type(w) # wrapper
print w.fun() # some operation after which I want to loose Wrapper
print a is w # this goes False and I'd like True :)
# so that this is the original A() object
Is there any way to do this in Python?
Assigning to self inside a method simply rebinds the local variable self to the new object. Generally, an assignment to a bare name never changes any objects, it just rebinds the name on the left-hand side to point to the object on the right-hand side.
So what you would need to do is modify the object self points to to match the object self.o points to. This is only possible if both A and Wrapper are new-style classes and none of them defines __slots__:
self.__class__ = self.o.__class__
self.__dict__ = self.o.__dict__
This will work in CPython, but I'm not sure about the other Python implementation. And even in CPython, it's a terrible idea to do this.
(Note that the is condition in the last line of your code will still be False, but I think this does what you intend.)
No, you can't. This would require pass by reference. You can change the local variable (more precisely, the parameter) self just fine, but doing that does not influence whatever location the reference passed as argument came from (such as your w).
And given the circumstances (implicitly passed self), it's not even possible to apply the usual hacks (such as using a single-element list and mutate x[0]). Even if such tricks would work (or if there's an even more obscure hack that can do this), they'd be highly discouraged. It goes against everything Python programmers are used to. Just make the Wrapper object act as if it was replaces (i.e. forward everything to self.o). That won't make identity checks succeed, but it's by far the simplest, cleanest and most maintainable solution.
Note: For the sake of experimenting, there is a nonstandard and absolutely nonportable PyPy extension which can do this (replacing an object completely): __pypy__.become. Needless to say, it'd be extremely ill-advised to use it. Find another solution.

Python: detect duplicates using a set

I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into a set, and override the __hash__ method so that it hashes the instance variable I'm concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person class, and any two instances of Person with the same name variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000)) is True. So why doesn't this properly detect duplicate Person objects?
You forgot to also define __eq__().
If a class does not define a __cmp__() or __eq__() method it should not define a __hash__() operation either; if it defines __cmp__() or __eq__() but not __hash__(), its instances will not be usable in hashed collections. If a class defines mutable objects and implements a __cmp__() or __eq__() method, it should not implement __hash__(), since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the == operator to make sure they are really equal. In your case, this will only yield True if the two objects are the same object (the standard implementation for user-defined classes).
Long story short: Also define __eq__() to make it work.
Hash function is not enough to distinguish object you have to implement the comparison function (ie. __eq__).
A hash function effectively says "A maybe equals B" or "A not equals B (for sure)".
If it says "maybe equals" then equality has to be checked anyway to make sure, which is why you also need to implement __eq__.
Nevertheless, defining __hash__ will significantly speed things up by making "A not equal B (for sure)" an O(1) operation.
The hash function must however always follow the "hash rule":
"hash rule": equal things must hash to the same value
(justification: or else we'd say "A not equals B (for sure)" when that is not the case)
For example you could hash everything by def __hash__(self): return 1. This would still be correct, but it would be inefficient because you'd have to check __eq__ each time, which may be a long process if you have complicated large data structures (e.g. with large lists, dictionaries, etc.).
Do note that you technically follow the "hash rule" do this by ignoring age in your implementation def __hash__(self): return self.name. If Bob is a person of age 20 and Bob is another person of age 30 and they are different people (likely unless this is some sort of keeps-track-of-people-over-time-as-they-age program), then they will hash to the same value and have to be compared with __eq__. This is perfectly fine, but I would implement it like so:
def __hash__(self):
return hash( (self.name, self.age) )
Do note that your way is still correct. It would however have been a coding error to use hash( (self.name, self.age) ) in a world where Person("Bob", age=20) and Person("Bob", age=30) were actually the same person, because the hash function would be saying they're different while the equals function would not (but be ignored).
You also need the __ eq __() method.

Categories

Resources