Understanding python object membership for sets

Understanding python object membership for sets - python

If I understand correctly, the __cmp__() function of an object is called in order to evaluate all objects in a collection while determining whether an object is a member, or 'in', the collection.
However, this does not seem to be the case for sets:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data-other.data
a = MyObject(5)
b = MyObject(5)
print a in [b] //evaluates to True, as I'd expect
print a in set([b]) //evaluates to False
How is an object membership tested in a set, then?

Adding a __hash__ method to your class yields this:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data - other.data
def __hash__(self):
return hash(self.data)
a = MyObject(5)
b = MyObject(5)
print a in [b] # True
print a in set([b]) # Also True!

>>> xs = []
>>> set([xs])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
There you are. Sets use hashes, very similar to dicts. This help performance extremely (membership tests are O(1), and many other operations depend on membership tests), and it also fits the semantics of sets well: Set items must be unique, and different items will produce different hashes, while same hashes indicate (well, in theory) duplicates.
Since the default __hash__ is just id (which is rather stupid imho), two instances of a class that inherits object's __hash__ will never hash to the same value (well, unless adress space is larger than the sizeof the hash).

As others pointed, your objects don't have a __hash__ so they use the default id as a hash, and you can override it as Nathon suggested, BUT read the docs about __hash__, specifically the points about when you should and should not do that.

A set uses a dict behind the scenes, so the "in" statement is checking whether the object exists as a key in the dict. Since your object doesn't implement a hash function, the default hash function for objects uses the object's id. So even though a and b are equivalent, they're not the same object, and that's what's being tested.

Related

Why is the dictionary key being converted to an inherited class type?

My code looks something like this:
class SomeClass(str):
pass
some_dict = {'s':42}
>>> type(some_dict.keys()[0])
str
>>> s = SomeClass('s')
>>> some_dict[s] = 40
>>> some_dict # expected: Two different keys-value pairs
{'s': 40}
>>> type(some_dict.keys()[0])
str
Why did Python convert the object s to the string "s" while updating the dictionary some_dict?

Whilst the hash value is related, it is not the main factor.
It is equality that is more important here. That is, objects may have the same hash value and not be equal, but equal objects must have the same hash value (though this is not strictly enforced). Otherwise you will end up with some strange bugs when using dict and set.
Since you have not defined the __eq__ method on SomeClass you inherit the one on str. Python's builtins are built to allow subclassing, so __eq__ returns true, if the object would otherwise be equal were it not for them having different types. eg. 's' == SomeClass('s') is true. Thus it is right and proper that 's' and SomeClass('s') are equivalent as keys to a dictionary.
To get the behaviour you want you must redefine the __eq__ dunder method to take into account type. However, when you define a custom equals, python stops giving you an automatic __hash__ dunder method, and you must redefine it as well. But in this case we can just reuse str.__hash__.
class SomeClass(str):
def __eq__(self, other):
return (
type(self) is SomeClass
and type(other) is SomeClass
and super().__eq__(other)
)
__hash__ = str.__hash__
d = {'s': 1}
d[SomeClass('s')] = 2
assert len(d) == 2
print(d)
prints: {'s': 2, 's': 1}

This is a really good question. Firstly, when put (key, value) pair into dict, it uses hash function to get the hash value of key and check if this hash code is present. If present, then dict compares the object with same hash code. If two objects are equal (__eq__(self, other) return True), then, it would update the value, which is why your code encounters such behavior.
Given SomeClass is not even modified, so 's' and SomeClass('s') should have the same hash code and 's'.__eq__(SomeClass('s')) will return True.

Python instances stored in shelves change after closing it

I think the best way to explain the situation is with an example:
>>> class Person:
... def __init__(self, brother=None):
... self.brother = brother
...
>>> bob = Person()
>>> alice = Person(brother=bob)
>>> import shelve
>>> db = shelve.open('main.db', writeback=True)
>>> db['bob'] = bob
>>> db['alice'] = alice
>>> db['bob'] is db['alice'].brother
True
>>> db['bob'] == db['alice'].brother
True
>>> db.close()
>>> db = shelve.open('main.db',writeback=True)
>>> db['bob'] is db['alice'].brother
False
>>> db['bob'] == db['alice'].brother
False
The expected output for both comparisons is True again. However, pickle (which is used by shelve) seems to be re-instantiating bob and alice.brother separately. How can I "fix" this using shelve/pickle? Is it possible for db['alice'].brother to point to db['bob'] or something similar? Notice I do not want only to compare both, I need both to actually be the same.
As suggested by Blckknght I tried pickling the entire dictionary at once, but the problem persists since it seems to pickle each key separately.

I believe that the issue you're seeing comes from the way the shelve module stores its values. Each value is pickled independently of the other values in the shelf, which means that if the same object is inserted as a value under multiple keys, the identity will not be preserved between the keys. However, if a single value has multiple references to the same object, the identity will be maintained within that single value.
Here's an example:
a = object() # an arbitrary object
db = shelve.open("text.db")
db['a'] = a
db['another_a'] = a
db['two_a_references'] = [a, a]
db.close()
db = shelve.open("text.db") # reopen the db
print(db['a'] is db['another_a']) # prints False
print(db['two_a_references'][0] is db['two_a_references'][1]) # prints True
The first print tries to confirm the identity of two versions of the object a that were inserted in the database, one under the key 'a' directly, and another under 'another_a'. It doesn't work because the separate values are pickled separately, and so the identity between them was lost.
The second print tests whether the two references to a that were stored under the key 'two_a_references' were maintained. Because the list was pickled in one go, the identity is kept.
So to address your issue you have a few options. One approach is to avoid testing for identity and rely on an __eq__ method in your various object types to determine if two objects are semantically equal, even if they are not the same object. Another would be to bundle all your data into a single object (e.g. a dictionary) which you'd then save with pickle.dump and restore with pickle.load rather than using shelve (or you could adapt this recipe for a persistent dictionary, which is linked from the shelve docs, and does pretty much that).

The appropriate way, in Python, is to implement the __eq__ and __ne__ functions inside of the Person class, like this:
class Person(object):
def __eq__(self, other):
return (isinstance(other, self.__class__)
and self.__dict__ == other.__dict__)
def __ne__(self, other):
return not self.__eq__(other)
Generally, that should be sufficient, but if these are truly database objects and have a primary key, it would be more efficient to check that attribute instead of self.__dict__.

Problem
To preserve identity with shelve you need to preserve identity with pickleread this.
Solution
This class saves all the objects on its class site and restores them if the identity is the same. You should be able to subclass from it.
>>> class PickleWithIdentity(object):
identity = None
identities = dict() # maybe use weakreference dict here
def __reduce__(self):
if self.identity is None:
self.identity = os.urandom(10) # do not use id() because it is only 4 bytes and not random
self.identities[self.identity] = self
return open_with_identity, (self.__class__, self.__dict__), self.__dict__
>>> def open_with_identity(cls, dict):
if dict['identity'] in cls.identities:
return cls.identities[dict['identity']]
return cls()
>>> p = PickleWithIdentity()
>>> p.asd = 'asd'
>>> import pickle
>>> import os
>>> pickle.loads(pickle.dumps(p))
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> pickle.loads(pickle.dumps(p)) is p
True
Further problems can occur because the state may be overwritten:
>>> p.asd
'asd'
>>> ps = pickle.dumps(p)
>>> p.asd = 123
>>> pickle.loads(ps)
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> p.asd
'asd'

Redeclaration of the method "in" within a class

I am creating an Abstract Data Type, which create a doubly linked list (not sure it's the correct translation). In it I have create a method __len__ to calcucate the length of it in the correct way, a method __repr__ to represent it correctly, but I wan't now to create a method which, when the user will make something like:
if foo in liste_adt
will return the correct answer, but I don't know what to use, because __in__ is not working.
Thank you,

Are you looking for __contains__?
object.__contains__(self, item)
Called to implement membership test operators. Should return true if item is in self, false otherwise. For mapping objects, this should consider the keys of the mapping rather than the values or the key-item pairs.
For objects that don’t define __contains__(), the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__(), see this section in the language reference.
Quick example:
>>> class Bar:
... def __init__(self, iterable):
... self.list = list(iterable)
... def __contains__(self, item):
... return item in self.list
>>>
>>> b = Bar([1,2,3])
>>> b.list
[1, 2, 3]
>>> 4 in b
False
>>> 2 in b
True
Note: Usually when you have this kind of doubts references can be found in the Data Model section of the The Python Language Reference.

Since the data structure is a linked list, it is necessary to iterate over it to check membership. Implementing an __iter__() method would make both if in and for in work. If there is a more efficient way for checking membership, implement that in __contains__().

Is there a way to check if two object contain the same values in each of their variables in python?

How do I check if two instances of a
class FooBar(object):
__init__(self, param):
self.param = param
self.param_2 = self.function_2(param)
self.param_3 = self.function_3()
are identical? By identical I mean they have the same values in all of their variables.
a = FooBar(param)
b = FooBar(param)
I thought of
if a == b:
print "a and b are identical"!
Will this do it without side effects?
The background for my question is unit testing. I want to achieve something like:
self.failUnlessEqual(self.my_object.a_function(), another_object)

If you want the == to work, then implement the __eq__ method in your class to perform the rich comparison.
If all you want to do is compare the equality of all attributes, you can do that succinctly by comparison of __dict__ in each object:
class MyClass:
def __eq__(self, other) :
return self.__dict__ == other.__dict__

For an arbitrary object, the == operator will only return true if the two objects are the same object (i.e. if they refer to the same address in memory).
To get more 'bespoke' behaviour, you'll want to override the rich comparison operators, in this case specifically __eq__. Try adding this to your class:
def __eq__(self, other):
if self.param == other.param \
and self.param_2 == other.param_2 \
and self.param_3 == other.param_3:
return True
else:
return False
(the comparison of all params could be neatened up here, but I've left them in for clarity).
Note that if the parameters are themselves objects you've defined, those objects will have to define __eq__ in a similar way for this to work.
Another point to note is that if you try to compare a FooBar object with another type of object in the way I've done above, python will try to access the param, param_2 and param_3 attributes of the other type of object which will throw an AttributeError. You'll probably want to check the object you're comparing with is an instance of FooBar with isinstance(other, FooBar) first. This is not done by default as there may be situations where you would like to return True for comparison between different types.
See AJ's answer for a tidier way to simply compare all parameters that also shouldn't throw an attribute error.
For more information on the rich comparison see the python docs.

For python 3.7 onwards you can also use dataclass to check exactly what you want very easily. For example:
from dataclasses import dataclass
#dataclass
class FooBar:
param: str
param2: float
param3: int
a = Foobar("test_text",2.0,3)
b = Foobar("test_text",2.0,3)
print(a==b)
would return True

According to Learning Python by Lutz, the "==" operator tests value equivalence, comparing all nested objects recursively. The "is" operator tests whether two objects are the same object, i.e. of the same address in memory (same pointer value).
Except for cache/reuse of small integers and simple strings, two objects such as x = [1,2] and y = [1,2] are equal "==" in value, but y "is" x returns false. Same true with two floats x = 3.567 and y = 3.567. This means their addresses are different, or in other words, hex(id(x)) != hex(id(y)).
For class object, we have to override the method __eq__() to make two class A objects like x = A(1,[2,3]) and y = A(1,[2,3]) "==" in content. By default, class object "==" resorts to comparing id only and id(x) != id(y) in this case, so x != y.
In summary, if x "is" y, then x == y, but opposite is not true.

If this is something you want to use in your tests where you just want to verify fields of simple object to be equal, look at compare from testfixtures:
from testfixtures import compare
compare(a, b)

To avoid the possibility of adding or removing attributes to the model and forgetting to do the appropriate changes to your __eq__ function, you can define it as follows.
def __eq__(self, other):
if self.__class__ == other.__class__:
fields = [field.name for field in self._meta.fields]
for field in fields:
if not getattr(self, field) == getattr(other, field):
return False
return True
else:
raise TypeError('Comparing object is not of the same type.')
In this way, all the object attributes are compared. Now you can check for attribute equality either with object.__eq__(other) or object == other.

Reuse existing objects for immutable objects?

In Python, how is it possible to reuse existing equal immutable objects (like is done for str)? Can this be done just by defining a __hash__ method, or does it require more complicated measures?

If you want to create via the class constructor and have it return a previously created object then you will need to provide a __new__ method (because by the time you get to __init__ the object has already been created).
Here is a simple example - if the value used to initialise has been seen before then a previously created object is returned rather than a new one created:
class Cached(object):
"""Simple example of immutable object reuse."""
def __init__(self, i):
self.i = i
def __new__(cls, i, _cache={}):
try:
return _cache[i]
except KeyError:
# you must call __new__ on the base class
x = super(Cached, cls).__new__(cls)
x.__init__(i)
_cache[i] = x
return x
Note that for this example you can use anything to initialise as long as it's hashable. And just to show that objects really are being reused:
>>> a = Cached(100)
>>> b = Cached(200)
>>> c = Cached(100)
>>> a is b
False
>>> a is c
True

There are two 'software engineering' solutions to this that don't require any low-level knowledge of Python. They apply in the following scenarios:
First Scenario: Objects of your class are 'equal' if they are constructed with the same constructor parameters, and equality won't change over time after construction. Solution: Use a factory that hashses the constructor parameters:
class MyClass:
def __init__(self, someint, someotherint):
self.a = someint
self.b = someotherint
cachedict = { }
def construct_myobject(someint, someotherint):
if (someint, someotherint) not in cachedict:
cachedict[(someint, someotherint)] = MyClass(someint, someotherint)
return cachedict[(someint, someotherint)]
This approach essentially limits the instances of your class to one unique object per distinct input pair. There are obvious drawbacks as well: not all types are easily hashable and so on.
Second Scenario: Objects of your class are mutable and their 'equality' may change over time. Solution: define a class-level registry of equal instances:
class MyClass:
registry = { }
def __init__(self, someint, someotherint, third):
MyClass.registry[id(self)] = (someint, someotherint)
self.someint = someint
self.someotherint = someotherint
self.third = third
def __eq__(self, other):
return MyClass.registry[id(self)] == MyClass.registry[id(other)]
def update(self, someint, someotherint):
MyClass.registry[id(self)] = (someint, someotherint)
In this example, objects with the same someint, someotherint pair are equal, while the third parameter does not factor in. The trick is to keep the parameters in registry in sync. As an alternative to update, you could override getattr and setattr for your class instead; this would ensure that any assignment foo.someint = y would be kept synced with your class-level dictionary. See an example here.

I believe you would have to keep a dict {args: object} of instances already created, then override the class' __new__ method to check in that dictionary, and return the relevant object if it already existed. Note that I haven't implemented or tested this idea. Of course, strings are handled at the C level.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding python object membership for sets - python

Adding a hash method to your class yields this: class MyObject(object): def init(self, data): self.data = data def cmp(self, other): return self.data - other.data def hash(self): return hash(self.data) a = MyObject(5) b = MyObject(5) print a in [b] # True print a in set([b]) # Also True!

As others pointed, your objects don't have a hash so they use the default id as a hash, and you can override it as Nathon suggested, BUT read the docs about hash, specifically the points about when you should and should not do that.

Related

Why is the dictionary key being converted to an inherited class type?

Python instances stored in shelves change after closing it

Redeclaration of the method "in" within a class

Is there a way to check if two object contain the same values in each of their variables in python?

Reuse existing objects for immutable objects?

Categories

Resources