Python instances stored in shelves change after closing it - python

I think the best way to explain the situation is with an example:
>>> class Person:
... def __init__(self, brother=None):
... self.brother = brother
...
>>> bob = Person()
>>> alice = Person(brother=bob)
>>> import shelve
>>> db = shelve.open('main.db', writeback=True)
>>> db['bob'] = bob
>>> db['alice'] = alice
>>> db['bob'] is db['alice'].brother
True
>>> db['bob'] == db['alice'].brother
True
>>> db.close()
>>> db = shelve.open('main.db',writeback=True)
>>> db['bob'] is db['alice'].brother
False
>>> db['bob'] == db['alice'].brother
False
The expected output for both comparisons is True again. However, pickle (which is used by shelve) seems to be re-instantiating bob and alice.brother separately. How can I "fix" this using shelve/pickle? Is it possible for db['alice'].brother to point to db['bob'] or something similar? Notice I do not want only to compare both, I need both to actually be the same.
As suggested by Blckknght I tried pickling the entire dictionary at once, but the problem persists since it seems to pickle each key separately.

I believe that the issue you're seeing comes from the way the shelve module stores its values. Each value is pickled independently of the other values in the shelf, which means that if the same object is inserted as a value under multiple keys, the identity will not be preserved between the keys. However, if a single value has multiple references to the same object, the identity will be maintained within that single value.
Here's an example:
a = object() # an arbitrary object
db = shelve.open("text.db")
db['a'] = a
db['another_a'] = a
db['two_a_references'] = [a, a]
db.close()
db = shelve.open("text.db") # reopen the db
print(db['a'] is db['another_a']) # prints False
print(db['two_a_references'][0] is db['two_a_references'][1]) # prints True
The first print tries to confirm the identity of two versions of the object a that were inserted in the database, one under the key 'a' directly, and another under 'another_a'. It doesn't work because the separate values are pickled separately, and so the identity between them was lost.
The second print tests whether the two references to a that were stored under the key 'two_a_references' were maintained. Because the list was pickled in one go, the identity is kept.
So to address your issue you have a few options. One approach is to avoid testing for identity and rely on an __eq__ method in your various object types to determine if two objects are semantically equal, even if they are not the same object. Another would be to bundle all your data into a single object (e.g. a dictionary) which you'd then save with pickle.dump and restore with pickle.load rather than using shelve (or you could adapt this recipe for a persistent dictionary, which is linked from the shelve docs, and does pretty much that).

The appropriate way, in Python, is to implement the __eq__ and __ne__ functions inside of the Person class, like this:
class Person(object):
def __eq__(self, other):
return (isinstance(other, self.__class__)
and self.__dict__ == other.__dict__)
def __ne__(self, other):
return not self.__eq__(other)
Generally, that should be sufficient, but if these are truly database objects and have a primary key, it would be more efficient to check that attribute instead of self.__dict__.

Problem
To preserve identity with shelve you need to preserve identity with pickleread this.
Solution
This class saves all the objects on its class site and restores them if the identity is the same. You should be able to subclass from it.
>>> class PickleWithIdentity(object):
identity = None
identities = dict() # maybe use weakreference dict here
def __reduce__(self):
if self.identity is None:
self.identity = os.urandom(10) # do not use id() because it is only 4 bytes and not random
self.identities[self.identity] = self
return open_with_identity, (self.__class__, self.__dict__), self.__dict__
>>> def open_with_identity(cls, dict):
if dict['identity'] in cls.identities:
return cls.identities[dict['identity']]
return cls()
>>> p = PickleWithIdentity()
>>> p.asd = 'asd'
>>> import pickle
>>> import os
>>> pickle.loads(pickle.dumps(p))
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> pickle.loads(pickle.dumps(p)) is p
True
Further problems can occur because the state may be overwritten:
>>> p.asd
'asd'
>>> ps = pickle.dumps(p)
>>> p.asd = 123
>>> pickle.loads(ps)
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> p.asd
'asd'

Related

Converting a python object with properties to a dictionary

I would like to know how to convert a python object from a dictionary (using python3 btw). I realize that this question has been asked (and answered) already (here). However, in my case the object is given entirely in terms of #property values, for example:
class Test:
#property
def value(self):
return 1.0
Regarding conversion to a dictionary: The __dict__ dictionary of the Test class is empty, and consequently, the vars
function does not work as expected:
>>> vars(Test())
{}
Still, I can use gettattr(Test(), 'value'), to obtain 1.0, so
the value is present.
Note: The reason I am coming up with this apparently contrived example is that I am trying to convert a cython cdef class (containing parameters) to a dictionary. The recommended way to wrap c structures with properties using cython is indeed based on properties.
I think you could use dir:
a = Test()
dir(a)
Output:
['__doc__', '__module__', 'value']
So you could maybe do something like:
d = {}
for attr in dir(a):
if not attr.startswith("__"):
d[attr] = getattr(a, attr)
Output:
d = {'value': 1.0}
Maybe you could abuse that:
In [10]: type(Test().__class__.__dict__['value']) is property
Out[10]: True
So you check the class of the object and if it has attribute of type property.
Here is how I would do it:
t = Test()
dictionary = {attr_name: getattr(t, attr_name)
for attr_name, method in t.__class__.__dict__.items()
if isinstance(method, property)}
It is even worse that that. You could imagine to build an instance __dict__ at init time, but that would not solve anything, except for read_only constant properties. Because the value in the dict will be a copy of the property at the time it was taken, and will not reflect future changes.

Compute class and instance hash

I need to compute a "hash" which allows me to uniquely identify an object, both it's contents and the parent class.
By comparing these "hashes" i want to be able to tell if an object has changed since the last time it was scanned.
I have found plenty of examples on how to make an object hashable, however not so much about how to compute a hash of the parent class.
It is important to note comparisons are made during different executions. I say this because I think comparing the id() of the object since the id/address of the object might be different for different executions.
I thought of resorting to inspect but I fear it might not be very efficient, also I am not very sure how that would work if the object's parent class is inheriting from another class.
If I had access to the actual memory raw data where the instance and the class' code is stored, I could just compute the hash of that.
Any ideas?
General idea is to serialize object and then take a hash. Then, the only question is to find a good library. Let's try dill:
>>>import dill
>>>class a():
pass
>>>b = a()
>>>b.x = lambda x:1
>>> hash(dill.dumps(b))
2997524124252182619
>>> b.x = lambda x:2
>>> hash(dill.dumps(b))
5848593976013553511
>>> a.y = lambda x: len(x)
>>> hash(dill.dumps(b))
-906228230367168187
>>> b.z = lambda x:2
>>> hash(dill.dumps(b))
5114647630235753811
>>>
Looks good?
dill: https://github.com/uqfoundation
To detect if an object has changed, you could generate a hash of its JSON representation and compare to the latest hash generated by the same method.
import json
instance.foo = 5
hash1 = hash(json.dumps(instance.__dict__, sort_keys=True))
instance.foo = 6
hash2 = hash(json.dumps(instance.__dict__, sort_keys=True))
hash1 == hash2
>> False
instance.foo = 5
hash3 = hash(json.dumps(instance.__dict__, sort_keys=True))
hash1 == hash3
>> True
Or, since json.dumps gives us a string, you can simply compare them instead of generating a hash.
import json
instance.foo = 5
str1 = json.dumps(instance.__dict__, sort_keys=True)
instance.foo = 6
str2 = json.dumps(instance.__dict__, sort_keys=True)
str1 == str2
>> False

Can I just reassign list instead of using persistentlist in ZODB

In this document, a method of handling mutable list in ZODB is introduced, which is "to use the mutable attribute as though it were immutable" by reassigning the list. I tried to create a simple but rather long inverted index in an OOBTree structure, where the values are lists. Inspired by the above-mentioned method, without using Persistentlist or anything for that matter, I simply replaced "append" with reassigning the list, and it worked just fine, and the resulting database file is ideally small. I checked and apparently there was nothing wrong with the index, but I just can't feel sure about it. Can it really be this easy? Any advice would be greatly appreciated.
The following is some fragments of code I wrote to create an inverted index with ZODB:
#......
root['idx']=OOBTree()
myindex = root['idx']
i=0
#.....
doc = cursor.first()
while doc:
docdict = {}
for word in doc.words():
docdict[word]=docdict.setdefault(word,0)+1
for (word,freq) in docdict.iteritems():
i+=1
if word in myindex:
myindex[word]= myindex[word]+ [(doc[0],freq)]
# This seems to work just fine.
# This is where I'm confused. Can this really work all right?
else:
myindex[word]=[(doc[0],freq)] # This list is not a PersistentList.
if i % 200000 == 0: transaction.savepoint(True)
doc =cursor.next()
docs.close()
transaction.commit()
#......
Yes, it can really be this easy.
The Persistent baseclass overrides the __setattr__ hook to detect when changes have been made that need to be committed to the database. Changes to mutable attributes, such as a list, do not use the __setattr__ hook and are thus not detected.
It's easiest to show this with a demo class:
>>> class Demo(object):
... def __setattr__(self, name, value):
... print 'Setting self.{} to {!r}'.format(name, value)
... super(Demo, self).__setattr__(name, value)
...
>>> d = Demo()
>>> d.foo = 'bar'
Setting self.foo to 'bar'
>>> d.foo = 'baz'
Setting self.foo to 'baz'
>>> d.foo
'baz'
>>> d.spam = []
Setting self.spam to []
>>> d.spam.append(42)
>>> d.spam
[42]
Note how the d.spam.append() call did not result in printed output; the .append() call works directly on the list, not on the parent Demo() instance.
Reassignment, on the other hand, is detected:
>>> d.spam = d.spam
Setting self.spam to [42]
Another way to signal to Persistent that something changed is by setting the _p_changed flag directly:
myindex._p_changed = True
If you are ever in doubt that your change is being detected, you can also test that flag:
assert myindex._p_changed, "Oops, no change detected?"
An AssertionError will be raised if myindex._p_changed is False (or perhaps 0); which is only the case if no changes were detected since the beginning of the current transaction.

Understanding python object membership for sets

If I understand correctly, the __cmp__() function of an object is called in order to evaluate all objects in a collection while determining whether an object is a member, or 'in', the collection.
However, this does not seem to be the case for sets:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data-other.data
a = MyObject(5)
b = MyObject(5)
print a in [b] //evaluates to True, as I'd expect
print a in set([b]) //evaluates to False
How is an object membership tested in a set, then?
Adding a __hash__ method to your class yields this:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data - other.data
def __hash__(self):
return hash(self.data)
a = MyObject(5)
b = MyObject(5)
print a in [b] # True
print a in set([b]) # Also True!
>>> xs = []
>>> set([xs])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
There you are. Sets use hashes, very similar to dicts. This help performance extremely (membership tests are O(1), and many other operations depend on membership tests), and it also fits the semantics of sets well: Set items must be unique, and different items will produce different hashes, while same hashes indicate (well, in theory) duplicates.
Since the default __hash__ is just id (which is rather stupid imho), two instances of a class that inherits object's __hash__ will never hash to the same value (well, unless adress space is larger than the sizeof the hash).
As others pointed, your objects don't have a __hash__ so they use the default id as a hash, and you can override it as Nathon suggested, BUT read the docs about __hash__, specifically the points about when you should and should not do that.
A set uses a dict behind the scenes, so the "in" statement is checking whether the object exists as a key in the dict. Since your object doesn't implement a hash function, the default hash function for objects uses the object's id. So even though a and b are equivalent, they're not the same object, and that's what's being tested.

Accessing the name of an instance in Python for printing

So as part of problem 17.6 in "Think Like a Computer Scientist", I've written a class called Kangaroo:
class Kangaroo(object):
def __init__(self, pouch_contents = []):
self.pouch_contents = pouch_contents
def __str__(self):
'''
>>> kanga = Kangaroo()
>>> kanga.put_in_pouch('olfactory')
>>> kanga.put_in_pouch(7)
>>> kanga.put_in_pouch(8)
>>> kanga.put_in_pouch(9)
>>> print kanga
"In kanga's pouch there is: ['olfactory', 7, 8, 9]"
'''
return "In %s's pouch there is: %s" % (object.__str__(self), self.pouch_contents)
def put_in_pouch(self, other):
'''
>>> kanga = Kangaroo()
>>> kanga.put_in_pouch('olfactory')
>>> kanga.put_in_pouch(7)
>>> kanga.put_in_pouch(8)
>>> kanga.put_in_pouch(9)
>>> kanga.pouch_contents
['olfactory', 7, 8, 9]
'''
self.pouch_contents.append(other)
What's driving me nuts is that I'd like to be able to write a string method that would pass the unit test underneath __str__ as written. What I'm getting now instead is:
In <__main__.Kangaroo object at 0x4dd870>'s pouch there is: ['olfactory', 7, 8, 9]
Bascially, what I'm wondering if there is some function that I can perform on kanga = Kangaroo such that the output of the function is those 5 characters, i.e. function(kanga) -> "kanga".
Any ideas?
Edit:
Reading the first answer has made me realize that there is a more concise way to ask my original question. Is there a way to rewrite __init__ such that the following code is valid as written?
>>> somename = Kangaroo()
>>> somename.name
'somename'
To put your request into perspective, please explain what name you would like attached to the object created by this code:
marsupials = []
marsupials.append(Kangaroo())
This classic essay by the effbot gives an excellent explanation.
To answer the revised question in your edit: No.
Now that you've come clean in a comment and said that the whole purpose of this naming exercise was to distinguish between objects for debugging purposes associated with your mutable default argument:
In CPython implementations of Python at least, at any given time, all existing objects have a unique ID, which may be obtained by id(obj). This may be sufficient for your debugging purposes. Note that if an object is deleted, that ID (which is a memory address) can be re-used by a subsequently created object.
I wasn't going to post this but if you only want this for debugging then here you go:
import sys
class Kangaroo(object):
def __str__(self):
flocals = sys._getframe(1).f_locals
for ident in flocals:
if flocals[ident] is self:
name = ident
break
else:
name = 'roo'
return "in {0}'s pouch, there is {1}".format(name, self.pouch_contents)
kang = Kangaroo()
print kang
This is dependent on CPython (AFAIK) and isn't suitable for production code. It wont work if the instance is in any sort of container and may fail for any reason at any time. It should do the trick for you though.
It works by getting the f_locals dictionary out of the stack frame that represents the namespace where print kang is called. The keys of f_locals are the names of the variables in the frame so we just loop through it and test if each entry is self. If so, we break. If break is not executed, then we didn't find an entry and the loops else clause assigns the value 'roo' as requested.
If you want to get it out of a container of some sort, you need to extend this to look through any containers in f_locals. You could either return the key if it's a dictlike container or the index if it's something like a tuple or list.
class Kangaroo(object):
def __init__(self, pouch_contents=None, name='roo'):
if pouch_contents is None:
self.pouch_contents = [] # this isn't shared with all other instances
else:
self.pouch_contents = pouch_contents
self.name = name
...
kanga = Kangaroo(name='kanga')
Note that it's good style not to put spaces around = in the arguments
What you want is basically impossible in Python, even with the suggested "hacks". For example,
what would the following code print?
>>> kanga1 = kanga2 = kanga3 = Kangaroo()
>>> kanga2.name
???
>>> kanga3.name
???
or
>>> l = [Kangaroo()]
>>> l[0].name
???
If you want "named" objects, just supply a name to your object
def __init__(self, name):
self.name = name
More explicit (which we like with Python) and consistent in all cases. Sure you can do something like
>>> foo = Kangaroo("bar")
>>> foo.name
'bar'
but foo will be just one of the possibly many labels the instance has. The name is explicit and permanent. You can even enforce unique naming if you want (while you can reuse a variable as much as you want for different objects)
I hadn't seen aaronasterling's hackish answer when I started working on this, but in any case here's a hackish answer of my own:
class Kangaroo(object):
def __init__(self, pouch_contents = ()):
self.pouch_contents = list(pouch_contents)
def __str__(self):
if not hasattr(self, 'name'):
for k, v in globals().iteritems():
if id(v) == id(self):
self.name = k
break
else:
self.name = 'roo'
return "In %s's pouch there is: %s" % (self.name, self.pouch_contents)
kanga = Kangaroo()
print kanga
You can break this by looking at it funny (it works as written, but it will fail as part of a doctest), but it works. I'm more concerned with what's possible than with what's practical at this point in my learning experience, so I'm glad to see that there are at least two different ways to do a thing I figured should be possible to do. Even if they're bad ways.

Categories

Resources