Compute class and instance hash - python

I need to compute a "hash" which allows me to uniquely identify an object, both it's contents and the parent class.
By comparing these "hashes" i want to be able to tell if an object has changed since the last time it was scanned.
I have found plenty of examples on how to make an object hashable, however not so much about how to compute a hash of the parent class.
It is important to note comparisons are made during different executions. I say this because I think comparing the id() of the object since the id/address of the object might be different for different executions.
I thought of resorting to inspect but I fear it might not be very efficient, also I am not very sure how that would work if the object's parent class is inheriting from another class.
If I had access to the actual memory raw data where the instance and the class' code is stored, I could just compute the hash of that.
Any ideas?

General idea is to serialize object and then take a hash. Then, the only question is to find a good library. Let's try dill:
>>>import dill
>>>class a():
pass
>>>b = a()
>>>b.x = lambda x:1
>>> hash(dill.dumps(b))
2997524124252182619
>>> b.x = lambda x:2
>>> hash(dill.dumps(b))
5848593976013553511
>>> a.y = lambda x: len(x)
>>> hash(dill.dumps(b))
-906228230367168187
>>> b.z = lambda x:2
>>> hash(dill.dumps(b))
5114647630235753811
>>>
Looks good?
dill: https://github.com/uqfoundation

To detect if an object has changed, you could generate a hash of its JSON representation and compare to the latest hash generated by the same method.
import json
instance.foo = 5
hash1 = hash(json.dumps(instance.__dict__, sort_keys=True))
instance.foo = 6
hash2 = hash(json.dumps(instance.__dict__, sort_keys=True))
hash1 == hash2
>> False
instance.foo = 5
hash3 = hash(json.dumps(instance.__dict__, sort_keys=True))
hash1 == hash3
>> True
Or, since json.dumps gives us a string, you can simply compare them instead of generating a hash.
import json
instance.foo = 5
str1 = json.dumps(instance.__dict__, sort_keys=True)
instance.foo = 6
str2 = json.dumps(instance.__dict__, sort_keys=True)
str1 == str2
>> False

Related

Why python's getattr(obj,'method') and obj.method give different results? How do brackets affect?

My program calculated only sha256 file hash and I decided to expand the number of possible algorithms. So I started to use getattr() instead direct call. And the hashes have changed.
It took me a while to figure out where the problem was, and here's simple example with string (differences are in ()):
>>> import hashlib
>>> text = 'this is nonsence'.encode()
# unique original
>>> hash1 = hashlib.sha256()
>>> hash1.update(text)
>>> print(hash1.hexdigest())
ea85e601f8e91dbdeeb46b507ff108152575c816089c2d0489313b42461aa502
# pathetic parody
>>> hash2 = getattr(hashlib,'sha256')
>>> hash2().update(text)
>>> print(hash2().hexdigest())
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
# solution
>>> hash3 = getattr(hashlib,'sha256')()
>>> hash3.update(text)
>>> print(hash3.hexdigest())
ea85e601f8e91dbdeeb46b507ff108152575c816089c2d0489313b42461aa502
Can someone please explain me why hash1 not equal hash2() and equal hash3?
Did I miss smth? Because for me they are looking the same:
>>> print(hash1)
<sha256 HASH object # 0x0000027D76700F50>
>>> print(hash2())
<sha256 HASH object # 0x0000027D76FD7470>
>>> print(hash3)
<sha256 HASH object # 0x0000027D76D92BF0>
>>> print(type(hash1))
<class '_hashlib.HASH'>
>>>print(type(hash2()))
<class '_hashlib.HASH'>
>>>print(type(hash3))
<class '_hashlib.HASH'>
In fact, getattr(obj, 'method') and obj.method give the same result, but in case #2, you're using it wrong.
When you call the function hashlib.sha256, it returns a new HASH object, which is what you're dealing with in cases #1 and #3. In case #2 however, hash2 is the function hashlib.sha256, not a HASH object, and that doesn't change when you call it later, meaning:
When you do hash2().update(text), the result is thrown away.
When you do hash2().hexdigest(), the result is the same as hashlib.sha256().hexdigest(), i.e. the empty hash.
For comparison, case #2 is practically the same as this:
>>> list().append(0) # Create new list object and append 0
>>> list() # Create new list object
[]

Are functions first class objects in python?

I am learning a tutorial on python.It is explaining how functions are first class objects in Python.
def foo():
pass
print(foo.__class__)
print(issubclass(foo.__class__,object))
The output that I get for the above code is
<type 'function'>
True
This program is supposed to demonstrate that functions are first class objects in python? My questions are as follows.
How does the above code prove that functions are fist class objects?
What are the attributes of a first class object?
what does function.__class__ signify? It returns a tuple <type,function> which doesn't mean much?
Here's what Guido says about first class objects in his blog:
One of my goals for Python was to make it so that all objects were "first class." By this, I meant that I wanted all objects that could be named in the language (e.g., integers, strings, functions, classes, modules, methods, etc.) to have equal status. That is, they can be assigned to variables, placed in lists, stored in dictionaries, passed as arguments, and so forth.
The whole blog post is worth reading.
In the example you posted, the tutorial may be making the point that first class objects are generally descendents of the "object" class.
First-class simply means that functions can be treated as a value -- that is you can assign them to variables, return them from functions, as well as pass them in as a parameter. That is you can do code like:
>>> def say_hi():
print "hi"
>>> def say_bye():
print "bye"
>>> f = say_hi
>>> f()
hi
>>> f = say_bye
>>> f()
bye
This is useful as you can now assign functions to variables like any ordinary variable:
>>> for f in (say_hi, say_bye):
f()
hi
bye
Or write higher order functions (that take functions as parameters):
>>> def call_func_n_times(f, n):
for i in range(n):
f()
>>> call_func_n_times(say_hi, 3)
hi
hi
hi
>>> call_func_n_times(say_bye, 2)
bye
bye
About __class__ in python tells what type of object you have. E.g., if you define an list object in python: a = [1,2,3], then a.__class__ will be <type 'list'>. If you have a datetime (from datetime import datetime and then d = datetime.now(), then the type of d instance will be <type 'datetime.datetime'>. They were just showing that in python a function is not a brand new concept. It's just an ordinary object of <type 'function'>.
You proved that functions are first class objects because you were allowed to pass foo as an argument to a method.
The attributes of first class objects was nicely summarised in this post: https://stackoverflow.com/a/245208/3248346
Depending on the language, this can
imply:
being expressible as an anonymous literal value
being storable in variables
being storable in data structures
having an intrinsic identity (independent of any given name)
being comparable for equality with other entities
being passable as a parameter to a procedure/function
being returnable as the result of a procedure/function
being constructible at runtime
being printable
being readable
being transmissible among distributed processes
being storable outside running processes
Regarding your third question, <type 'function'> isn't a tuple. Python's tuple notation is (a,b), not angle brackets.
foo.__class__ returns a class object, that is, an object which represents the class to which foo belongs; class objects happen to produce descriptive strings in the interpreter, in this case telling you that the class of foo is the type called 'function'. (Classes and types are basically the same in modern Python.)
It doesn't mean a whole lot other than that, like any other object, functions have a type:
>>> x = 1
>>> x.__class__
<type 'int'>
>>> y = "bar"
>>> y.__class__
<type 'str'>
>>> def foo(): pass
...
>>> foo.__class__
<type 'function'>
Regarding your comment to #I.K.s answer, f_at_2() in the following would be the method.
def f_at_2(f):
return f(2)
def foo(n):
return n ** n
def bar(n):
return n * n
def baz(n):
return n / 2
funcs = [foo, bar, baz]
for f in funcs:
print f.func_name, f_at_2(f)
...
>>>
foo 4
bar 4
baz 1
>>>
A method is a function of/in a class, but the concept also applies to a function (outside of a class). The functions (as objects) are contained in a data structure and passed to another object.

Python instances stored in shelves change after closing it

I think the best way to explain the situation is with an example:
>>> class Person:
... def __init__(self, brother=None):
... self.brother = brother
...
>>> bob = Person()
>>> alice = Person(brother=bob)
>>> import shelve
>>> db = shelve.open('main.db', writeback=True)
>>> db['bob'] = bob
>>> db['alice'] = alice
>>> db['bob'] is db['alice'].brother
True
>>> db['bob'] == db['alice'].brother
True
>>> db.close()
>>> db = shelve.open('main.db',writeback=True)
>>> db['bob'] is db['alice'].brother
False
>>> db['bob'] == db['alice'].brother
False
The expected output for both comparisons is True again. However, pickle (which is used by shelve) seems to be re-instantiating bob and alice.brother separately. How can I "fix" this using shelve/pickle? Is it possible for db['alice'].brother to point to db['bob'] or something similar? Notice I do not want only to compare both, I need both to actually be the same.
As suggested by Blckknght I tried pickling the entire dictionary at once, but the problem persists since it seems to pickle each key separately.
I believe that the issue you're seeing comes from the way the shelve module stores its values. Each value is pickled independently of the other values in the shelf, which means that if the same object is inserted as a value under multiple keys, the identity will not be preserved between the keys. However, if a single value has multiple references to the same object, the identity will be maintained within that single value.
Here's an example:
a = object() # an arbitrary object
db = shelve.open("text.db")
db['a'] = a
db['another_a'] = a
db['two_a_references'] = [a, a]
db.close()
db = shelve.open("text.db") # reopen the db
print(db['a'] is db['another_a']) # prints False
print(db['two_a_references'][0] is db['two_a_references'][1]) # prints True
The first print tries to confirm the identity of two versions of the object a that were inserted in the database, one under the key 'a' directly, and another under 'another_a'. It doesn't work because the separate values are pickled separately, and so the identity between them was lost.
The second print tests whether the two references to a that were stored under the key 'two_a_references' were maintained. Because the list was pickled in one go, the identity is kept.
So to address your issue you have a few options. One approach is to avoid testing for identity and rely on an __eq__ method in your various object types to determine if two objects are semantically equal, even if they are not the same object. Another would be to bundle all your data into a single object (e.g. a dictionary) which you'd then save with pickle.dump and restore with pickle.load rather than using shelve (or you could adapt this recipe for a persistent dictionary, which is linked from the shelve docs, and does pretty much that).
The appropriate way, in Python, is to implement the __eq__ and __ne__ functions inside of the Person class, like this:
class Person(object):
def __eq__(self, other):
return (isinstance(other, self.__class__)
and self.__dict__ == other.__dict__)
def __ne__(self, other):
return not self.__eq__(other)
Generally, that should be sufficient, but if these are truly database objects and have a primary key, it would be more efficient to check that attribute instead of self.__dict__.
Problem
To preserve identity with shelve you need to preserve identity with pickleread this.
Solution
This class saves all the objects on its class site and restores them if the identity is the same. You should be able to subclass from it.
>>> class PickleWithIdentity(object):
identity = None
identities = dict() # maybe use weakreference dict here
def __reduce__(self):
if self.identity is None:
self.identity = os.urandom(10) # do not use id() because it is only 4 bytes and not random
self.identities[self.identity] = self
return open_with_identity, (self.__class__, self.__dict__), self.__dict__
>>> def open_with_identity(cls, dict):
if dict['identity'] in cls.identities:
return cls.identities[dict['identity']]
return cls()
>>> p = PickleWithIdentity()
>>> p.asd = 'asd'
>>> import pickle
>>> import os
>>> pickle.loads(pickle.dumps(p))
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> pickle.loads(pickle.dumps(p)) is p
True
Further problems can occur because the state may be overwritten:
>>> p.asd
'asd'
>>> ps = pickle.dumps(p)
>>> p.asd = 123
>>> pickle.loads(ps)
<__main__.PickleWithIdentity object at 0x02D2E870>
>>> p.asd
'asd'

dict does not reference elements? Python2.7 changed behavior

Given the example:
>>> import gc
>>> d = { 1 : object() }
>>> gc.get_referrers(d[1])
[] # Python 2.7
[{1: <object object at 0x003A0468>}] # Python 2.5
Why is d not listed as refererrer to to the object?
EDIT1: Although the dict in d references the object, why is the dictionairy not listed?
The doc mentions that:
This function will only locate those containers which support garbage
collection; extension types which do refer to other objects but do not
support garbage collection will not be found.
Seems that dictionaries do not support it.
And here is why:
The garbage collector tries to avoid tracking simple containers which
can’t be part of a cycle. In Python 2.7, this is now true for tuples
and dicts containing atomic types (such as ints, strings, etc.).
Transitively, a dict containing tuples of atomic types won’t be
tracked either. This helps reduce the cost of each garbage collection
by decreasing the number of objects to be considered and traversed by
the collector.
— From What's new in Python 2.7
It seems that object() is considered an atomic type, and trying this with an instance of a user-defined class (that is, not object) confirms this as your code now works.
# Python 2.7
>>> class A(object): pass
>>> r = A()
>>> d = {1: r}
>>> del r
>>> gc.get_referrers(d[1])
[{1: <__main__.A instance at 0x0000000002663708>}]
See also issue 4688.
This is a change in how objects are tracked in Python 2.7; tuples and dictionaries containing only atomic types (including instances of object()), which would never require cycle breaking, are not listed anymore.
See http://bugs.python.org/issue4688; this was implemented to avoid a performance issues with creating loads of tuples or dictionaries.
The work-around is to add an object to your dictionary that does need tracking:
>>> gc.is_tracked(d)
False
>>> class Foo(object): pass
...
>>> d['_'] = Foo()
>>> gc.is_tracked(d)
True
>>> d in gc.get_referrers(r)
True
Once tracked, a dictionary only goes back to being untracked after a gc collection cycle:
>>> del d['_']
>>> gc.is_tracked(d)
True
>>> d in gc.get_referrers(r)
True
>>> gc.collect()
0
>>> gc.is_tracked(d)
False
>>> d in gc.get_referrers(r)
False

Python: How do I pass a string by reference?

From this link: How do I pass a variable by reference?, we know, Python will copy a string (an immutable type variable) when it is passed to a function as a parameter, but I think it will waste memory if the string is huge. In many cases, we need to use functions to wrap some operations for strings, so I want to know how to do it more effective?
Python does not make copies of objects (this includes strings) passed to functions:
>>> def foo(s):
... return id(s)
...
>>> x = 'blah'
>>> id(x) == foo(x)
True
If you need to "modify" a string in a function, return the new string and assign it back to the original name:
>>> def bar(s):
... return s + '!'
...
>>> x = 'blah'
>>> x = bar(x)
>>> x
'blah!'
Unfortunately, this can be very inefficient when making small changes to large strings because the large string gets copied. The pythonic way of dealing with this is to hold strings in an list and join them together once you have all the pieces.
Python does pass a string by reference. Notice that two strings with the same content are considered identical:
a = 'hello'
b = 'hello'
a is b # True
Since when b is assigned by a value, and the value already exists in memory, it uses the same reference of the string. Notice another fact, that if the string was dynamically created, meaning being created with string operations (i.e concatenation), the new variable will reference a new instance of the same string:
c = 'hello'
d = 'he'
d += 'llo'
c is d # False
That being said, creating a new string will allocate a new string in memory and returning a reference for the new string, but using a currently created string will reuse the same string instance. Therefore, passing a string as a function parameter will pass it by reference, or in other words, will pass the address in memory of the string.
And now to the point you were looking for- if you change the string inside the function, the string outside of the function will remain the same, and that stems from string immutability. Changing a string means allocating a new string in memory.
a = 'a'
b = a # b will hold a reference to string a
a += 'a'
a is b # False
Bottom line:
You cannot really change a string. The same as for maybe every other programming language (but don't quote me).
When you pass the string as an argument, you pass a reference. When you change it's value, you change the variable to point to another place in memory. But when you change a variable's reference, other variables that points to the same address will naturally keep the old value (reference) they held.
Wish the explanation was clear enough
In [7]: strs="abcd"
In [8]: id(strs)
Out[8]: 164698208
In [9]: def func(x):
print id(x)
x=x.lower() #perform some operation on string object, it returns a new object
print id(x)
...:
In [10]: func(strs)
164698208 # same as strs, i.e it actually passes the same object
164679776 # new object is returned if we perform an operation
# That's why they are called immutable
But operations on strings always return a new string object.
def modify_string( t ):
the_string = t[0]
# do stuff
modify_string( ["my very long string"] )
If you want to potentially change the value of something passed in, wrap it in a dict or a list:
This doesn't change s
def x(s):
s += 1
This does change s:
def x(s):
s[0] += 1
This is the only way to "pass by reference".
wrapping the string into a class will make it pass by reference:
class refstr:
"wrap string in object, so it is passed by reference rather than by value"
def __init__(self,s=""):
self.s=s
def __add__(self,s):
self.s+=s
return self
def __str__(self):
return self.s
def fn(s):
s+=" world"
s=refstr("hello")
fn(s) # s gets modified because objects are passed by reference
print(s) #returns 'hello world'
Just pass it in as you would any other parameter. The contents won't get copied, only the reference will.

Categories

Resources