Is the pickling process deterministic? - python

Does Pickle always produce the same output for a certain input value? I suppose there could be a gotcha when pickling dictionaries that have the same contents but different insert/delete histories. My goal is to create a "signature" of function arguments, using Pickle and SHA1, for a memoize implementation.

I suppose there could be a gotcha when pickling dictionaries that have the same contents but different insert/delete histories.
Right:
>>> pickle.dumps({1: 0, 9: 0}) == pickle.dumps({9: 0, 1: 0})
False
See also: pickle.dumps not suitable for hashing
My goal is to create a "signature" of function arguments, using Pickle and SHA1, for a memoize implementation.
There's a number of fundamental problems with this. It's impossible to come up with an object-to-string transformation that maps equality correctly—think of the problem of object identity:
>>> a = object()
>>> b = object()
>>> a == b
False
>>> pickle.dumps(b) == pickle.dumps(a)
True
Depending on your exact requirements, you may be able to transform object hierarchies into ones that you could then hash:
def hashablize(obj):
"""Convert a container hierarchy into one that can be hashed.
Don't use this with recursive structures!
Also, this won't be useful if you pass dictionaries with
keys that don't have a total order.
Actually, maybe you're best off not using this function at all."""
try:
hash(obj)
except TypeError:
if isinstance(obj, dict):
return tuple((k, hashablize(v)) for (k, v) in sorted(obj.iteritems()))
elif hasattr(obj, '__iter__'):
return tuple(hashablize(o) for o in obj)
else:
raise TypeError("Can't hashablize object of type %r" % type(obj))
else:
return obj

What do you mean by same output ? You should normally always get the same output for a roundtrip (pickling -> unpickling), but I don't think the serialized format itself is guaranteed to be the same in every condition. Certainly, it may change between platforms and all that.
Within one run of your program, using pickling for memoization should be fine - I have used this scheme several times without trouble, but that was for quite simple problems. One problem is that this does not cover every useful case (function come to mind: you cannot pickle them, so if your function takes a callable argument, that won't work).

Related

If an object is unhashable it's an error to ask if it's a dictionary key

If an object is unhashable, like a set, it is an error to ask whether it's a key of a dictionary:
>>> key = set()
>>> key in {}
Traceback (most recent call last):
TypeError: unhashable type: 'set'
I'm just wondering why this is.
edit: I understand that the object needs to be hashed, but not why it needs to be an error if it can't.
That's because the first thing Python needs to do in order to check whether it's in the dict is to attempt to hash the object.
As an alternate design decision, Python could potentially have handled this case; it would be done in dict.__contains__ implementation, catching the TypeError and returning False. But this provides less information to the user than just leaving the exception unhandled, and is therefore arguably less useful.
That's because checking for dictionary membership requires that the object be hashed first and then compared using __eq__.
Consider the following toy example:
class Key(object):
def __hash__(self):
print('calling hash')
return 1
def __eq__(self, other):
print('calling eq')
return True
dct = {1: 2}
Key() in dct
# calling hash
# calling eq
In your example, the set() does not get past the hash stage, and an error is correctly raised, rather than passed silently.
Making a set be a key in a dictionary is iffy because set is a mutable data structure.
What happens when you add some things to the set? Should 'a' continue to be treated as a key in the dict or should some other set with the original value of a be treated as a key instead?
To use a set-like structure as a key, use an immutable 'frozenset' instead. You cannot change the value of a frozenset after instantiating it and so python can safely allow it to be a key in a dict.
It's more useful that way.
key in d is supposed to be equivalent to
any(key == k for k in d)
which can actually be True for unhashable objects:
d = {frozenset(): 1}
key = set()
print(any(key == k for k in d)) # prints True
So trying to always return False for unhashable objects would create an inconsistency. In theory, we could try to fall back to that for loop, but that would lead to silent O(len(d)) performance degradation and defeat the benefits of a dict for little gain.
Also, even if key in d was always False for an unhashable key, or if it had some kind of fallback, most instances where such a test even happens would still probably be due to bugs rather than deliberate. An exception is more useful than a False result.

Python: how does a dict with mixed key type work?

I understand that the following is valid in Python: foo = {'a': 0, 1: 2, some_obj: 'c'}
However, I wonder how the internal works. Does it treat everything (object, string, number, etc.) as object? Does it type check to determine how to compute the hash code given a key?
Types aren't used the same way in Python as statically types languages. A hashable object is simply one with a valid hash method. The interpreter simply calls that method, no type checking or anything. From there on out, standard hash map principles apply: for an object to fulfill the contract, it must implement both hash and equals methods.
You can answer this by opening a Python interactive prompt and trying several of these keys:
>>> hash('a')
12416037344
>>> hash(2)
2
>>> hash(object())
8736272801544
Does it treat everything (object, string, number, etc.) as object?
You are simply using the hash function to represent each dictionary key as an integer. This integer is simply used to index in the underlying dictionary array. Assuming a dictionary starts of with a pre-allocated size of 8, we use the modulus operator (the remainder) to fit it into an appropriate location:
>>> hash('a')
12416037344
>>> hash(object()) % 8
2
So in this particular case, the hashed object is placed in index 2 of the underlying array. Of course there can be collisions, and so depending on the underlying implementation, the underlying array may actually be an array of arrays.
Note that items that aren't hashable cannot be used as dictionary keys:
>>> hash({})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Proof:
>>> d = {}
>>> d[{}] = 5
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Everything in Python is an object, and every object which has a __hash__ method can be used as a dictionary key. How exactly the method (tries to) return a unique integer for each possible key is thus specific to each class, and somewhat private (to use the term carelessly for humorous effect). See https://wiki.python.org/moin/DictionaryKeys for details.
(There are a couple of other methods your class needs to support before it is completely hashable. Again, see the linked exposition.)
I think it would work as long, the object supports __hash__ method and the __hash__(self) of two objects return same values for which __eq__ returns True. hashable objects are explained here.
eg. Try following
a = []
a.__hash__ == None # True
aa = 'xyz'
aa.__hash__ == None # False
a = (1,1,) # A Tuple, hashable
a.__hash__ == None # False
Hope that helps
It might be easier to ignore dictionaries for a moment, and just think of sets. You can probably see how a set s might consist of {'a', 1, some_object}, right? And that if you tried to add 1 to this set, it wouldn't do anything (since 1 is already there)?
Now, suppose you try to add a another_object to s. another_object is not 1 or 'a', so to see if it can be added to s, Python will see if another_object is equal to some_object. Understanding object equality is a whole nother subject, but suffice it to say, there is a sensible way to go about it. If another_object == some_object is true, then s will remain unchanged. Otherwise, s will consist of {'a', 1, some_object, another_object}.
Hopefully, this makes sense to you. If it does, then just think of dictionaries as special sets. The keys of the dictionary are the entries of the set, and values of the dictionary just hold a single value for each key. Saying d['and now'] = 'completely different' is just the same thing as deleting 'and now' from the set, and then adding it back it again with 'completely different' as its associated value. (This isn't technically how a dict handles __setitem__ , but it can be helpful to think of a dict as working this way. In reality, sets are more like crippled dicts than dicts are like extra powerful sets).
Of course, with dicts you should bear in mind that only hashable objects are allowed in a dict. But that is itself a different subject, and not really the crux of your question, AFAICT

How to distinguish between a sequence and a mapping

I would like to perform an operation on an argument based on the fact that it might be a map-like object or a sequence-like object. I understand that no strategy is going to be 100% reliable for type-like checking, but I'm looking for a robust solution.
Based on this answer, I know how to determine whether something is a sequence and I can do this check after checking if the object is a map.
def ismap(arg):
# How to implement this?
def isseq(arg):
return hasattr(arg,"__iter__")
def operation(arg):
if ismap(arg):
# Do something with a dict-like object
elif isseq(arg):
# Do something with a sequence-like object
else:
# Do something else
Because a sequence can be seen as a map where keys are integers, should I just try to find a key that is not an integer? Or maybe I could look at the string representation? or...?
UPDATE
I selected SilentGhost's answer because it looks like the most "correct" one, but for my needs, here is the solution I ended up implementing:
if hasattr(arg, 'keys') and hasattr(arg, '__getitem__'):
# Do something with a map
elif hasattr(arg, '__iter__'):
# Do something with a sequence/iterable
else:
# Do something else
Essentially, I don't want to rely on an ABC because there are many custom classes that behave like sequences and dictionary but that still do not extend the python collections ABCs (see #Manoj comment). I thought the keys attribute (mentioned by someone who removed his/her answer) was a good enough check for mappings.
Classes extending the Sequence and Mapping ABCs will work with this solution as well.
>>> from collections import Mapping, Sequence
>>> isinstance('ac', Sequence)
True
>>> isinstance('ac', Mapping)
False
>>> isinstance({3:42}, Mapping)
True
>>> isinstance({3:42}, Sequence)
False
collections abstract base classes (ABCs)
Sequences have an __add__ method that implements the + operator. Maps do not have that method, since adding to a map requires both a key and a value, and the + operator only has one right-hand side.
So you may try:
def ismap(arg):
return isseq(arg) and not hasattr(arg, "__add__")

Verifying that an object in python adheres to a specific structure

Is there some simple method that can check if an input object to some function adheres to a specific structure? For example, I want only a dictionary of string keys and values that are a list of integers.
One method would be to write a recursive function that you pass in the object and you iterate over it, checking at each level it is what you expect. But I feel that there should be a more elegant way to do it than this in python.
Why would you expect Python to provide an "elegant way" to check types, since the whole idea of type-checking is so utterly alien to the Pythonic way of conceiving the world and interacting with it?! Normally in Python you'd use duck typing -- so "an integer" might equally well be an int, a long, a gmpy.mpz -- types with no relation to each other except they all implement the same core signature... just as "a dict" might be any implementation of mapping, and so forth.
The new-in-2.6-and-later concept of "abstract base classes" provides a more systematic way to implement and verify duck typing, and 3.0-and-later function annotations let you interface with such a checking system (third-party, since Python adopts no such system for the foreseeable future). For example, this recipe provides a 3.0-and-later way to perform "kinda but not quite" type checking based on function annotations -- though I doubt it goes anywhere as deep as you desire, but then, it's early times for function annotations, and most of us Pythonistas feel so little craving for such checking that we're unlikely to run flat out to implement such monumental systems in lieu of actually useful code;-).
Short answer, no, you have to create your own function.
Long answer: its not pythonic to do what you're asking. There might be some special cases (e.g, marshalling a dict to xmlrpc), but by and large, assume the objects will act like what they're documented to be. If they don't, let the AttributeError bubble up. If you are ok with coercing values, then use str() and int() to convert them. They could, afterall, implement __str__, __add__, etc that makes them not descendants of int/str, but still usable.
def dict_of_string_and_ints(obj):
assert isinstance(obj, dict)
for key, value in obj.iteritems(): # py2.4
assert isinstance(key, basestring)
assert isinstance(value, list)
assert sum(isinstance(x, int) for x in value) == len(list)
Since Python emphasizes things just working, your best bet is to just assert as you go and trust the users of your library to feed you proper data. Let exceptions happen, if you must; that's on your clients for not reading your docstring.
In your case, something like this:
def myfunction(arrrrrgs):
assert issubclass(dict, type(arrrrrgs)), "Need a dictionary!"
for key in arrrrrgs:
assert type(key) is str, "Need a string!"
val = arrrrrgs[key]
assert type(val) is list, "Need a list!"
And so forth.
Really, it isn't worth the effort, and let your program blow up if you express yourself clearly in the docstring, or throw well-placed exceptions to guide the late-night debugger.
I will take a shot and propose a helper function that can do something like that for you in a more generic+elegant way:
def check_type(value, type_def):
"""
This validates an object instanct <value> against a type template <type_def>
presented as a simplified object.
E.g.
if value is list of dictionaries that have string values as key and integers
as values:
>> check_type(value, [{'':0}])
if value is list of dictionaries, no restriction on key/values
>> check_type(value, [{}])
"""
if type(value) != type(type_def):
return False
if hasattr(value, '__iter__'):
if len(type_def) == 0:
return True
type_def_val = iter(type_def).next()
for key in value:
if not check_type(key, type_def_val):
return False
if type(value) is dict:
if not check_type(value.values(), type_def.values()):
return False
return True
The comment explains a sample of usage, but you can always go pretty deep, e.g.
>>> check_type({1:['a', 'b'], 2:['c', 'd']}, {0:['']})
True
>>> check_type({1:['a', 'b'], 2:['c', 3]}, {0:['']})
False
P.S. Feel free to modify it if you want one-by-one tuple validation (e.g. validation against ([], '', {0:0}) which is not handled as it is expected now)

Using non-hashable Python objects as keys in dictionaries

Python doesn't allow non-hashable objects to be used as keys in other dictionaries. As pointed out by Andrey Vlasovskikh, there is a nice workaround for the special case of using non-nested dictionaries as keys:
frozenset(a.items())#Can be put in the dictionary instead
Is there a method of using arbitrary objects as keys in dictionaries?
Example:
How would this be used as a key?
{"a":1, "b":{"c":10}}
It is extremely rare that you will actually have to use something like this in your code. If you think this is the case, consider changing your data model first.
Exact use case
The use case is caching calls to an arbitrary keyword only function. Each key in the dictionary is a string (the name of the argument) and the objects can be quite complicated, consisting of layered dictionaries, lists, tuples, ect.
Related problems
This sub-problem has been split off from the problem here. Solutions here deal with the case where the dictionaries is not layered.
Based off solution by Chris Lutz again.
import collections
def hashable(obj):
if isinstance(obj, collections.Hashable):
items = obj
elif isinstance(obj, collections.Mapping):
items = frozenset((k, hashable(v)) for k, v in obj.iteritems())
elif isinstance(obj, collections.Iterable):
items = tuple(hashable(item) for item in obj)
else:
raise TypeError(type(obj))
return items
Don't. I agree with Andreys comment on the previous question that is doesn't make sense to have dictionaries as keys, and especially not nested ones. Your data-model is obviously quite complex, and dictionaries are probably not the right answer. You should try some OO instead.
Based off solution by Chris Lutz. Note that this doesn't handle objects that are changed by iteration, such as streams, nor does it handle cycles.
import collections
def make_hashable(obj):
"""WARNING: This function only works on a limited subset of objects
Make a range of objects hashable.
Accepts embedded dictionaries, lists or tuples (including namedtuples)"""
if isinstance(obj, collections.Hashable):
#Fine to be hashed without any changes
return obj
elif isinstance(obj, collections.Mapping):
#Convert into a frozenset instead
items=list(obj.items())
for i, item in enumerate(items):
items[i]=make_hashable(item)
return frozenset(items)
elif isinstance(obj, collections.Iterable):
#Convert into a tuple instead
ret=[type(obj)]
for i, item in enumerate(obj):
ret.append(make_hashable(item))
return tuple(ret)
#Use the id of the object
return id(obj)
I agree with Lennart Regebro that you don't. However I often find it useful to cache some function calls, callable object and/or Flyweight objects, since they may use keyword arguments.
But if you really want it, try pickle.dumps (or cPickle if python 2.6) as a quick and dirty hack. It is much faster than any of the answers that uses recursive calls to make items immutable, and strings are hashable.
import pickle
hashable_str = pickle.dumps(unhashable_object)
If you really must, make your objects hashable. Subclass whatever you want to put in as a key, and provide a __hash__ function which returns an unique key to this object.
To illustrate:
>>> ("a",).__hash__()
986073539
>>> {'a': 'b'}.__hash__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object is not callable
If your hash is not unique enough you will get collisions. May be slow as well.
I totally disagree with comments & answers saying that this shouldn't be done for data model purity reason.
A dictionary associates an object with another object using the former one as a key. Dictionaries can't be used as keys because they're not hashable. This doesn't make any less meaningful/practical/necessary to map dictionaries to other objects.
As I understand the Python binding system, you can bind any dictionary to a number of variables (or the reverse, depends on your terminology) which means that these variables all know the same unique 'pointer' to that dictionary. Wouldn't it be possible to use that identifier as the hashing key ?
If your data model ensures/enforces that you can't have two dictionaries with the same content used as keys then that seems to be a safe technique to me.
I should add that I have no idea whatsoever of how that can/should be done though.
I'm not entirely whether this should be an answer or a comment. Please correct me if needed.
With recursion!
def make_hashable(h):
items = h.items()
for item in items:
if type(items) == dict:
item = make_hashable(item)
return frozenset(items)
You can add other type tests for any other mutable types you want to make hashable. It shouldn't be hard.
I encountered this issue when using a decorator that caches the results of previous calls based on call signature. I do not agree with the comments/answers here to the effect of "you should not do this", but I think it is important to recognize the potential for surprising and unexpected behavior when going down this path. My thought is that since instances are both mutable and hashable, and it does not seem practical to change that, there is nothing inherently wrong with creating hashable equivalents of non-hashable types or objects. But of course that is only my opinion.
For anyone who requires Python 2.5 compatibility, the below may be useful. I based it on the earlier answer.
from itertools import imap
tuplemap = lambda f, data: tuple(imap(f, data))
def make_hashable(obj):
u"Returns a deep, non-destructive conversion of given object to an equivalent hashable object"
if isinstance(obj, list):
return tuplemap(make_hashable, iter(obj))
elif isinstance(obj, dict):
return frozenset(tuplemap(make_hashable, obj.iteritems()))
elif hasattr(obj, '__hash__') and callable(obj.__hash__):
try:
obj.__hash__()
except:
if hasattr(obj, '__iter__') and callable(obj.__iter__):
return tuplemap(make_hashable, iter(obj))
else:
raise NotImplementedError, 'object of type %s cannot be made hashable' % (type(obj),)
else:
return obj
elif hasattr(obj, '__iter__') and callable(obj.__iter__):
return tuplemap(make_hashable, iter(obj))
else:
raise NotImplementedError, 'object of type %s cannot be made hashable' % (type, obj)

Categories

Resources