What's the point of using [object instance].__self__? - python

I was checking the code of the toolz library's groupby function in Python and I found this:
def groupby(key, seq):
""" Group a collection by a key function
"""
if not callable(key):
key = getter(key)
d = collections.defaultdict(lambda: [].append)
for item in seq:
d[key(item)](item)
rv = {}
for k, v in d.items():
rv[k] = v.__self__
return rv
Is there any reason to use rv[k] = v.__self__ instead of rv[k] = v?

This is a somewhat confusing trick to save a small amount of time:
We are creating a defaultdict with a factory function that returns a bound append method of a new list instance with [].append. Then we can just do d[key(item)](item) instead of d[key(item)].append(item) like we would have if we create a defaultdict that contains lists. If we don't lookup append everytime, we gain a small amount of time.
But now the dict contains bound methods instead of the lists, so we have to get the original list instance back via __self__.
__self__ is an attribute described for instance methods that returns the original instance. You can verify that with this for example:
>>> a = []
>>> a.append.__self__ is a
True

This is a somewhat convoluted, but possibly more efficient approach to creating and using a defaultdict of lists.
First, remember that the default item is lambda: [].append. This means create a new list, and store a bound append method in the dictionary. This saves you a method bind on every further append to the same key, and the garbage collect that follows. For example, the following more standard approach is less efficient:
d = collections.defaultdict(list)
for item in seq:
d[key(item)].append(item)
The problem then becomes how to get the original lists back out of the dictionary, since the reference is not stored explicitly. Luckily, bound methods have a __self__ attribute which does just that. Here, [].append.__self__ is a reference to the original [].
As a side note, the last loop could be a comprehension:
return {k: v.__self__ for k, v in d.items()}

Related

Python defaultdict(default) vs dict.get(key, default)

Suppose I want to create a dict (or dict-like object) that returns a default value if I attempt to access a key that's not in the dict.
I can do this either by using a defaultdict:
from collections import defaultdict
foo = defaultdict(lambda: "bar")
print(foo["hello"]) # "bar"
or by using a regular dict and always using dict.get(key, default) to retrieve values:
foo = dict()
print(foo.get("hello", "bar")) # "bar"
print(foo["hello"]) # KeyError (as expected)
Other than the obvious ergonomic overhead of having to remember to use .get() with a default value instead of the expected bracket syntax, what's the difference between these 2 approaches?
Asides from the ergonomics of having .get everwhere, one important difference is if you lookup a missing key in defaultdict it will insert a new element into itself rather than just returning the default. The most important implications of this are:
Later iterations will retrieve all keys looked up in a defaultdict
As more ends up stored in the dictionary, more memory is typically used
Mutation of the default will store that mutation in a defaultdict, with .get the default is lost unless stored explicty
from collections import defaultdict
default_foo = defaultdict(list)
dict_foo = dict()
for i in range(1024):
default_foo[i]
dict_foo.get(i, [])
print(len(default_foo.items())) # 1024
print(len(dict_foo.items())) # 0
# Defaults in defaultdict's can be mutated where as with .get mutations are lost
default_foo[1025].append("123")
dict_foo.get(1025, []).append("123")
print(default_foo[1025]) # ["123"]
print(dict_foo.get(1025, [])) # []
The difference here really comes down to how you want your program to handle a KeyError.
foo = dict()
def do_stuff_with_foo():
print(foo["hello"])
# Do something here
if __name__ == "__main__":
try:
foo["hello"] # The key exists and has a value
except KeyError:
# The first code snippet does this
foo["hello"] = "bar"
do_stuff_with_foo()
# The second code snippet does this
exit(-1)
It's a matter of do we want to stop the program entirely? Do we want the user to fill in a value for foo["hello"] or do we want to use a default value?
The first approach is a more compact way to do foo.get("hello", "bar")
But the kicker is the matter of is this what we really want to happen?

Why the value of the parameter changed?

This is my code:
def getGraphWave(G, d, maxk, p):
data = dict()
output_wavelets = {2:33,5:77,...}
print(len(output_wavelets))
k = [10,20]
for i in k:
S = graphwave(i, output_wavelets)
# size = avgSize(G, S, p, 200)
size = IC(d, S, p)
data[i] = size + i
return data
The output_wavelets is a dict and the length of it is 2000.
However, when running the following code:
def graphwave(k, output_wavelets):
S = []
print(len(output_wavelets))
for i in range(k):
Seed = max(output_wavelets, key=output_wavelets.get)
S.append(Seed)
output_wavelets.pop(Seed)
return S
In the getGraphWave(G,D,maxk,p), graphWave(k,output_wavelets) runs two times in the circulation. However, Why in the graphWave(), the result of print(len(output_wavelets)) is 2000 and 1991?
I thought output_wavelets is not changed before output_wavelets. And how to let output_wavelets always be the original?
When you call graphwave(i, output_wavelets) this passes a reference to output_wavelets into the function, not a copy of output_wavelets. This means that when the function modifies the output_wavelets dictionary it is modifying the original dictionary.
The output_wavelets.pop(Seed) line removes items from the dictionary, so is modifying the output_wavelets dictionary that you passed in. That is why it is getting smaller!
There are various ways which you could fix this. The simplest (but probably not the most efficient) would be to use the copy.copy() function to make a copy of your dictionary at the start of the graphwave function, and edit the copy rather than the original.
First you need to understand how value pass in python. Actually it depend on param which you passing to function.
like if you pass list, dict or any mutable object.. it can modify within the function.
but if you pass tuple, string or any immutable object.. it will not change.
In your case you can copy existing dict and then modify it. like
temp_output_wavelets = copy.deepcopy(output_wavelets)

Python 3 changing value of dictionary key in for loop not working

I have python 3 code that is not working as expected:
def addFunc(x,y):
print (x+y)
def subABC(x,y,z):
print (x-y-z)
def doublePower(base,exp):
print(2*base**exp)
def RootFunc(inputDict):
for k,v in inputDict.items():
if v[0]==1:
d[k] = addFunc(*v[1:])
elif v[0] ==2:
d[k] = subABC(*v[1:])
elif v[0]==3:
d[k] = doublePower(*v[1:])
d={"s1_7":[1,5,2],"d1_6":[2,12,3,3],"e1_3200":[3,40,2],"s2_13":[1,6,7],"d2_30":[2,42,2,10]}
RootFunc(d)
#test to make sure key var assignment works
print(d)
I get:
{'d2_30': None, 's2_13': None, 's1_7': None, 'e1_3200': None, 'd1_6': None}
I expected:
{'d2_30': 30, 's2_13': 13, 's1_7': 7, 'e1_3200': 3200, 'd1_6': 6}
What's wrong?
Semi related: I know dictionaries are unordered but is there any reason why python picked this order? Does it run the keys through a randomizer?
print does not return a value. It returns None, so every time you call your functions, they're printing to standard output and returning None. Try changing all print statements to return like so:
def addFunc(x,y):
return x+y
This will give the value x+y back to whatever called the function.
Another problem with your code (unless you meant to do this) is that you define a dictionary d and then when you define your function, you are working on this dictionary d and not the dictionary that is 'input':
def RootFunc(inputDict):
for k,v in inputDict.items():
if v[0]==1:
d[k] = addFunc(*v[1:])
Are you planning to always change d and not the dictionary that you are iterating over, inputDict?
There may be other issues as well (accepting a variable number of arguments within your functions, for instance), but it's good to address one problem at a time.
Additional Notes on Functions:
Here's some sort-of pseudocode that attempts to convey how functions are often used:
def sample_function(some_data):
modified_data = []
for element in some_data:
do some processing
add processed crap to modified_data
return modified_data
Functions are considered 'black box', which means you structure them so that you can dump some data into them and they always do the same stuff and you can call them over and over again. They will either return values or yield values or update some value or attribute or something (the latter are called 'side effects'). For the moment, just pay attention to the return statement.
Another interesting thing is that functions have 'scope' which means that when I just defined it with a fake-name for the argument, I don't actually have to have a variable called "some_data". I can pass whatever I want to the function, but inside the function I can refer to the fake name and create other variables that really only matter within the context of the function.
Now, if we run my function above, it will go ahead and process the data:
sample_function(my_data_set)
But this is often kind of pointless because the function is supposed to return something and I didn't do anything with what it returned. What I should do is assign the value of the function and its arguments to some container so I can keep the processed information.
my_modified_data = sample_function(my_data_set)
This is a really common way to use functions and you'll probably see it again.
One Simple Way to Approach Your Problem:
Taking all this into consideration, here is one way to solve your problem that comes from a really common programming paradigm:
def RootFunc(inputDict):
temp_dict = {}
for k,v in inputDict.items():
if v[0]==1:
temp_dict[k] = addFunc(*v[1:])
elif v[0] ==2:
temp_dict[k] = subABC(*v[1:])
elif v[0]==3:
temp_dict[k] = doublePower(*v[1:])
return temp_dict
inputDict={"s1_7":[1,5,2],"d1_6":[2,12,3,3],"e1_3200":[3,40,2],"s2_13":[1,6,7],"d2_30"[2,42,2,10]}
final_dict = RootFunc(inputDict)
As erewok stated, you are using "print" and not "return" which may be the source of your error. And as far as the ordering is concerned, you already know that dictionaries are unordered, according to python doc at least, the ordering is not random, but rather implemented as hash tables.
Excerpt from the python doc: [...]A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is currently only one standard mapping type, the dictionary. [...]
Now key here is that the order of the element is not really random. I have often noticed that the order stays the same no matter how I construct a dictionary on some values... using lambda or just creating it outright, the order has always remained the same, so it can't be random, but it's definitely arbitrary.

is this the right way to delete object inside dict

i wrote a class inheriting from dict, i wrote a member method to remove objects.
class RoleCOList(dict):
def __init__(self):
dict.__init__(self)
def recyle(self):
'''
remove roles too long no access
'''
checkTime = time.time()-60*30
l = [k for k,v in self.items() if v.lastAccess>checkTime]
for x in l:
self.pop(x)
isn't it too inefficient? i used 2 list loops but i couldn't find other way
At the SciPy conference last year, I attended a talk where the speaker said that any() and all() are fast ways to do a task in a loop. It makes sense; a for loop rebinds the loop variable on each iteration, whereas any() and all() simply consume the value.
Clearly, you use any() when you want to run a function that always returns a false value such as None. That way, the whole loop will run to the end.
checkTime = time.time() - 60*30
# use any() as a fast way to run a loop
# The .__delitem__() method always returns `None`, so this runs the whole loop
lst = [k for k in self.keys() if self[k].lastAccess > checkTime]
any(self.__delitem__(k) for k in lst)
what about this?
_ = [self.pop(k) for k,v in self.items() if v.lastAccess>checkTime]
Since you don't need the list you generated, you could use generators and a snippet from this consume recipe. In particular, use collections.deque to run through a generator for you.
checkTime = time.time()-60*30
# Create a generator for all the values you will age off
age_off = (self.pop(k) for k in self.keys() if self[k].lastAccess>checkTime)
# Let deque handle iteration (in one shot, with little memory footprint)
collections.deque(age_off,maxlen=0)
Since the dictionary is changed during the iteration of age_off, use self.keys() which returns a list. (Using self.iteritems() will raise a RuntimeError.)
My (completly unreadable solution):
from operator import delitem
map(lambda k: delitem(self,k), filter(lambda k: self[k].lastAccess<checkTime, iter(self)))
but at least it should be quite time and memory efficient ;-)
If performance is an issue, and if you will have large volumes of data, you might want to look into using a Python front-end for a system like memcached or redis; those can handle expiring old data for you.
http://memcached.org/
http://pypi.python.org/pypi/python-memcached/
http://redis.io/
https://github.com/andymccurdy/redis-py

Hashing a dictionary?

For caching purposes I need to generate a cache key from GET arguments which are present in a dict.
Currently I'm using sha1(repr(sorted(my_dict.items()))) (sha1() is a convenience method that uses hashlib internally) but I'm curious if there's a better way.
Using sorted(d.items()) isn't enough to get us a stable repr. Some of the values in d could be dictionaries too, and their keys will still come out in an arbitrary order. As long as all the keys are strings, I prefer to use:
json.dumps(d, sort_keys=True)
That said, if the hashes need to be stable across different machines or Python versions, I'm not certain that this is bulletproof. You might want to add the separators and ensure_ascii arguments to protect yourself from any changes to the defaults there. I'd appreciate comments.
If your dictionary is not nested, you could make a frozenset with the dict's items and use hash():
hash(frozenset(my_dict.items()))
This is much less computationally intensive than generating the JSON string or representation of the dictionary.
UPDATE: Please see the comments below, why this approach might not produce a stable result.
EDIT: If all your keys are strings, then before continuing to read this answer, please see Jack O'Connor's significantly simpler (and faster) solution (which also works for hashing nested dictionaries).
Although an answer has been accepted, the title of the question is "Hashing a python dictionary", and the answer is incomplete as regards that title. (As regards the body of the question, the answer is complete.)
Nested Dictionaries
If one searches Stack Overflow for how to hash a dictionary, one might stumble upon this aptly titled question, and leave unsatisfied if one is attempting to hash multiply nested dictionaries. The answer above won't work in this case, and you'll have to implement some sort of recursive mechanism to retrieve the hash.
Here is one such mechanism:
import copy
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that contains
only other hashable types (including any lists, tuples, sets, and
dictionaries).
"""
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
Bonus: Hashing Objects and Classes
The hash() function works great when you hash classes or instances. However, here is one issue I found with hash, as regards objects:
class Foo(object): pass
foo = Foo()
print (hash(foo)) # 1209812346789
foo.a = 1
print (hash(foo)) # 1209812346789
The hash is the same, even after I've altered foo. This is because the identity of foo hasn't changed, so the hash is the same. If you want foo to hash differently depending on its current definition, the solution is to hash off whatever is actually changing. In this case, the __dict__ attribute:
class Foo(object): pass
foo = Foo()
print (make_hash(foo.__dict__)) # 1209812346789
foo.a = 1
print (make_hash(foo.__dict__)) # -78956430974785
Alas, when you attempt to do the same thing with the class itself:
print (make_hash(Foo.__dict__)) # TypeError: unhashable type: 'dict_proxy'
The class __dict__ property is not a normal dictionary:
print (type(Foo.__dict__)) # type <'dict_proxy'>
Here is a similar mechanism as previous that will handle classes appropriately:
import copy
DictProxyType = type(object.__dict__)
def make_hash(o):
"""
Makes a hash from a dictionary, list, tuple or set to any level, that
contains only other hashable types (including any lists, tuples, sets, and
dictionaries). In the case where other kinds of objects (like classes) need
to be hashed, pass in a collection of object attributes that are pertinent.
For example, a class can be hashed in this fashion:
make_hash([cls.__dict__, cls.__name__])
A function can be hashed like so:
make_hash([fn.__dict__, fn.__code__])
"""
if type(o) == DictProxyType:
o2 = {}
for k, v in o.items():
if not k.startswith("__"):
o2[k] = v
o = o2
if isinstance(o, (set, tuple, list)):
return tuple([make_hash(e) for e in o])
elif not isinstance(o, dict):
return hash(o)
new_o = copy.deepcopy(o)
for k, v in new_o.items():
new_o[k] = make_hash(v)
return hash(tuple(frozenset(sorted(new_o.items()))))
You can use this to return a hash tuple of however many elements you'd like:
# -7666086133114527897
print (make_hash(func.__code__))
# (-7666086133114527897, 3527539)
print (make_hash([func.__code__, func.__dict__]))
# (-7666086133114527897, 3527539, -509551383349783210)
print (make_hash([func.__code__, func.__dict__, func.__name__]))
NOTE: all of the above code assumes Python 3.x. Did not test in earlier versions, although I assume make_hash() will work in, say, 2.7.2. As far as making the examples work, I do know that
func.__code__
should be replaced with
func.func_code
The code below avoids using the Python hash() function because it will not provide hashes that are consistent across restarts of Python (see hash function in Python 3.3 returns different results between sessions). make_hashable() will convert the object into nested tuples and make_hash_sha256() will also convert the repr() to a base64 encoded SHA256 hash.
import hashlib
import base64
def make_hash_sha256(o):
hasher = hashlib.sha256()
hasher.update(repr(make_hashable(o)).encode())
return base64.b64encode(hasher.digest()).decode()
def make_hashable(o):
if isinstance(o, (tuple, list)):
return tuple((make_hashable(e) for e in o))
if isinstance(o, dict):
return tuple(sorted((k,make_hashable(v)) for k,v in o.items()))
if isinstance(o, (set, frozenset)):
return tuple(sorted(make_hashable(e) for e in o))
return o
o = dict(x=1,b=2,c=[3,4,5],d={6,7})
print(make_hashable(o))
# (('b', 2), ('c', (3, 4, 5)), ('d', (6, 7)), ('x', 1))
print(make_hash_sha256(o))
# fyt/gK6D24H9Ugexw+g3lbqnKZ0JAcgtNW+rXIDeU2Y=
Here is a clearer solution.
def freeze(o):
if isinstance(o,dict):
return frozenset({ k:freeze(v) for k,v in o.items()}.items())
if isinstance(o,list):
return tuple([freeze(v) for v in o])
return o
def make_hash(o):
"""
makes a hash out of anything that contains only list,dict and hashable types including string and numeric types
"""
return hash(freeze(o))
MD5 HASH
The method which resulted in the most stable results for me was using md5 hashes and json.stringify
from typing import Dict, Any
import hashlib
import json
def dict_hash(dictionary: Dict[str, Any]) -> str:
"""MD5 hash of a dictionary."""
dhash = hashlib.md5()
# We need to sort arguments so {'a': 1, 'b': 2} is
# the same as {'b': 2, 'a': 1}
encoded = json.dumps(dictionary, sort_keys=True).encode()
dhash.update(encoded)
return dhash.hexdigest()
While hash(frozenset(x.items()) and hash(tuple(sorted(x.items())) work, that's doing a lot of work allocating and copying all the key-value pairs. A hash function really should avoid a lot of memory allocation.
A little bit of math can help here. The problem with most hash functions is that they assume that order matters. To hash an unordered structure, you need a commutative operation. Multiplication doesn't work well as any element hashing to 0 means the whole product is 0. Bitwise & and | tend towards all 0's or 1's. There are two good candidates: addition and xor.
from functools import reduce
from operator import xor
class hashable(dict):
def __hash__(self):
return reduce(xor, map(hash, self.items()), 0)
# Alternative
def __hash__(self):
return sum(map(hash, self.items()))
One point: xor works, in part, because dict guarantees keys are unique. And sum works because Python will bitwise truncate the results.
If you want to hash a multiset, sum is preferable. With xor, {a} would hash to the same value as {a, a, a} because x ^ x ^ x = x.
If you really need the guarantees that SHA makes, this won't work for you. But to use a dictionary in a set, this will work fine; Python containers are resiliant to some collisions, and the underlying hash functions are pretty good.
Updated from 2013 reply...
None of the above answers seem reliable to me. The reason is the use of items(). As far as I know, this comes out in a machine-dependent order.
How about this instead?
import hashlib
def dict_hash(the_dict, *ignore):
if ignore: # Sometimes you don't care about some items
interesting = the_dict.copy()
for item in ignore:
if item in interesting:
interesting.pop(item)
the_dict = interesting
result = hashlib.sha1(
'%s' % sorted(the_dict.items())
).hexdigest()
return result
Use DeepHash from DeepDiff Module
from deepdiff import DeepHash
obj = {'a':'1',b:'2'}
hashes = DeepHash(obj)[obj]
To preserve key order, instead of hash(str(dictionary)) or hash(json.dumps(dictionary)) I would prefer quick-and-dirty solution:
from pprint import pformat
h = hash(pformat(dictionary))
It will work even for types like DateTime and more that are not JSON serializable.
You can use the maps library to do this. Specifically, maps.FrozenMap
import maps
fm = maps.FrozenMap(my_dict)
hash(fm)
To install maps, just do:
pip install maps
It handles the nested dict case too:
import maps
fm = maps.FrozenMap.recurse(my_dict)
hash(fm)
Disclaimer: I am the author of the maps library.
You could use the third-party frozendict module to freeze your dict and make it hashable.
from frozendict import frozendict
my_dict = frozendict(my_dict)
For handling nested objects, you could go with:
import collections.abc
def make_hashable(x):
if isinstance(x, collections.abc.Hashable):
return x
elif isinstance(x, collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
elif isinstance(x, collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
else:
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
If you want to support more types, use functools.singledispatch (Python 3.7):
#functools.singledispatch
def make_hashable(x):
raise TypeError("Don't know how to make {} objects hashable".format(type(x).__name__))
#make_hashable.register
def _(x: collections.abc.Hashable):
return x
#make_hashable.register
def _(x: collections.abc.Sequence):
return tuple(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Set):
return frozenset(make_hashable(xi) for xi in x)
#make_hashable.register
def _(x: collections.abc.Mapping):
return frozendict({k: make_hashable(v) for k, v in x.items()})
# add your own types here
One way to approach the problem is to make a tuple of the dictionary's items:
hash(tuple(my_dict.items()))
This is not a general solution (i.e. only trivially works if your dict is not nested), but since nobody here suggested it, I thought it might be useful to share it.
One can use a (third-party) immutables package and create an immutable 'snapshot' of a dict like this:
from immutables import Map
map = dict(a=1, b=2)
immap = Map(map)
hash(immap)
This seems to be faster than, say, stringification of the original dict.
I learned about this from this nice article.
For nested structures, having string keys at the top level dict, you can use pickle(protocol=5) and hash the bytes object. If you need safety, you can use a safe serializer.
I do it like this:
hash(str(my_dict))

Categories

Resources