Why aren't Python dicts unified? - python

After reading this question, I noticed that S. Lott might have liked to use an “ordered defaultdict”, but it doesn't exist. Now, I wonder: Why do we have so many dict classes in Python?
dict
blist.sorteddict
collections.OrderedDict
collections.defaultdict
weakref.WeakKeyDictionary
weakref.WeakValueDictionary
others?
Why not have something like this,
dict(initializer=[], sorted=False, ordered=False, default=None,
weak_keys=False, weak_values=False)
that unifies everything, and provides every useful combination?

One issue is that making this change would break backward-compatibility, due to this type of constructor usage that exists now:
>>> dict(one=1, two=2)
{'two': 2, 'one': 1}

Those extra options don't come for free. Since 99.9% of Python is built on dict, it is very important to make it as minimal and fast as possible.

Because the implementations differ a lot. You'd basically end up with a dict factory that returns an instance of a _dict(A very fast, low-overhead dictionary - the current dict), ordereddict, defaultdict, ... class. Also, you could not initialize dictionaries with keyword arguments anymore; programs relying on this would fail:
>>> dict(sorted=42)
{'sorted': 42}
# Your proposal would lead to an empty dictionary here (breaking compatibility)
Besides, when it's reasonable, the various classes already inherit from each other:
>>> collections.defaultdict.__bases__
(<type 'dict'>,)

This is why languages have "mixins".
You could try to invent something like the following by defining the right bunch of classes.
class defaultdict( dict, unordered, default_init ): pass
class OrderedDict( dict, ordered, nodefault_init ): pass
class WeakKeyDict( dict, ordered, nodefault_init, weakkey ): pass
class KeyValueDict( dict, ordered, nodefault_init, weakvalue ): pass
Then, once you have those "unified" dictionaries, your applications look like this
groups= defaultdict( list )
No real change to the app, is there?

Related

Custom class which is a dict, but initialized without dict copy?

For legibility purposes, I would like to have a custom class that behaves exactly like a dict (but carries a meaningful type instead of the more general dict type):
class Derivatives(dict):
"Dictionary that represents the derivatives."
Now, is there a way of building new objects of this class in a way that does not involve copies? The naive usage
derivs = Derivatives({var: 1}) # var is a Python object
in fact creates a copy of the dictionary passed as an argument, which I would like to avoid, for efficiency reasons.
I tried to bypass the copy but then the class of the dict cannot be changed, in CPython:
class Derivatives(dict):
def __new__(cls, init_dict):
init_dict.__class__ = cls # Fails with __class__ assignment: only for heap types
return init_dict
I would like to have both the ability to give an explicit class name to the dictionaries that the program manipulates and an efficient way of building such dictionaries (instead of being forced to copy a Python dict). Is this doable efficiently in Python?
PS: The use case is maybe 100,000 creations of single-key Derivatives, where the key is a variable (not a string, so no keyword initialization). This is actually not slow, so "efficiency reasons" here means more something like "elegance": there is ideally no need to waste time doing a copy when the copy is not needed. So, in this particular case the question is more about the elegance/clarity that Python can bring here than about running speed.
By inheriting from dict you are given three possibilities for constructor arguments: (baring the {} literal)
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
This means that, in order to instantiate your instance you must do one of the following:
Pass the variables as keywords D(x=1) which are then packed into an intermediate dictionary anyway.
Create a plain dictionary and pass it as a mapping.
Pass an iterable of (key,value) pairs.
So in all three of these cases you will need to create intermediate objects to satisfy the dict constructor.
The third option for a single pair it would look like D(((var,1),)) which I highly recommend against for readability sake.
So if you want your class to inherit from a dictionary, using Derivatives({var: 1}) is your most efficient and most readable option.
As a personal note if you will have thousands of single pair dictionaries I'm not sure how the dict setup is the best in the first place, you may just reconsider the basis of your class.
TL;DR: There's not general-purpose way to do it unless you do it in C.
Long answer:
The dict class is implemented in C. Thus, there is no way to access it's internal properties - and most importantly, it's internal hash table, unless you use C.
In C, you could simply copy the pointer representing the hash table into your object without having to iterate over the dict (key, value) pairs and insert them into your object. (Of course, it's a bit more complicated than this. Note that I omit memory management details).
Longer answer:
I'm not sure why you are concerned about efficiency.
Python passes arguments as references. It rarely every copies unless you explicitly tell it to.
I read in the comments that you can't use named parameters, as the keys are actual Python objects. That leaves me to understand that you're worried about copying the dict keys (and maybe values). However, even the dictionary keys are not copied, and passed by reference! Consider this code:
class Test:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return self.x
t = Test(1, 2)
print(t.y) # prints 2
d = {t: 1}
print(d[t]) # prints 1
keys = list(d.keys())
keys[0].y = 10
print(t.y) # prints 10! No copying was made when inserting object into dictionary.
Thus, the only remaining area of concern is iterating through the dict and inserting the values in your Derivatives class. This is unavoidable, unless you can somehow set the internal hash table of your class to the dict's internal hash table. There is no way to do this in pure python, as the dict class is implemented in C (as mentioned above).
Note that others have suggested using generators. This seems like a good idea too - say if you were reading the derivatives from a file or if you were generating them with a simple formula. It would avoid creating the dict object in the first place. However, there will be no noticable improvements in efficiency if the generators are just wrappers around lists (or any other data structure that can contain an arbritary set of values).
Your best bet is do stick with your original method. Generators are great, but they can't efficiently represent an arbritary set of values (which might be the case in your scenario). It's also not worth it to do it in C.
EDIT: It might be worth it to do it in C, after all!
I'm not too big on the details of the Python C API, but consider defining a class in C, for example,DerivativesBase (deriving from dict). All you do is define an __init__ function in C for DerivativesBase that takes a dict as a parameter and copies the hash table pointer from the dict into your DerivativesBase object. Then, in python, your Derivatives class derives from DerivativesBase and implements the bulk of the functionality.

Sampling dictionaries in Python 3.x

In Python 3, dict_values, dict_keys and dict_items do not support indexing
my_dict = {'a': 0', 'b': 1', 'c': 2}
All of the queries below fail:
my_dict.keys()[1]
my_dict.values()[1]
my_dict.items()[1]
for that reason.
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
The key types are explained under Dictionary view objects, and also guaranteed to be subclasses of collections.abc.KeysView and friends. Basically, this means you can only count on them having __contains__, __iter__, and __len__.
They don't directly support indexing because their ordering can be invalidated.* But practically, in any implementation of Python, they're only actually invalidated if you mutate the dictionary. Which means you can safely do things like this:
next(itertools.islice(my_dict.keys(), i, None))
Basically, the same way you'd index a set, or any other non-iterator iterable.
* The actual rules as to what behavior is documented have changed a few times. The current version actually says "They provide a dynamic view on the dictionary’s entries, which means that when the dictionary changes, the view reflects these changes," which implies the practical rule can now be relied on. But even if you're using an older version that, e.g., explicitly only guarantees consistency between adjacent calls to keys, values, items, and related functions, unless you're worried about someone writing a new implementation of Python 2.6 or 3.1 or something, there's no reason to worry about that.
Of course you probably want to wrap that up in a function that's more readable. In fact, I'd do it in two steps. First, use the nth function from the itertools recipes:
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
Then wrap up the key indexing:
def getkey(mapping, index, default=None):
return nth(mapping.keys(), index, default)
What if you want a random sample? Well, dictionary views are Sized, as are dictionaries themselves, so you can always use randrange:
def choosekey(mapping):
return getkey(mapping, random.randrange(len(mapping)))
If you just want a key, value or item, use next() and iter():
next(iter(my_dict))
next(iter(my_dict.values()))
next(iter(my_dict.items()))

Wrapping an instance in a subclass's constructor before returning

I hope the title isn't too wrong or misleading; I'm not entirely sure what the name is for this kind of thing.
Basically, I've been doing a lot of work with dictionaries (and some subclasses thereof: defaultdict and OrderedDict, for instance) and am trying to just make a few helper functions which will do some of this lifting for me, while still operating across a number of different types of dictionary. Ideally, I'd like to do this quickly (that is, with some amount of optimization) and of course elegantly. Here's an example of what I have now:
def filter_dict(old_dict, keep_keys):
return {k:v for k,v in old_dict.items() if k in keep_keys}
This works well, but if old_dict is anything other than a basic dictionary, I lose that.
Okay, so I can do something like type(old_dict)({.....}) to wrap the newly created dictionary in old_dict's class's constructor, but that still doesn't work for defaultdicts, which take as a first argument a function returning default value. If it were second argument, I could just pass some *args or something to a constructor, but this complicates things.
I guess my question is: can I get a partial application to a constructor? Can I get the guts of that instance and call its class constructor with everything intact except for the items? In Haskell, I'd do something like reverse = lambda f, x, *args: f(*args, x) but that's not even allowed in Python.
What am I missing? Some introspection into the classes? Tinkering with __new__? Factory... somethingoranother? I feel really dumb here trying to get something pretty but functional (har).
Thanks in advance for any insight.
I'd copy the original object, then change the copy. In your example:
def filter_dict(old_dict, keep_keys):
res = old_dict.copy()
for k in res.keys():
if k not in keep_keys:
res.pop(k)
return res
It sounds like your helper functions only know about the things they want to change in old_dict, not the things they want to keep the same. So, the safest way seems to be to rely on the copy method (or the copy module) and then remove or change what you don't want. Otherwise you need some way to recall exactly what arguments were used to create the object, and I don't know if a general method for doing this exists.

Hashing an immutable dictionary in Python

Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter):
# ...
def __hash__(self):
return hash(tuple(sorted(self.items())))
Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self):
return functools.reduce(lambda a, b: a ^ b, self.items(), 0)
I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)
What I've gathered from the comments
Both #marcin and #senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. #marcin was first but I gave the check mark to #senderle because of the good research on the big-O running times for different solutions. #marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter):
# Edit: A previous version of this code included a __slots__ definition.
# But, from the Python documentation: "When inheriting from a class without
# __slots__, the __dict__ attribute of that class will always be accessible,
# so a __slots__ definition in the subclass is meaningless."
# http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
# ...
def __hash__(self):
"Implements hash(self) -> int"
if not hasattr(self, '_hash'):
self._hash = hash(frozenset(self.items()))
return self._hash
Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.
To provide an explicit example:
>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990
To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.
See also this great answer from Raymond Hettinger describing his implementation of the frozenset hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.
Have you considered hash(sorted(hash(x) for x in self.items()))? That way, you are only sorting integers, and don't have to build a list.
You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__ method?
Alternatively, similar to my answer here, hash(frozenset(self.items())).

How to glob for iterable element

I have a python dictionary that contains iterables, some of which are lists, but most of which are other dictionaries. I'd like to do glob-style assignment similar to the following:
myiter['*']['*.txt']['name'] = 'Woot'
That is, for each element in myiter, look up all elements with keys ending in '.txt' and then set their 'name' item to 'Woot'.
I've thought about sub-classing dict and using the fnmatch module. But, it's unclear to me what the best way of accomplishing this is.
The best way, I think, would be not to do it -- '*' is a perfectly valid key in a dict, so myiter['*'] has a perfectly well defined meaning and usefulness, and subverting that can definitely cause problems. How to "glob" over keys which are not strings, including the exclusively integer "keys" (indices) in elements which are lists and not mappings, is also quite a design problem.
If you nevertheless must do it, I would recommend taking full control by subclassing the abstract base class collections.MutableMapping, and implement the needed methods (__len__, __iter__, __getitem__, __setitem__, __delitem__, and, for better performance, also override others such as __contains__, which the ABC does implement on the base of the others, but slowly) in terms of a contained dict. Subclassing dict instead, as per other suggestions, would require you to override a huge number of methods to avoid inconsistent behavior between the use of "keys containing wildcards" in the methods you do override, and in those you don't.
Whether you subclass collections.MutableMapping, or dict, to make your Globbable class, you have to make a core design decision: what does yourthing[somekey] return when yourthing is a Globbable?
Presumably it has to return a different type when somekey is a string containing wildcards, versus anything else. In the latter case, one would imagine, just what is actually at that entry; but in the former, it can't just return another Globbable -- otherwise, what would yourthing[somekey] = 'bah' do in the general case? For your single "slick syntax" example, you want it to set a somekey entry in each of the items of yourthing (a HUGE semantic break with the behavior of every other mapping in the universe;-) -- but then, how would you ever set an entry in yourthing itself?!
Let's see if the Zen of Python has anything to say about this "slick syntax" for which you yearn...:
>>> import this
...
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Consider for a moment the alternative of losing the "slick syntax" (and all the huge semantic headaches it necessarily implies) in favor of clarity and simplicity (using Python 2.7-and-better syntax here, just for the dict comprehension -- use an explicit dict(...) call instead if you're stuck with 2.6 or earlier), e.g.:
def match(s, pat):
try: return fnmatch.fnmatch(s, pat)
except TypeError: return False
def sel(ds, pat):
return [d[k] for d in ds for k in d if match(k, pat)]
def set(ds, k, v):
for d in ds: d[k] = v
so your assignment might become
set(sel(sel([myiter], '*')), '*.txt'), 'name', 'Woot')
(the selection with '*' being redundant if all , I'm just omitting it). Is this so horrible as to be worth the morass of issues I've mentioned above in order to use instead
myiter['*']['*.txt']['name'] = 'Woot'
...? By far the clearest and best-performing way, of course, remains the even-simpler
def match(k, v, pat):
try:
if fnmatch.fnmatch(k, pat):
return isinstance(v, dict)
except TypeError:
return False
for k, v in myiter.items():
if match(k, v, '*'):
for sk, sv in v.items():
if match(sk, sv, '*.txt'):
sv['name'] = 'Woot'
but if you absolutely crave conciseness and compactness, despising the Zen of Python's koan "Sparse is better than dense", you can at least obtain them without the various nightmares I mentioned as needed to achieve your ideal "syntax sugar".
The best way is to subclass dict and use the fnmatch module.
subclass dict: adding functionality you want in an object-oriented way.
fnmatch module: reuse of existing functionality.
You could use fnmatch for functionality to match on dictionary keys although you would have to compromise syntax slightly, especially if you wanted to do this on a nested dictionary. Perhaps a custom dictionary-like class with a search method to return wildcard matches would work well.
Here is a VERY BASIC example that comes with a warning that this is NOT RECURSIVE and will not handle nested dictionaries:
from fnmatch import fnmatch
class GlobDict(dict):
def glob(self, match):
"""#match should be a glob style pattern match (e.g. '*.txt')"""
return dict([(k,v) for k,v in self.items() if fnmatch(k, match)])
# Start with a basic dict
basic_dict = {'file1.jpg':'image', 'file2.txt':'text', 'file3.mpg':'movie',
'file4.txt':'text'}
# Create a GlobDict from it
glob_dict = GlobDict( **basic_dict )
# Then get glob-styl results!
globbed_results = glob_dict.glob('*.txt')
# => {'file4.txt': 'text', 'file2.txt': 'text'}
As for what way is the best? The best way is the one that works. Don't try to optimize a solution before it's even created!
Following the principle of least magic, perhaps just define a recursive function, rather than subclassing dict:
import fnmatch
def set_dict_with_pat(it,key_patterns,value):
if len(key_patterns)>1:
for key in it:
if fnmatch.fnmatch(key,key_patterns[0]):
set_dict_with_pat(it[key],key_patterns[1:],value)
else:
for key in it:
if fnmatch.fnmatch(key,key_patterns[0]):
it[key]=value
Which could be used like this:
myiter=({'dir1':{'a.txt':{'name':'Roger'},'b.notxt':{'name':'Carl'}},'dir2':{'b.txt':{'name':'Sally'}}})
set_dict_with_pat(myiter,['*','*.txt','name'],'Woot')
print(myiter)
# {'dir2': {'b.txt': {'name': 'Woot'}}, 'dir1': {'b.notxt': {'name': 'Carl'}, 'a.txt': {'name': 'Woot'}}}

Categories

Resources