Sampling dictionaries in Python 3.x

Sampling dictionaries in Python 3.x - python

In Python 3, dict_values, dict_keys and dict_items do not support indexing
my_dict = {'a': 0', 'b': 1', 'c': 2}
All of the queries below fail:
my_dict.keys()[1]
my_dict.values()[1]
my_dict.items()[1]
for that reason.
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).

Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
The key types are explained under Dictionary view objects, and also guaranteed to be subclasses of collections.abc.KeysView and friends. Basically, this means you can only count on them having __contains__, __iter__, and __len__.
They don't directly support indexing because their ordering can be invalidated.* But practically, in any implementation of Python, they're only actually invalidated if you mutate the dictionary. Which means you can safely do things like this:
next(itertools.islice(my_dict.keys(), i, None))
Basically, the same way you'd index a set, or any other non-iterator iterable.
* The actual rules as to what behavior is documented have changed a few times. The current version actually says "They provide a dynamic view on the dictionary’s entries, which means that when the dictionary changes, the view reflects these changes," which implies the practical rule can now be relied on. But even if you're using an older version that, e.g., explicitly only guarantees consistency between adjacent calls to keys, values, items, and related functions, unless you're worried about someone writing a new implementation of Python 2.6 or 3.1 or something, there's no reason to worry about that.
Of course you probably want to wrap that up in a function that's more readable. In fact, I'd do it in two steps. First, use the nth function from the itertools recipes:
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
Then wrap up the key indexing:
def getkey(mapping, index, default=None):
return nth(mapping.keys(), index, default)
What if you want a random sample? Well, dictionary views are Sized, as are dictionaries themselves, so you can always use randrange:
def choosekey(mapping):
return getkey(mapping, random.randrange(len(mapping)))

If you just want a key, value or item, use next() and iter():
next(iter(my_dict))
next(iter(my_dict.values()))
next(iter(my_dict.items()))

Related

Why does the **kwargs mapping compare equal with a differently ordered OrderedDict?

According to PEP 468:
Starting in version 3.6 Python will preserve the order of keyword arguments as passed to a function. To accomplish this the collected kwargs will now be an ordered mapping. Note that this does not necessarily mean OrderedDict.
In that case, why does this ordered mapping fail to respect equality comparison with Python's canonical ordered mapping type, the collections.OrderedDict:
>>> from collections import OrderedDict
>>> data = OrderedDict(zip('xy', 'xy'))
>>> def foo(**kwargs):
... return kwargs == data
...
>>> foo(x='x', y='y') # expected result: True
True
>>> foo(y='y', x='x') # expected result: False
True
Although iteration order is now preserved, kwargs seems to be behaving just like a normal dict for the comparisons. Python has a C implemented ordered dict since 3.5, so it could conceivably have been used directly (or, if performance was still a concern, a faster implementation using a thin subclass of the 3.6 compact dict).
Why doesn't the ordered mapping received by a function respect ordering in equality comparisons?

Regardless of what an “ordered mapping” means, as long as it’s not necessarily OrderedDict, OrderedDict’s == won’t take into account its order. Docs:
Equality tests between OrderedDict objects are order-sensitive and are implemented as list(od1.items())==list(od2.items()). Equality tests between OrderedDict objects and other Mapping objects are order-insensitive like regular dictionaries. This allows OrderedDict objects to be substituted anywhere a regular dictionary is used.

"Ordered mapping" only means the mapping has to preserve order. It doesn't mean that order has to be part of the mapping's == relation.
The purpose of PEP 468 is just to preserve the ordering information. Having order be part of == would produce backward incompatibility without any real benefit to any of the use cases that motivated PEP 468. Using OrderedDict would also be more expensive (since OrderedDict still maintains its own separate linked list to track order, and it can't abandon that linked list without sacrificing big-O efficiency in popitem and move_to_end).

The answer to your first 'why' is because this feature is implemented by using a plain dict in CPython. As #Ryan's answer points out, this means that comparisons won't be order-sensitive.
The second 'why' here is why this doesn't use an OrderedDict.
Using an OrderedDict was the initial plan as stated in the first draft of PEP 486. The idea, as stated in this reply, was to collect some perf data to show the effect of plugging in the OrderedDict since this was a point of contention when the idea was floated around before. The author of the PEP even alluded to the order preserving dict being another option in the final reply on that thread.
After that, the conversation on the topic seems to have died down until Python 3.6 came along. When the new dict came, it had the nice side-effect of just implementing PEP 486 out of the box (as this Python-dev thread states). The specific message in that thread also states how the author wanted the term OrderedDict to be changed to Ordered Mapping. (This is also when a new commit on PEP 468, after the initial one, was made)
As far as I can tell, this rewording was done in order to allow other implementations to provide this feature as they see fit. CPython and PyPy already had a dict that easily implemented PEP 468, other implementations might opt for an OrderedDict, others could go for another form of an ordered mapping.
That does open the door for a problem, though. It does mean that, theoretically, in an implementation of Python 3.6 with an OrderedDict as the structure implementing this feature, the comparison would be order-sensitive while in others (CPython) it wouldn't. (In Python 3.7, all dicts are required to be insertion-ordered so this point is probably moot since all implementations would use it for **kwargs)
Though it does seem like an issue, it really isn't. As #user2357112 pointed out, there's no guarantee on ==. PEP 468 only guarantees order. As far as I can tell, == is basically implementation defined.
In short, it compares equal in CPython because kwargs in CPython is a dict and it's a dict because after 3.6 the whole thing just worked.

Just to add, if you do want to make this check (without relying on an implementation detail (which even then, won't be in python 3.7)), just do
from collections import OrderedDict
>>> data = OrderedDict(zip('xy', 'xy'))
>>> def foo(**kwargs):
... return OrderedDict(kwargs) == data
since this is guaranteed to be True.

Custom class which is a dict, but initialized without dict copy?

For legibility purposes, I would like to have a custom class that behaves exactly like a dict (but carries a meaningful type instead of the more general dict type):
class Derivatives(dict):
"Dictionary that represents the derivatives."
Now, is there a way of building new objects of this class in a way that does not involve copies? The naive usage
derivs = Derivatives({var: 1}) # var is a Python object
in fact creates a copy of the dictionary passed as an argument, which I would like to avoid, for efficiency reasons.
I tried to bypass the copy but then the class of the dict cannot be changed, in CPython:
class Derivatives(dict):
def __new__(cls, init_dict):
init_dict.__class__ = cls # Fails with __class__ assignment: only for heap types
return init_dict
I would like to have both the ability to give an explicit class name to the dictionaries that the program manipulates and an efficient way of building such dictionaries (instead of being forced to copy a Python dict). Is this doable efficiently in Python?
PS: The use case is maybe 100,000 creations of single-key Derivatives, where the key is a variable (not a string, so no keyword initialization). This is actually not slow, so "efficiency reasons" here means more something like "elegance": there is ideally no need to waste time doing a copy when the copy is not needed. So, in this particular case the question is more about the elegance/clarity that Python can bring here than about running speed.

By inheriting from dict you are given three possibilities for constructor arguments: (baring the {} literal)
class dict(**kwarg)
class dict(mapping, **kwarg)
class dict(iterable, **kwarg)
This means that, in order to instantiate your instance you must do one of the following:
Pass the variables as keywords D(x=1) which are then packed into an intermediate dictionary anyway.
Create a plain dictionary and pass it as a mapping.
Pass an iterable of (key,value) pairs.
So in all three of these cases you will need to create intermediate objects to satisfy the dict constructor.
The third option for a single pair it would look like D(((var,1),)) which I highly recommend against for readability sake.
So if you want your class to inherit from a dictionary, using Derivatives({var: 1}) is your most efficient and most readable option.
As a personal note if you will have thousands of single pair dictionaries I'm not sure how the dict setup is the best in the first place, you may just reconsider the basis of your class.

TL;DR: There's not general-purpose way to do it unless you do it in C.
Long answer:
The dict class is implemented in C. Thus, there is no way to access it's internal properties - and most importantly, it's internal hash table, unless you use C.
In C, you could simply copy the pointer representing the hash table into your object without having to iterate over the dict (key, value) pairs and insert them into your object. (Of course, it's a bit more complicated than this. Note that I omit memory management details).
Longer answer:
I'm not sure why you are concerned about efficiency.
Python passes arguments as references. It rarely every copies unless you explicitly tell it to.
I read in the comments that you can't use named parameters, as the keys are actual Python objects. That leaves me to understand that you're worried about copying the dict keys (and maybe values). However, even the dictionary keys are not copied, and passed by reference! Consider this code:
class Test:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return self.x
t = Test(1, 2)
print(t.y) # prints 2
d = {t: 1}
print(d[t]) # prints 1
keys = list(d.keys())
keys[0].y = 10
print(t.y) # prints 10! No copying was made when inserting object into dictionary.
Thus, the only remaining area of concern is iterating through the dict and inserting the values in your Derivatives class. This is unavoidable, unless you can somehow set the internal hash table of your class to the dict's internal hash table. There is no way to do this in pure python, as the dict class is implemented in C (as mentioned above).
Note that others have suggested using generators. This seems like a good idea too - say if you were reading the derivatives from a file or if you were generating them with a simple formula. It would avoid creating the dict object in the first place. However, there will be no noticable improvements in efficiency if the generators are just wrappers around lists (or any other data structure that can contain an arbritary set of values).
Your best bet is do stick with your original method. Generators are great, but they can't efficiently represent an arbritary set of values (which might be the case in your scenario). It's also not worth it to do it in C.
EDIT: It might be worth it to do it in C, after all!
I'm not too big on the details of the Python C API, but consider defining a class in C, for example,DerivativesBase (deriving from dict). All you do is define an __init__ function in C for DerivativesBase that takes a dict as a parameter and copies the hash table pointer from the dict into your DerivativesBase object. Then, in python, your Derivatives class derives from DerivativesBase and implements the bulk of the functionality.

Why does Python's dict.keys() return a list and not a set?

I would've expected Python's keys method to return a set instead of a list. Since it most closely resembles the kind of guarantees that keys of a hashmap would give. Specifically, they are unique and not sorted, like a set. However, this method returns a list:
>>> d = {}
>>> d.keys().__class__
<type 'list'>
Is this just a mistake in the Python API or is there some other reason I am missing?

One reason is that dict.keys() predates the introduction of sets into the language.
Note that the return type of dict.keys() has changed in Python 3: the function now returns a "set-like" view rather than a list.
For set-like views, all of the operations defined for the abstract base class collections.abc.Set are available (for example, ==, <, or ^).

In python 2, it's less efficient to construct a set than a list.
Don't want to make an assumption that the user of the return value will want to search within the result. Iteration is also likely.
In python 3, it's no longer a list. It's an ordered iterable because ordering is guaranteed.

Given an arbitrary collection, is there a way to tell if it is ordered?

Here's what I have so far:
def is_ordered(collection):
if isinstance(collection, set):
return False
if isinstance(collection, list):
return True
if isinstance(collection, dict):
return False
raise Exception("unknown collection")
Is there a much better way to do this?
NB: I do mean ordered and not sorted.
Motivation:
I want to iterate over an ordered collection. e.g.
def most_important(priorities):
for p in priorities:
print p
In this case the fact that priorities is ordered is important. What kind of collection it is is not. I'm trying to live duck-typing here. I have frequently been dissuaded by from type checking by Pythonistas.

If the collection is truly arbitrary (meaning it can be of any class whatsoever), then the answer has to be no.
Basically, there are two possible approaches:
know about every possible class that can be presented to your method, and whether it's ordered;
test the collection yourself by inserting into it every possible combination of keys, and seeing whether the ordering is preserved.
The latter is clearly infeasible. The former is along the lines of what you already have, except that you have to know about every derived class such as collections.OrderedDict; checking for dict is not enough.
Frankly, I think the whole is_ordered check is a can of worms. Why do you want to do this anyway?

Update: In essence, you are trying to unittest the argument passed to you. Stop doing that, and unittest your own code. Test your consumer (make sure it works with ordered collections), and unittest the code that calls it, to ensure it is getting the right results.
In a statically-typed language you would simply restrict yourself to specific types. If you really want to replicate that, simply specify the only types you accept, and test for those. Raise an exception if anything else is passed. It's not pythonic, but it reliably achieves what you want to do
Well, you have two possible approaches:
Anything with an append method is almost certainly ordered; and
If it only has an add method, you can try adding a nonce-value, then iterating over the collection to see if the nonce appears at the end (or, perhaps at one end); you could try adding a second nonce and doing it again just to be more confident.
Of course, this won't work where e.g. the collection is empty, or there is an ordering function that doesn't result in addition at the ends.
Probably a better solution is simply to specify that your code requires ordered collections, and only pass it ordered collections.

I think that enumerating the 90% case is about as good as you're going to get (if using Python 3, replace basestring with str). Probably also want to consider how you would handle generator expressions and similar ilk, too (again, if using Py3, skip the xrangor):
generator = type((i for i in xrange(0)))
enumerator = type(enumerate(range(0)))
xrangor = type(xrange(0))
is_ordered = lambda seq : isinstance(seq,(tuple, list, collections.OrderedDict,
basestring, generator, enumerator, xrangor))
If your callers start using itertools, then you'll also need to add itertools types as returned by islice, imap, groupby. But the sheer number of these special cases really starts to point to a code smell.

What if the list is not ordered, e.g. [1,3,2]?

Hashing an immutable dictionary in Python

Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter):
# ...
def __hash__(self):
return hash(tuple(sorted(self.items())))
Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self):
return functools.reduce(lambda a, b: a ^ b, self.items(), 0)
I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)
What I've gathered from the comments
Both #marcin and #senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. #marcin was first but I gave the check mark to #senderle because of the good research on the big-O running times for different solutions. #marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter):
# Edit: A previous version of this code included a __slots__ definition.
# But, from the Python documentation: "When inheriting from a class without
# __slots__, the __dict__ attribute of that class will always be accessible,
# so a __slots__ definition in the subclass is meaningless."
# http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
# ...
def __hash__(self):
"Implements hash(self) -> int"
if not hasattr(self, '_hash'):
self._hash = hash(frozenset(self.items()))
return self._hash

Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.
To provide an explicit example:
>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990
To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.
See also this great answer from Raymond Hettinger describing his implementation of the frozenset hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.

Have you considered hash(sorted(hash(x) for x in self.items()))? That way, you are only sorting integers, and don't have to build a list.
You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__ method?
Alternatively, similar to my answer here, hash(frozenset(self.items())).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.