Python: Understanding dictionary view objects - python

I've been trying to understand built-in view objects return by .items(), .values(), .keys() in Python 3 or similarly by .viewitems(), .viewvalues(), .viewkeys(). There are other threads on that subject but none (even the doc) seems to described how they work internally.
The main gain here seems to be efficienty compared to the copy of type list returned in Python 2. There are often compared to a window to the dictionnary items (like in this thread).
But what is that window and why is it more efficient ?
The only thing I can see is that the view objects seems to be set-like objects, which are generally faster for membership testing. But is this the only factor ?
Code sample
>>> example_dict = {'test':'test'}
>>> example_dict.items()
dict_items([('test', 'test')])
>>> type(example_dict.items())
<class 'dict_items'>
So, my question is regarding this dict_items class. How does that work internally?

Dict views store a reference to their parent dict, and they translate operations on the view to corresponding operations on the dict.
Iteration over a dict view is more efficient than building a list and iterating over that, because building a list takes time and memory that you don't have to spend with the view. The old way, Python would iterate over the dict's underlying storage to build a new list, and then you would iterate over the list. Iterating over a dict view uses an iterator that walks through the dict's underlying storage directly, skipping the unnecessary list step.
Dict views also support efficient containment tests and setlike intersection/difference/etc. operations, because they get to perform direct hash lookups on the underlying dict instead of iterating through a list and checking equality element by element.
If you want to see the concrete implementation used by CPython, you can take a look in the official repository, but this implementation is subject to change. It has changed, repeatedly.

One of the main advantages is that views are dynamic:
>>> di={1:'one',2:'two',3:'three'}
>>> view=di.viewitems()
>>> view
dict_items([(1, 'one'), (2, 'two'), (3, 'three')])
>>> di[2]='new two'
>>> view
dict_items([(1, 'one'), (2, 'new two'), (3, 'three')])
Therefore you do not need to regenerate the item, key or value list (as you would with dict.items()) if the dictionary changes.
Think of the Python 2 dict.items() as a type of copy of the dict -- the way it was when the copy was made.
Think of Python 3 dict.items() or the Python 2 equivalent of dict.viewitems() as an up-to-date copy of the way the dict is now. (Same with .viewkeys(), .viewvalues() obviously.)
The Python 3.6 documents have good examples of why and when you would use one.
Value views are not set-like, since dicts can have duplicate values. Key views are set-like, and items views are set-like for dicts with hashable values.
Note: With Python 3, the view replaces what Python 2 had with .keys() .values() or .items() Some may relying on dict.keys() or dict.values() being a static representation of a dict's previous state may have a surprise.

Related

Why exactly are .values() and .keys() considered O(1)?

Couldnt track down a solid enough reasoning for why dictionary functions such as .values() and .keys() are considered to be O(1) in big O notation. (not sure if .items() is also considered O(1) )
It is likely that the reference you found to .keys() and .values() (and .items()) being O(1) emphasized the performance because it is a contrast to Python 2, where those functions returned lists and required O(N) time to copy references to all the relevant objects out of the dictionary.
Iterating on the view objects returned by those methods in Python 3 will still take O(N) time, as there's no way to avoid visiting each item, since that's the whole point of iteration. The keys and items views do offer O(1) membership tests (e.g. (somekey, somevalue) in somedict.items()), which is a lot more efficient than searching for an item in a list.
I am not versed in python, but I found this:
Dictionary view objects
The objects returned by dict.keys(), dict.values() and
dict.items() are view objects. They provide a dynamic view on the
dictionary’s entries, which means that when the dictionary changes,
the view reflects these changes.
Dictionary views can be iterated over to yield their respective data,
[...]
Which means that dict.keys() and such don't return a new object, but just a thin wrapper that can iterate over the dictionary. So getting this view is O(1). Iterating over elements it's not.

Why do `key in dict` and `key in dict.keys()` have the same output?

I tried to search the keys in a dictionary, but I forgot to add the keys() function. I still got the expected answer.
Why is the result the same for these two expressions?
key in dict
and
key in dict.keys()
To understand why key in dct returns the same result as key in dct.keys() one needs to look in the past. Historically in Python 2, one would test the existence of a key in dictionary dct with dct.has_key(key). This was changed for Python 2.2, when the preferred way became key in dct, which basically did the same thing:
In a minor related change, the in operator now works on dictionaries, so key in dict is now equivalent to dict.has_key(key)
The behaviour of in is implemented internally in terms of the __contains__ dunder method. Its behaviour is documented in the Python language reference - 3 Data Model:
object.__contains__(self, item)
Called to implement membership test operators. Should return true if item is in self, false otherwise. For mapping objects, this should consider the keys of the mapping rather than the values or the key-item pairs.
For objects that don’t define __contains__(), the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__(), see this section in the language reference.
(emphasis mine; dictionaries in Python are mapping objects)
In Python 3, the has_key method was removed altogether and now there the correct way to test for the existence of a key is solely key in dict, as documented.
In contrast with the 2 above, key in dct.keys() has never been the correct way of testing whether a key exists in a dictionary.
The result of both your examples is indeed the same, however key in dct.keys() is slightly slower on Python 3 and is abysmally slow on Python 2.
key in dct returns true, if the key is found as a key in the dct in almost constant time operation - it does not matter whether there are two or a million keys - its time complexity is constant on average case (O(1))
dct.keys() in Python 2 creates a list of all keys; and in Python 3 a view of keys; both of these objects understand the key in x. With Python 2 it works like for any iterable; the values are iterated over and True is returned as soon as one value is equal to the given value (here key).
In practice, in Python 2 you'd find key in dct.keys() much slower than key in dict (key in dct.keys() scales linearly with the number of keys - its time complexity is O(n) - both dct.keys(), which builds a list of all keys, and key in key_list are O(n))
In Python 3, the key in dct.keys() won't be much slower than key in dctas the view does not make a list of the keys, and the access still would be O(1), however in practice it would be slower by at least a constant value, and it is 7 more characters, so there is usually practically no reason to use it, even if on Python 3.
Python data model dictates that generally a membership test are normally implemented as an iteration through a sequence unless a container object supplies the special method __contains__ .
As mentioned further in the document, for objects that does not implement the __contains__ special method, the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__().
Its important to know that for dictionaries, dict.keys() returns either an iterator either a dictionary view (Python 3.X) or a sequence (more precisely a list), in Python (2.X). Membership test for a sequence/list is an O(n) complexity where as for a dictionary like object which is implemented as a hash map, or a dictionary view which supports operation like supports operations like membership test and iteration has a complexity of O(1).
So for Python 2.X, there is a distinct difference in terms of what both does, that might impact performance, where as for Python 2.X, the only overhead is an extra function call.
In any case, it is always preferred to use the membership on the dict object rather than using the membership test on a dictionary view or a sequence which is returned by dict.keys

Sampling dictionaries in Python 3.x

In Python 3, dict_values, dict_keys and dict_items do not support indexing
my_dict = {'a': 0', 'b': 1', 'c': 2}
All of the queries below fail:
my_dict.keys()[1]
my_dict.values()[1]
my_dict.items()[1]
for that reason.
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
Sometimes I just want to get a random sample of what's in my dictionary. I know I can convert them their output to lists. Do they have any other getter methods that do not require creating another data structure? (I would also imagine that converting them to a list would create a copy, which may not work well for huge dictionaries).
The key types are explained under Dictionary view objects, and also guaranteed to be subclasses of collections.abc.KeysView and friends. Basically, this means you can only count on them having __contains__, __iter__, and __len__.
They don't directly support indexing because their ordering can be invalidated.* But practically, in any implementation of Python, they're only actually invalidated if you mutate the dictionary. Which means you can safely do things like this:
next(itertools.islice(my_dict.keys(), i, None))
Basically, the same way you'd index a set, or any other non-iterator iterable.
* The actual rules as to what behavior is documented have changed a few times. The current version actually says "They provide a dynamic view on the dictionary’s entries, which means that when the dictionary changes, the view reflects these changes," which implies the practical rule can now be relied on. But even if you're using an older version that, e.g., explicitly only guarantees consistency between adjacent calls to keys, values, items, and related functions, unless you're worried about someone writing a new implementation of Python 2.6 or 3.1 or something, there's no reason to worry about that.
Of course you probably want to wrap that up in a function that's more readable. In fact, I'd do it in two steps. First, use the nth function from the itertools recipes:
def nth(iterable, n, default=None):
return next(itertools.islice(iterable, n, None), default)
Then wrap up the key indexing:
def getkey(mapping, index, default=None):
return nth(mapping.keys(), index, default)
What if you want a random sample? Well, dictionary views are Sized, as are dictionaries themselves, so you can always use randrange:
def choosekey(mapping):
return getkey(mapping, random.randrange(len(mapping)))
If you just want a key, value or item, use next() and iter():
next(iter(my_dict))
next(iter(my_dict.values()))
next(iter(my_dict.items()))

Why does Python's dict.keys() return a list and not a set?

I would've expected Python's keys method to return a set instead of a list. Since it most closely resembles the kind of guarantees that keys of a hashmap would give. Specifically, they are unique and not sorted, like a set. However, this method returns a list:
>>> d = {}
>>> d.keys().__class__
<type 'list'>
Is this just a mistake in the Python API or is there some other reason I am missing?
One reason is that dict.keys() predates the introduction of sets into the language.
Note that the return type of dict.keys() has changed in Python 3: the function now returns a "set-like" view rather than a list.
For set-like views, all of the operations defined for the abstract base class collections.abc.Set are available (for example, ==, <, or ^).
In python 2, it's less efficient to construct a set than a list.
Don't want to make an assumption that the user of the return value will want to search within the result. Iteration is also likely.
In python 3, it's no longer a list. It's an ordered iterable because ordering is guaranteed.

Hashing an immutable dictionary in Python

Short version: What's the best hashing algorithm for a multiset implemented as a dictionary of unordered items?
I'm trying to hash an immutable multiset (which is a bag or multiset in other languages: like a mathematical set except that it can hold more than one of each element) implemented as a dictionary. I've created a subclass of the standard library class collections.Counter, similar to the advice here: Python hashable dicts, which recommends a hash function like so:
class FrozenCounter(collections.Counter):
# ...
def __hash__(self):
return hash(tuple(sorted(self.items())))
Creating the full tuple of items takes up a lot of memory (relative to, say, using a generator) and hashing will occur in an extremely memory intensive part of my application. More importantly, my dictionary keys (multiset elements) probably won't be order-able.
I'm thinking of using this algorithm:
def __hash__(self):
return functools.reduce(lambda a, b: a ^ b, self.items(), 0)
I figure using bitwise XOR means order doesn't matter for the hash value unlike in the hashing of a tuple? I suppose I could semi-implement the Python tuple-hashing alogrithm on the unordered stream of tuples of my data. See https://github.com/jonashaag/cpython/blob/master/Include/tupleobject.h (search in the page for the word 'hash') -- but I barely know enough C to read it.
Thoughts? Suggestions? Thanks.
(If you're wondering why I'm messing around with trying to hash a multiset: The input data for my problem are sets of multisets, and within each set of multisets, each multiset must be unique. I'm working on a deadline and I'm not an experienced coder, so I wanted to avoid inventing new algorithms where possible. It seems like the most Pythonic way to make sure I have unique of a bunch of things is to put them in a set(), but the things must be hashable.)
What I've gathered from the comments
Both #marcin and #senderle gave pretty much the same answer: use hash(frozenset(self.items())). This makes sense because items() "views" are set-like. #marcin was first but I gave the check mark to #senderle because of the good research on the big-O running times for different solutions. #marcin also reminds me to include an __eq__ method -- but the one inherited from dict will work just fine. This is how I'm implementing everything -- further comments and suggestions based on this code are welcome:
class FrozenCounter(collections.Counter):
# Edit: A previous version of this code included a __slots__ definition.
# But, from the Python documentation: "When inheriting from a class without
# __slots__, the __dict__ attribute of that class will always be accessible,
# so a __slots__ definition in the subclass is meaningless."
# http://docs.python.org/py3k/reference/datamodel.html#notes-on-using-slots
# ...
def __hash__(self):
"Implements hash(self) -> int"
if not hasattr(self, '_hash'):
self._hash = hash(frozenset(self.items()))
return self._hash
Since the dictionary is immutable, you can create the hash when the dictionary is created and return it directly. My suggestion would be to create a frozenset from items (in 3+; iteritems in 2.7), hash it, and store the hash.
To provide an explicit example:
>>>> frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems())
frozenset([(3, 2), (1, 3), (4, 1), (2, 1)])
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 3, 4]).iteritems()))
-3071743570178645657
>>>> hash(frozenset(Counter([1, 1, 1, 2, 3, 4]).iteritems()))
-6559486438209652990
To clarify why I prefer a frozenset to a tuple of sorted items: a frozenset doesn't have to sort the items, and so the initial hash completes in O(n) time rather than O(n log n) time. This can be seen from the frozenset_hash and set_next implementations.
See also this great answer from Raymond Hettinger describing his implementation of the frozenset hash function. There he explicitly explains how the hash function avoids having to sort values to get a stable, order insensitive value.
Have you considered hash(sorted(hash(x) for x in self.items()))? That way, you are only sorting integers, and don't have to build a list.
You could also xor the element hashes together, but frankly I don't how well that would work (would you have a lot of collisions?). Speaking of collisions, don't you have to implement the __eq__ method?
Alternatively, similar to my answer here, hash(frozenset(self.items())).

Categories

Resources