I'm a bit confused about what can/can't be used as a key for a python dict.
dicked = {}
dicked[None] = 'foo' # None ok
dicked[(1,3)] = 'baz' # tuple ok
import sys
dicked[sys] = 'bar' # wow, even a module is ok !
dicked[(1,[3])] = 'qux' # oops, not allowed
So a tuple is an immutable type but if I hide a list inside of it, then it can't be a key.. couldn't I just as easily hide a list inside a module?
I had some vague idea that that the key has to be "hashable" but I'm just going to admit my own ignorance about the technical details; I don't know what's really going on here. What would go wrong if you tried to use lists as keys, with the hash as, say, their memory location?
There's a good article on the topic in the Python wiki: Why Lists Can't Be Dictionary Keys. As explained there:
What would go wrong if you tried to use lists as keys, with the hash as, say, their memory location?
It can be done without really breaking any of the requirements, but it leads to unexpected behavior. Lists are generally treated as if their value was derived from their content's values, for instance when checking (in-)equality. Many would - understandably - expect that you can use any list [1, 2] to get the same key, where you'd have to keep around exactly the same list object. But lookup by value breaks as soon as a list used as key is modified, and for lookup by identity requires you to keep around exactly the same list - which isn't requires for any other common list operation (at least none I can think of).
Other objects such as modules and object make a much bigger deal out of their object identity anyway (when was the last time you had two distinct module objects called sys?), and are compared by that anyway. Therefore, it's less surprising - or even expected - that they, when used as dict keys, compare by identity in that case as well.
Why can't I use a list as a dict key in python?
>>> d = {repr([1,2,3]): 'value'}
{'[1, 2, 3]': 'value'}
(for anybody who stumbles on this question looking for a way around it)
as explained by others here, indeed you cannot. You can however use its string representation instead if you really want to use your list.
Just found you can change List into tuple, then use it as keys.
d = {tuple([1,2,3]): 'value'}
The issue is that tuples are immutable, and lists are not. Consider the following
d = {}
li = [1,2,3]
d[li] = 5
li.append(4)
What should d[li] return? Is it the same list? How about d[[1,2,3]]? It has the same values, but is a different list?
Ultimately, there is no satisfactory answer. For example, if the only key that works is the original key, then if you have no reference to that key, you can never again access the value. With every other allowed key, you can construct a key without a reference to the original.
If both of my suggestions work, then you have very different keys that return the same value, which is more than a little surprising. If only the original contents work, then your key will quickly go bad, since lists are made to be modified.
Here's an answer http://wiki.python.org/moin/DictionaryKeys
What would go wrong if you tried to use lists as keys, with the hash as, say, their memory location?
Looking up different lists with the same contents would produce different results, even though comparing lists with the same contents would indicate them as equivalent.
What about Using a list literal in a dictionary lookup?
Because lists are mutable, dict keys (and set members) need to be hashable, and hashing mutable objects is a bad idea because hash values should be computed on the basis of instance attributes.
In this answer, I will give some concrete examples, hopefully adding value on top of the existing answers. Every insight applies to the elements of the set datastructure as well.
Example 1: hashing a mutable object where the hash value is based on a mutable characteristic of the object.
>>> class stupidlist(list):
... def __hash__(self):
... return len(self)
...
>>> stupid = stupidlist([1, 2, 3])
>>> d = {stupid: 0}
>>> stupid.append(4)
>>> stupid
[1, 2, 3, 4]
>>> d
{[1, 2, 3, 4]: 0}
>>> stupid in d
False
>>> stupid in d.keys()
False
>>> stupid in list(d.keys())
True
After mutating stupid, it cannot be found in the dict any longer because the hash changed. Only a linear scan over the list of the dict's keys finds stupid.
Example 2: ... but why not just a constant hash value?
>>> class stupidlist2(list):
... def __hash__(self):
... return id(self)
...
>>> stupidA = stupidlist2([1, 2, 3])
>>> stupidB = stupidlist2([1, 2, 3])
>>>
>>> stupidA == stupidB
True
>>> stupidA in {stupidB: 0}
False
That's not a good idea as well because equal objects should hash identically such that you can find them in a dict or set.
Example 3: ... ok, what about constant hashes across all instances?!
>>> class stupidlist3(list):
... def __hash__(self):
... return 1
...
>>> stupidC = stupidlist3([1, 2, 3])
>>> stupidD = stupidlist3([1, 2, 3])
>>> stupidE = stupidlist3([1, 2, 3, 4])
>>>
>>> stupidC in {stupidD: 0}
True
>>> stupidC in {stupidE: 0}
False
>>> d = {stupidC: 0}
>>> stupidC.append(5)
>>> stupidC in d
True
Things seem to work as expected, but think about what's happening: when all instances of your class produce the same hash value, you will have a hash collision whenever there are more than two instances as keys in a dict or present in a set.
Finding the right instance with my_dict[key] or key in my_dict (or item in my_set) needs to perform as many equality checks as there are instances of stupidlist3 in the dict's keys (in the worst case). At this point, the purpose of the dictionary - O(1) lookup - is completely defeated. This is demonstrated in the following timings (done with IPython).
Some Timings for Example 3
>>> lists_list = [[i] for i in range(1000)]
>>> stupidlists_set = {stupidlist3([i]) for i in range(1000)}
>>> tuples_set = {(i,) for i in range(1000)}
>>> l = [999]
>>> s = stupidlist3([999])
>>> t = (999,)
>>>
>>> %timeit l in lists_list
25.5 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit s in stupidlists_set
38.5 µs ± 61.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit t in tuples_set
77.6 ns ± 1.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
As you can see, the membership test in our stupidlists_set is even slower than a linear scan over the whole lists_list, while you have the expected super fast lookup time (factor 500) in a set without loads of hash collisions.
TL; DR: you can use tuple(yourlist) as dict keys, because tuples are immutable and hashable.
The simple answer to your question is that the class list does not implement the method hash which is required for any object which wishes to be used as a key in a dictionary. However the reason why hash is not implemented the same way it is in say the tuple class (based on the content of the container) is because a list is mutable so editing the list would require the hash to be recalculated which may mean the list in now located in the wrong bucket within the underling hash table. Note that since you cannot modify a tuple (immutable) it doesn't run into this problem.
As a side note, the actual implementation of the dictobjects lookup is based on Algorithm D from Knuth Vol. 3, Sec. 6.4. If you have that book available to you it might be a worthwhile read, in addition if you're really, really interested you may like to take a peek at the developer comments on the actual implementation of dictobject here. It goes into great detail as to exactly how it works. There is also a python lecture on the implementation of dictionaries which you may be interested in. They go through the definition of a key and what a hash is in the first few minutes.
Your awnser can be found here:
Why Lists Can't Be Dictionary Keys
Newcomers to Python often wonder why, while the language includes both
a tuple and a list type, tuples are usable as a dictionary keys, while
lists are not. This was a deliberate design decision, and can best be
explained by first understanding how Python dictionaries work.
Source & more info: http://wiki.python.org/moin/DictionaryKeys
A Dictionary is a HashMap it stores map of your keys, value converted
to a hashed new key and value mapping.
something like (psuedo code):
{key : val}
hash(key) = val
If you are wondering which are available options that can be used as key for your dictionary. Then
anything which is hashable(can be converted to hash, and hold static value i.e immutable so as to make a
hashed key as stated above) is eligible but as list or set objects can be vary on the go so hash(key) should also needs to vary just to be in sync with your list or set.
You can try :
hash(<your key here>)
If it works fine it can be used as key for your dictionary or else convert it to something hashable.
Inshort :
Convert that list to tuple(<your list>).
Convert that list to str(<your list>).
Simply we can keep in mind that the dict keys need to be immutable (to be exact, hashable). Lists are mutable (to be exact, lists do not provide a valid __hash__ method).
Here an immutable object (unchangeable object) is an object whose state cannot be modified after it is created. This is in contrast to a mutable object (changeable object), which can be modified after it is created.
According to the Python 2.7.2 documentation:
An object is hashable if it has a hash value which never changes
during its lifetime (it needs a __hash__() method), and can be
compared to other objects (it needs an __eq__() or __cmp__() method).
Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set
member, because these data structures use the hash value internally.
All of Python’s immutable built-in objects are hashable, while no
mutable containers (such as lists or dictionaries) are. Objects which
are instances of user-defined classes are hashable by default; they
all compare unequal, and their hash value is their id().
A tuple is immutable in the sense that you cannot add, remove or replace its elements, but the elements themselves may be mutable. List's hash value depends on the hash values of its elements, and so it changes when you change the elements.
Using id's for list hashes would imply that all lists compare differently, which would be surprising and inconvenient.
Related
Recently I have been learning more about hashing in Python and I came around this blog where it states that:
Suppose a Python program has 2 lists. If we need to think about
comparing those two lists, what will you do? Compare each element?
Well, that sounds fool-proof but slow as well!
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
So I am really confused about these statements.
First when we do:
[1, 2, 3] == [1, 2, 3] then how does this equality works ? Does it calculate the hash value and then compare it ?
Second what's the difference when we do:
[1, 2, 3] == [1, 2, 3] and (1, 2, 3) == (1, 2, 3) ?
Because when I tried to find time of execution with timeit then I got this result:
$ python3.5 -m timeit '[1, 2, 3] == [1, 2, 3]'
10000000 loops, best of 3: 0.14 usec per loop
$ python3.5 -m timeit '(1, 2, 3) == (1, 2, 3)'
10000000 loops, best of 3: 0.0301 usec per loop
So why there is difference in time from 0.14 for list to 0.03 for tuple which is faster than list.
Well, part of your confusion is that the blog post you're reading is just wrong. About multiple things. Try to forget that you ever read it (except to remember the site and the author's name so you'll know to avoid them in the future).
It is true that tuples are hashable and lists are not, but that isn't relevant to their equality-testing functions. And it's certainly not true that "it simply compares the hash values and it knows if they are equal!" Hash collisions happen, and ignoring them would lead to horrible bugs, and fortunately Python's developers are not that stupid. In fact, it's not even true that Python computes the hash value at initialization time.*
There actually is one significant difference between tuples and lists (in CPython, as of 3.6), but it usually doesn't make much difference: Lists do an extra check for unequal length at the beginning as an optimization, but the same check turned out to be a pessimization for tuples,** so it was removed from there.
Another, often much more important, difference is that tuple literals in your source get compiled into constant values, and separate copies of the same tuple literal get folded into the same constant object; that doesn't happen with lists, for obvious reasons.
In fact, that's what you're really testing with your timeit. On my laptop, comparing the tuples takes 95ns, while comparing the lists takes 169ns—but breaking it down, that's actually 93ns for the comparison, plus an extra 38ns to create each list. To make it a fair comparison, you have to move the creation to a setup step, and then compare already-existing values inside the loop. (Or, of course, you may not want to be fair—you're discovering the useful fact that every time you use a tuple constant instead of creating a new list, you're saving a significant fraction of a microsecond.)
Other than that, they basically do the same thing. Translating the C source into Python-like pseudocode (and removing all the error handling, and the stuff that makes the same function work for <, and so on):
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return len(v) == len(w)
return False
The list equivalent is like this:
if len(v) != len(w):
return False
for i in range(min(len(v), len(w))):
if v[i] != w[i]:
break
else:
return True
return False
* In fact, unlike strings, tuples don't even cache their hashes; if you call hash over and over, it'll keep re-computing it. See issue 9685, where a patch to change that was rejected because it slowed down some benchmarks and didn't speed up anything anyone could find.
** Not because of anything inherent about the implementation, but because people often compare lists of different lengths, but rarely do with tuples.
The answer was given in that article too :)
here is the demonstrated :
>>> l1=[1,2,3]
>>> l2=[1,2,3]
>>>
>>> hash(l1) #since list is not hashable
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
>>> t=(1,2,3)
>>> t2=(1,2,3)
>>> hash(t)
2528502973977326415
>>> hash(t2)
2528502973977326415
>>>
in above when you call hash on list it will give you TypeError as it is not hashable and for equality of two list python check its inside value that will lead to take much time
in case of tuple it calculates the hash value and for two similar tuplle having the same hash value so python only compared the hashvalue of tuple so it us much fast than list
from given article
Python has a much smarter way to do this. When a tuple is constructed
in a program, Python Interpreter calculates its hash in its memory. If
a comparison occurs between 2 tuples, it simply compares the hash
values and it knows if they are equal!
With Python 3:
>>> from collections import OrderedDict
>>> d1 = OrderedDict([('foo', 'bar')])
>>> d2 = OrderedDict([('foo', 'bar')])
I wanted to check for equality:
>>> d1 == d2
True
>>> d1.keys() == d2.keys()
True
But:
>>> d1.values() == d2.values()
False
Do you know why values are not equal?
I've tested this with Python 3.4 and 3.5.
Following this question, I posted on the Python-Ideas mailing list to have additional details:
https://mail.python.org/pipermail/python-ideas/2015-December/037472.html
In Python 3, dict.keys() and dict.values() return special iterable classes - respectively a collections.abc.KeysView and a collections.abc.ValuesView. The first one inherit it's __eq__ method from set, the second uses the default object.__eq__ which tests on object identity.
In python3, d1.values() and d2.values() are collections.abc.ValuesView objects:
>>> d1.values()
ValuesView(OrderedDict([('foo', 'bar')]))
Don't compare them as an object, convert them to lists and then compare them:
>>> list(d1.values()) == list(d2.values())
True
Investigating why it works for comparing keys, in _collections_abc.py of CPython, KeysView is inheriting from Set while ValuesView does not:
class KeysView(MappingView, Set):
class ValuesView(MappingView):
Tracing for __eq__ in ValuesView and its parents:
MappingView ==> Sized ==> ABCMeta ==> type ==> object.
__eq__ is implemented only in object and not overridden.
In the other hand, KeysView inherits __eq__ directly from Set.
Unfortunately, both current answers don't address why this is but focus on how this is done. That mailing list discussion was amazing, so I'll sum things up:
For odict.keys/dict.keys and odict.items/dict.items:
odict.keys (subclass of dict.keys) supports comparison due to its conformance to collections.abc.Set (it's a set-like object). This is possible due to the fact that keys inside a dictionary (ordered or not) are guaranteed to be unique and hashable.
odict.items (subclass of dict.items) also supports comparison for the same reason as .keys does. itemsview is allowed to do this since it raises the appropriate error if one of the items (specifically, the second element representing the value) is not hashable, uniqueness is guaranteed, though (due to keys being unique):
>>> od = OrderedDict({'a': []})
>>> set() & od.items()
TypeErrorTraceback (most recent call last)
<ipython-input-41-a5ec053d0eda> in <module>()
----> 1 set() & od.items()
TypeError: unhashable type: 'list'
For both these views keys, items, the comparison uses a simple function called all_contained_in (pretty readable) that uses the objects __contain__ method to check for membership of the elements in the views involved.
Now, about odict.values/dict.values:
As noticed, odict.values (subclass of dict.values [shocker]) doesn't compare like a set-like object. This is because the values of a valuesview cannot be represented as a set, the reasons are two-fold:
Most importantly, the view might contain duplicates which cannot be dropped.
The view might contain non-hashable objects (which, on it's own, isn't sufficient to not treat the view as set-like).
As stated in a comment by #user2357112 and by #abarnett in the mailing list, odict.values/dict.values is a multiset, a generalization of sets that allows multiple instances of it's elements.
Trying to compare these is not as trivial as comparing keys or items due to the inherent duplication, the ordering and the fact that you probably need to take into consideration the keys that correspond to those values. Should dict_values that look like this:
>>> {1:1, 2:1, 3:2}.values()
dict_values([1, 1, 2])
>>> {1:1, 2:1, 10:2}.values()
dict_values([1, 1, 2])
actually be equal even though the values that correspond to the keys isn't the same? Maybe? Maybe not? It isn't straight-forward either way and will lead to inevitable confusion.
The point to be made though is that it isn't trivial to compare these as is with keys and items, to sum up, with another comment from #abarnett on the mailing list:
If you're thinking we could define what multisets should do, despite not having a standard multiset type or an ABC for them, and apply that to values views, the next question is how to do that in better than quadratic time for non-hashable values. (And you can't assume ordering here, either.) Would having a values view hang for 30 seconds and then come back with the answer you intuitively wanted instead of giving the wrong answer in 20 millis be an improvement? (Either way, you're going to learn the same lesson: don't compare values views. I'd rather learn that in 20 millis.)
Basically, I need to make a lookup table with non-consecutive integer IDs. I'm wondering whether, in terms of lookup speed, I'm generally better off using a dict with integer keys anyway, or using a very long list with a lot of empty indices. It seems to me that a list might still be faster, as Python should know exactly where to look, but I'm wondering if there are any backend processes with the dict to compensate and whether the additional memory requirements for those empty list slots would negate the (probably) more easily traversed list's speed gains. Are there any alternatives to lists and dicts that might be better suited to this?
I have seen this question, but it doesn't totally answer mine: Dictionary access speed comparison with integer key against string key
ETA: I'm implementing lookup tables like this twice in my program. One instance sees a max id of 5,000 with 70-100 objects populated; the other has a max id of 750 with 20-30 populated.
To answer your question about dict vs list you'd have to give the exact information about the number of elements, the number of missing elements etc, so tha we could estimate exactly the memory usage of the two data structure and try to predict and/or check their performance.
In general a dict of N key-value pairs requires quite a bit more memory than a list with N values:
The dict must keep track of the keys
The dict is never more than 2/3 full. When this happens the memory allocated is doubled (this is required to have O(1) amortized time operations on the dict).
However there is an alternative to these data structure which should provide very good performance: blist. The blist package provide an interface that matches that of list, only it is implemented using B-trees. It can efficiently handle sparse lists. Most operations take either O(1) or O(log n) time, so they are quite efficient.
For example you could first create a sparse blist doing:
from blist import blist
seq = blist([None])
seq *= 2**30 # create a 2**30 element blist. Instantaneous!
And then you can set only the indices that have a value:
for i, value in zip(indices, values):
seq[i] = value
The full documentation is here.
Note that blist provides other efficient operations such as:
Concatenating two blists take O(log n) time
Taking an [i:j] slice takes O(log n) time
Inserting an element at a given index takes O(log n) operations
Popping an element (from every position) takes O(log n) operations
Since you gave some numbers, here's how they compare to dicts:
>>> from blist import blist
>>> b = blist([None])
>>> b *= 5000
>>> for i in range(100):b[i] = i
...
>>> b.__sizeof__()
2660
>>> d = dict()
>>> for i in range(100):d[i] = i
...
>>> d.__sizeof__()
6216
>>> b = blist([None])
>>> b *= 750
>>> for i in range(30):b[i] = i
...
>>> b.__sizeof__()
1580
>>> d = dict()
>>> for i in range(30):d[i] = i
...
>>> d.__sizeof__()
1608
In both cases a blist takes less memory (in the first case it takes 1/3 of the memory of the equivalent dict). Note that the memory taken by a blist also depends on whether or not the indices are contiguous (contiguous is better). However even using random indices:
>>> b = blist([None])
>>> b *= 5000
>>> import random
>>> for i in range(100):b[random.randint(0, 4999)] = i
...
>>> b.__sizeof__()
2916
It's still much better than the dict.
Even lookup times are better:
In [1]: from blist import blist
...: import random
...:
In [2]: b = blist([None])
In [3]: b *= 5000
In [4]: for i in range(100):b[random.randint(0, 4999)] = i
In [5]: %timeit b[0]
10000000 loops, best of 3: 50.7 ns per loop
In [6]: d = dict()
In [7]: for i in range(100):d[random.randint(0, 4999)] = i
In [10]: %timeit d[1024] # 1024 is an existing key in this dictionary
10000000 loops, best of 3: 70.7 ns per loop
In [11]: %timeit b[1024]
10000000 loops, best of 3: 50.7 ns per loop
Note that a list takes about 47 ns to lookup an index on my machine, so blist is really really close to a list in terms of lookup performance on small lists as what you have.
Lists:
1. append and pop from the end of list are fast
2. insert and pop from the beginning of a list are slow (there is a heavy shit operation behind these 2 functions)
3. it is better to use collection.degue for the 2nd case.
Dictionaries:
4. Access operations are faster compared to lists
Looping through dictionaries and lists:
Dictionaries use iteritems() method to retrieve at the same time the key and its corresponding value.
Lists use enumerate() for the same purpose.
Notes:
If your question is just about looping speed, there is no tangible difference between iteritems() and enumerate()
iteritems() is removed in Python 3.x.
The zip() method is a heavy process to avoid.
I think there is no general answer to that question. It depends on the repartition of the integers id, the available memory and your performance requirements. The rules are :
a list lookup is faster, because you do not have to compute the hash of the key.
a dict may be more compact if the greatest value of key is large
if your largest key is very large (about 2^30) you will waste a lot of memory and system will begin swapping which greatly degrades performances
Here is what could be a rule of thumb :
if there are "few" empty values of if you know the largest key will be "reasonably" low (relative to the memory you accept to spend on that) => use a list
if the following requirement is not verified and you do not have strong performance requirement => use a dict
if none of the 2 preceding assumptions are true you will have to try some hash functions optimizations - I detail it below
The theory of the dict is an array for which the index is the result of a hash function applied to the key. Python algorythm is correctly optimized but it is a generalist one. If you know that you have special repartition, you could try to find a hash specially adapted to your repartition. You could find pointers to go further in the Wikipedia article on Hash functions or on the good old standard C library hash
EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)
I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]
The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).
class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]
The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.
Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)
Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.
"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).
It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]
How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.