How does hash-table in set works in python? - python

As far as I know, set in python works via a hash-table to achieve O(1) look-up complexity. While it is hash-table, every entry in a set must be hashable (or immutable).
So This peace of code raises exception:
>>> {dict()}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Because dict is not hashable. But we can create our own class inherited from dict and implement the __hash__ magic method. I created my own in this way:
>>> class D(dict):
... def __hash__(self):
... return 3
...
I know it should not work properly but I just wanted to experiment with it. So I checked that I can now use this type in set:
>>> {D()}
{{}}
>>> {D(name='ali')}
{{'name': 'ali'}}
So far so good, but I thought that this way of implementing the __hash__ magic method would screw up the look up in set. Because every object of the D has the same hash value.
>>> d1 = D(n=1)
>>> d2 = D(n=2)
>>> hash(d1), hash(d2)
(3, 3)
>>>
>>> {d1, d2}
{{'n': 2}, {'n': 1}}
But the surprise for me was this:
>>> d3 = D()
>>> d3 in {d1, d2}
False
I expected the result to be True, because hash of d3 is 3 and currently there are values in our set with the same hash value. How does the set works internally?

To be usable in sets and dicts, a __hash__ method must guarantee that if x == y, then hash(x) == hash(y). But that's a one-sided implication. It's not at all required that if hash(x) == hash(h) then x == y must be true. Indeed, that's impossible to achieve in general (for example, there are an unbounded number of distinct Python ints, but only a finite number of hash codes - there must be distinct ints that have the same hash value).
That your hashes are all the same is fine. They only tell the set/dict where to start looking. All objects in the container with the same hash are then compared, one by one, for equality, until success, or until all such objects have been tried without success.
However, while making all hashes the same doesn't hurt correctness, it's a disaster for performance: it effectively turns the set/dict into an exceptionally slow way to do an O(n) linear search.

Related

Why does {}.values() == {}.values() return False? [duplicate]

With Python 3:
>>> from collections import OrderedDict
>>> d1 = OrderedDict([('foo', 'bar')])
>>> d2 = OrderedDict([('foo', 'bar')])
I wanted to check for equality:
>>> d1 == d2
True
>>> d1.keys() == d2.keys()
True
But:
>>> d1.values() == d2.values()
False
Do you know why values are not equal?
I've tested this with Python 3.4 and 3.5.
Following this question, I posted on the Python-Ideas mailing list to have additional details:
https://mail.python.org/pipermail/python-ideas/2015-December/037472.html
In Python 3, dict.keys() and dict.values() return special iterable classes - respectively a collections.abc.KeysView and a collections.abc.ValuesView. The first one inherit it's __eq__ method from set, the second uses the default object.__eq__ which tests on object identity.
In python3, d1.values() and d2.values() are collections.abc.ValuesView objects:
>>> d1.values()
ValuesView(OrderedDict([('foo', 'bar')]))
Don't compare them as an object, convert them to lists and then compare them:
>>> list(d1.values()) == list(d2.values())
True
Investigating why it works for comparing keys, in _collections_abc.py of CPython, KeysView is inheriting from Set while ValuesView does not:
class KeysView(MappingView, Set):
class ValuesView(MappingView):
Tracing for __eq__ in ValuesView and its parents:
MappingView ==> Sized ==> ABCMeta ==> type ==> object.
__eq__ is implemented only in object and not overridden.
In the other hand, KeysView inherits __eq__ directly from Set.
Unfortunately, both current answers don't address why this is but focus on how this is done. That mailing list discussion was amazing, so I'll sum things up:
For odict.keys/dict.keys and odict.items/dict.items:
odict.keys (subclass of dict.keys) supports comparison due to its conformance to collections.abc.Set (it's a set-like object). This is possible due to the fact that keys inside a dictionary (ordered or not) are guaranteed to be unique and hashable.
odict.items (subclass of dict.items) also supports comparison for the same reason as .keys does. itemsview is allowed to do this since it raises the appropriate error if one of the items (specifically, the second element representing the value) is not hashable, uniqueness is guaranteed, though (due to keys being unique):
>>> od = OrderedDict({'a': []})
>>> set() & od.items()
TypeErrorTraceback (most recent call last)
<ipython-input-41-a5ec053d0eda> in <module>()
----> 1 set() & od.items()
TypeError: unhashable type: 'list'
For both these views keys, items, the comparison uses a simple function called all_contained_in (pretty readable) that uses the objects __contain__ method to check for membership of the elements in the views involved.
Now, about odict.values/dict.values:
As noticed, odict.values (subclass of dict.values [shocker]) doesn't compare like a set-like object. This is because the values of a valuesview cannot be represented as a set, the reasons are two-fold:
Most importantly, the view might contain duplicates which cannot be dropped.
The view might contain non-hashable objects (which, on it's own, isn't sufficient to not treat the view as set-like).
As stated in a comment by #user2357112 and by #abarnett in the mailing list, odict.values/dict.values is a multiset, a generalization of sets that allows multiple instances of it's elements.
Trying to compare these is not as trivial as comparing keys or items due to the inherent duplication, the ordering and the fact that you probably need to take into consideration the keys that correspond to those values. Should dict_values that look like this:
>>> {1:1, 2:1, 3:2}.values()
dict_values([1, 1, 2])
>>> {1:1, 2:1, 10:2}.values()
dict_values([1, 1, 2])
actually be equal even though the values that correspond to the keys isn't the same? Maybe? Maybe not? It isn't straight-forward either way and will lead to inevitable confusion.
The point to be made though is that it isn't trivial to compare these as is with keys and items, to sum up, with another comment from #abarnett on the mailing list:
If you're thinking we could define what multisets should do, despite not having a standard multiset type or an ABC for them, and apply that to values views, the next question is how to do that in better than quadratic time for non-hashable values. (And you can't assume ordering here, either.) Would having a values view hang for 30 seconds and then come back with the answer you intuitively wanted instead of giving the wrong answer in 20 millis be an improvement? (Either way, you're going to learn the same lesson: don't compare values views. I'd rather learn that in 20 millis.)

Why does the 'in' keyword claim it needs an iterable object?

>>> non_iterable = 1
>>> 5 in non_iterable
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: 'int' object is not iterable
>>> class also_non_iterable:
... def __contains__(self,thing):
... return True
>>> 5 in also_non_iterable()
True
>>> isinstance(also_non_iterable(), Iterable)
False
Is there a reason in keyword claims to want an iterable object when what it truly wants is an object that implements __contains__?
It claims to want an iterable because, if the object's class does not implement an __contains__ , then in tries to iterate through the object and check if the values are equal to the values yield by it.
An Example to show that -
>>> class C:
... def __iter__(self):
... return iter([1,2,3,4])
>>>
>>> c = C()
>>> 2 in c
True
>>> 5 in c
False
This is explained in the documentation -
For user-defined classes which define the __contains__() method, x in y is true if and only if y.__contains__(x) is true.
For user-defined classes which do not define __contains__() but do define __iter__() , x in y is true if some value z with x == z is produced while iterating over y . If an exception is raised during the iteration, it is as if in raised that exception.
Is there a reason in keyword claims to want an iterable object when what it truly wants is an object that implements __contains__?
x in thing and for x in thing are very closely related. Almost everything that supports x in thing follows the rule that x in thing is true if and only if a for loop over thing will find an element equal to x. In particular, if an object supports iteration but not __contains__, Python will use iteration as a fallback for in tests.
The error message could say it needs __contains__, but that would be about as wrong as the current message, since __contains__ isn't strictly necessary. It could say it needs a container, but it's not immediately clear what counts as a container. For example, dicts support in, but calling them containers is questionable. The current message, which says it needs an iterable, is about as accurate as the other options. Its advantage is that in practice, "is it iterable" is a better check than "is it a container" or "does it support __contains__" for determining whether actual objects support in.
There are two different uses of in:
test if a container has a value (eg. left argument implements __contains__)
traverse through a sequence (eg. right argument is Iterable)

Checking for NaN presence in a container

NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.]
I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.
hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?
Is it safe to check for NaN presence in an arbitrary container using in operator? Or is it implementation dependent?
My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too.
NaN = float('nan')
print(NaN != NaN) # True
print(NaN == NaN) # False
list_ = (1, 2, NaN)
print(NaN in list_) # True; works fine but how?
set_ = {1, 2, NaN}
print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here
print(hash(0)) # 0
print(hash(NaN)) # 0
set_ = {1, 2, 0}
print(NaN in set_) # False; works fine, but how?
Note that if I add an instance of a user-defined class to a list, and then check for containment, the instance's __eq__ method is called (if defined) - at least in CPython. That's why I assumed that list containment is tested using operator ==.
EDIT:
Per Roman's answer, it would seem that __contains__ for list, tuple, set, dict behaves in a very strange way:
def __contains__(self, x):
for element in self:
if x is element:
return True
if x == element:
return True
return False
I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice.
Of course, one NaN object may not be identical (in the sense of id) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined.
This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using in.
I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...)
Another alternative is watch out for cases where in might have NaN on the left side, and in such cases, test for NaN membership separately, using math.isnan(). In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.
Question #1: why is NaN found in a container when it's an identical object.
From the documentation:
For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).
This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).
Question #2: why is the hash value for NaN the same as for 0?
From the documentation:
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. hash()
should return an integer. The only required property is that objects
which compare equal have the same hash value; it is advised to somehow
mix together (e.g. using exclusive or) the hash values for the
components of the object that also play a part in comparison of
objects.
Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2
So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..
I would recommend to make NaN an instance of a subclass of float that doesn't support hashing and hence cannot be accidentally added to a set/dict. I'll submit this to python-ideas.
Finally, I found a mistake in the documentation here:
For user-defined classes which do not define __contains__() but do
define __iter__(), x in y is true if some value z with x == z is
produced while iterating over y. If an exception is raised during the
iteration, it is as if in raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
__getitem__(), x in y is true if and only if there is a non-negative
integer index i such that x == y[i], and all lower integer indices do
not raise IndexError exception. (If any other exception is raised, it
is as if in raised that exception).
You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
... def __iter__(self):
... yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False
As you can see, the identity is checked first, before == - consistent with the built-in containers. I'll submit a report to fix the docs.
I can't repro you tuple/set cases using float('nan') instead of NaN.
So i assume that it worked only because id(NaN) == id(NaN), i.e. there is no interning for NaN objects:
>>> NaN = float('NaN')
>>> id(NaN)
34373956456
>>> id(float('NaN'))
34373956480
And
>>> NaN is NaN
True
>>> NaN is float('NaN')
False
I believe tuple/set lookups has some optimization related to comparison of the same objects.
Answering your question - it seam to be unsafe to relay on in operator while checking for presence of NaN. I'd recommend to use None, if possible.
Just a comment. __eq__ has nothing to do with is statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons:
>>> class A(object):
... def __eq__(*args):
... print '__eq__'
...
>>> A() == A()
__eq__ # as expected
>>> A() is A()
False # `is` checks only ids
>>> A() in [A()]
__eq__ # as expected
False
>>> a = A()
>>> a in [a]
True # surprise!

NaNs as key in dictionaries

Can anyone explain the following behaviour to me?
>>> import numpy as np
>>> {np.nan: 5}[np.nan]
5
>>> {float64(np.nan): 5}[float64(np.nan)]
KeyError: nan
Why does it work in the first case, but not in the second?
Additionally, I found that the following DOES work:
>>> a ={a: 5}[a]
float64(np.nan)
The problem here is that NaN is not equal to itself, as defined in the IEEE standard for floating point numbers:
>>> float("nan") == float("nan")
False
When a dictionary looks up a key, it roughly does this:
Compute the hash of the key to be looked up.
For each key in the dict with the same hash, check if it matches the key to be looked up. This check consists of
a. Checking for object identity: If the key in the dictionary and the key to be looked up are the same object as indicated by the is operator, the key was found.
b. If the first check failed, check for equality using the __eq__ operator.
The first example succeeds, since np.nan and np.nan are the same object, so it does not matter they don't compare equal:
>>> numpy.nan is numpy.nan
True
In the second case, np.float64(np.nan) and np.float64(np.nan) are not the same object -- the two constructor calls create two distinct objects:
>>> numpy.float64(numpy.nan) is numpy.float64(numpy.nan)
False
Since the objects also do not compare equal, the dictionary concludes the key is not found and throws a KeyError.
You can even do this:
>>> a = float("nan")
>>> b = float("nan")
>>> {a: 1, b: 2}
{nan: 1, nan: 2}
In conclusion, it seems a saner idea to avoid NaN as a dictionary key.
Please note this is not the case anymore in Python 3.6:
>>> d = float("nan") #object nan
>>> d
nan
>>> c = {"a": 3, d: 4}
>>> c["a"]
3
>>> c[d]
4
In this example c is a dictionary that contains the value 3 associated to the key "a" and the value 4 associated to the key NaN.
The way Python 3.6 internally looks up in the dictionary has changed. Now, the first thing it does is compare the two pointers that represent the underlying variables. If they point to the same object, then the two objects are considered the same (well, technically we are comparing one object with itself). Otherwise, their hash is compared, if the hash is different, then the two objects are considered different. If at this point the equality of the objects has not been decided, then their comparators are called (they are "manually" compared, so to speak).
This means that although IEEE754 specifies that NAN isn't equal to itself:
>>> d == d
False
When looking up a dictionary, the underlying pointers of the variables are the first thing to be compared. Because these they point to the same object NaN, the dictionary returns 4.
Note also that not all NaN objects are exactly the same:
>>> e = float("nan")
>>> e == d
False
>>> c[e]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: nan
>>> c[d]
4
So, to summarize. Dictionaries prioritize performance by trying to compare if the underlying objects are the same. They have hash comparison and comparisons as fallback. Moreover, not every NaN represents the same underlying object.
One has to be very careful when dealing with NaNs as keys to dictionaries, adding such a key makes the underlying value impossible to reach unless you depend on the property described here. This property may change in the future (somewhat unlikely, but possible). Proceed with care.

A data-structure for 1:1 mappings in python?

I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]
The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).
class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]
The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.
Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)
Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.
"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).
It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]
How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.

Categories

Resources