Related
As far as I know, set in python works via a hash-table to achieve O(1) look-up complexity. While it is hash-table, every entry in a set must be hashable (or immutable).
So This peace of code raises exception:
>>> {dict()}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'dict'
Because dict is not hashable. But we can create our own class inherited from dict and implement the __hash__ magic method. I created my own in this way:
>>> class D(dict):
... def __hash__(self):
... return 3
...
I know it should not work properly but I just wanted to experiment with it. So I checked that I can now use this type in set:
>>> {D()}
{{}}
>>> {D(name='ali')}
{{'name': 'ali'}}
So far so good, but I thought that this way of implementing the __hash__ magic method would screw up the look up in set. Because every object of the D has the same hash value.
>>> d1 = D(n=1)
>>> d2 = D(n=2)
>>> hash(d1), hash(d2)
(3, 3)
>>>
>>> {d1, d2}
{{'n': 2}, {'n': 1}}
But the surprise for me was this:
>>> d3 = D()
>>> d3 in {d1, d2}
False
I expected the result to be True, because hash of d3 is 3 and currently there are values in our set with the same hash value. How does the set works internally?
To be usable in sets and dicts, a __hash__ method must guarantee that if x == y, then hash(x) == hash(y). But that's a one-sided implication. It's not at all required that if hash(x) == hash(h) then x == y must be true. Indeed, that's impossible to achieve in general (for example, there are an unbounded number of distinct Python ints, but only a finite number of hash codes - there must be distinct ints that have the same hash value).
That your hashes are all the same is fine. They only tell the set/dict where to start looking. All objects in the container with the same hash are then compared, one by one, for equality, until success, or until all such objects have been tried without success.
However, while making all hashes the same doesn't hurt correctness, it's a disaster for performance: it effectively turns the set/dict into an exceptionally slow way to do an O(n) linear search.
Two string variables are set to the same value. s1 == s2 always returns True, but s1 is s2 sometimes returns False.
If I open my Python interpreter and do the same is comparison, it succeeds:
>>> s1 = 'text'
>>> s2 = 'text'
>>> s1 is s2
True
Why is this?
is is identity testing, and == is equality testing. What happens in your code would be emulated in the interpreter like this:
>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False
So, no wonder they're not the same, right?
In other words: a is b is the equivalent of id(a) == id(b)
Other answers here are correct: is is used for identity comparison, while == is used for equality comparison. Since what you care about is equality (the two strings should contain the same characters), in this case the is operator is simply wrong and you should be using == instead.
The reason is works interactively is that (most) string literals are interned by default. From Wikipedia:
Interned strings speed up string
comparisons, which are sometimes a
performance bottleneck in applications
(such as compilers and dynamic
programming language runtimes) that
rely heavily on hash tables with
string keys. Without interning,
checking that two different strings
are equal involves examining every
character of both strings. This is
slow for several reasons: it is
inherently O(n) in the length of the
strings; it typically requires reads
from several regions of memory, which
take time; and the reads fills up the
processor cache, meaning there is less
cache available for other needs. With
interned strings, a simple object
identity test suffices after the
original intern operation; this is
typically implemented as a pointer
equality test, normally just a single
machine instruction with no memory
reference at all.
So, when you have two string literals (words that are literally typed into your program source code, surrounded by quotation marks) in your program that have the same value, the Python compiler will automatically intern the strings, making them both stored at the same memory location. (Note that this doesn't always happen, and the rules for when this happens are quite convoluted, so please don't rely on this behavior in production code!)
Since in your interactive session both strings are actually stored in the same memory location, they have the same identity, so the is operator works as expected. But if you construct a string by some other method (even if that string contains exactly the same characters), then the string may be equal, but it is not the same string -- that is, it has a different identity, because it is stored in a different place in memory.
The is keyword is a test for object identity while == is a value comparison.
If you use is, the result will be true if and only if the object is the same object. However, == will be true any time the values of the object are the same.
One last thing to note is you may use the sys.intern function to ensure that you're getting a reference to the same string:
>>> from sys import intern
>>> a = intern('a')
>>> a2 = intern('a')
>>> a is a2
True
As pointed out in previous answers, you should not be using is to determine equality of strings. But this may be helpful to know if you have some kind of weird requirement to use is.
Note that the intern function used to be a built-in on Python 2, but it was moved to the sys module in Python 3.
is is identity testing and == is equality testing. This means is is a way to check whether two things are the same things, or just equivalent.
Say you've got a simple person object. If it is named 'Jack' and is '23' years old, it's equivalent to another 23-year-old Jack, but it's not the same person.
class Person(object):
def __init__(self, name, age):
self.name = name
self.age = age
def __eq__(self, other):
return self.name == other.name and self.age == other.age
jack1 = Person('Jack', 23)
jack2 = Person('Jack', 23)
jack1 == jack2 # True
jack1 is jack2 # False
They're the same age, but they're not the same instance of person. A string might be equivalent to another, but it's not the same object.
This is a side note, but in idiomatic Python, you will often see things like:
if x is None:
# Some clauses
This is safe, because there is guaranteed to be one instance of the Null Object (i.e., None).
If you're not sure what you're doing, use the '=='.
If you have a little more knowledge about it you can use 'is' for known objects like 'None'.
Otherwise, you'll end up wondering why things doesn't work and why this happens:
>>> a = 1
>>> b = 1
>>> b is a
True
>>> a = 6000
>>> b = 6000
>>> b is a
False
I'm not even sure if some things are guaranteed to stay the same between different Python versions/implementations.
From my limited experience with Python, is is used to compare two objects to see if they are the same object as opposed to two different objects with the same value. == is used to determine if the values are identical.
Here is a good example:
>>> s1 = u'public'
>>> s2 = 'public'
>>> s1 is s2
False
>>> s1 == s2
True
s1 is a Unicode string, and s2 is a normal string. They are not the same type, but they are the same value.
I think it has to do with the fact that, when the 'is' comparison evaluates to false, two distinct objects are used. If it evaluates to true, that means internally it's using the same exact object and not creating a new one, possibly because you created them within a fraction of 2 or so seconds and because there isn't a large time gap in between it's optimized and uses the same object.
This is why you should be using the equality operator ==, not is, to compare the value of a string object.
>>> s = 'one'
>>> s2 = 'two'
>>> s is s2
False
>>> s2 = s2.replace('two', 'one')
>>> s2
'one'
>>> s2 is s
False
>>>
In this example, I made s2, which was a different string object previously equal to 'one' but it is not the same object as s, because the interpreter did not use the same object as I did not initially assign it to 'one', if I had it would have made them the same object.
The == operator tests value equivalence. The is operator tests object identity, and Python tests whether the two are really the same object (i.e., live at the same address in memory).
>>> a = 'banana'
>>> b = 'banana'
>>> a is b
True
In this example, Python only created one string object, and both a and b refers to it. The reason is that Python internally caches and reuses some strings as an optimization. There really is just a string 'banana' in memory, shared by a and b. To trigger the normal behavior, you need to use longer strings:
>>> a = 'a longer banana'
>>> b = 'a longer banana'
>>> a == b, a is b
(True, False)
When you create two lists, you get two objects:
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> a is b
False
In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.
If a refers to an object and you assign b = a, then both variables refer to the same object:
>>> a = [1, 2, 3]
>>> b = a
>>> b is a
True
Reference: Think Python 2e by Allen B. Downey
I believe that this is known as "interned" strings. Python does this, so does Java, and so do C and C++ when compiling in optimized modes.
If you use two identical strings, instead of wasting memory by creating two string objects, all interned strings with the same contents point to the same memory.
This results in the Python "is" operator returning True because two strings with the same contents are pointing at the same string object. This will also happen in Java and in C.
This is only useful for memory savings though. You cannot rely on it to test for string equality, because the various interpreters and compilers and JIT engines cannot always do it.
Actually, the is operator checks for identity and == operator checks for equality.
From the language reference:
Types affect almost all aspects of object behavior. Even the importance of object identity is affected in some sense: for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed. E.g., after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation, but after c = []; d = [], c and d are guaranteed to refer to two different, unique, newly created empty lists. (Note that c = d = [] assigns the same object to both c and d.)
So from the above statement we can infer that the strings, which are immutable types, may fail when checked with "is" and may succeed when checked with "is".
The same applies for int and tuple which are also immutable types.
is will compare the memory location. It is used for object-level comparison.
== will compare the variables in the program. It is used for checking at a value level.
is checks for address level equivalence
== checks for value level equivalence
is is identity testing and == is equality testing (see the Python documentation).
In most cases, if a is b, then a == b. But there are exceptions, for example:
>>> nan = float('nan')
>>> nan is nan
True
>>> nan == nan
False
So, you can only use is for identity tests, never equality tests.
The basic concept, we have to be clear, while approaching this question, is to understand the difference between is and ==.
"is" is will compare the memory location. if id(a)==id(b), then a is b returns true else it returns false.
So, we can say that is is used for comparing memory locations. Whereas,
== is used for equality testing which means that it just compares only the resultant values. The below shown code may acts as an example to the above given theory.
Code
In the case of string literals (strings without getting assigned to variables), the memory address will be same as shown in the picture. so, id(a)==id(b). The remaining of this is self-explanatory.
Two string variables are set to the same value. s1 == s2 always returns True, but s1 is s2 sometimes returns False.
If I open my Python interpreter and do the same is comparison, it succeeds:
>>> s1 = 'text'
>>> s2 = 'text'
>>> s1 is s2
True
Why is this?
is is identity testing, and == is equality testing. What happens in your code would be emulated in the interpreter like this:
>>> a = 'pub'
>>> b = ''.join(['p', 'u', 'b'])
>>> a == b
True
>>> a is b
False
So, no wonder they're not the same, right?
In other words: a is b is the equivalent of id(a) == id(b)
Other answers here are correct: is is used for identity comparison, while == is used for equality comparison. Since what you care about is equality (the two strings should contain the same characters), in this case the is operator is simply wrong and you should be using == instead.
The reason is works interactively is that (most) string literals are interned by default. From Wikipedia:
Interned strings speed up string
comparisons, which are sometimes a
performance bottleneck in applications
(such as compilers and dynamic
programming language runtimes) that
rely heavily on hash tables with
string keys. Without interning,
checking that two different strings
are equal involves examining every
character of both strings. This is
slow for several reasons: it is
inherently O(n) in the length of the
strings; it typically requires reads
from several regions of memory, which
take time; and the reads fills up the
processor cache, meaning there is less
cache available for other needs. With
interned strings, a simple object
identity test suffices after the
original intern operation; this is
typically implemented as a pointer
equality test, normally just a single
machine instruction with no memory
reference at all.
So, when you have two string literals (words that are literally typed into your program source code, surrounded by quotation marks) in your program that have the same value, the Python compiler will automatically intern the strings, making them both stored at the same memory location. (Note that this doesn't always happen, and the rules for when this happens are quite convoluted, so please don't rely on this behavior in production code!)
Since in your interactive session both strings are actually stored in the same memory location, they have the same identity, so the is operator works as expected. But if you construct a string by some other method (even if that string contains exactly the same characters), then the string may be equal, but it is not the same string -- that is, it has a different identity, because it is stored in a different place in memory.
The is keyword is a test for object identity while == is a value comparison.
If you use is, the result will be true if and only if the object is the same object. However, == will be true any time the values of the object are the same.
One last thing to note is you may use the sys.intern function to ensure that you're getting a reference to the same string:
>>> from sys import intern
>>> a = intern('a')
>>> a2 = intern('a')
>>> a is a2
True
As pointed out in previous answers, you should not be using is to determine equality of strings. But this may be helpful to know if you have some kind of weird requirement to use is.
Note that the intern function used to be a built-in on Python 2, but it was moved to the sys module in Python 3.
is is identity testing and == is equality testing. This means is is a way to check whether two things are the same things, or just equivalent.
Say you've got a simple person object. If it is named 'Jack' and is '23' years old, it's equivalent to another 23-year-old Jack, but it's not the same person.
class Person(object):
def __init__(self, name, age):
self.name = name
self.age = age
def __eq__(self, other):
return self.name == other.name and self.age == other.age
jack1 = Person('Jack', 23)
jack2 = Person('Jack', 23)
jack1 == jack2 # True
jack1 is jack2 # False
They're the same age, but they're not the same instance of person. A string might be equivalent to another, but it's not the same object.
This is a side note, but in idiomatic Python, you will often see things like:
if x is None:
# Some clauses
This is safe, because there is guaranteed to be one instance of the Null Object (i.e., None).
If you're not sure what you're doing, use the '=='.
If you have a little more knowledge about it you can use 'is' for known objects like 'None'.
Otherwise, you'll end up wondering why things doesn't work and why this happens:
>>> a = 1
>>> b = 1
>>> b is a
True
>>> a = 6000
>>> b = 6000
>>> b is a
False
I'm not even sure if some things are guaranteed to stay the same between different Python versions/implementations.
From my limited experience with Python, is is used to compare two objects to see if they are the same object as opposed to two different objects with the same value. == is used to determine if the values are identical.
Here is a good example:
>>> s1 = u'public'
>>> s2 = 'public'
>>> s1 is s2
False
>>> s1 == s2
True
s1 is a Unicode string, and s2 is a normal string. They are not the same type, but they are the same value.
I think it has to do with the fact that, when the 'is' comparison evaluates to false, two distinct objects are used. If it evaluates to true, that means internally it's using the same exact object and not creating a new one, possibly because you created them within a fraction of 2 or so seconds and because there isn't a large time gap in between it's optimized and uses the same object.
This is why you should be using the equality operator ==, not is, to compare the value of a string object.
>>> s = 'one'
>>> s2 = 'two'
>>> s is s2
False
>>> s2 = s2.replace('two', 'one')
>>> s2
'one'
>>> s2 is s
False
>>>
In this example, I made s2, which was a different string object previously equal to 'one' but it is not the same object as s, because the interpreter did not use the same object as I did not initially assign it to 'one', if I had it would have made them the same object.
The == operator tests value equivalence. The is operator tests object identity, and Python tests whether the two are really the same object (i.e., live at the same address in memory).
>>> a = 'banana'
>>> b = 'banana'
>>> a is b
True
In this example, Python only created one string object, and both a and b refers to it. The reason is that Python internally caches and reuses some strings as an optimization. There really is just a string 'banana' in memory, shared by a and b. To trigger the normal behavior, you need to use longer strings:
>>> a = 'a longer banana'
>>> b = 'a longer banana'
>>> a == b, a is b
(True, False)
When you create two lists, you get two objects:
>>> a = [1, 2, 3]
>>> b = [1, 2, 3]
>>> a is b
False
In this case we would say that the two lists are equivalent, because they have the same elements, but not identical, because they are not the same object. If two objects are identical, they are also equivalent, but if they are equivalent, they are not necessarily identical.
If a refers to an object and you assign b = a, then both variables refer to the same object:
>>> a = [1, 2, 3]
>>> b = a
>>> b is a
True
Reference: Think Python 2e by Allen B. Downey
I believe that this is known as "interned" strings. Python does this, so does Java, and so do C and C++ when compiling in optimized modes.
If you use two identical strings, instead of wasting memory by creating two string objects, all interned strings with the same contents point to the same memory.
This results in the Python "is" operator returning True because two strings with the same contents are pointing at the same string object. This will also happen in Java and in C.
This is only useful for memory savings though. You cannot rely on it to test for string equality, because the various interpreters and compilers and JIT engines cannot always do it.
Actually, the is operator checks for identity and == operator checks for equality.
From the language reference:
Types affect almost all aspects of object behavior. Even the importance of object identity is affected in some sense: for immutable types, operations that compute new values may actually return a reference to any existing object with the same type and value, while for mutable objects this is not allowed. E.g., after a = 1; b = 1, a and b may or may not refer to the same object with the value one, depending on the implementation, but after c = []; d = [], c and d are guaranteed to refer to two different, unique, newly created empty lists. (Note that c = d = [] assigns the same object to both c and d.)
So from the above statement we can infer that the strings, which are immutable types, may fail when checked with "is" and may succeed when checked with "is".
The same applies for int and tuple which are also immutable types.
is will compare the memory location. It is used for object-level comparison.
== will compare the variables in the program. It is used for checking at a value level.
is checks for address level equivalence
== checks for value level equivalence
is is identity testing and == is equality testing (see the Python documentation).
In most cases, if a is b, then a == b. But there are exceptions, for example:
>>> nan = float('nan')
>>> nan is nan
True
>>> nan == nan
False
So, you can only use is for identity tests, never equality tests.
The basic concept, we have to be clear, while approaching this question, is to understand the difference between is and ==.
"is" is will compare the memory location. if id(a)==id(b), then a is b returns true else it returns false.
So, we can say that is is used for comparing memory locations. Whereas,
== is used for equality testing which means that it just compares only the resultant values. The below shown code may acts as an example to the above given theory.
Code
In the case of string literals (strings without getting assigned to variables), the memory address will be same as shown in the picture. so, id(a)==id(b). The remaining of this is self-explanatory.
Here, I have the following:
>>> import numpy as np
>>> q = np.nan
>>> q == np.nan
False
>>> q is np.nan
True
>>> q in (np.nan, )
True
So, the question is: why nan is not equal to nan, but is nan?
(UNIQUE) And why 'in' returns True?
I don't seem to be able to trace down the implementation of nan. It leads me to C:\Python33\lib\site-packages\numpy\core\umath.pyd (row NAN = nan), but from there there is no traceable way to find out what nan actually is.
The creators of numpy decided that it made most sense that most comparisons to nan, including ==, should yield False. You can do this in Python by defining a __eq__(self, other) method for your object. This behaviour was chosen simply because it is the most useful, for various purposes. After all, the fact that one entry has a missing value, and another entry also has a missing value, does not imply that those two entries are equal. It just implies that you don't know whether they are equal or not, and it's therefore best not to treat them as if they are (e.g. when you join two tables together by pairing up corresponding rows).
is on the other hand is a Python keyword which cannot be overwritten by numpy. It tests whether two objects are the same thing. nan is the same object as nan. This is also useful behaviour to have anyway, because often you will want to e.g. get rid of all entries which don't have a value, which you can achieve with is not nan.
nan in (nan,) returns True because as you probably know, (nan,) is a tuple with only one element, nan, and when Python checks if an object is in a tuple, it is checking whether that object is or == any object in the tuple.
NaN is handled perfectly when I check for its presence in a list or a set. But I don't understand how. [UPDATE: no it's not; it is reported as present if the identical instance of NaN is found; if only non-identical instances of NaN are found, it is reported as absent.]
I thought presence in a list is tested by equality, so I expected NaN to not be found since NaN != NaN.
hash(NaN) and hash(0) are both 0. How do dictionaries and sets tell NaN and 0 apart?
Is it safe to check for NaN presence in an arbitrary container using in operator? Or is it implementation dependent?
My question is about Python 3.2.1; but if there are any changes existing/planned in future versions, I'd like to know that too.
NaN = float('nan')
print(NaN != NaN) # True
print(NaN == NaN) # False
list_ = (1, 2, NaN)
print(NaN in list_) # True; works fine but how?
set_ = {1, 2, NaN}
print(NaN in set_) # True; hash(NaN) is some fixed integer, so no surprise here
print(hash(0)) # 0
print(hash(NaN)) # 0
set_ = {1, 2, 0}
print(NaN in set_) # False; works fine, but how?
Note that if I add an instance of a user-defined class to a list, and then check for containment, the instance's __eq__ method is called (if defined) - at least in CPython. That's why I assumed that list containment is tested using operator ==.
EDIT:
Per Roman's answer, it would seem that __contains__ for list, tuple, set, dict behaves in a very strange way:
def __contains__(self, x):
for element in self:
if x is element:
return True
if x == element:
return True
return False
I say 'strange' because I didn't see it explained in the documentation (maybe I missed it), and I think this is something that shouldn't be left as an implementation choice.
Of course, one NaN object may not be identical (in the sense of id) to another NaN object. (This not really surprising; Python doesn't guarantee such identity. In fact, I never saw CPython share an instance of NaN created in different places, even though it shares an instance of a small number or a short string.) This means that testing for NaN presence in a built-in container is undefined.
This is very dangerous, and very subtle. Someone might run the very code I showed above, and incorrectly conclude that it's safe to test for NaN membership using in.
I don't think there is a perfect workaround to this issue. One, very safe approach, is to ensure that NaN's are never added to built-in containers. (It's a pain to check for that all over the code...)
Another alternative is watch out for cases where in might have NaN on the left side, and in such cases, test for NaN membership separately, using math.isnan(). In addition, other operations (e.g., set intersection) need to also be avoided or rewritten.
Question #1: why is NaN found in a container when it's an identical object.
From the documentation:
For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).
This is precisely what I observe with NaN, so everything is fine. Why this rule? I suspect it's because a dict/set wants to honestly report that it contains a certain object if that object is actually in it (even if __eq__() for whatever reason chooses to report that the object is not equal to itself).
Question #2: why is the hash value for NaN the same as for 0?
From the documentation:
Called by built-in function hash() and for operations on members of
hashed collections including set, frozenset, and dict. hash()
should return an integer. The only required property is that objects
which compare equal have the same hash value; it is advised to somehow
mix together (e.g. using exclusive or) the hash values for the
components of the object that also play a part in comparison of
objects.
Note that the requirement is only in one direction; objects that have the same hash do not have to be equal! At first I thought it's a typo, but then I realized that it's not. Hash collisions happen anyway, even with default __hash__() (see an excellent explanation here). The containers handle collisions without any problem. They do, of course, ultimately use the == operator to compare elements, hence they can easily end up with multiple values of NaN, as long as they are not identical! Try this:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> d = {}
>>> d[nan1] = 1
>>> d[nan2] = 2
>>> d[nan1]
1
>>> d[nan2]
2
So everything works as documented. But... it's very very dangerous! How many people knew that multiple values of NaN could live alongside each other in a dict? How many people would find this easy to debug?..
I would recommend to make NaN an instance of a subclass of float that doesn't support hashing and hence cannot be accidentally added to a set/dict. I'll submit this to python-ideas.
Finally, I found a mistake in the documentation here:
For user-defined classes which do not define __contains__() but do
define __iter__(), x in y is true if some value z with x == z is
produced while iterating over y. If an exception is raised during the
iteration, it is as if in raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
__getitem__(), x in y is true if and only if there is a non-negative
integer index i such that x == y[i], and all lower integer indices do
not raise IndexError exception. (If any other exception is raised, it
is as if in raised that exception).
You may notice that there is no mention of is here, unlike with built-in containers. I was surprised by this, so I tried:
>>> nan1 = float('nan')
>>> nan2 = float('nan')
>>> class Cont:
... def __iter__(self):
... yield nan1
...
>>> c = Cont()
>>> nan1 in c
True
>>> nan2 in c
False
As you can see, the identity is checked first, before == - consistent with the built-in containers. I'll submit a report to fix the docs.
I can't repro you tuple/set cases using float('nan') instead of NaN.
So i assume that it worked only because id(NaN) == id(NaN), i.e. there is no interning for NaN objects:
>>> NaN = float('NaN')
>>> id(NaN)
34373956456
>>> id(float('NaN'))
34373956480
And
>>> NaN is NaN
True
>>> NaN is float('NaN')
False
I believe tuple/set lookups has some optimization related to comparison of the same objects.
Answering your question - it seam to be unsafe to relay on in operator while checking for presence of NaN. I'd recommend to use None, if possible.
Just a comment. __eq__ has nothing to do with is statement, and during lookups comparison of objects' ids seem to happen prior to any value comparisons:
>>> class A(object):
... def __eq__(*args):
... print '__eq__'
...
>>> A() == A()
__eq__ # as expected
>>> A() is A()
False # `is` checks only ids
>>> A() in [A()]
__eq__ # as expected
False
>>> a = A()
>>> a in [a]
True # surprise!