'is' operator behaves differently when comparing strings with spaces - python

I've started learning Python (python 3.3) and I was trying out the is operator. I tried this:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
>>> c = 'isitthespace'
>>> d = 'isitthespace'
>>> c is d
True
>>> e = 'isitthespace?'
>>> f = 'isitthespace?'
>>> e is f
False
It seems like the space and the question mark make the is behave differently. What's going on?
EDIT: I know I should be using ==, I just wanted to know why is behaves like this.

Warning: this answer is about the implementation details of a specific python interpreter. comparing strings with is==bad idea.
Well, at least for cpython3.4/2.7.3, the answer is "no, it is not the whitespace". Not only the whitespace:
Two string literals will share memory if they are either alphanumeric or reside on the same block (file, function, class or single interpreter command)
An expression that evaluates to a string will result in an object that is identical to the one created using a string literal, if and only if it is created using constants and binary/unary operators, and the resulting string is shorter than 21 characters.
Single characters are unique.
Examples
Alphanumeric string literals always share memory:
>>> x='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> y='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
>>> x is y
True
Non-alphanumeric string literals share memory if and only if they share the enclosing syntactic block:
(interpreter)
>>> x='`!##$%^&*() \][=-. >:"?<a'; y='`!##$%^&*() \][=-. >:"?<a';
>>> z='`!##$%^&*() \][=-. >:"?<a';
>>> x is y
True
>>> x is z
False
(file)
x='`!##$%^&*() \][=-. >:"?<a';
y='`!##$%^&*() \][=-. >:"?<a';
z=(lambda : '`!##$%^&*() \][=-. >:"?<a')()
print(x is y)
print(x is z)
Output: True and False
For simple binary operations, the compiler is doing very simple constant propagation (see peephole.c), but with strings it does so only if the resulting string is shorter than 21 charcters. If this is the case, the rules mentioned earlier are in force:
>>> 'a'*10+'a'*10 is 'a'*20
True
>>> 'a'*21 is 'a'*21
False
>>> 'aaaaaaaaaaaaaaaaaaaaa' is 'aaaaaaaa' + 'aaaaaaaaaaaaa'
False
>>> t=2; 'a'*t is 'aa'
False
>>> 'a'.__add__('a') is 'aa'
False
>>> x='a' ; x+='a'; x is 'aa'
False
Single characters always share memory, of course:
>>> chr(0x20) is ' '
True

To expand on Ignacio’s answer a bit: The is operator is the identity operator. It is used to compare object identity. If you construct two objects with the same contents, then it is usually not the case that the object identity yields true. It works for some small strings because CPython, the reference implementation of Python, stores the contents separately, making all those objects reference to the same string content. So the is operator returns true for those.
This however is an implementation detail of CPython and is generally neither guaranteed for CPython nor any other implementation. So using this fact is a bad idea as it can break any other day.
To compare strings, you use the == operator which compares the equality of objects. Two string objects are considered equal when they contain the same characters. So this is the correct operator to use when comparing strings, and is should be generally avoided if you do not explicitely want object identity (example: a is False).
If you are really interested in the details, you can find the implementation of CPython’s strings here. But again: This is implementation detail, so you should never require this to work.

The is operator relies on the id function, which is guaranteed to be unique among simultaneously existing objects. Specifically, id returns the object's memory address. It seems that CPython has consistent memory addresses for strings containing only characters a-z and A-Z.
However, this seems to only be the case when the string has been assigned to a variable:
Here, the id of "foo" and the id of a are the same. a has been set to "foo" prior to checking the id.
>>> a = "foo"
>>> id(a)
4322269384
>>> id("foo")
4322269384
However, the id of "bar" and the id of a are different when checking the id of "bar" prior to setting a equal to "bar".
>>> id("bar")
4322269224
>>> a = "bar"
>>> id(a)
4322268984
Checking the id of "bar" again after setting a equal to "bar" returns the same id.
>>> id("bar")
4322268984
So it seems that cPython keeps consistent memory addresses for strings containing only a-zA-Z when those strings are assigned to a variable. It's also entirely possible that this is version dependent: I'm running python 2.7.3 on a macbook. Others might get entirely different results.

In fact your code amounts to comparing objects id (i.e. their physical address). So instead of your is comparison:
>>> b = 'is it the space?'
>>> a = 'is it the space?'
>>> a is b
False
You can do:
>>> id(a) == id(b)
False
But, note that if a and b were directly in the comparison it would work.
>>> id('is it the space?') == id('is it the space?')
True
In fact, in an expression there's sharing between the same static strings. But, at the program scale there's only sharing for word-like strings (so neither spaces nor punctuations).
You should not rely on this behavior as it's not documented anywhere and is a detail of implementation.

Two or more identical strings of consecutive alphanumeric (only) characters are stored in one structure, thus they share their memory reference. There are posts about this phenomenon all over the internet since the 1990's. It has evidently always been that way. I have never seen a reasonable guess as to why that's the case. I only know that it is. Furthermore, if you split and re-join alphanumeric strings to remove spaces between words, the resulting identical alphanumeric strings do NOT share a reference, which I find odd. See below:
Add any non-alphanumeric value identically to both strings, and they instantly become copies, but not shared references.
a ="abbacca"; b = "abbacca"; a is b => True
a ="abbacca "; b = "abbacca "; a is b => False
a ="abbacca?"; b = "abbacca?"; a is b => False
~Dr. C.

'is' operator compare the actual object.
c is d should also be false. My guess is that python make some optimization and in that case, it is the same object.

Related

Why 'is' operator is different when adding a value to a variable [duplicate]

This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
Closed last month.
in shell environment,
x = 250
y = 250
x + 100 is y + 100
x + 100 is y + 100 result is False.
Why is the result value "False" while id(x), id(y) is same?
Also
x = 250
y = 250
x is y
x += 10
y += 10
x is y
last statement result is False. It is same problem.
I thought both questions would be True.
At a high-level, is checks whether the values refer to the same reference, while == calls a method of the objects to compare them
As a general rule, only use is when comparing against singletons
None
True
False
and == everywhere else
#Mark Ransom's comment gets to why this can have unexpected effects (such as numbers sometimes comparing the same, not not every time) .. small numbers have a special priority and are cached, though to my knowledge this is implementation-specific (and so it may vary between different interpreters and versions of Python) and should not be relied upon
>>> a = 100
>>> b = 100
>>> c = 1000
>>> d = 1000
>>> a is b
True
>>> c is d
False
"The is operator does not match the values of the variables, but the instances themselves."
Source : Understanding the "is" operator
To sum up, the values are equal but the instances isn't the same.
I recommend this lecture from Python official documentation : https://docs.python.org/3/tutorial/classes.html
is checks object identity. It is only True if you are comparing the same object on both sides. Just because two integers have the same value doesn't mean they have to be the same object. In fact, they usually aren't.
CPython optimizes a small set of integers (-5 through +256) to always be singletons. That is, whenever an integer is created, python checks whether the integer is in a small range of cached integers and uses those when possible. This is pretty quick - for small integers, which are the most commonly used integers, you don't need to create new objects. But its not scalable. Outside that range, python will generally create a new integer even if the integer already exists. The cost of lookup is too much.
The python compiler may also reuse immutables such as integers and strings that it knows about in a given compilation unit. This lookup is done once at compile and is relatively fast considering everything that goes on in that compile step.
But you can't rely on any of this. This is just how python happens to be implemented and the moment and it may change in the future. There are several singletons you can count on: None, True and False will always work.

Is there something about a string with an exclamation mark that prevent Python from string interning? [duplicate]

I learnt that in some immutable classes, __new__ may return an existing instance - this is what the int, str and tuple types sometimes do for small values.
But why do the following two snippets differ in the behavior?
With a space at the end:
>>> a = 'string '
>>> b = 'string '
>>> a is b
False
Without a space:
>>> c = 'string'
>>> d = 'string'
>>> c is d
True
Why does the space bring the difference?
This is a quirk of how the CPython implementation chooses to cache string literals. String literals with the same contents may refer to the same string object, but they don't have to. 'string' happens to be automatically interned when 'string ' isn't because 'string' contains only characters allowed in a Python identifier. I have no idea why that's the criterion they chose, but it is. The behavior may be different in different Python versions or implementations.
From the CPython 2.7 source code, stringobject.h, line 28:
Interning strings (ob_sstate) tries to ensure that only one string
object with a given value exists, so equality tests can be one pointer
comparison. This is generally restricted to strings that "look like"
Python identifiers, although the intern() builtin can be used to force
interning of any string.
You can see the code that does this in Objects/codeobject.c:
/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!PyString_Check(v))
continue;
if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
continue;
PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
Also, note that interning is a separate process from the merging of string literals by the Python bytecode compiler. If you let the compiler compile the a and b assignments together, e.g. by placing them in a module or an if True:, you would find that a and b would be the same string.
This behavior is not consistent, and as others have mentioned depends on the variant of Python being executed. For a deeper discussion, see this question.
If you want to make sure that the same object is being used you can force the interning of strings by the appropriately named intern:
intern(...)
intern(string) -> string
``Intern'' the given string. This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary lookups.
Return the string itself or the previously interned string object with the
same value.
>>> a = 'string '
>>> b = 'string '
>>> id(a) == id(b)
False
>>> a = intern('string ')
>>> b = intern('string ')
>>> id(a) == id(b)
True
Note in Python3, you have to explicitly import intern from sys import intern.

Wrong boolean result in simple expression

I am getting a strange result in this boolean in python. I keep getting the wrong result.
string = '94070'
string[0:2] is '95' or string[0:2] is '94'
returns False, but when I hardcode in the value '94', it works
'94' is '95' or '94' is '94'
returns True. I've checked the data types and they are both of type 'str' so I'm not sure what is going on here.
Use == instead of is. In Python, the is operator does an object identity check. The == operator checks two objects (which may be different objects) to see whether they contain the same contents.
is is an identity test (is this the exact same object?), not equality test. While is works coincidentally, as an implementation detail for some things that aren't logically singletons, it shouldn't be used like this; use value equality testing with ==.
Your test of '94' is '94' can work due to a couple of related possibilities:
Python often coalesces constant literals in a function (sometimes only on a single line)
String literals are often interned by Python, so the same string literal expressed anywhere in the code references a common copy of that string
When you slice off bits of a string, interning isn't involved, so the identity test fails.
Use is to see if two arguments refer to the same object and == to see if they have the same value.
>>> a = 'this is some text.'
>>> b = 'this is some text.'
>>> a == b
True
>>> a is b
False
>>> a = 'this is some text.'
>>> b = a
>>> a == b
True
>>> a is b
True

python reference about floating point number [duplicate]

This question already has answers here:
Why id function behaves differently with integer and float?
(6 answers)
Closed 7 years ago.
Before question, Here are sample code.
Take a look at those first,please.
>>> id(1)
1636939440
>>> a = 1
>>> b = 1
>>> c = 1
>>> id(a)
1636939440
>>> id(b)
1636939440
>>> id(c)
1636939440
>>> id("hello")
43566560
>>> a = "hello"
>>> b = "hello"
>>> c = "hello"
>>> id(a)
43566560
>>> id(b)
43566560
>>> id(c)
43566560
>>> id(3.14)
34312864
>>> a = 3.14
>>> b = 3.14
>>> c = 3.14
>>> id(a)
34312864
>>> id(b)
34312600
>>> id(c)
34312432
As you see above, in terms of Integer and String, Python variable references
the object the same way. But floating point number works in different way.
Why is that? Is there any special reason for that?
For small integers and strings Python uses internal memory optimization. Since any variable in Python is a reference to memory object, Python puts such small values into the memory only once. Then, whenever the same value is assigned to any other variable, it makes that variable point to the object already kept in memory. This works for strings and integers as they are immutable and if the variable value changes, effectively it's the reference used by this variable that is changed, the object in memory with original value is not itself affected.
First of all, floating point numbers are not 'small', and, second, the same 3.14 in memory depending on calculations might be kept as 3.14123123456789 and 3.14123987654321 (just example numbers to explain). So these two values are two different objects, but during calculations and displaying the meaningful part looks the same, i.e. 3.14 (in fact there's obviously many more possible values in memory for the same floating point number). That's why reusing the same floating point number object in memory is problematic and doesn't worth it after all.
See more on how floating point numbers are kept in memory here:
http://floating-point-gui.de/
http://docs.python.org/2/tutorial/floatingpoint.html
Also, there's a big article on floating point numbers at Oracle docs.
Mutable and immutable.
Strings, tuples and bytes are immutable, whilst lists and byte arrays are mutable. Read more about the concept here: Data models in Python.

Intern and efficient memory use of string in python [duplicate]

I learnt that in some immutable classes, __new__ may return an existing instance - this is what the int, str and tuple types sometimes do for small values.
But why do the following two snippets differ in the behavior?
With a space at the end:
>>> a = 'string '
>>> b = 'string '
>>> a is b
False
Without a space:
>>> c = 'string'
>>> d = 'string'
>>> c is d
True
Why does the space bring the difference?
This is a quirk of how the CPython implementation chooses to cache string literals. String literals with the same contents may refer to the same string object, but they don't have to. 'string' happens to be automatically interned when 'string ' isn't because 'string' contains only characters allowed in a Python identifier. I have no idea why that's the criterion they chose, but it is. The behavior may be different in different Python versions or implementations.
From the CPython 2.7 source code, stringobject.h, line 28:
Interning strings (ob_sstate) tries to ensure that only one string
object with a given value exists, so equality tests can be one pointer
comparison. This is generally restricted to strings that "look like"
Python identifiers, although the intern() builtin can be used to force
interning of any string.
You can see the code that does this in Objects/codeobject.c:
/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
PyObject *v = PyTuple_GetItem(consts, i);
if (!PyString_Check(v))
continue;
if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
continue;
PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}
Also, note that interning is a separate process from the merging of string literals by the Python bytecode compiler. If you let the compiler compile the a and b assignments together, e.g. by placing them in a module or an if True:, you would find that a and b would be the same string.
This behavior is not consistent, and as others have mentioned depends on the variant of Python being executed. For a deeper discussion, see this question.
If you want to make sure that the same object is being used you can force the interning of strings by the appropriately named intern:
intern(...)
intern(string) -> string
``Intern'' the given string. This enters the string in the (global)
table of interned strings whose purpose is to speed up dictionary lookups.
Return the string itself or the previously interned string object with the
same value.
>>> a = 'string '
>>> b = 'string '
>>> id(a) == id(b)
False
>>> a = intern('string ')
>>> b = intern('string ')
>>> id(a) == id(b)
True
Note in Python3, you have to explicitly import intern from sys import intern.

Categories

Resources