This question already has answers here:
About the changing id of an immutable string
(5 answers)
Closed 4 years ago.
Why does "hello" is "hello" produce True in Python?
I read the following here:
If two string literals are equal, they have been put to same
memory location. A string is an immutable entity. No harm can
be done.
So there is one and only one place in memory for every Python string? Sounds pretty strange. What's going on here?
Python (like Java, C, C++, .NET) uses string pooling / interning. The interpreter realises that "hello" is the same as "hello", so it optimizes and uses the same location in memory.
Another goodie: "hell" + "o" is "hello" ==> True
So there is one and only one place in memory for every Python string?
No, only ones the interpreter has decided to optimise, which is a decision based on a policy that isn't part of the language specification and which may change in different CPython versions.
eg. on my install (2.6.2 Linux):
>>> 'X'*10 is 'X'*10
True
>>> 'X'*30 is 'X'*30
False
similarly for ints:
>>> 2**8 is 2**8
True
>>> 2**9 is 2**9
False
So don't rely on 'string' is 'string': even just looking at the C implementation it isn't safe.
Literal strings are probably grouped based on their hash or something similar. Two of the same literal strings will be stored in the same memory, and any references both refer to that.
Memory Code
-------
| myLine = "hello"
| /
|hello <
| \
| myLine = "hello"
-------
The is operator returns true if both arguments are the same object. Your result is a consequence of this, and the quoted bit.
In the case of string literals, these are interned, meaning they are compared to known strings. If an identical string is already known, the literal takes that value, instead of an alternative one. Thus, they become the same object, and the expression is true.
The Python interpreter/compiler parses the string literals, i.e. the quoted list of characters. When it does this, it can detect "I've seen this string before", and use the same representation as last time. It can do this since it knows that strings defined in this way cannot be changed.
Why is it strange. If the string is immutable it makes a lot of sense to only store it once. .NET has the same behavior.
I think if any two variables (not just strings) contain the same value, the value will be stored only once not twice and both the variables will point to the same location. This saves memory.
Related
This question already has answers here:
Does Python do slice-by-reference on strings?
(2 answers)
Closed 2 years ago.
While going through the mutable and immutable topic in depth. I found that the variable address is different when you simply call "s" v/s when you call it by index. Why is this so?
s = "hello!"
print(id(s))
print(id(s[0:5]))
s[0:5] creates a new, temporary string. That, of course, has a different ID. One "normal" case would be to assign this expression to a variable; the separate object is ready for that assignment.
Also note that a common way to copy a sequence is with a whole-sequence slice, such as
s_copy = s[:]
It's creating a new string, as Python strings are immutable (cannot be changed once created)
However, many implementations of Python will tag the same object if they find they're creating the same string, which makes for a good example. Here's what happens in a CPython 3.9.1 shell (which I suspect is one of the most common at time of writing)
>>> s = "hello!"
>>> print(id(s))
4512419376
>>> print(id(s[0:5])) # new string from slice of original
4511329712
>>> print(id(s[0:5])) # refers to the newly-created string
4511329712
>>> print(id(s[0:6])) # references original string
4512419376
>>> print(id(s[:])) # trivial slice also refers to original
4512419376
In general you should not rely on this behavior and only consider it an example.
Check for string equality with ==!
See also Why does comparing strings using either '==' or 'is' sometimes produce a different result?
Strings are immutable and thus by creating a new substring by slicing, the new String has a new address in memory. Because the old String cannot be altered due to it being immutable.
I am from c background and a beginner in python. I want to know how strings are actually stored in memory in case of python.
I did something like
s="foo"
id(s)=140542718184424
id(s[0])= 140542719027040
id(s[1])= 140542718832152
id(s[2])= 140542718832152
I did not understand how each character is getting stored in memory and and why id of s is not equal to id of s[0] (like it use to be in c) and why id of s1 and s2 are same?
Python has no characters. Indexing into a string creates a new string, which (like every other object) promptly vanquishes if you don't keep a reference to it around. So the id()s in your example can't be compared with each other, an object's id is only unique as long as the object lives. In particular, id(s[0]) != id(s) because the former is a new (temporary) object, and id(s[1]) == id(s[2]) because after the first operand is evaluated, the first temporary string is destroyed and the second temporary string is allocated to the previously freed memory. The latter is an implementation detail and a coincidence and cannot be relied on.
Reasoning about string memory is further complicated by implementation details like small strings (along with integers, some tuples, and more) being interned, so some_str is other_str may be true for equal strings that come from different sources (e.g. from indexing into a string with different indices).
This article is a good reading which explains how strings are stored. Briefly:
When working with empty strings or ASCII strings of one character Python uses string interning. Interned strings act as singletons, that is, if you have two identical strings that are interned, there is only one copy of them in the memory.
Python does not UTF-8 internally to provide constant access to substrings:
s = 'hello world'
s[0]
s[7]
both do not require to scan the string from the initial char (or, more correctly, the first substring of length 1) to the i-th position.
This is why Python uses the three kinds of internal representations for Unicode strings with 1, 2 or 4 byte(s) per char (Latin-1, UCS-2, UCS-4 encoding) and does not use the space-optimised UTF-8.
This is implementation dependent, but some implementations (not only of Python, other languages too) may keep a moderate-size set of constant values around for expected frequent use. In Python's case those might be values like True, None, 'o', 1, 2, etc. This way, when one of those common values is needed, there is no overhead to create it--just refer to the existing value.
This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
Closed 9 years ago.
Using Python2.7, if I try to compare the identity of two numbers, I don't get the same results for int and long.
int
>>> a = 5
>>> b = 5
>>> a is b
True
long
>>> a = 885763521873651287635187653182763581276358172635812763
>>> b = 885763521873651287635187653182763581276358172635812763
>>> a is b
False
I have a few related questions:
Why is the behavior different between the two?
Am I correct to generalize this behavior to all ints and all longs?
Is this CPython specific?
This isn't a difference between int and long. CPython interns small ints (from -5 to 256)
>>> a = 257
>>> b = 257
>>> a is b
False
Why is the behavior different between the two?
When you create a new value in CPython, you essentially create an entire new object containing that value. For small integer values (from -5 to 256), CPython decides to reuse old objects you have created earlier. This is called interning and it happens for performance reasons – it is considered cheaper to keep around old objects instead of creating new ones.
Am I correct to generalize this behavior to all ints and all longs?
No. As I said, it only happens for small ints. It also happens for short strings. (Where short means less than 7 characters.) Do not rely on this. It is only there for performance reasons, and your program shouldn't depend on the interning of values.
Is this CPython specific?
Not at all. Although the specifics may vary, many other platforms do interning. Notable in this aspect is the JVM, since in Java, the == operator means the same thing as the Python is operator. People learning Java are doing stuff like
String name1 = "John";
String name2 = "John";
if (name1 == name2) {
System.out.println("They have the same name!");
}
which of course is a bad habit, because if the names were longer, they would be in different objects and the comparison would be false, just like in Python if you would use the is operator.
Why is the behavior different between the two?
This is because only some ints are really interned. It cannot be done on really all values.
Am I correct to generalize this behavior to all ints and all longs?
In the case of int, no. Some are interned, some not.
AFAICT, for longs, you may be right, but I am not sure. I tested it with two 4Ls, and they were not identical.
Is this CPython specific?
Yes. It is not specified in the language specs, so this behaviour may (and will) differ on other implementations.
Try this way then you will be able to compare the data
a = 885763521873651287635187653182763581276358172635812763
b = 885763521873651287635187653182763581276358172635812763
if a == b:
print 'both are same'
Then you will get result
both are same
While I know that there is the possibility:
>>> a = "abc"
>>> result = a[-1]
>>> a = a[:-1]
Now I also know that strings are immutable and therefore something like this:
>>> a.pop()
c
is not possible.
But is this really the preferred way?
Strings are "immutable" for good reason: It really saves a lot of headaches, more often than you'd think. It also allows python to be very smart about optimizing their use. If you want to process your string in increments, you can pull out part of it with split() or separate it into two parts using indices:
a = "abc"
a, result = a[:-1], a[-1]
This shows that you're splitting your string in two. If you'll be examining every byte of the string, you can iterate over it (in reverse, if you wish):
for result in reversed(a):
...
I should add this seems a little contrived: Your string is more likely to have some separator, and then you'll use split:
ans = "foo,blah,etc."
for a in ans.split(","):
...
Not only is it the preferred way, it's the only reasonable way. Because strings are immutable, in order to "remove" a char from a string you have to create a new string whenever you want a different string value.
You may be wondering why strings are immutable, given that you have to make a whole new string every time you change a character. After all, C strings are just arrays of characters and are thus mutable, and some languages that support strings more cleanly than C allow mutable strings as well. There are two reasons to have immutable strings: security/safety and performance.
Security is probably the most important reason for strings to be immutable. When strings are immutable, you can't pass a string into some library and then have that string change from under your feet when you don't expect it. You may wonder which library would change string parameters, but if you're shipping code to clients you can't control their versions of the standard library, and malicious clients may change out their standard libraries in order to break your program and find out more about its internals. Immutable objects are also easier to reason about, which is really important when you try to prove that your system is secure against particular threats. This ease of reasoning is especially important for thread safety, since immutable objects are automatically thread-safe.
Performance is surprisingly often better for immutable strings. Whenever you take a slice of a string, the Python runtime only places a view over the original string, so there is no new string allocation. Since strings are immutable, you get copy semantics without actually copying, which is a real performance win.
Eric Lippert explains more about the rationale behind immutable of strings (in C#, not Python) here.
The precise wording of the question makes me think it's impossible.
return to me means you have a function, which you have passed a string as a parameter.
You cannot change this parameter. Assigning to it will only change the value of the parameter within the function, not the passed in string. E.g.
>>> def removeAndReturnLastCharacter(a):
c = a[-1]
a = a[:-1]
return c
>>> b = "Hello, Gaukler!"
>>> removeAndReturnLastCharacter(b)
!
>>> b # b has not been changed
Hello, Gaukler!
Yes, python strings are immutable and any modification will result in creating a new string. This is how it's mostly done.
So, go ahead with it.
I decided to go with a for loop and just avoid the item in question, is it an acceptable alternative?
new = ''
for item in str:
if item == str[n]:
continue
else:
new += item
In Python, the following code produces an error:
a = 'abc'
b = 1
print(a + b)
(The error is "TypeError: cannot concatenate 'str' and 'int' objects").
Why does the Python interpreter not automatically try using the str() function when it encounters concatenation of these types?
The problem is that the conversion is ambiguous, because + means both string concatenation and numeric addition. The following question would be equally valid:
Why does the Python interpreter not automatically try using the int() function when it encounters addition of these types?
This is exactly the loose-typing problem that unfortunately afflicts Javascript.
There's a very large degree of ambiguity with such operations. Suppose that case instead:
a = '4'
b = 1
print(a + b)
It's not clear if a should be coerced to an integer (resulting in 5), or if b should be coerced to a string (resulting in '41'). Since type juggling rules are transitive, passing a numeric string to a function expecting numbers could get you in trouble, especially since almost all arithmetic operators have overloaded operations for strings too.
For instance, in Javascript, to make sure you deal with integers and not strings, a common practice is to multiply a variable by one; in Python, the multiplication operator repeats strings, so '41' * 1 is a no-op. It's probably better to just ask the developer to clarify.
The short answer would be because Python is a strongly typed language.
This was a design decision made by Guido. It could have been one way or another really, concatenating str and int to str or int.
The best explanation, is still the one given by guido, you can check it here
The other answers have provided pretty good explanations, but have failed to mention that this feature is known a Strong Typing. Languages that perform implicit conversions are Weakly Typed.
Because Python does not perform type conversion when concatenating strings. This behavior is by design, and you should get in the habit of performing explicit type conversions when you need to coerce objects into strings or numbers.
Change your code to:
a = 'abc'
b = 1
print(a + str(b))
And you'll see the desired result.
Python would have to know what's in the string to do it correctly. There's an ambiguous case: what should '5' + 5 generate? A number or a string? That should certainly throw an error. Now to determine whether that situation holds, python would have to examine the string to tell. Should it do that every time you try to concatenate or add two things? Better to just let the programmer convert the string explicitly.
More generally, implicit conversions like that are just plain confusing! They're hard to predict, hard to read, and hard to debug.
That's just how they decided to design the language. Probably the rationale is that requiring explicit conversions to string reduces the likelihood of unintended behavior (e.g. integer addition if both operands happen to be ints instead of strings).
tell python that the int is a list to disambiguate the '+' operation.
['foo', 'bar'] + [5]
this returns: ['foo', 'bar', 5]