Why are CPython ints unique and long not? [duplicate]

Why are CPython ints unique and long not? [duplicate] - python

This question already has answers here:
"is" operator behaves unexpectedly with integers
(11 answers)
Closed 9 years ago.
Using Python2.7, if I try to compare the identity of two numbers, I don't get the same results for int and long.
int
>>> a = 5
>>> b = 5
>>> a is b
True
long
>>> a = 885763521873651287635187653182763581276358172635812763
>>> b = 885763521873651287635187653182763581276358172635812763
>>> a is b
False
I have a few related questions:
Why is the behavior different between the two?
Am I correct to generalize this behavior to all ints and all longs?
Is this CPython specific?

This isn't a difference between int and long. CPython interns small ints (from -5 to 256)
>>> a = 257
>>> b = 257
>>> a is b
False

Why is the behavior different between the two?
When you create a new value in CPython, you essentially create an entire new object containing that value. For small integer values (from -5 to 256), CPython decides to reuse old objects you have created earlier. This is called interning and it happens for performance reasons – it is considered cheaper to keep around old objects instead of creating new ones.
Am I correct to generalize this behavior to all ints and all longs?
No. As I said, it only happens for small ints. It also happens for short strings. (Where short means less than 7 characters.) Do not rely on this. It is only there for performance reasons, and your program shouldn't depend on the interning of values.
Is this CPython specific?
Not at all. Although the specifics may vary, many other platforms do interning. Notable in this aspect is the JVM, since in Java, the == operator means the same thing as the Python is operator. People learning Java are doing stuff like
String name1 = "John";
String name2 = "John";
if (name1 == name2) {
System.out.println("They have the same name!");
}
which of course is a bad habit, because if the names were longer, they would be in different objects and the comparison would be false, just like in Python if you would use the is operator.

Why is the behavior different between the two?
This is because only some ints are really interned. It cannot be done on really all values.
Am I correct to generalize this behavior to all ints and all longs?
In the case of int, no. Some are interned, some not.
AFAICT, for longs, you may be right, but I am not sure. I tested it with two 4Ls, and they were not identical.
Is this CPython specific?
Yes. It is not specified in the language specs, so this behaviour may (and will) differ on other implementations.

Try this way then you will be able to compare the data
a = 885763521873651287635187653182763581276358172635812763
b = 885763521873651287635187653182763581276358172635812763
if a == b:
print 'both are same'
Then you will get result
both are same

Related

Python min() of arguments with different types

I was surprised recently to find that you can take the min() of arguments of different types in Python and not get a ValueError.
min(3, "blah") ==> 3
min(300, 'zzz') ==> 300
The documentation is unclear on this - it just says min() takes "the smallest of the arguments". How is it actually determining which element is the smallest?

It determines this by comparing them using the usual rules. If objects are of different types and can't be compared sensibly (because neither of them implement the required special methods, or the implementation doesn't work with the type of the other object) then they are given a consistent order by type; e.g., all integers are less than all strings. Try it: 1 < "1"
(By the way, Booleans are implemented as a subclass of integers, and can be compared with numbers, so they'll sort False as 0 and True as 1.)
It was implemented this way so that if you sort a list containing various types, like types will be sorted together. In Python 3 forward, however, this was changed and you can no longer implicitly compare dissimilar types.

In Python 2, there existed an arbitrary but predictable comparison of values with different types. I think it was something like lexicographic comparison of the type name (ints < floats < strs < tuples ).

It is correct according to Python's definition of ordering:
>>> 3 < "blah"
True
>>> 300 < 'zzz'
True
The rule is:
If both are numbers, they are converted to a common type. Otherwise, objects of different types always compare unequal, and are ordered consistently but arbitrarily.

Python Remove last char from string and return it

While I know that there is the possibility:
>>> a = "abc"
>>> result = a[-1]
>>> a = a[:-1]
Now I also know that strings are immutable and therefore something like this:
>>> a.pop()
c
is not possible.
But is this really the preferred way?

Strings are "immutable" for good reason: It really saves a lot of headaches, more often than you'd think. It also allows python to be very smart about optimizing their use. If you want to process your string in increments, you can pull out part of it with split() or separate it into two parts using indices:
a = "abc"
a, result = a[:-1], a[-1]
This shows that you're splitting your string in two. If you'll be examining every byte of the string, you can iterate over it (in reverse, if you wish):
for result in reversed(a):
...
I should add this seems a little contrived: Your string is more likely to have some separator, and then you'll use split:
ans = "foo,blah,etc."
for a in ans.split(","):
...

Not only is it the preferred way, it's the only reasonable way. Because strings are immutable, in order to "remove" a char from a string you have to create a new string whenever you want a different string value.
You may be wondering why strings are immutable, given that you have to make a whole new string every time you change a character. After all, C strings are just arrays of characters and are thus mutable, and some languages that support strings more cleanly than C allow mutable strings as well. There are two reasons to have immutable strings: security/safety and performance.
Security is probably the most important reason for strings to be immutable. When strings are immutable, you can't pass a string into some library and then have that string change from under your feet when you don't expect it. You may wonder which library would change string parameters, but if you're shipping code to clients you can't control their versions of the standard library, and malicious clients may change out their standard libraries in order to break your program and find out more about its internals. Immutable objects are also easier to reason about, which is really important when you try to prove that your system is secure against particular threats. This ease of reasoning is especially important for thread safety, since immutable objects are automatically thread-safe.
Performance is surprisingly often better for immutable strings. Whenever you take a slice of a string, the Python runtime only places a view over the original string, so there is no new string allocation. Since strings are immutable, you get copy semantics without actually copying, which is a real performance win.
Eric Lippert explains more about the rationale behind immutable of strings (in C#, not Python) here.

The precise wording of the question makes me think it's impossible.
return to me means you have a function, which you have passed a string as a parameter.
You cannot change this parameter. Assigning to it will only change the value of the parameter within the function, not the passed in string. E.g.
>>> def removeAndReturnLastCharacter(a):
c = a[-1]
a = a[:-1]
return c
>>> b = "Hello, Gaukler!"
>>> removeAndReturnLastCharacter(b)
!
>>> b # b has not been changed
Hello, Gaukler!

Yes, python strings are immutable and any modification will result in creating a new string. This is how it's mostly done.
So, go ahead with it.

I decided to go with a for loop and just avoid the item in question, is it an acceptable alternative?
new = ''
for item in str:
if item == str[n]:
continue
else:
new += item

Why does Python not perform type conversion when concatenating strings?

In Python, the following code produces an error:
a = 'abc'
b = 1
print(a + b)
(The error is "TypeError: cannot concatenate 'str' and 'int' objects").
Why does the Python interpreter not automatically try using the str() function when it encounters concatenation of these types?

The problem is that the conversion is ambiguous, because + means both string concatenation and numeric addition. The following question would be equally valid:
Why does the Python interpreter not automatically try using the int() function when it encounters addition of these types?
This is exactly the loose-typing problem that unfortunately afflicts Javascript.

There's a very large degree of ambiguity with such operations. Suppose that case instead:
a = '4'
b = 1
print(a + b)
It's not clear if a should be coerced to an integer (resulting in 5), or if b should be coerced to a string (resulting in '41'). Since type juggling rules are transitive, passing a numeric string to a function expecting numbers could get you in trouble, especially since almost all arithmetic operators have overloaded operations for strings too.
For instance, in Javascript, to make sure you deal with integers and not strings, a common practice is to multiply a variable by one; in Python, the multiplication operator repeats strings, so '41' * 1 is a no-op. It's probably better to just ask the developer to clarify.

The short answer would be because Python is a strongly typed language.
This was a design decision made by Guido. It could have been one way or another really, concatenating str and int to str or int.
The best explanation, is still the one given by guido, you can check it here

The other answers have provided pretty good explanations, but have failed to mention that this feature is known a Strong Typing. Languages that perform implicit conversions are Weakly Typed.

Because Python does not perform type conversion when concatenating strings. This behavior is by design, and you should get in the habit of performing explicit type conversions when you need to coerce objects into strings or numbers.
Change your code to:
a = 'abc'
b = 1
print(a + str(b))
And you'll see the desired result.

Python would have to know what's in the string to do it correctly. There's an ambiguous case: what should '5' + 5 generate? A number or a string? That should certainly throw an error. Now to determine whether that situation holds, python would have to examine the string to tell. Should it do that every time you try to concatenate or add two things? Better to just let the programmer convert the string explicitly.
More generally, implicit conversions like that are just plain confusing! They're hard to predict, hard to read, and hard to debug.

That's just how they decided to design the language. Probably the rationale is that requiring explicit conversions to string reduces the likelihood of unintended behavior (e.g. integer addition if both operands happen to be ints instead of strings).

tell python that the int is a list to disambiguate the '+' operation.
['foo', 'bar'] + [5]
this returns: ['foo', 'bar', 5]

Why is int(50)<str(5) in python 2.x?

In python 3, int(50)<'2' causes a TypeError, and well it should. In python 2.x, however, int(50)<'2' returns True (this is also the case for other number formats, but int exists in both py2 and py3). My question, then, has several parts:
Why does Python 2.x (< 3?) allow this behavior?
(And who thought it was a good idea to allow this to begin with???)
What does it mean that an int is less than a str?
Is it referring to ord / chr?
Is there some binary format which is less obvious?
Is there a difference between '5' and u'5' in this regard?

It works like this1.
>>> float() == long() == int() < dict() < list() < str() < tuple()
True
Numbers compare as less than containers. Numeric types are converted to a common type and compared based on their numeric value. Containers are compared by the alphabetic value of their names.2
From the docs:
CPython implementation detail: Objects of different types except numbers are ordered by >their type names; objects of the same types that don’t support proper comparison are >ordered by their address.
Objects of different builtin types compare alphabetically by the name of their type int starts with an 'i' and str starts with an s so any int is less than any str..
I have no idea.
A drunken master.
It means that a formal order has been introduced on the builtin types.
It's referring to an arbitrary order.
No.
No. strings and unicode objects are considered the same for this purpose. Try it out.
In response to the comment about long < int
>>> int < long
True
You probably meant values of those types though, in which case the numeric comparison applies.
1 This is all on Python 2.6.5
2 Thank to kRON for clearing this up for me. I'd never thought to compare a number to a dict before and comparison of numbers is one of those things that's so obvious that it's easy to overlook.

The reason why these comparisons are allowed, is sorting. Python 2.x can sort lists containing mixed types, including strings and integers -- integers always appear first. Python 3.x does not allow this, for the exact reasons you pointed out.
Python 2.x:
>>> sorted([1, '1'])
[1, '1']
>>> sorted([1, '1', 2, '2'])
[1, 2, '1', '2']
Python 3.x:
>>> sorted([1, '1'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: str() < int()

(And who thought it was a good idea to allow this to begin with???)
I can imagine that the reason might be to allow object from different types to be stored in tree-like structures, which use comparisons internally.

As Aaron said. Breaking it up into your points:
Because it makes sort do something halfway usable where it otherwise would make no sense at all (mixed lists). It's not a good idea generally, but much in Python is designed for convenience over strictness.
Ordered by type name. This means things of the same type group together, where they can be sorted. They should probably be grouped by type class, such as numbers together, but there's no proper type class framework. There may be a few more specific rules in there (probably is one for numeric types), I'd have to check the source.
One is string and the other is unicode. They may have a direct comparison operation, however, but it's conceivable a non-comparable type would get grouped between them, causing a mess. I don't know if there's code to avoid this.
So, it doesn't make sense in the general case, but occasionally it's helpful.
from random import shuffle
letters=list('abcdefgh')
ints=range(8)
both=ints+letters
shuffle(ints)
shuffle(letters)
shuffle(both)
print sorted(ints+letters)
print sorted(both)
Both print the ints first, then the letters.
As a rule, you don't want to mix types randomly within a program, and apparently Python 3 prevents it where Python 2 tries to make vague sense where none exists. You could still sort by lambda a,b: cmp(repr(a),repr(b)) (or something better) if you really want to, but it appears the language developers agreed it's impractical default behaviour. I expect it varies which gives the least surprise, but it's a lot harder to detect a problem in the Python 2 sense.

Python: Why does ("hello" is "hello") evaluate as True? [duplicate]

This question already has answers here:
About the changing id of an immutable string
(5 answers)
Closed 4 years ago.
Why does "hello" is "hello" produce True in Python?
I read the following here:
If two string literals are equal, they have been put to same
memory location. A string is an immutable entity. No harm can
be done.
So there is one and only one place in memory for every Python string? Sounds pretty strange. What's going on here?

Python (like Java, C, C++, .NET) uses string pooling / interning. The interpreter realises that "hello" is the same as "hello", so it optimizes and uses the same location in memory.
Another goodie: "hell" + "o" is "hello" ==> True

So there is one and only one place in memory for every Python string?
No, only ones the interpreter has decided to optimise, which is a decision based on a policy that isn't part of the language specification and which may change in different CPython versions.
eg. on my install (2.6.2 Linux):
>>> 'X'*10 is 'X'*10
True
>>> 'X'*30 is 'X'*30
False
similarly for ints:
>>> 2**8 is 2**8
True
>>> 2**9 is 2**9
False
So don't rely on 'string' is 'string': even just looking at the C implementation it isn't safe.

Literal strings are probably grouped based on their hash or something similar. Two of the same literal strings will be stored in the same memory, and any references both refer to that.
Memory Code
-------
| myLine = "hello"
| /
|hello <
| \
| myLine = "hello"
-------

The is operator returns true if both arguments are the same object. Your result is a consequence of this, and the quoted bit.
In the case of string literals, these are interned, meaning they are compared to known strings. If an identical string is already known, the literal takes that value, instead of an alternative one. Thus, they become the same object, and the expression is true.

The Python interpreter/compiler parses the string literals, i.e. the quoted list of characters. When it does this, it can detect "I've seen this string before", and use the same representation as last time. It can do this since it knows that strings defined in this way cannot be changed.

Why is it strange. If the string is immutable it makes a lot of sense to only store it once. .NET has the same behavior.

I think if any two variables (not just strings) contain the same value, the value will be stored only once not twice and both the variables will point to the same location. This saves memory.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.