Python: Identical strings (or numbers) with unique ids?

Python: Identical strings (or numbers) with unique ids? - python

Python is wonderfully optimized, but I have a case where I'd like to work around it. It seems for small numbers and strings, python will automatically collapse multiple objects into one. For example:
>>> a = 1
>>> b = 1
>>> id(a) == id(b)
True
>>> a = str(a)
>>> b = str(b)
>>> id(a) == id(b)
True
>>> a += 'foobar'
>>> b += 'foobar'
>>> id(a) == id(b)
False
>>> a = a[:-6]
>>> b = b[:-6]
>>> id(a) == id(b)
True
I have a case where I'm comparing objects based on their Python ids. This is working really well except for the few cases where I run into small numbers. Does anyone know how to turn off this optimization for specific strings and integers? Something akin to an anti-intern()?

You shouldn't be relying on these objects to be different objects at all. There's no way to turn this behavior off without modifying and recompiling Python, and which particular objects it applies to is subject to change without notice.

You can't turn it off without re-compiling your own version of CPython.
But if you want to have "separate" versions of the same small integers, you can do that by maintaining your own id (for example a uuid4) associated with the object.
Since ints and strings are immutable, there's no obvious reason to do this - if you can't modify the object at all, you shouldn't care whether you have the "original" or a copy because there is no use-case where it can make any difference.
Related: How to create the int 1 at two different memory locations?

Sure, it can be done, but its never really a good idea:
#
Z =1
class MyString(string):
def __init__(self, *args):
global Z
super(MyString,
self).__init__(*args)
self.i = Z
Z += 1
>>> a = MyString("1")
>>> b = MyString("1")
>>> a is b
False
btw, to compare if objects have the same id just use a is b instead of id(a)==id(b)

The Python documentation on id() says
Return the “identity” of an object. This is an integer which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object in memory.
So it's guaranteed to be unique, it must be intended as a way to tell if two variables are bound to the same object.
In a comment on StackOverflow here, Alex Martelli says the CPython implementation is not the authoritative Python, and other correct implementations of Python can and do behave differently in some ways - and that the Python Language Reference (PLR) is the closest thing Python has to a definitive specification.
In the PLR section on objects it says much the same:
Every object has an identity, a type and a value. An object’s identity never changes once it has been created; you may think of it as the object’s address in memory. The ‘is‘ operator compares the identity of two objects; the id() function returns an integer representing its identity (currently implemented as its address).
The language reference doesn't say it's guaranteed to be unique. It also says (re: the object's lifetime):
Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.
and:
CPython implementation detail: CPython currently uses a reference-counting scheme with (optional) delayed detection of cyclically linked garbage, which collects most objects as soon as they become unreachable, but is not guaranteed to collect garbage containing circular references. See the documentation of the gc module for information on controlling the collection of cyclic garbage. Other implementations act differently and CPython may change. Do not depend on immediate finalization of objects when they become unreachable (ex: always close files).
This isn't actually an answer, I was hoping this would end up somewhere conclusive. But I don't want to delete it now I've quoted and cited.
I'll go with turning your premise around: python will automatically collapse multiple objects into one. - no it willn't, they were never multiple objects, they can't be, because they have the same id().
If id() is Python's definitive answer on whether two objects are the same or different, your premise is incorrect - this isn't an optimization, it's a fundamental part of Python's view on the world.

This version accounts for wim's concerns about more aggressive internment in the future. It will use more memory, which is why I discarded it originally, but probably is more future proof.
>>> class Wrapper(object):
... def __init__(self, obj):
... self.obj = obj
>>> a = 1
>>> b = 1
>>> aWrapped = Wrapper(a)
>>> bWrapped = Wrapper(b)
>>> aWrapped is bWrapped
False
>>> aUnWrapped = aWrapped.obj
>>> bUnwrapped = bWrapped.obj
>>> aUnWrapped is bUnwrapped
True
Or a version that works like the pickle answer (wrap + pickle = wrapple):
class Wrapple(object):
def __init__(self, obj):
self.obj = obj
#staticmethod
def dumps(obj):
return Wrapple(obj)
def loads(self):
return self.obj
aWrapped = Wrapple.dumps(a)
aUnWrapped = Wrapple.loads(a)

Well, seeing as no one posted a response that was useful, I'll just let you know what I ended up doing.
First, some friendly advice to someone who might read this one day. This is not recommended for normal use, so if you're contemplating it, ask yourself if you have a really good reason. There are good reason, but they are rare, and if someone says there aren't, they just aren't thinking hard enough.
In the end, I just used pickle.dumps() on all the objects and passed the output in instead of the real object. On the other side I checked the id and then used pickle.loads() to restore the object. The nice part of this solution was it works for all types including None and Booleans.
>>> a = 1
>>> b = 1
>>> a is b
True
>>> aPickled = pickle.dumps(a)
>>> bPickled = pickle.dumps(b)
>>> aPickled is bPickled
False
>>> aUnPickled = pickle.loads(aPickled)
>>> bUnPickled = pickle.loads(bPickled)
>>> aUnPickled is bUnPickled
True
>>> aUnPickled
1

Related

understanding python id() uniqueness

Python documentation for id() function states the following:
This is an integer which is guaranteed to be unique and constant for
this object during its lifetime. Two objects with non-overlapping
lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object in memory.
Although, the snippet below shows that id's are repeated. Since I didn't explicitly del the objects, I presume they are all alive and unique (I do not know what non-overlapping means).
>>> g = [0, 1, 0]
>>> for h in g:
... print(h, id(h))
...
0 10915712
1 10915744
0 10915712
>>> a=0
>>> b=1
>>> c=0
>>> d=[a, b,c]
>>> for e in d:
... print(e, id(e))
...
0 10915712
1 10915744
0 10915712
>>> id(a)
10915712
>>> id(b)
10915744
>>> id(c)
10915712
>>>
How can the id values for different objects be the same? Is it so because the value 0 (object of class int) is a constant and the interpreter/C compiler optimizes?
If I were to do a = c, then I understand c to have the same id as a since c would just be a reference to a (alias). I expected the objects a and c to have different id values otherwise, but, as shown above, they have the same values.
What's happening? Or am I looking at this the wrong way?
I would expect the id's for user-defined class' objects to ALWAYS be unique even if they have the exact same member values.
Could someone explain this behavior? (I looked at the other questions that ask uses of id(), but they steer in other directions)
EDIT (09/30/2019):
TO extend what I already wrote, I ran python interpreters in separate terminals and checked the id's for 0 on all of them, they were exactly the same (for the same interpreter); multiple instances of different interpreters had the same id for 0. Python2 vs Python3 had different values, but the same Python2 interpreter had same id values.
My question is because the id()'s documentation doesn't state any such optimizations, which seems misleading (I don't expect every quirk to be noted, but some note alongside the CPython note would be nice)...
EDIT 2 (09/30/2019):
The question is stemmed in understanding this behavior and knowing if there are any hooks to optimize user-define classes in a similar way (by modifying the __equals__ method to identify if two objects are same; perhaps the would point to the same address in memory i.e. same id? OR use some metaclass properties)

Ids are guaranteed to be unique for the lifetime of the object. If an object gets deleted, a new object can acquire the same id. CPython will delete items immediately when their refcount drops to zero. The garbage collector is only needed to break up reference cycles.
CPython may also cache and re-use certain immutable objects like small integers and strings defined by literals that are valid identifiers. This is an implementation detail that you should not rely upon. It is generally considered improper to use is checks on such objects.
There are certain exceptions to this rule, for example, using an is check on possibly-interned strings as an optimization before comparing them with the normal == operator is fine. The dict builtin uses this strategy for lookups to make them faster for identifiers.
a is b or a == b # This is OK
If the string happens to be interned, then the above can return true with a simple id comparison instead of a slower character-by-character comparison, but it still returns true if and only if a == b (because if a is b then a == b must also be true). However, a good implementation of .__eq__() would already do an is check internally, so at best you would only avoid the overhead of calling the .__eq__().
Thanks for the answer, would you elaborate around the uniqueness for user-defined objects, are they always unique?
The id of any object (be it user-defined or not) is unique for the lifetime of the object. It's important to distinguish objects from variables. It's possible to have two or more variables refer to the same object.
>>> a = object()
>>> b = a
>>> c = object()
>>> a is b
True
>>> a is c
False
Caching optimizations mean that you are not always guaranteed to get a new object in cases where one might naiively think one should, but this does not in any way violate the uniqueness guarantee of IDs. Builtin types like int and str may have some caching optimizations, but they follow exactly the same rules: If they are live at the same time, and their IDs are the same, then they are the same object.
Caching is not unique to builtin types. You can implement caching for your own objects.
>>> def the_one(it=object()):
... return it
...
>>> the_one() is the_one()
True
Even user-defined classes can cache instances. For example, this class only makes one instance of itself.
>>> class TheOne:
... _the_one = None
... def __new__(cls):
... if not cls._the_one:
... cls._the_one = super().__new__(cls)
... return cls._the_one
...
>>> TheOne() is TheOne() # There can be only one TheOne.
True
>>> id(TheOne()) == id(TheOne()) # This is what an is-check does.
True
Note that each construction expression evaluates to an object with the same id as the other. But this id is unique to the object. Both expressions reference the same object, so of course they have the same id.
The above class only keeps one instance, but you could also cache some other number. Perhaps recently used instances, or those configured in a way you expect to be common (as ints do), etc.

Questions about python dictionary equality [duplicate]

Curiously:
>>> a = 123
>>> b = 123
>>> a is b
True
>>> a = 123.
>>> b = 123.
>>> a is b
False
Seems a is b being more or less defined as id(a) == id(b). It is easy to make bugs this way:
basename, ext = os.path.splitext(fname)
if ext is '.mp3':
# do something
else:
# do something else
Some fnames unexpectedly ended up in the else block. The fix is simple, we should use ext == '.mp3' instead, but nonetheless if ext is '.mp3' on the surface seems like a nice pythonic way to write this and it's more readable than the "correct" way.
Since strings are immutable, what are the technical details of why it's wrong? When is an identity check better, and when is an equality check better?

They are fundamentally different.
== compares by calling the __eq__ method
is returns true if and only if the two references are to the same object
So in comparision with say Java:
is is the same as == for objects
== is the same as equals for objects

As far as I can tell, is checks for object identity equivalence. As there's no compulsory "string interning", two strings that just happen to have the same characters in sequence are, typically, not the same string object.
When you extract a substring from a string (or, really, any subsequence from a sequence), you will end up with two different objects, containing the same value(s).
So, use is when and only when you are comparing object identities. Use == when comparing values.

Simple rule for determining if to use is or == in Python
Here is an easy rule (unless you want to go to theory in Python interpreter or building frameworks doing funny things with Python objects):
Use is only for None comparison.
if foo is None
Otherwise use ==.
if x == 3
Then you are on the safe side. The rationale for this is already explained int the above comments. Don't use is if you are not 100% sure why to do it.

It would be also useful to define a class like this to be used as the default value for constants used in your API. In this case, it would be more correct to use is than the == operator.
class Sentinel(object):
"""A constant object that does not change even when copied."""
def __deepcopy__(self, memo):
# Always return the same object because this is essentially a constant.
return self
def __copy__(self):
# called via copy.copy(x)
return self

You should be warned by PyCharm when you use is with a literal with a warning such as SyntaxWarning: "is" with a literal. Did you mean "=="?. So, when comparing with a literal, always use ==. Otherwise, you may prefer using is in order to compare objects through their references.

why some types of data refer to the same memory location

Asked such a question. Why only the type only str and boolean with the same variables refer to one memory location:
a = 'something'
b = 'something'
if a is b: print('True') # True
but we did not write anywhere a = b. hence the interpreter saw that the strings are equal to each other and made a reference to one memory cell.
Of course, if we assign a new value to either of these two variables, there will be no conflict, so now the variable will refer to another memory location
b = 'something more'
if a is b: print('True') # False
with type boolean going on all the same
a = True
b = True
if a is b: print('True') # True
I first thought that this happens with all mutable types. But no. There remained one unchangeable type - tuple. But it has a different behavior, that is, when we assign the same values to variables, we already refer to different memory cells. Why does this happen only with tuple of immutable types
a = (1,9,8)
b = (1,9,8)
if a is b: print('True') # False

In Python, == checks for value equality, while is checks if basically its the same object like so: id(object) == id(object)
Python has some builtin singletons which it starts off with (I'm guessing lower integers and some commonly used strings)
So, if you dig deeper into your statement
a = 'something'
b = 'something'
id(a)
# 139702804094704
id(b)
# 139702804094704
a is b
# True
But if you change it a bit:
a = 'something else'
b = 'something else'
id(a)
# 139702804150640
id(b)
# 139702804159152
a is b
# False
We're getting False because Python uses different memory location for a and b this time, unlike before.
My guess is with tuples (and someone correct me if I'm mistaken) Python allocates different memory every time you create one.

Why do some types cache values? Because you shouldn't be able to notice the difference!
is is a very specialized operator. Nearly always you should use == instead, which will do exactly what you want.
The cases where you want to use is instead of == basically are when you're dealing with objects that have overloaded the behavior of == to not mean what you want it to mean, or where you're worried that you might be dealing with such objects.
If you're not sure whether you're dealing with such objects or not, you're probably not, which means that == is always right and you don't have to ever use is.
It can be a matter of "style points" to use is with known singleton objects, like None, but there's nothing wrong with using == there (again, in the absence of a pathological implementation of ==).
If you're dealing with potentially untrustworthy objects, then you should never do anything that may invoke a method that they control.... and that's a good place to use is. But almost nobody is doing that, and those who do should be aware of the zillion other ways a malicious object could cause problems.
If an object implements == incorrectly then you can get all kinds of weird problems. In the course of debugging those problems, of course you can and should use is! But that shouldn't be your normal way of comparing objects in code you write.
The one other case where you might want to use is rather than == is as a performance optimization, if the object you're dealing with implements == in a particularly expensive way. This is not going to be the case very often at all, and most of the time there are better ways to reduce the number of times you have to compare two objects (e.g. by comparing hash codes instead) which will ultimately have a much better effect on performance without bringing correctness into question.
If you use == wherever you semantically want an equality comparison, then you will never even notice when some types sneakily reuse instances on you.

What is an object reference in Python?

A introductory Python textbook defined 'object reference' as follows, but I didn't understand:
An object reference is nothing more than a concrete representation of the object’s identity (the memory address where the object is stored).
The textbook tried illustrating this by using an arrow to show an object reference as some sort of relation going from a variable a to an object 1234 in the assignment statement a = 1234.
From what I gathered off of Wikipedia, the (object) reference of a = 1234 would be an association between a and 1234 were a was "pointing" to 1234 (feel free to clarify "reference vs. pointer"), but it has been a bit difficult to verify as (1) I'm teaching myself Python, (2) many search results talk about references for Java, and (3) not many search results are about object references.
So, what is an object reference in Python? Thanks for the help!

Whatever is associated with a variable name has to be stored in the program's memory somewhere. An easy way to think of this, is that every byte of memory has an index-number. For simplicity's sake, lets imagine a simple computer, these index-numbers go from 0 (the first byte), upwards to however many bytes there are.
Say we have a sequence of 37 bytes, that a human might interpret as some words:
"The Owl and the Pussy-cat went to sea"
The computer is storing them in a contiguous block, starting at some index-position in memory. This index-position is most often called an "address". Obviously this address is absolutely just a number, the byte-number of the memory these letters are residing in.
#12000 The Owl and the Pussy-cat went to sea
So at address 12000 is a T, at 12001 an h, 12002 an e ... up to the last a at 12037.
I am labouring the point here because it's fundamental to every programming language. That 12000 is the "address" of this string. It's also a "reference" to it's location. For most intents and purposes an address is a pointer is a reference. Different languages have differing syntactic handling of these, but essentially they're the same thing - dealing with a block of data at a given number.
Python and Java try to hide this addressing as much as possible, where languages like C are quite happy to expose pointers for exactly what they are.
The take-away from this, is that an object reference is the number of where the data is stored in memory. (As is a pointer.)
Now, most programming languages distinguish between simple types: characters and numbers, and complex types: strings, lists and other compound-types. This is where the reference to an object makes a difference.
So when performing operations on simple types, they are independent, they each have their own memory for storage. Imagine the following sequence in python:
>>> a = 3
>>> b = a
>>> b
3
>>> b = 4
>>> b
4
>>> a
3 # <-- original has not changed
The variables a and b do not share the memory where their values are stored. But with a complex type:
>>> s = [ 1, 2, 3 ]
>>> t = s
>>> t
[1, 2, 3]
>>> t[1] = 8
>>> t
[1, 8, 3]
>>> s
[1, 8, 3] # <-- original HAS changed
We assigned t to be s, but obviously in this case t is s - they share the same memory. Wait, what! Here we have found out that both s and t are a reference to the same object - they simply share (point to) the same address in memory.
One place Python differs from other languages is that it considers strings as a simple type, and these are independent, so they behave like numbers:
>>> j = 'Pussycat'
>>> k = j
>>> k
'Pussycat'
>>> k = 'Owl'
>>> j
'Pussycat' # <-- Original has not changed
Whereas in C strings are definitely handled as complex types, and would behave like the Python list example.
The upshot of all this, is that when objects that are handled by reference are modified, all references-to this object "see" the change. So if the object is passed to a function that modifies it (i.e.: the content of memory holding the data is changed), the change is reflected outside that function too.
But if a simple type is changed, or passed to a function, it is copied to the function, so the changes are not seen in the original.
For example:
def fnA( my_list ):
my_list.append( 'A' )
a_list = [ 'B' ]
fnA( a_list )
print( str( a_list ) )
['B', 'A'] # <-- a_list was changed inside the function
But:
def fnB( number ):
number += 1
x = 3
fnB( x )
print( x )
3 # <-- x was NOT changed inside the function
So keeping in mind that the memory of "objects" that are used by reference is shared by all copies, and memory of simple types is not, it's fairly obvious that the two types operate differently.

Objects are things. Generally, they're what you see on the right hand side of an equation.
Variable names (often just called "names") are references to the actual object. When a name is on the right hand side of an equation1, the object that it references is automatically looked up and used in the equation. The result of the expression on the right hand side is an object. The name on the left hand side of the equation becomes a reference to this (possibly new) object.
Note, you can have object references that aren't explicit names if you are working with container objects (like lists or dictionaries):
a = [] # the name a is a reference to a list.
a.append(12345) # the container list holds a reference to an integer object
In a similar way, multiple names can refer to the same object:
a = []
b = a
We can demonstrate that they are the same object by looking at the id of a and b and noting that they are the same. Or, we can look at the "side-effects" of mutating the object referenced by a or b (if we mutate one, we mutate both because they reference the same object).
a.append(1)
print a, b # look mom, both are [1]!
1More accurately, when a name is used in an expression

In python, strictly speaking, the language has only naming references to the objects, that behave as labels. The assignment operator only binds to the name. The objects will stay in the memory until they are garbage collected

Ok, first things first.
Remember, there are two types of objects in python.
Mutable : Whose values can be changed. Eg: dictionaries, lists and user defined objects(unless defined immutable)
Immutable : Whose values can't be changed. Eg: tuples, numbers, booleans and strings.
Now, when python says PASS BY OBJECT REFERENECE, just remember that
If the underlying object is mutable, then any modifications done will persist.
and,
If the underlying object is immutable, then any modifications done will not persist.
If you still want examples for clarity, scroll down or click here .

why id(A()) == id(A()) is different to A() is A()?

I am very confused with the python code below:
>>> class A(): pass
...
>>> id(A()) == id(A())
True
>>> id(A())
19873304
>>> id(A())
19873304
>>> A() is A()
False
>>> a = A()
>>> b = A()
>>> id (a) == id (b)
False
>>> a is b
False
>>> id (a)
19873304
>>> id (b)
20333272
>>> def f():
... print id(A())
... print id(A())
...
>>> f()
20333312
20333312
I can tell myself clearly what python doing when creating objects.
Can anyone tell me more about what happend? Thanks!

When you say
print id(A()) == id(A())
you are creating an object of type A and passing it to id function. When the function returns there are no references to that object created for the parameter. So, the reference count becomes zero and it becomes ready for garbage collection.
When you do id(A()) in the same expression again, you are trying to create another object of the same type. Now, Python might try to reuse the same memory location which was used for the previous object created (if it is already garbage collected). Otherwise it will create it in a different location. So, id may or may not be the same.
If you take,
print A() is A()
We create an object of type A and we are trying to compare it against another object of type A. Now the first object is still referenced in this expression. So, it will not marked for garbage collection and so the references will be different always.
Suggestion: Never do anything like this in production code.
Quoting from the docs,
Due to automatic garbage-collection, free lists, and the dynamic
nature of descriptors, you may notice seemingly unusual behaviour in
certain uses of the is operator, like those involving comparisons
between instance methods, or constants. Check their documentation for
more info.

Two different objects can be at the same location in memory, if one of them is freed before the other is created.
That is to say -- if you allocate an object, take its id, and then have no more reference to it, it can then be freed, so another object can get the same id.
By contrast, if you retain a reference to the first object, any subsequent object allocated will necessarily have a different id.

A() creates a temporary variable and then clears it
so the next A() gets the same id (that was just garbage collected,although this behavior is probably not guaranteed))... thus when you print them they have the same id
id(A()) == id(A())
has to create two temporary variables each with a different id

Since id() is based on object pointer, it is only guaranteed to be unique if both objects are in memory. In some situations, the second instance of A probably reused the same spot in memory of the first (that had already been garbage-collected).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.