This would be similar to the java.lang.Object.hashcode() method.
I need to store objects I have no control over in a set, and make sure that only if two objects are actually the same object (not contain the same values) will the values be overwritten.
id(x)
will do the trick for you. But I'm curious, what's wrong about the set of objects (which does combine objects by value)?
For your particular problem I would probably keep the set of ids or of wrapper objects. A wrapper object will contain one reference and compare by x==y <==> x.ref is y.ref.
It's also worth noting that Python objects have a hash function as well. This function is necessary to put an object into a set or dictionary. It is supposed to sometimes collide for different objects, though good implementations of hash try to make it less likely.
That's what "is" is for.
Instead of testing "if a == b", which tests for the same value,
test "if a is b", which will test for the same identifier.
As ilya n mentions, id(x) produces a unique identifier for an object.
But your question is confusing, since Java's hashCode method doesn't give a unique identifier. Java's hashCode works like most hash functions: it always returns the same value for the same object, two objects that are equal always get equal codes, and unequal hash values imply unequal hash codes. In particular, two different and unequal objects can get the same value.
This is confusing because cryptographic hash functions are quite different from this, and more like (though not exactly) the "unique id" that you asked for.
The Python equivalent of Java's hashCode method is hash(x).
You don't have to compare objects before placing them in a set. set() semantics already takes care of this.
class A(object):
a = 10
b = 20
def __hash__(self):
return hash((self.a, self.b))
a1 = A()
a2 = A()
a3 = A()
a4 = a1
s = set([a1,a2,a3,a4])
s
=> set([<__main__.A object at 0x222a8c>, <__main__.A object at 0x220684>, <__main__.A object at 0x22045c>])
Note: You really don't have to override hash to prove this behaviour :-)
Related
Python documentation for id() function states the following:
This is an integer which is guaranteed to be unique and constant for
this object during its lifetime. Two objects with non-overlapping
lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object in memory.
Although, the snippet below shows that id's are repeated. Since I didn't explicitly del the objects, I presume they are all alive and unique (I do not know what non-overlapping means).
>>> g = [0, 1, 0]
>>> for h in g:
... print(h, id(h))
...
0 10915712
1 10915744
0 10915712
>>> a=0
>>> b=1
>>> c=0
>>> d=[a, b,c]
>>> for e in d:
... print(e, id(e))
...
0 10915712
1 10915744
0 10915712
>>> id(a)
10915712
>>> id(b)
10915744
>>> id(c)
10915712
>>>
How can the id values for different objects be the same? Is it so because the value 0 (object of class int) is a constant and the interpreter/C compiler optimizes?
If I were to do a = c, then I understand c to have the same id as a since c would just be a reference to a (alias). I expected the objects a and c to have different id values otherwise, but, as shown above, they have the same values.
What's happening? Or am I looking at this the wrong way?
I would expect the id's for user-defined class' objects to ALWAYS be unique even if they have the exact same member values.
Could someone explain this behavior? (I looked at the other questions that ask uses of id(), but they steer in other directions)
EDIT (09/30/2019):
TO extend what I already wrote, I ran python interpreters in separate terminals and checked the id's for 0 on all of them, they were exactly the same (for the same interpreter); multiple instances of different interpreters had the same id for 0. Python2 vs Python3 had different values, but the same Python2 interpreter had same id values.
My question is because the id()'s documentation doesn't state any such optimizations, which seems misleading (I don't expect every quirk to be noted, but some note alongside the CPython note would be nice)...
EDIT 2 (09/30/2019):
The question is stemmed in understanding this behavior and knowing if there are any hooks to optimize user-define classes in a similar way (by modifying the __equals__ method to identify if two objects are same; perhaps the would point to the same address in memory i.e. same id? OR use some metaclass properties)
Ids are guaranteed to be unique for the lifetime of the object. If an object gets deleted, a new object can acquire the same id. CPython will delete items immediately when their refcount drops to zero. The garbage collector is only needed to break up reference cycles.
CPython may also cache and re-use certain immutable objects like small integers and strings defined by literals that are valid identifiers. This is an implementation detail that you should not rely upon. It is generally considered improper to use is checks on such objects.
There are certain exceptions to this rule, for example, using an is check on possibly-interned strings as an optimization before comparing them with the normal == operator is fine. The dict builtin uses this strategy for lookups to make them faster for identifiers.
a is b or a == b # This is OK
If the string happens to be interned, then the above can return true with a simple id comparison instead of a slower character-by-character comparison, but it still returns true if and only if a == b (because if a is b then a == b must also be true). However, a good implementation of .__eq__() would already do an is check internally, so at best you would only avoid the overhead of calling the .__eq__().
Thanks for the answer, would you elaborate around the uniqueness for user-defined objects, are they always unique?
The id of any object (be it user-defined or not) is unique for the lifetime of the object. It's important to distinguish objects from variables. It's possible to have two or more variables refer to the same object.
>>> a = object()
>>> b = a
>>> c = object()
>>> a is b
True
>>> a is c
False
Caching optimizations mean that you are not always guaranteed to get a new object in cases where one might naiively think one should, but this does not in any way violate the uniqueness guarantee of IDs. Builtin types like int and str may have some caching optimizations, but they follow exactly the same rules: If they are live at the same time, and their IDs are the same, then they are the same object.
Caching is not unique to builtin types. You can implement caching for your own objects.
>>> def the_one(it=object()):
... return it
...
>>> the_one() is the_one()
True
Even user-defined classes can cache instances. For example, this class only makes one instance of itself.
>>> class TheOne:
... _the_one = None
... def __new__(cls):
... if not cls._the_one:
... cls._the_one = super().__new__(cls)
... return cls._the_one
...
>>> TheOne() is TheOne() # There can be only one TheOne.
True
>>> id(TheOne()) == id(TheOne()) # This is what an is-check does.
True
Note that each construction expression evaluates to an object with the same id as the other. But this id is unique to the object. Both expressions reference the same object, so of course they have the same id.
The above class only keeps one instance, but you could also cache some other number. Perhaps recently used instances, or those configured in a way you expect to be common (as ints do), etc.
I come from Java where even mutable objects can be "hashable".
And I am playing with Python 3.x these days just for fun.
Here is the definition of hashable in Python (from the Python glossary).
hashable
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() method). Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
All of Python’s immutable built-in objects are hashable; mutable containers (such as lists or dictionaries) are not. Objects which are instances of user-defined classes are hashable by default. They all compare unequal (except with themselves), and their hash value is derived from their id().
I read it and I am thinking...
Still... Why didn't they make in Python even mutable objects hashable? E.g. using the same default hashing mechanism as for user-defined objects i.e. as described by the last 2 sentences above.
Objects which are instances of user-defined classes are hashable by default. They all compare unequal (except with themselves), and their hash value is derived from their id().
This feels somewhat weird... so user-defined mutable objects are hashable (via this default hashing mechanism) but built-in mutable objects are not hashable. Doesn't this just complicate things? I don't see what benefits it brings, could someone explain?
In Python, mutable objects can be hashable, but it is generally not a good idea, because generally speaking, the equality is defined in terms of these mutable attributes, and this can lead to all sorts of crazy behavhior.
If built-in mutable objects are hashed based on identity, like the default hashing mechanism for user-defined objects, then their hash would be inconsistent with their equality. And that is absolutely a problem. However, user-defined objects by default compare and hash based on identity, so it isn't as bad of a situation, although, this set of affairs isn't very useful.
Note, if you implement __eq__ in a user-defined class, the __hash__ is set to None, making the class unhashable.
So, from the Python 3 data model documentation:
User-defined classes have __eq__() and __hash__() methods by
default; with them, all objects compare unequal (except with
themselves) and x.__hash__() returns an appropriate value such that
x == y implies both that x is y and hash(x) == hash(y).
A class that overrides __eq__() and does not define __hash__() will have its __hash__() implicitly set to None. When the
__hash__() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve
their hash value, and will also be correctly identified as unhashable
when checking isinstance(obj, collections.abc.Hashable).
Calculating a hash value is like giving an identity to an object which simplify the comparison of objects. The comparison by hash value is generally faster than the comparison by value: for an object, you compare its attributes, for a collection, you compare its items, recursively…
If an object is mutable you need to calculate its hash value again after each changes. If this object was compared equal with another one, after a change it becomes unequal. So, mutable objects must be compared by value, not by hash. It’s a non-send to compare by hash values for mutable objects.
Edit: Java HashCode
Typically, hashCode() just returns the object's address in memory if you don't override it.
See the reference about the hashCode function.
As much as is reasonably practical, the hashCode method defined by
class Object does return distinct integers for distinct objects. (This
is typically implemented by converting the internal address of the
object into an integer, but this implementation technique is not
required by the JavaTM programming language.)
So, the Java hashCode function works the same as the default Python __hash__ function.
In Java, if you use a mutable object in a HashSet, for instance, the HashSet isn’t working properly. Because the hashCode depends of the state of the object it can no longer be retrieved properly, so the check for containment fails.
From reading other comments/answers, it seems like what you're not buying is that you have to change a hash of a mutable entity when it mutates, and that you can just hash by id, so I'll try to elaborate on this point.
To quote you:
#kindall Hm... Who says that the hash value has to come from the values in the list? And that if you e.g. add a new value you have to rehash the list, get a new hash value, etc.. In other languages that's not how it is... this is my point. In other languages the hash value just comes from the id (or is the id itself, just like for user-defined mutable Python objects)... And OK... I just feel it makes things a bit too complicated in Python (especially for beginners... not for me).
This isn't exactly false (although I do not know what "other" languages you are referencing), you could do that, but there are some pretty dire consequences:
class HashableList(list):
def __hash__(self):
return id(self)
x = HashableList([1,2,3])
y = HashableList([1,2,3])
our_set = {x}
print("Is x in our_set? ", x in our_set)
print("Is y in our_set? ", y in our_set)
print("Are x and y equal? ", x == y)
This (unexpectedly) outputs:
Is x in our_set? True
Is y in our_set? False <-- potentially confusing
Are x and y equal? True
This means that the hash is not consistent with equality, which is just downright confusing.
You might counter with "well, just hash by the contents then", but I think you already understand that if the contents change then you get other undesirable behavior (for example):
class HashableListByContents(list):
def __hash__(self):
return sum(hash(x) for x in self)
a = HashableListByContents([1,2,3])
b = HashableListByContents([1,2,3])
our_set = {a}
print('Is a in our_set? ', a in our_set)
print('Is b in our_set? ', b in our_set)
print('Are a and b equal? ', a == b)
This outputs:
Is a in our_set? True
Is b in our_set? True
Are a and b equal? True
So far so good! But...
a.append(2)
print('Is a still in our set? ', a in our_set)
this outputs:
Is a still in our set? False <-- potentially confusing
I am not a Python beginner, so I would not presume to know what would or would not confuse a Python beginner, but either way this seems confusing to me (at best). My two cents is that it's simply incorrect to hash mutable objects. I mean we have functional purists that claim mutable objects are just incorrect, period! Python won't stop you from doing any of what you described, because it would never force a paradigm like that, but it's really asking for trouble no matter what route you go down.
HTH!
This is more a theoretical question than practical.
Suppose I have a Python set s filled with instances of some class C, where C has some custom equality relation. I also have on hand an instance x of C, and I know that there is an object x' in s such that x == x, although id(x) != id(x').
How can I get a reference to the object x'?
Things I've tried already:
s & {x} / {x} & s. In both cases, the expression evaluates to the object x. In the Python source for set intersection, if the sets to intersect are different sizes, the result set takes its objects from the smaller set.
s.pop(), which gives me no control over which element is removed.
As motivation, s might be a cache of objects. Rather than storing duplicate instances of a class in memory, we'd rather reuse the instance we've already created. So we create the potentially-duplicate instance, check if we've created an identical instance before, and if so, use that instance instead.
I know that I can accomplish similar behaviour with a dictionary, but I'm specifically curious about whether it can be done with just the set interface.
In Python:
len(a) can be replaced by a.__len__()
str(a) or repr(a) can be replaced by a.__str__() or a.__repr__()
== is __eq__, + is __add__, etc.
Is there similar method to get the id(a) ? If not, is there any workaround to get an unique id of a python object without using id() ?
edit: additional question: if not ? is there any reason not to define a __id__() ?
No, this behavior cannot be changed. id() is used to get "an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime" (source). No other special meaning is given to this integer (in CPython it is the address of the memory location where the object is stored, but this cannot be relied upon in portable Python).
Since there is no special meaning for the return value of id(), it makes no sense to allow you to return a different value instead.
Further, while you could guarantee that id() would return unique integers for your own objects, you could not possibly satisfy the global uniqueness constraint, since your object cannot possibly have knowledge of all other living objects. It would be possible (and likely) that one of your special values clashes with the identity of another object alive in the runtime. This would not be an acceptable scenario.
If you need a return value that has some special meaning then you should define a method where appropriate and return a useful value from it.
An object isn't aware of its own name (it can have many), let alone of any unique ID it has associated with it. So - in short - no. The reasons that __len__ and co. work is that they are bound to the object already - an object is not bound to its ID.
I have several python scripts run parallel this simple code:
test_id = id('test')
Is test_id unique or not?
http://docs.python.org/library/functions.html#id
Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same id() value.
CPython implementation detail: This is the address of the object.
So yes, the IDs are unique.
However, since Python strings are immutable, id('test') may be the same for all strings since 'test' is 'test' is likely to be True.
What do you mean unique? Unique among what?
It is just identifier for part of memory, used by parameter's value. For immutable objects with the same value it is often the same:
>>> id('foo') == id('fo' + 'o')
True
In CPython, id is the pointer to the object in memory.
>>> a = [1,2,3]
>>> b = a
>>> id(a) == id(b)
True
So, if you have multiple references to the same object (and on some corner cases, small strings are created only once and also numbers smaller than 257) it will not be unique
It might help if you talked about what you were trying to do - it isn't really typical to use the id() builtin for anything, least of all strings, unless you really know what you're doing.
Python docs nicely describe the id() builtin function:
This is an integer (or long integer)
which is guaranteed to be unique and
constant for this object during its
lifetime. Two objects with
non-overlapping lifetimes may have the
same id() value.
As I read this, the return values of id() are really only guaranteed to be unique in one interpreter instance - and even then only if the lifetimes of the items overlap. Saving these ids for later use, sending them over sockets, etc. seems not useful. Again, I don't think this is really for people who don't know that they need it.
If you want to generate IDs which are unique across multiple program instances, you might check out the uuid module.
It also occurs to me that you might be trying to produce hashes from Python objects.
Probably there is some approach to your problem which will be cleaner than trying to use the id() function, maybe the problem needs reformulating.