is there a best practice to determine the equality of two arbitrary python objects?
Let's say I write a container for some sort of object and I need to figure out whether new objects are equal to the old ones stored into the container.
The problem is I cannot use "is" since this will only check if the variables are bound to the very same object (but we might have a deep copy of an object, which is in my sense equal to its original). I cannot use "==" either, since some of these objects return an element-wise equal, like numpy arrays.
Is there a best practice to determine the equality of any kind of objects?
For instance would
repr(objectA)==repr(objectB)
suffice?
Or is it common to use:
numpy.all(objectA==objectB)
Which probably fails if objectA == objectB evaluates to "[]"
Cheers,
Robert
EDIT:
Ok, regarding the 3rd comment, I elaborate more on
"What's your definition of "equal objects"?"
In the strong sense I don't have any definition of equality, I rather let the objects decide whether they are equal or not. The problem is, as far as I understand, there is no well agreed standard for eq or ==,respectively. The statement can return arrays or all kinds of things.
What I have in mind is to have some operator lets call it SEQ (strong equality) in between eq and "is".
SEQ is superior to eq in the sense that it will always evaluate to a single boolean value (for numpy arrays that could mean all elements are equal, for example) and determine if the objects consider themselves equal or not. But SEQ would be inferior to "is" in the sense that objects that are distinct in memory can be equal as well.
I suggest you write a custom recursive equality-checker, something like this:
from collections import Sequence, Mapping, Set
import numpy as np
def nested_equal(a, b):
"""
Compare two objects recursively by element, handling numpy objects.
Assumes hashable items are not mutable in a way that affects equality.
"""
# Use __class__ instead of type() to be compatible with instances of
# old-style classes.
if a.__class__ != b.__class__:
return False
# for types that implement their own custom strict equality checking
seq = getattr(a, "seq", None)
if seq and callable(seq):
return seq(b)
# Check equality according to type type [sic].
if isinstance(a, basestring):
return a == b
if isinstance(a, np.ndarray):
return np.all(a == b)
if isinstance(a, Sequence):
return all(nested_equal(x, y) for x, y in zip(a, b))
if isinstance(a, Mapping):
if set(a.keys()) != set(b.keys()):
return False
return all(nested_equal(a[k], b[k]) for k in a.keys())
if isinstance(a, Set):
return a == b
return a == b
The assumption that hashable objects are not mutable in a way that affects equality is rather safe, since it would break dicts and sets if such objects were used as keys.
Related
So many tutorials have stated that the == comparison operator is for value equality, like in this answer, quote:
== is for value equality. Use it when you would like to know if two objects have the same value.
is is for reference equality. Use it when you would like to know if two references refer to the same object.
However, I found that the Python doc says that:
x==y calls x.__eq__(y). By default, object implements __eq__() by using is, returning NotImplemented in the case of a false comparison: True if x is y else NotImplemented."
It seems the default behavior of the == operator is to compare the reference quality like the is operator, which contradicts what these tutorials say.
So what exactly should I use == for? value equality or reference equality? Or it just depends on how you implement the __eq__ method.
I think the doc of Value comparisons has illustrated this question clearly:
The operators <, >, ==, >=, <=, and != compare the values of two objects. The value of an object is a rather abstract notion in Python. Comparison operators implement a particular notion of what the value of an object is. One can think of them as defining the value of an object indirectly, by means of their comparison implementation.
The behavior of the default equality comparison, that instances with different identities are always unequal, may be in contrast to what types will need that have a sensible definition of object value and value-based equality. Such types will need to customize their comparison behavior, and in fact, a number of built-in types have done that.
The default behavior for equality comparison (== and !=) is based on the identity of the objects. Hence, equality comparison of instances with the same identity results in equality, and equality comparison of instances with different identities results in inequality. A motivation for this default behavior is the desire that all objects should be reflexive (i.e. x is y implies x == y).
It also includes a list that describes the comparison behavior of the most important built-in types like numbers, strings and sequences, etc.
It solely depends on what __eq__ does. The default __eq__ of type object behaves like is. Some builtin datatypes use their own implementation. For example, two lists are equal if all their values are equal. You just have to know this.
object implements __eq__() by using is, but many classes in the standard library implement __eq__() using value equality. E.g.:
>>> l1 = [1, 2, 3]
>>> l2 = [1, 2, 3]
>>> l3 = l1
>>> l1 is l2
False
>>> l1 == l2
True
>>> l1 is l3
True
In your own classes, you can implement __eq__() as it makes sense, e.g.:
class Point():
def __init__(self, x, y):
self.x = x
self.y = y
def __eq__(self, other):
return self.x == other.x and self.y == other.x
To add your thought process with a simple definition ,
The Equality operator == compares the values of both the operands and checks for value equality. Whereas the is operator checks whether both the operands refer to the same object or not (present in the same memory location).
In a nutshell, is checks whether two references point to the same object or not.== checks whether two objects have the same value or not.
For example:
a=[1,2,3]
b=a #a and b point to the same object
c=list(a) #c points to different object
if a==b:
print('#') #output:#
if a is b:
print('##') #output:##
if a==c:
print('###') #output:##
if a is c:
print('####') #no output as c and a point to different object
one = 1
a = one
b = one
if (a == b): # this if works like this: if (1 == 1)
if (a is 1): # this if works like this: if (int.object(1, memory_location:somewhere) == int.object(1, memory_location:variable.one))
thus, a is 1 won't work because its arguments are not pointing to the same location.
Why is in operator defined for generators?
>>> def foo():
... yield 42
...
>>>
>>> f = foo()
>>> 10 in f
False
What are the possible use cases?
I know that range(...) objects have a __contains__ function defined so that we can do stuff like this:
>>> r = range(10)
>>> 4 in r
True
>>> r.__contains__
<method-wrapper '__contains__' of range object at 0x7f82bd51cc00>
But f above doesn't have __contains__ method.
"What are the possible use cases?" To check if the generator will produce some value.
Dunder methods serve as hooks for the particular syntax they are associated with. __contains__ isn't some kind of one-to-one mapping to x in y. The language ultimately defines the semantics of these operators.
From the documentation of membership testing, we see there are several ways for x in y to be evaluated, depending on various properties of the objects involved. I've highlted the relevant one for generator objects, which do not define a __contains__ but are iterable, i.e., they define an __iter__ method:
The operators in and not in test for membership. x in s evaluates to
True if x is a member of s, and False otherwise. x not in s returns
the negation of x in s. All built-in sequences and set types support
this as well as dictionary, for which in tests whether the dictionary
has a given key. For container types such as list, tuple, set,
frozenset, dict, or collections.deque, the expression x in y is
equivalent to any(x is e or x == e for e in y).
For the string and bytes types, x in y is True if and only if x is a
substring of y. An equivalent test is y.find(x) != -1. Empty strings
are always considered to be a substring of any other string, so "" in "abc" will return True.
For user-defined classes which define the __contains__() method, x in
y returns True if y.__contains__(x) returns a true value, and False
otherwise.
For user-defined classes which do not define contains() but do
define __iter__(), x in y is True if some value z, for which the
expression x is z or x == z is true, is produced while iterating over
y. If an exception is raised during the iteration, it is as if in
raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
__getitem__(), x in y is True if and only if there is a non-negative integer index i such that x is y[i] or x == y[i], and no lower integer
index raises the IndexError exception. (If any other exception is
raised, it is as if in raised that exception).
The operator not in is defined to have the inverse truth value of in.
To summarize, x in y will be defined for objects that:
Are strings or bytes, and it is defined as a substring relationship.
types that define __contains__
types that are iterators, i.e. that define __iter__
the old-style iteration protocol (relies on __getitem__)
Generators fall into 3.
A broader point, you really shouldn't use the dunder methods directly, unless you really understand what they are doing. Even then, it may be best to aviod it.
It usually isn't worth trying to be credible or succinct by using something to the effect of:
x.__lt__(y)
Instead of:
x < y
You should at least understand, that this might happen:
>>> (1).__lt__(3.)
NotImplemented
>>>
And if you are just naively doing stuff like filter((1).__lt__, iterable) then you've probably got a bug.
Generators are iterable, and so have an .__iter__ method, which can be used to check membership
How this behaves is described in the Expressions docs on Membership test operations
For user-defined classes which do not define __contains__() but do define __iter__(), x in y is True if some value z, for which the expression x is z or x == z is true, is produced while iterating over y. If an exception is raised during the iteration, it is as if in raised that exception.
Emphasis mine!
This pops the whole generator, and your example with 42 simply doesn't include the tested value of 10
>>> def foo():
... yield 5
... yield 10
...
>>> f = foo()
>>> 10 in f
True
>>> 10 in f
False
I have written some code in Python which has a class called product and overrided the magic functions __eq__ and __hash__. Now I need to make a set which should remove duplicates from the list based on the ID of the product. As you can see the output of this code the hashes of two objects are the same, yet when i make a set of those two objects the length is 2 not one.
But, when i change the __eq__ method of the code to this
def __eq__(self, b) -> bool:
if self.id == b.id:
return True
return False
and use it with the same hash function it works and the length of the set is 1. So i am confused whether the set data-structure uses the __eq__ method to test for equality or the __hash__ method.
Equality tests can be expensive, so the set starts by comparing hashes. If the hashes are not equal, then the check ends. If the hashes are equal, the set then tests for equality. If it only used __eq__, it might have to do a lot of unnecessary work, but if it only used __hash__, there would be no way to resolve a hash collision.
Here's a simple example of using equality to resolve a hash collision. All integers are their own hashes, except for -1:
>>> hash(-1)
-2
>>> hash(-2)
-2
>>> s = set()
>>> s.add(-1)
>>> -2 in s
False
Here's an example of the set skipping an equality check because the hashes aren't equal. Let's subclass an int so it return a new hash every second:
>>> class TimedInt(int):
... def __hash__(self):
... return int(time.time())
...
>>> a = TimedInt(5)
>>> a == 5
True
>>> a == a
True
>>> s = set()
>>> s.add(a) # Now wait a few seconds...
>>> a in s
False
In Python, it is known that in checks for membership in iterators (lists, dictionaries, etc) and looks for substrings in strings. My question is regarding how in is implemented to achieve all of the following: 1) test for membership, 2) test for substrings and 3) access to the next element in a for-loop. For example, when for i in myList: or if i in myList: is executed, does in call myList.__next__()? If it does call it, how then does it work with strings, given that str objects are not iterators(as checked in Python 2.7) and so do not have the next() method? If a detailed discussion of in's implementation is not possible, would appreciate if a gist of it is supplied here.
A class can define how the in operator works on instances of that class by defining a __contains__ method.
The Python data model documentation says:
For objects that don’t define __contains__(), the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__(), see this section in the language reference.
Section 6.10.2, "Membership test operations", of the Python language reference has this to say:
The operators in and not in test for membership. x in s evaluates to True if x is a member of s, and False otherwise. x not in s returns the negation of x in s. All built-in sequences and set types support this as well as dictionary, for which in tests whether the dictionary has a given key. For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y).
For the string and bytes types, x in y is True if and only if x is a substring of y. An equivalent test is y.find(x) != -1. Empty strings are always considered to be a substring of any other string, so "" in "abc" will return True.
For user-defined classes which define the __contains__() method, x in y returns True if y.__contains__(x) returns a true value, and False otherwise.
For user-defined classes which do not define __contains__() but do define __iter__(), x in y is True if some value z with x == z is produced while iterating over y. If an exception is raised during the iteration, it is as if in raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines __getitem__(), x in y is True if and only if there is a non-negative integer index i such that x == y[i], and all lower integer indices do not raise IndexError exception. (If any other exception is raised, it is as if in raised that exception).
The operator not in is defined to have the inverse true value of in.
As a comment indicates above, the expression operator in is distinct from the keyword in which forms a part of the for statement. In the Python grammar, the in is "hardcoded" as a part of the syntax of for:
for_stmt ::= "for" target_list "in" expression_list ":" suite
["else" ":" suite]
So in the context of a for statement, in doesn't behave as an operator, it's simply a syntactic marker to separate the target_list from the expression_list.
Python has __contains__ special method that is used when you do item in collection.
For example, here's a class that "__contains__" all even numbers:
>>> class EvenNumbers:
... def __contains__(self, item):
... return item % 2 == 0
...
>>> en = EvenNumbers()
>>> 2 in en
True
>>> 3 in en
False
>>>
In a comment on this question, I saw a statement that recommended using
result is not None
vs
result != None
What is the difference? And why might one be recommended over the other?
== is an equality test. It checks whether the right hand side and the left hand side are equal objects (according to their __eq__ or __cmp__ methods.)
is is an identity test. It checks whether the right hand side and the left hand side are the very same object. No methodcalls are done, objects can't influence the is operation.
You use is (and is not) for singletons, like None, where you don't care about objects that might want to pretend to be None or where you want to protect against objects breaking when being compared against None.
First, let me go over a few terms. If you just want your question answered, scroll down to "Answering your question".
Definitions
Object identity: When you create an object, you can assign it to a variable. You can then also assign it to another variable. And another.
>>> button = Button()
>>> cancel = button
>>> close = button
>>> dismiss = button
>>> print(cancel is close)
True
In this case, cancel, close, and dismiss all refer to the same object in memory. You only created one Button object, and all three variables refer to this one object. We say that cancel, close, and dismiss all refer to identical objects; that is, they refer to one single object.
Object equality: When you compare two objects, you usually don't care that it refers to the exact same object in memory. With object equality, you can define your own rules for how two objects compare. When you write if a == b:, you are essentially saying if a.__eq__(b):. This lets you define a __eq__ method on a so that you can use your own comparison logic.
Rationale for equality comparisons
Rationale: Two objects have the exact same data, but are not identical. (They are not the same object in memory.)
Example: Strings
>>> greeting = "It's a beautiful day in the neighbourhood."
>>> a = unicode(greeting)
>>> b = unicode(greeting)
>>> a is b
False
>>> a == b
True
Note: I use unicode strings here because Python is smart enough to reuse regular strings without creating new ones in memory.
Here, I have two unicode strings, a and b. They have the exact same content, but they are not the same object in memory. However, when we compare them, we want them to compare equal. What's happening here is that the unicode object has implemented the __eq__ method.
class unicode(object):
# ...
def __eq__(self, other):
if len(self) != len(other):
return False
for i, j in zip(self, other):
if i != j:
return False
return True
Note: __eq__ on unicode is definitely implemented more efficiently than this.
Rationale: Two objects have different data, but are considered the same object if some key data is the same.
Example: Most types of model data
>>> import datetime
>>> a = Monitor()
>>> a.make = "Dell"
>>> a.model = "E770s"
>>> a.owner = "Bob Jones"
>>> a.warranty_expiration = datetime.date(2030, 12, 31)
>>> b = Monitor()
>>> b.make = "Dell"
>>> b.model = "E770s"
>>> b.owner = "Sam Johnson"
>>> b.warranty_expiration = datetime.date(2005, 8, 22)
>>> a is b
False
>>> a == b
True
Here, I have two Dell monitors, a and b. They have the same make and model. However, they neither have the same data nor are the same object in memory. However, when we compare them, we want them to compare equal. What's happening here is that the Monitor object implemented the __eq__ method.
class Monitor(object):
# ...
def __eq__(self, other):
return self.make == other.make and self.model == other.model
Answering your question
When comparing to None, always use is not. None is a singleton in Python - there is only ever one instance of it in memory.
By comparing identity, this can be performed very quickly. Python checks whether the object you're referring to has the same memory address as the global None object - a very, very fast comparison of two numbers.
By comparing equality, Python has to look up whether your object has an __eq__ method. If it does not, it examines each superclass looking for an __eq__ method. If it finds one, Python calls it. This is especially bad if the __eq__ method is slow and doesn't immediately return when it notices that the other object is None.
Did you not implement __eq__? Then Python will probably find the __eq__ method on object and use that instead - which just checks for object identity anyway.
When comparing most other things in Python, you will be using !=.
Consider the following:
class Bad(object):
def __eq__(self, other):
return True
c = Bad()
c is None # False, equivalent to id(c) == id(None)
c == None # True, equivalent to c.__eq__(None)
None is a singleton, and therefore identity comparison will always work, whereas an object can fake the equality comparison via .__eq__().
>>> () is ()
True
>>> 1 is 1
True
>>> (1,) == (1,)
True
>>> (1,) is (1,)
False
>>> a = (1,)
>>> b = a
>>> a is b
True
Some objects are singletons, and thus is with them is equivalent to ==. Most are not.