In Python, is there a good "magic method" to represent object similarity? - python

I want instances of my custom class to be able to compare themselves to one another for similarity. This is different than the __cmp__ method, which is used for determining the sorting order of objects.
Is there a magic method that makes sense for this? Is there any standard syntax for doing this?
How I imagine this could look:
>>> x = CustomClass("abc")
>>> y = CustomClass("abd")
>>> z = CustomClass("xyz")
>>> x.__<???>__(y)
0.75
>>> x <?> y
0.75
>>> x.__<???>__(z)
0.0
>>> x <?> z
0.0
Where <???> is the magic method name and <?> is the operator.

Take a look at the numeric types emulation in the datamodel and pick an operator hook that suits you.
I don't think there is currently an operator that is an exact match though, so you'll end up surprising some poor hapless future code maintainer (could even be you) that you overloaded a standard operator.
For a Levenshtein Distance I'd just use a regular method instead. I'd find a one.similarity(other) method a lot clearer when reading the code.

well, you could override __eq__ to mean both boolean logical equality and 'fuzzy' simlirity, by returning a sufficiently weird result from __eq__:
class FuzzyBool(object):
def __init__(self, quality, tolerance=0):
self.quality, self._tolerance = quality, tolerance
def __nonzero__(self):
return self.quality <= self._tolerance
def tolerance(self, tolerance):
return FuzzyBool(self.quality, tolerance)
def __repr__(self):
return "sorta %s" % bool(self)
class ComparesFuzzy(object):
def __init__(self, value):
self.value = value
def __eq__(self, other):
return FuzzyBool(abs(self.value - other.value))
def __hash__(self):
return hash((ComparesFuzzy, self.value))
>>> a = ComparesFuzzy(1)
>>> b = ComparesFuzzy(2)
>>> a == b
sorta False
>>> (a == b).tolerance(3)
sorta True
the default behavior of the comparator should be that it is Truthy only if the compared values are exactly equal, so that normal equality is unaffected

No, there is not. You can make a class method, but I don't think there is any intuitive operator to overload that would do what you're looking for. And, to avoid confusion, I would avoid overloading unless it is obviously intuitive.
I would simply call is CustomClass.similarity(y)

I don't think there is a magic method (and corresponding operator) that would make sense for this in any context.
However, if, with a bit of fantasy, your instances can be seen as vectors, then checking for similarity could be analogous to calculating the scalar product. It would make sense then to use __mul__ and multiplication sign for this (unless you have already defined product for CustomClass instances).

No magic function/operator for that.
When I think of "similarity" for ints and floats, I think of the difference being lower than a certain threshold. Perhaps that's something you might use?
E.g. being able to calculate the "difference" between your objects might be suitable in the sub method.

In the example you've cited, I would use difflib. This conducts spell-check like comparisons between strings. But in general, if you really are comparing objects rather than strings, then I agree with the others; you should probably create something context-specific.

Related

Is it really necessary to hash the same for classes that compare the same?

Reading this answer it seems, that if __eq__ is defined in custom class, __hash__ needs to be defined as well. This is understandable.
However it is not clear, why - effectively - __eq__ should be same as self.__hash__()==other.__hash__
Imagining a class like this:
class Foo:
...
self.Name
self.Value
...
def __eq__(self,other):
return self.Value==other.Value
...
def __hash__(self):
return id(self.Name)
This way class instances could be compared by value, which could be the only reasonable use, but considered identical by name.
This way set could not contain multiple instances with equal name, but comparison would still work.
What could be the problem with such definition?
The reason for defining __eq__, __lt__ and other by Value is to be able to sort instances by Value and to be able to use functions like max. For example, he class should represent a physical output of a device (say heating element). Each of these outputs has unique Name. The Value is power of the output device. To find optimal combination of heating elements to turn on, it is useful to be able to compare them by power (Value). In a set or dictionary, however, it should not be possible to have multiple outputs with same names. Of course, different outputs with different names might easily have equal power.
The problem is that it does not make sense, hash is used to do efficient bucketing of objects. Consequently, when you have a set, which is implemented as a hash table, each hash points to a bucket, which is usually a list of elements. In order to check if an element is in the set (or other hash based container) you go to the bucket pointed by a hash and then you iterate over all elements in the list, comparing them one by one.
In other words - hash is not supposed to be a comparator (as it can, and should give you sometimes a false positive). In particular, in your example, your set will not work - it will not recognize duplicate, as they do not compare to each other.
class Foo:
def __eq__(self,other):
return self.Value==other.Value
def __hash__(self):
return id(self.Name)
a = set()
el = Foo()
el.Name = 'x'
el.Value = 1
el2 = Foo()
el2.Name = 'x'
el2.Value = 2
a.add(el)
a.add(el2)
print len(a) # should be 1, right? Well it is 2
actually it is even worse then that, if you have 2 objects with the same values but different names, they are not recognized to be the same either
class Foo:
def __eq__(self,other):
return self.Value==other.Value
def __hash__(self):
return id(self.Name)
a = set()
el = Foo()
el.Name = 'x'
el.Value = 2
el2 = Foo()
el2.Name = 'a'
el2.Value = 2
a.add(el)
a.add(el2)
print len(a) # should be 1, right? Well it is 2 again
while doing it properly (thus, "if a == b, then hash(a) == hash(b)") gives:
class Foo:
def __eq__(self,other):
return self.Name==other.Name
def __hash__(self):
return id(self.Name)
a = set()
el = Foo()
el.Name = 'x'
el.Value = 1
el2 = Foo()
el2.Name = 'x'
el2.Value = 2
a.add(el)
a.add(el2)
print len(a) # is really 1
Update
There is also an non deterministic part, which is hard to easily reproduce, but essentially hash does not uniquely define a bucket. Usually it is like
bucket_id = hash(object) % size_of_allocated_memory
consequently things that have different hashes can still end up in the same bucket. Consequently, you can get two elements equal to each (inside set) due to equality of Values even though Names are different, as well as the other way around, depending on actual internal implementation, memory constraints etc.
In general there are many more examples where things can go wrong, as hash is defined as a function h : X -> Z such that x == y => h(x) == h(y), thus people implementing their containers, authorization protocols, and other tools are free to assume this property. If you break it - every single tool using hashes can break. Furthermore, it can break in time, meaning that you update some library and your code will stop working, as a valid update to the underlying libraries (using the above assumption) can lead to exploiting your violation of this assumption.
Update 2
Finally, in order to solve your issue - you simply should not define your eq, lt operators to handle sorting. It is about actual comparison of the elements, which should be compatible with the rest of the behaviours. All you have to do is define a separate comparator and use it in your sorting routines (sorting in python accepts any comparator, you do not need to rely on <, > etc.). The other way around is to instead have valid <, >, = defined on values, but in order to keep names unique - keep a set with... well... names, and not objects themselves. Whichever path you choose - the crucial element here is:
equality and hashing have to be compatible, that's all.
It is possible to implement your class like this and not have any problems. However, you have to be 100% sure that no two different objects will ever produce the same hash. Consider the following example:
class Foo:
def __init__(self, name, value):
self.name= name
self.value= value
def __eq__(self, other):
return self.value == other.value
def __hash__(self):
return hash(self.name[0])
s= set()
s.add(Foo('a', 1))
s.add(Foo('b', 1))
print(len(s)) # output: 2
But you have a problem if a hash collision occurs:
s.add(Foo('abc', 1))
print(len(s)) # output: 2
In order to prevent this, you would have to know exactly how the hashes are generated (which, if you rely on functions like id or hash, might vary between implementations!) and also the values of the attribute(s) used to generate the hash (name in this example). That's why ruling out the possibility of a hash collision is very difficult, if not impossible. It's basically like begging for unexpected things to happen.

Python3 style sorting -- old cmp method functionality in new key mechanism?

I read about the wrapper function to move a cmp style comparison into a key style comparison in Python 3, where the cmp capability was removed.
I'm having a heck of a time wrapping my head around how a Python3 straight key style sorted() function, with, at least as I understand it, just one item specified for the key, can allow you to properly compare, for instance, two IPs for ordering. Or ham calls.
Whereas with cmp there was nothing to it: sorted() and sort() called you with the two ips, you looked at the appropriate portions, made your decisions, done.
def ipCompare(dqA,dqB):
...
ipList = sorted(ipList,cmp=ipCompare)
Same thing with ham radio calls. The sorting isn't alphabetic; the calls are generally letter(s)+number(s)+letter(s); the first sorting priority is the number portion, then the first letter(s), then the last letter(s.)
Using cmp... no sweat.
def callCompare(callA,callB):
...
hamlist = sorted(hamlist,cmp=callCompare)
With Python3... without going through the hoop jumping of the wrapper... and being passed one item... I think... how can that be done?
And if the wrapper is absolutely required... then why remove cmp within Python3 in the first place?
I'm sure I'm missing something. I just can't see it. :/
ok, now I know what I was missing. Solutions for IPs were given in the answers below. Here's a key I came up with for sorting ham calls of the common prefix, region, postfix form:
import re
def callKey(txtCall):
l = re.split('([0-9]*)',txtCall.upper(),1)
return l[1],l[0],l[2]
hamList = ['N4EJI','W1AW','AA7AS','na1a']
sortedHamList = sorted(hamList,key=callKey)
sortedHamList result is ['na1a','W1AW','N4EJI','AA7AS']
Detail:
AA7AS comes out of callKey() as 7,AA,AS
N4EJI comes out of callKey() as 4,N,EJI
W1AW comes out of callKey() as 1,W,AW
na1a comes out of callKey() as 1,NA,A
First, if you haven't read the Sorting HOWTO, definitely read that; it explains a lot that may not be obvious at first.
For your first example, two IPv4 addresses, the answer is pretty simple.
To compare two addresses, one obvious thing to do is convert them both from dotted-four strings into tuples of 4 ints, then just compare the tuples:
def cmp_ip(ip1, ip2):
ip1 = map(int, ip1.split('.'))
ip2 = map(int, ip2.split('.'))
return cmp(ip1, ip2)
An even better thing to do is convert them to some kind of object that represents an IP address and has comparison operators. In 3.4+, the stdlib has such an object built in; let's pretend 2.7 did as well:
def cmp_ip(ip1, ip2):
return cmp(ipaddress.ip_address(ip1), ipaddress.ip_address(ip2))
It should be obvious that these are both even easier as key functions:
def key_ip(ip):
return map(int, ip.split('.'))
def key_ip(ip):
return ipaddress.ip_address(ip)
For your second example, ham radio callsigns: In order to write a cmp function, you have to be able to break each ham address into the letters, numbers, letters portions, then compare the numbers, then compare the first letters, then compare the second letters. In order to write a key function, you have to be able to break down a ham address into the letters, numbers, letters portions, then return a tuple of (numbers, first letters, second letters). Again the key function is actually easier, not harder.
And really, this is the case for most examples anyone was able to come up with. Most complicated comparisons ultimately come down to a complicated conversion into some sequence of parts, and then simple lexicographical comparison of that sequence.
That's why cmp functions were deprecated way back in 2.4 and finally removed in 3.0.
Of course there are some cases where a cmp function is easier to read—most of the examples people try to come up with turn out to be wrong, but there are some. And there's also code which has been working for 20 years and nobody wants to rethink it in new terms for no benefit. For those cases, you've got cmp_to_key.
There's actually another reason cmp was deprecated, on top of this one, and maybe a third.
In Python 2.3, types had a __cmp__ method, which was used for handling all of the operators. In 2.4, they grew the six methods __lt__, __eq__, etc. as a replacement. This allows for more flexibility—e.g., you can have types that aren't total-ordered. So, 2.3's when compared a < b, it was actually doing a.__cmp__(b) < 0, which maps in a pretty obvious way to a cmp argument. But in 2.4+, a < b does a.__lt__(b), which doesn't. This confused a lot of people over the years, and removing both __cmp__ and the cmp argument to sort functions removed that confusion.
Meanwhile, if you read the Sorting HOWTO, you'll notice that before we had cmp, the only way to do this kind of thing was decorate-sort-undecorate (DSU). Notice that it's blindly obvious how to map a good key function to a good DSU sort and vice-versa, but it's definitely not obvious with a cmp function. I don't remember anyone explicitly mentioning this one on the py3k list, but I suspect people may have had it in their heads when deciding whether to finally kill cmp for good.
To use the new key argument, simply decompose the comparison to another object that already implements a well-ordered comparison, such as to a tuple or list (eg. a sequence of integers). These types work well because they are sequence-wise ordered.
def ip_as_components (ip):
return map(int, ip.split('.'))
sorted_ips = sorted(ips, key=ip_as_components)
The ordering of the each of the components are the same as the individual tests as found in a traditional compare-and-then-compare-by function.
Looking at the HAM ordering it may look like:
def ham_to_components (ham_code):
# .. decompose components based on ordering of each
return (prefix_letters, numbers, postfix_letters)
The key approach (similar to "order by" found in other languages) is generally a simpler and more natural construct to deal with - assuming that the original types are not already well-ordered. The main drawback with this approach is that partially reversed (eg. asc then desc) ordering can be tricky, but that is solvable by returning nested tuples etc.
In Py3.0, the cmp parameter was removed entirely (as part of a larger effort to simplify and unify the language, eliminating the conflict between rich comparisons and the __cmp__() magic method).
If absolutely needing sorted with a custom "cmp", cmp_to_key can be trivially used.
sorted_ips = sorted(ips, key=functools.cmp_to_key(ip_compare))
According to official docs - https://docs.python.org/3/howto/sorting.html#the-old-way-using-the-cmp-parameter
When porting code from Python 2.x to 3.x, the situation can arise when you have the user supplying a comparison function and you need to convert that to a key function. The following wrapper makes that easy to do:
def cmp_to_key(mycmp):
'Convert a cmp= function into a key= function'
class K:
def __init__(self, obj, *args):
self.obj = obj
def __lt__(self, other):
return mycmp(self.obj, other.obj) < 0
def __gt__(self, other):
return mycmp(self.obj, other.obj) > 0
def __eq__(self, other):
return mycmp(self.obj, other.obj) == 0
def __le__(self, other):
return mycmp(self.obj, other.obj) <= 0
def __ge__(self, other):
return mycmp(self.obj, other.obj) >= 0
def __ne__(self, other):
return mycmp(self.obj, other.obj) != 0
return K
To convert to a key function, just wrap the old comparison function:
>>> def reverse_numeric(x, y):
... return y - x
>>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric))
[5, 4, 3, 2, 1]

How to perform __eq__, __gt__ etc on int object in Python?

I am working on a class that requires multiple rules to validate against e.g. if certain pattern appears more than, equal, less than to a certain number. I have the output of this regular expression I am validating in a list and checking the length of the list. Now how do I call one of these methods (__gt__, __eq__ etc) dynamically?
My approach:
func = getattr(len(re.findall('\d+', 'password1'), method_name) #method_name
#can be any one of the values __eq__, __gt__).
if func(desired_length):
print ":validated"
else:
raise Exception("Not sufficient complexity")
For example, for method_name='__eq__' and desired_length=1, the above will result True. For method_name='__gt__' and desired_length=1, the above will result False (if a number appears in this string more than once).
But I realize int objects don't really implement these methods. Is there any way I can achieve this?
Rather than using getattr on the int instance here, maybe you should consider the operator module. Then you can grab the comparison operator and pass the int instance and the desired length. A full example would be something like this:
import operator
method_name = '__eq__'
desired_length = 1
func = getattr(operator, method_name)
n_ints = len(re.findall('\d+', 'password1'))
if func(n_ints, desired_length):
print('Yeah Buddy!')

Is there a way in Python to utilize infinite subsets of R or C?

I understand that it is intrinsically impossible for computers to store infinite sets (aside from the use of generators to produce countably infinite sets), but I was wondering if there's a way to represent, say, the set of complex numbers with |z| < 1. I know I could do this with comprehensions if a package has a "set of all complex numbers" object, but my initial searches has come up empty.
I presume the better way to deal with such sets is to test inclusion given a number (i.e., given z is |z| <1?) rather than try to have some type of object, but just thought I'd ask. Thanks!
You can easily create a class to abstract away the < test:
>>> class complex_subset(object):
... def __init__(self, norm_below):
... self.norm_below=norm_below
... def __contains__(self, item):
... return abs(complex(item)) < self.norm_below
...
>>> complex_below_norm_1=complex_subset(norm_below=1)
>>> 0 in complex_below_norm_1
True
>>> 3 in complex_below_norm_1
False
>>> 0.5+0.5j in complex_below_norm_1
True
and of course, you can generalize complex_subset with keyword __init__ arguments to define your __contains__ method.
If you want to be able to compare complex_subsets with each other, you have to write the appropriate __eq__, __gt__ and __lt__ methods.

Under what circumstances are __rmul__ called?

Say I have a list l. Under what circumstance is l.__rmul__(self, other) called?
I basically understood the documentation, but I would also like to see an example to clarify its usages beyond any doubt.
When Python attempts to multiply two objects, it first tries to call the left object's __mul__() method. If the left object doesn't have a __mul__() method (or the method returns NotImplemented, indicating it doesn't work with the right operand in question), then Python wants to know if the right object can do the multiplication. If the right operand is the same type as the left, Python knows it can't, because if the left object can't do it, another object of the same type certainly can't either.
If the two objects are different types, though, Python figures it's worth a shot. However, it needs some way to tell the right object that it is the right object in the operation, in case the operation is not commutative. (Multiplication is, of course, but not all operators are, and in any case * is not always used for multiplication!) So it calls __rmul__() instead of __mul__().
As an example, consider the following two statements:
print "nom" * 3
print 3 * "nom"
In the first case, Python calls the string's __mul__() method. The string knows how to multiply itself by an integer, so all is well. In the second case, the integer does not know how to multiply itself by a string, so its __mul__() returns NotImplemented and the string's __rmul__() is called. It knows what to do, and you get the same result as the first case.
Now we can see that __rmul__() allows all of the string's special multiplication behavior to be contained in the str class, such that other types (such as integers) do not need to know anything about strings to be able to multiply by them. A hundred years from now (assuming Python is still in use) you will be able to define a new type that can be multiplied by an integer in either order, even though the int class has known nothing of it for more than a century.
By the way, the string class's __mul__() has a bug in some versions of Python. If it doesn't know how to multiply itself by an object, it raises a TypeError instead of returning NotImplemented. That means you can't multiply a string by a user-defined type even if the user-defined type has an __rmul__() method, because the string never lets it have a chance. The user-defined type has to go first (e.g. Foo() * 'bar' instead of 'bar' * Foo()) so its __mul__() is called. They seem to have fixed this in Python 2.7 (I tested it in Python 3.2 also), but Python 2.6.6 has the bug.
Binary operators by their nature have two operands. Each operand may be on either the left or the right side of an operator. When you overload an operator for some type, you can specify for which side of the operator the overloading is done. This is useful when invoking the operator on two operands of different types. Here's an example:
class Foo(object):
def __init__(self, val):
self.val = val
def __str__(self):
return "Foo [%s]" % self.val
class Bar(object):
def __init__(self, val):
self.val = val
def __rmul__(self, other):
return Bar(self.val * other.val)
def __str__(self):
return "Bar [%s]" % self.val
f = Foo(4)
b = Bar(6)
obj = f * b # Bar [24]
obj2 = b * f # ERROR
Here, obj will be a Bar with val = 24, but the assignment to obj2 generates an error because Bar has no __mul__ and Foo has no __rmul__.
I hope this is clear enough.
__mul__() is the case of a dot product and the result of the dot product should be a scalar or just a number, i.e. __mul__() results in a dot product multiplication like x1*x2+y1*y2. In __rmul__() , the result is a point with x = x1*x2 and y = y1*y2 .

Categories

Resources