How to test for approximate equality of generic classes - python

I'm trying to find out if two classes are equivalent, ignoring types parameters. Say I have
from typing import Generic, TypeVar
T = TypeVar('T')
class A(Generic[T]):
pass
class B(Generic[T], A[T]):
pass
class X:
pass
I'd like each following row to be equivalent
Generic, Generic[T]
A, A[T], A[str], A[int]
B, B[T], B[str], B[int]
X
None of is, ==, isinstance, type, or __class__ work. Comparing __name__ is fragile to someone defining another class with the same name.
For bonus points*, I'd also be interested in an additional way to test equivalence of
A, A[T], A[str], A[int], B, B[T], B[str], B[int]
*not a bounty :p
(The context is that I'd like to find all the subclasses of a class other than Generic)

To recover A from A[T], you can use the __origin__ attribute, which for A will be None.
def compare(a, b):
if hasattr(a, "__origin__") and hasattr(b, "__origin__"):
a_origin = a.__origin__ or a
b_origin = b.__origin__ or b
return a_origin == b_origin
else:
return a == b
compare(A, A[int]) # True
compare(A, B[int]) # False
compare(A, A) # True
compare(X, X) # True
According to the linked comment, __origin__ should be available for Union, Optional, Generic, Callable, and Tuple.
It's worth noting that this is an implementation detail. Using this is opening yourself up to the risk of the implementation changing without warning.

Related

What part of the implementation in Enum shape the behavior of `isinstance`?

isinstance(x, E), when applied on a Enum's child class E, checks for whether the first parameter x is a member of e. For example, with the following definition:
from enum import Enum
class E(Enum):
X = 1
Y = 2
Z = 3
isinstance(E.X, E) returns True, but isinstance(1, E) returns False. I'm puzzled how the Enum implementation makes this work: I don't even see __instancecheck__ being overridden. How does the Enum implementation make isinstance calls work in this way?
There is nothing special done at runtime to handle isinstance. The special behaviour for enums is all done during class construction when you derive from Enum.
Eg.
>>> from enum import Enum
>>> class E(Enum):
... X = 1
... Y = 2
...
>>>
>>> type(E.X)
<enum 'E'>
>>> E.X == 1
False
Enum has EnumMeta as its metaclass, and the special handling is in EnumMeta.__new__ which builds the E enum class. What it does, among other things, is to replace all the attributes you define as ints with instances of E with the given value.

Python two parameters of a generic method to have the same concrete type

How i can to test that a parameters of function have the same concrete type?
Example:
class A(Generic[T]):
def __init__(p: T):
self._p = p
def get_data(self) -> T:
return self._p
class B(Generic[T]):
def __init__(p: T):
self._p = p
def inc(v: A[T]):
self._p = self._p + v.get_data()
class I(Generic[T]):
def __init__(a: A[T], b: B[T]):
self._a = a
self._b = b
def execute():
self._b.inc(a)
Ok:
I(A(1), B(2)).execute()
Failed:
I(A(1), B("2")).execute()
So to me it looks like you want to use type variables / generics? That is, when a function is called, that type variable is then bound for that particular call.
For example, we could have a function head() which optionally returns a head item of a homogenous list (Python doesn't have this restriction for lists, but for now imagine we want to have):
from typing import TypeVar, Optional, List
T = TypeVar("T", int, float, str)
def head(lst: List[T]) -> Optional[T]:
if lst:
return lst[0]
return None
print(head([1, 2]))
print(head(["koi"]))
print(head([]))
Now, any of str, float or int are allowed types for these lists, but they
must be homogenous. So the type signature basically says that "head takes list of strings, ints or floats, and whatever the type of the items is in the list, the return value (if present) will be of that type too"
Note that this is quite different from union types, eg. you could have
T = Union[str, float, int]
def head(lst: List[T]) -> Optional[T]:
...
but it wouldn't then work like you want; according to types, returning an int would still be ok even if list contained only strings.
Update: as kindly mentioned by #MisterMiyagi in the comment below, this still requires user to be aware of types that are allowed.
For your use case (with classes), this should be fine (trying to use minimal example):
from typing import TypeVar, Generic
T = TypeVar("T")
class A(Generic[T]):
def a(self, a: T, b: T) -> T:
return a
print(A[int]().a(1, 2))
print(A[float]().a(32.1, 2))
print(A[str]().a("koi", "bar"))
# print(A[int]().a(1, "bar")) -- does not pass type checker
Crucial difference here is that you still need to pass types when you actually call them, but they don't need to be present when you define your generic class.
Check out https://mypy.readthedocs.io/en/stable/generics.html for more information on those.

specify comparator function in python dictionary

Is there a way to pass in a custom equality comparison function when creating a python dictionary so that it doesn't use the default __eq__ or hash comparators? I'm hoping there is a way to do so but wasn't able to find it so far.
edit: I am looking for a way to provide different definitions of object equality for classes that I defined. Like:
class A:
def __init__(self, a, b):
self.a = a
self.b = b
def c1(a1, a2): # assume both are A objects
return a1.a == a2.a
def c2(a1, a2):
return a1.b == a2.b
# is this possible?
d1 = dict(cmp=c1)
d2 = dict(cmp=c2)
I know I can override __eq__ etc in my class definition but I can only override it once. In Java I can use TreeMap and I am looking for the equivalent in Python.

Custom IDE-compatible static-types in Python

For the sake of nicer design and OOP, I would like to create a custom IDE-compatible static type. For instance, consider the following idealized class:
class IntOrIntString(Union[int, str]):
#staticmethod
def is_int_string(item):
try:
int(item)
return True
except:
return False
def __instancecheck__(self, instance):
# I know __instacecheck__ is declared in the metaclass. It's written here for the sake of the argument.
return isinstance(instance, int) or (isinstance(instance, str) and self.is_int_string(instance))
#staticmethod
def as_integer(item):
return int(item)
Now, this is a silly class, I know, but it serves as a simple example. Defining such class has the following advantages:
It allows for static type-checking in the IDE (e.g. def parse(s: IntOrIntString): ...).
It allows dynamic type-checking (e.g. isinstance(item, IntOrIntString)).
It can be used to better encapsulate type-related static functions (e.g. inetger = IntOrIntString.as_integer(item)).
However, this code won't run because Union[int, str] can not be subclassed - I get:
TypeError: Cannot subclass typing.Union
So, I tried to work-around this by creating this "type" by referring to it as an instance of Union (which it actually is). Meaning:
IntOrIntString = Union[int, str]
IntOrIntString.as_integer = lambda item: int(item)
...
but that didn't work either as I get the error message
AttributeError: '_Union' object has no attribute 'as_integer'
Any thoughts on how that could be accomplished, or, perhaps, justifications for why it shouldn't be possible to accomplish?
I use python 3.6, but that's not set in stone because I could change the version if needed. The IDE I use is PyCharm.
Thanks
Edit: Two more possible examples for where this is useful:
The type AnyNumber that can accept any number that I wish. Maybe starting with float and int, but can be extended to support any number-like type I want such as int-strings, or single-item iterables. Such extension is immediately system-wide, which is a huge bonus. As an example, consider the function
def func(n: AnyNumber):
n = AnyNumber.get_as_float()
# The rest of the function is implemented just for float.
...
Working with pandas, you can usually perform similar operations on Series, DataFrame and Index, so suppose that there's a "type-class" like above called SeriesContainer that simplifies the usage - allows me to handle all the data-types uniformly by invoking SeriesContainer.as_series_collection(...), or SeriesContainer.as_data_frame(...) depending on the usage.
if I were you I would avoid creating such classes since they create unnecessary type ambiguity. Instead, to take your example, in order to achieve the objective of differentiating between a regular string and an int string, this is how I would go about it. First, make a (non static) intString class:
from typing import Union
class intString(object):
def __init__(self, item: str):
try:
int(item)
except ValueError:
print("error message")
exit(1)
self.val = item
def __int__(self):
return int(self.val)
(It might be better to inherit from str, but I'm not sure how to do it correctly and it's not material to the issue).
Lets say we have the following three variables:
regular_string = "3"
int_string = intString(regular_string)
int_literal = 3
Now we can use the built in python tools to achieve our three objectives:
static type checking:
def foo(f: Union[int, intString]):
pass
foo(regular_string) # Warning
foo(3) # No warnings
foo(int_string) # No warnings
You will notice that here we have stricter type checking then what you were proposing - even though the first string can be cast into an intString, the IDE will recognize that it isn't one before runtime and warn you.
Dynamic type checking:
print(isinstance(regular_string, (intString, int))) # <<False
print(isinstance(int_string, (intString, int))) # <<True
print(isinstance(int_literal, (intString, int))) # <<True
Notice that isinstance returns true if any of the items in the tuple match any of its parent classes or its own class.
I'm not sure that I understood how this relates to encapsulation honestly. But since we defined the int operator in the IntString class, we have polymorphism with ints as desired:
for i in [intString("4"), 5, intString("77"), "5"]:
print(int(i))
will print 4,5,77 as expected.
I'm sorry if I got too hung up on this specific example, but I just found it hard to imagine a situation where merging different types like this would be useful, since I believe that the three advantages you brought up can be achieved in a more pythonic manner.
I suggest you take a look at https://docs.python.org/3/library/typing.html#newtype for more basic functionality relating to defining new types.
A couple thoughts. First, Union[int, str] includes all strings, even strings like "9.3" and "cat", which don't look like an int.
If you're okay with this, you could do something like the following:
intStr = Union[int, str]
isinstance(5, intStr.__args__) # True
isinstance(5.3, intStr.__args__) # False
isinstance("5.3", intStr.__args__) # True
isinstance("howdy", intStr.__args__) # True
Note that when using a Union type, or a type with an origin of Union, you have to use .__args__ for isinstance() to work, as isinstance() doesn't work with straight up Unions. It can't differentiate Unions from generic types.
I'm assuming, though, that intStr shouldn't include all strings, but only a subset of strings. In this case, why not separate the type-checking methods from the type hinting?
def intStr_check(x):
"checks if x is an instance of intStr"
if isinstance(x, int):
return True
elif isinstance(x, str):
try:
x = int(x)
return True
except:
return False
else:
return False
Then simply use that function in place of isinstance() when checking if the type is an intStr.
Note that your original method had an error, being that int(3.14) does not throw an error and would have passed your check.
Now that we've gotten isinstance() out of the way, if for parsing purposes you need to differentiate intStr objects from Union[int,str] objects, you could use the NewType from typing like so:
from typing import NewType
IntStr = NewType("IntStr", Union[int,str])
def some_func(a: IntStr):
if intStr_check(a):
return int(a) + 1
else:
raise ValueError("Argument must be an intStr (an int or string of an int)")
some_num = IntStr("9")
print(some_func(some_num)) # 10
There's no need to create an as_integer() function or method, as it's exactly the same as int(), which is more concise and readable.
My opinion on style: nothing should be done simply for the sake of OOP. Sure, sometimes you need to store state and update parameters, but in cases where that's unnecessary, I believe OOP tends to lead to more verbose code, and potentially more headaches maintaining mutable state and avoiding unintended side effects. Hence, I prefer to declare new classes only when necessary.
EDIT: Since you insist on reusing the function name isinstance, you can overwrite isinstance to add additional functionality like so:
from typing import NewType, Union, _GenericAlias
isinstance_original = isinstance
def intStr_check(x):
"checks if x is an instance of intStr"
if isinstance_original(x, int):
return True
elif isinstance_original(x, str):
try:
x = int(x)
return True
except:
return False
else:
return False
def isinstance(x, t):
if (t == 'IntStr'): # run intStr_check
return intStr_check(x)
elif (type(t) == _GenericAlias): # check Union types
try:
check = False
for i in t.__args__:
check = check or isinstance_original(x,i)
if check == True: break
return check
except:
return isinstance_original(x,t)
else: # regular isinstance
return isinstance_original(x, t)
# Some tests
assert isinstance("4", 'IntStr') == True
assert isinstance("4.2", 'IntStr') == False
assert isinstance("4h", 'IntStr') == False
assert isinstance(4, 'IntStr') == True
assert isinstance(4.2, int) == False
assert isinstance(4, int) == True
assert isinstance("4", int) == False
assert isinstance("4", str) == True
assert isinstance(4, Union[str,int]) == True
assert isinstance(4, Union[str,float]) == False
Just be careful not to run isinstance_original = isinstance multiple times.
You could still use IntStr = NewType("IntStr", Union[int,str]) for static type checking, but since you're in love with OOP, you could also do something like the following:
class IntStr:
"an integer or a string of an integer"
def __init__(self, value):
self.value = value
if not (isinstance(self.value, 'IntStr')):
raise ValueError(f"could not convert {type(self.value)} to IntStr (an int or string of int): {self.value}")
def check(self):
return isinstance(self.value, 'IntStr')
def as_integer(self):
return int(self.value)
def __call__(self):
return self.value
# Some tests
try:
a = IntStr("4.2")
except ValueError:
print("it works")
a = IntStr("4")
print(f"a == {a()}")
assert a.as_integer() + 1 == 5
assert isinstance(a, IntStr) == True
assert isinstance(a(), str) == True
assert a.check() == True
a.value = 4.2
assert a.check() == False

What's a correct and good way to implement __hash__()?

What's a correct and good way to implement __hash__()?
I am talking about the function that returns a hashcode that is then used to insert objects into hashtables aka dictionaries.
As __hash__() returns an integer and is used for "binning" objects into hashtables I assume that the values of the returned integer should be uniformly distributed for common data (to minimize collisions).
What's a good practice to get such values? Are collisions a problem?
In my case I have a small class which acts as a container class holding some ints, some floats and a string.
An easy, correct way to implement __hash__() is to use a key tuple. It won't be as fast as a specialized hash, but if you need that then you should probably implement the type in C.
Here's an example of using a key for hash and equality:
class A:
def __key(self):
return (self.attr_a, self.attr_b, self.attr_c)
def __hash__(self):
return hash(self.__key())
def __eq__(self, other):
if isinstance(other, A):
return self.__key() == other.__key()
return NotImplemented
Also, the documentation of __hash__ has more information, that may be valuable in some particular circumstances.
John Millikin proposed a solution similar to this:
class A(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
return (isinstance(othr, type(self))
and (self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
def __hash__(self):
return hash((self._a, self._b, self._c))
The problem with this solution is that the hash(A(a, b, c)) == hash((a, b, c)). In other words, the hash collides with that of the tuple of its key members. Maybe this does not matter very often in practice?
Update: the Python docs now recommend to use a tuple as in the example above. Note that the documentation states
The only required property is that objects which compare equal have the same hash value
Note that the opposite is not true. Objects which do not compare equal may have the same hash value. Such a hash collision will not cause one object to replace another when used as a dict key or set element as long as the objects do not also compare equal.
Outdated/bad solution
The Python documentation on __hash__ suggests to combine the hashes of the sub-components using something like XOR, which gives us this:
class B(object):
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __eq__(self, othr):
if isinstance(othr, type(self)):
return ((self._a, self._b, self._c) ==
(othr._a, othr._b, othr._c))
return NotImplemented
def __hash__(self):
return (hash(self._a) ^ hash(self._b) ^ hash(self._c) ^
hash((self._a, self._b, self._c)))
Update: as Blckknght points out, changing the order of a, b, and c could cause problems. I added an additional ^ hash((self._a, self._b, self._c)) to capture the order of the values being hashed. This final ^ hash(...) can be removed if the values being combined cannot be rearranged (for example, if they have different types and therefore the value of _a will never be assigned to _b or _c, etc.).
Paul Larson of Microsoft Research studied a wide variety of hash functions. He told me that
for c in some_string:
hash = 101 * hash + ord(c)
worked surprisingly well for a wide variety of strings. I've found that similar polynomial techniques work well for computing a hash of disparate subfields.
A good way to implement hash (as well as list, dict, tuple) is to make the object have a predictable order of items by making it iterable using __iter__. So to modify an example from above:
class A:
def __init__(self, a, b, c):
self._a = a
self._b = b
self._c = c
def __iter__(self):
yield "a", self._a
yield "b", self._b
yield "c", self._c
def __hash__(self):
return hash(tuple(self))
def __eq__(self, other):
return (isinstance(other, type(self))
and tuple(self) == tuple(other))
(here __eq__ is not required for hash, but it's easy to implement).
Now add some mutable members to see how it works:
a = 2; b = 2.2; c = 'cat'
hash(A(a, b, c)) # -5279839567404192660
dict(A(a, b, c)) # {'a': 2, 'b': 2.2, 'c': 'cat'}
list(A(a, b, c)) # [('a', 2), ('b', 2.2), ('c', 'cat')]
tuple(A(a, b, c)) # (('a', 2), ('b', 2.2), ('c', 'cat'))
things only fall apart if you try to put non-hashable members in the object model:
hash(A(a, b, [1])) # TypeError: unhashable type: 'list'
I can try to answer the second part of your question.
The collisions will probably result not from the hash code itself, but from mapping the hash code to an index in a collection. So for example your hash function could return random values from 1 to 10000, but if your hash table only has 32 entries you'll get collisions on insertion.
In addition, I would think that collisions would be resolved by the collection internally, and there are many methods to resolve collisions. The simplest (and worst) is, given an entry to insert at index i, add 1 to i until you find an empty spot and insert there. Retrieval then works the same way. This results in inefficient retrievals for some entries, as you could have an entry that requires traversing the entire collection to find!
Other collision resolution methods reduce the retrieval time by moving entries in the hash table when an item is inserted to spread things out. This increases the insertion time but assumes you read more than you insert. There are also methods that try and branch different colliding entries out so that entries to cluster in one particular spot.
Also, if you need to resize the collection you will need to rehash everything or use a dynamic hashing method.
In short, depending on what you're using the hash code for you may have to implement your own collision resolution method. If you're not storing them in a collection, you can probably get away with a hash function that just generates hash codes in a very large range. If so, you can make sure your container is bigger than it needs to be (the bigger the better of course) depending on your memory concerns.
Here are some links if you're interested more:
coalesced hashing on wikipedia
Wikipedia also has a summary of various collision resolution methods:
Also, "File Organization And Processing" by Tharp covers alot of collision resolution methods extensively. IMO it's a great reference for hashing algorithms.
A very good explanation on when and how implement the __hash__ function is on programiz website:
Just a screenshot to provide an overview:
(Retrieved 2019-12-13)
As for a personal implementation of the method, the above mentioned site provides an example that matches the answer of millerdev.
class Person:
def __init__(self, age, name):
self.age = age
self.name = name
def __eq__(self, other):
return self.age == other.age and self.name == other.name
def __hash__(self):
print('The hash is:')
return hash((self.age, self.name))
person = Person(23, 'Adam')
print(hash(person))
Depends on the size of the hash value you return. It's simple logic that if you need to return a 32bit int based on the hash of four 32bit ints, you're gonna get collisions.
I would favor bit operations. Like, the following C pseudo code:
int a;
int b;
int c;
int d;
int hash = (a & 0xF000F000) | (b & 0x0F000F00) | (c & 0x00F000F0 | (d & 0x000F000F);
Such a system could work for floats too, if you simply took them as their bit value rather than actually representing a floating-point value, maybe better.
For strings, I've got little/no idea.
#dataclass(frozen=True) (Python 3.7)
This awesome new feature, among other good things, automatically defines a __hash__ and __eq__ method for you, making it just work as usually expected in dicts and sets:
dataclass_cheat.py
from dataclasses import dataclass, FrozenInstanceError
#dataclass(frozen=True)
class MyClass1:
n: int
s: str
#dataclass(frozen=True)
class MyClass2:
n: int
my_class_1: MyClass1
d = {}
d[MyClass1(n=1, s='a')] = 1
d[MyClass1(n=2, s='a')] = 2
d[MyClass1(n=2, s='b')] = 3
d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] = 4
d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] = 5
d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] = 6
assert d[MyClass1(n=1, s='a')] == 1
assert d[MyClass1(n=2, s='a')] == 2
assert d[MyClass1(n=2, s='b')] == 3
assert d[MyClass2(n=1, my_class_1=MyClass1(n=1, s='a'))] == 4
assert d[MyClass2(n=2, my_class_1=MyClass1(n=1, s='a'))] == 5
assert d[MyClass2(n=2, my_class_1=MyClass1(n=2, s='a'))] == 6
# Due to `frozen=True`
o = MyClass1(n=1, s='a')
try:
o.n = 2
except FrozenInstanceError as e:
pass
else:
raise 'error'
As we can see in this example, the hashes are being calculated based on the contents of the objects, and not simply on the addresses of instances. This is why something like:
d = {}
d[MyClass1(n=1, s='a')] = 1
assert d[MyClass1(n=1, s='a')] == 1
works even though the second MyClass1(n=1, s='a') is a completely different instance from the first with a different address.
frozen=True is mandatory, otherwise the class is not hashable, otherwise it would make it possible for users to inadvertently make containers inconsistent by modifying objects after they are used as keys. Further documentation: https://docs.python.org/3/library/dataclasses.html
Tested on Python 3.10.7, Ubuntu 22.10.

Categories

Resources