Is avoiding expensive __init__ a good reason to use __new__? - python

In my project, we have a class based on set. It can be initialised from a string, or an iterable (eg tuple) of strings, or other custom classes. When initialised with an iterable it converts each item to a particular custom class if it is not one already.
Because it can be initialised from a variety of data structures a lot of the methods that operate on this class (such as __and__) are liberal in what they accept and just convert their arguments to this class (ie initialise a new instance). We are finding this is rather slow, when the argument is already an instance of the class, and has a lot of members (it is iterating through them all and checking that they are the right type).
I was thinking that to avoid this, we could add a __new__ method to the class and just if the argument passed in is already an instance of the class, return it directly. Would this be a reasonable use of __new__?

Adding a __new__ method will not solve your problem. From the documentation for __new__:
If __new__() returns an instance of cls, then the new instance’s
__init__() method will be invoked like __init__(self[, ...]),
where self is the new instance and the remaining arguments are the
same as were passed to __new__().
In otherwords, returning the same instance will not prevent python from calling __init__.
You can verify this quite easily:
In [20]: class A:
...: def __new__(cls, arg):
...: if isinstance(arg, cls):
...: print('here')
...: return arg
...: return super().__new__(cls)
...: def __init__(self, values):
...: self.values = list(values)
In [21]: a = A([1,2,3])
In [22]: A(a)
here
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-c206e38274e0> in <module>()
----> 1 A(a)
<ipython-input-20-5a7322f37287> in __init__(self, values)
6 return super().__new__(cls)
7 def __init__(self, values):
----> 8 self.values = list(values)
TypeError: 'A' object is not iterable
You may be able to make this work if you did not implement __init__ at all, but only __new__. I believe this is what tuple does.
Also that behaviour would be acceptable only if your class is immutable (e.g. tuple does this), because the result would be sensible. If it is mutable you are asking for hidden bugs.
A more sensible approach is to do what set does: __*__ operations operate only on sets, however set also provides named methods that work with any iterable:
In [30]: set([1,2,3]) & [1,2]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-dfd866b6c99b> in <module>()
----> 1 set([1,2,3]) & [1,2]
TypeError: unsupported operand type(s) for &: 'set' and 'list'
In [31]: set([1,2,3]) & set([1,2])
Out[31]: {1, 2}
In [32]: set([1,2,3]).intersection([1,2])
Out[32]: {1, 2}
In this way the user can choose between speed and flexibility of the API.
A simpler approach is the one proposed by unutbu: use isinstance instead of duck-typing when implementing the operations.

Related

Object is enumerable but not indexable?

Problem summary and question
I'm trying to look at some of the data inside an object that can be enumerated over but not indexed. I'm still newish to python, but I don't understand how this is possible.
If you can enumerate it, why can't you access the index through the same way enumerate does? And if not, is there a way to access the items individually?
The actual example
import tensorflow_datasets as tfds
train_validation_split = tfds.Split.TRAIN.subsplit([6, 4])
(train_data, validation_data), test_data = tfds.load(
name="imdb_reviews",
split=(train_validation_split, tfds.Split.TEST),
as_supervised=True)
Take a select subset of the dataset
foo = train_data.take(5)
I can iterate over foo with enumerate:
[In] for i, x in enumerate(foo):
print(i)
which generates the expected output:
0
1
2
3
4
But then, when I try to index into it foo[0] I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-44-2acbea6d9862> in <module>
----> 1 foo[0]
TypeError: 'TakeDataset' object does not support indexing
Python only allows these things if the class has methods for them:
__getitem__ is required for the [] syntax.
__iter__ and __next__1 are required to iterate.
Any class can define one without defining the other. __getattr__ is usually not defined if it would be inefficient.
1 __next__ is required on the class returned by __iter__.
This is a result of foo being iterable, but not having a __getitem__ function. You can use itertools.isslice to get the nth element of an iterable like so
import itertools
def nth(iterable, n, default=None):
"Returns the nth item or a default value"
return next(itertools.islice(iterable, n, None), default)
In Python, instances of custom classes can implement enumeration through special (or "dunder") __iter__ method. Perhaps this class implements __iter__ but not __getitem__.
Dunder overview: https://dbader.org/blog/python-dunder-methods
Specs for an __iter__ method: https://docs.python.org/3/library/stdtypes.html#typeiter

extending built-in python dict class

I want to create a class that would extend dict's functionalities. This is my code so far:
class Masks(dict):
def __init__(self, positive=[], negative=[]):
self['positive'] = positive
self['negative'] = negative
I want to have two predefined arguments in the constructor: a list of positive and negative masks. When I execute the following code, I can run
m = Masks()
and a new masks-dictionary object is created - that's fine. But I'd like to be able to create this masks objects just like I can with dicts:
d = dict(one=1, two=2)
But this fails with Masks:
>>> n = Masks(one=1, two=2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'two'
I should call the parent constructor init somewhere in Masks.init probably. I tried it with **kwargs and passing them into the parent constructor, but still - something went wrong. Could someone point me on what should I add here?
You must call the superclass __init__ method. And if you want to be able to use the Masks(one=1, ..) syntax then you have to use **kwargs:
In [1]: class Masks(dict):
...: def __init__(self, positive=(), negative=(), **kwargs):
...: super(Masks, self).__init__(**kwargs)
...: self['positive'] = list(positive)
...: self['negative'] = list(negative)
...:
In [2]: m = Masks(one=1, two=2)
In [3]: m['one']
Out[3]: 1
A general note: do not subclass built-ins!!!
It seems an easy way to extend them but it has a lot of pitfalls that will bite you at some point.
A safer way to extend a built-in is to use delegation, which gives better control on the subclass behaviour and can avoid many pitfalls of inheriting the built-ins. (Note that implementing __getattr__ it's possible to avoid reimplementing explicitly many methods)
Inheritance should be used as a last resort when you want to pass the object into some code that does explicit isinstance checks.
Since all you want is a regular dict with predefined entries, you can use a factory function.
def mask(*args, **kw):
"""Create mask dict using the same signature as dict(),
defaulting 'positive' and 'negative' to empty lists.
"""
d = dict(*args, **kw)
d.setdefault('positive', [])
d.setdefault('negative', [])

What's the difference between class variables of different types?

Firstly, there is class A with two class variables and two instance variables:
In [1]: def fun(x, y): return x + y
In [2]: class A:
...: cvar = 1
...: cfun = fun
...: def __init__(self):
...: self.ivar = 100
...: self.ifun = fun
We can see that both class variable and instance variable of int type works fine:
In [3]: a = A()
In [4]: a.ivar, a.cvar
Out[4]: (100, 1)
However, things have changed if we check the function type variables:
In [5]: a.ifun, a.cfun
Out[5]:
(<function __main__.fun>,
<bound method A.fun of <__main__.A instance at 0x25f90e0>>)
In [6]: a.ifun(1,2)
Out[6]: 3
In [7]: a.cfun(1,2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/future/<ipython-input-7-39aa8db2389e> in <module>()
----> 1 a.cfun(1,2)
TypeError: fun() takes exactly 2 arguments (3 given)
I known that python has translated a.cfun(1,2) to A.cfun(a,1,2) and then error raised.
My question is: Since both cvar and cfun are class variable, why do python treat them in difference way?
Actually, a function assigned to a class member remains function:
def x():pass
class A:
f = x
e = None
g = None
print(A.__dict__['f'])
# <function x at 0x10e0a6e60>
It's converted on the fly to a method object when you retrieve it from an instance:
print(A().f)
# <bound method A.x of <__main__.A instance at 0x1101ddea8>>
http://docs.python.org/2/reference/datamodel.html#the-standard-type-hierarchy "User-defined methods":
User-defined method objects may be created when getting an attribute of a class (perhaps via an instance of that class), if that attribute is a user-defined function object, an unbound user-defined method object, or a class method object... Note that the transformation from function object to (unbound or bound) method object happens each time the attribute is retrieved from the class or instance.
This conversion only occurs to functions assigned to a class, not to an instance. Note that this has been changed in Python 3, where Class.fun returns a normal function, not an "unbound method".
As to your question why is this needed, a method object are essentially a closure that contains a function along with its execution context ("self"). Imagine you've got an object and use its method as a callback somewhere. In many other languages you have to pass both object and method pointers or to create a closure manually. For example, in javascript:
myListener = new Listener()
something.onSomeEvent = myListener.listen // won't work!
something.onSomeEvent = function() { myListener.listen() } // works
Python manages that for us behind the scenes:
myListener = Listener()
something.onSomeEvent = myListener.listen // works
On the other side, sometimes it's practical to have "bare" functions or "foreign" methods in a class:
def __init__(..., dir, ..):
self.strip = str.lstrip if dir == 'ltr' else str.rstrip
...
def foo(self, arg):
self.strip(arg)
This above convention (class vars => methods, instance vars => functions) provides a convenient way to have both.
Needless to add, like everything else in python, it's possible to change this behavior, i.e. to write a class that doesn't convert its functions to methods and returns them as is.

Iterating class object

It's not a real world program but I would like to know why it can't be done.
I was thinking about numpy.r_ object and tried to do something similar but just making a class and not instantiating it.
a simple code (has some flaws) for integers could be:
class r_:
#classmethod
def __getitem__(clc, sl):
try:
return range(sl)
except TypeError:
sl = sl.start, sl.stop, sl.step
return range(*(i for i in sl if i is not None))
but as I try to do r_[1:10] i receive TypeError: 'type' object is not subscriptable.
Of course the code works with r_.__getitem__(slice(1,10)) but that's not what I want.
Is there something I can do in this case instead of using r_()[1:10]?
The protocol for resolving obj[index] is to look for a __getitem__ method in the type of obj, not to directly look up a method on obj (which would normally fall back to looking up a method on the type if obj didn't have an instance attribute with the name __getitem__).
This can be easily verified.
>>> class Foo(object):
pass
>>> def __getitem__(self, index):
return index
>>> f = Foo()
>>> f.__getitem__ = __getitem__
>>> f[3]
Traceback (most recent call last):
File "<pyshell#8>", line 1, in <module>
f[3]
TypeError: 'Foo' object does not support indexing
>>> Foo.__getitem__ = __getitem__
>>> f[3]
3
I don't know why exactly it works this way, but I would guess that at least part of the reason is exactly to prevent what you're trying to do; it would be surprising if every class that defined __getitem__ so that its instances were indexable accidentally gained the ability to be indexed itself. In the overwhelming majority of cases, code that tries to index a class will be a bug, so if the __getitem__ method happened to be able to return something, it would be bad if that didn't get caught.
Why don't you just call the class something else, and bind an instance of it to the name r_? Then you'd be able to do r_[1:10].
What you are trying to do is like list[1:5] or set[1:5] =) The special __getitem__ method only works on instances.
What one would normally do is just create a single ("singleton") instance of the class:
class r_class(object):
...
r_ = r_class()
Now you can do:
r_[1:5]
You can also use metaclasses, but that may be more than is necessary.
"No, my question was about getitem in the class, not in the instance"
Then you do need metaclasses.
class r_meta(type):
def __getitem__(cls, key):
return range(key)
class r_(object, metaclass=r_meta):
pass
Demo:
>>> r_[5]
range(0, 5)
If you pass in r_[1:5] you will get a slice object. Do help(slice) for more info; you can access values like key.stop if isinstance(key,slice) else key.
Define __getitem__() as a normal method in r_'s metaclass.
The reason for this behavior lies in the way how special methods like __getitem__() are lookup up.
Attributes are looked up first in the objects __dict__, and, if not found there, in the class __dict__. That's why e.g. this works:
>>> class Test1(object):
... x = 'hello'
...
>>> t = Test1()
>>> t.__dict__
{}
>>> t.x
'hello'
Methods that are defined in the class body are stored in the class __dict__:
>>> class Test2(object):
... def foo(self):
... print 'hello'
...
>>> t = Test2()
>>> t.foo()
hello
>>> Test2.foo()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method foo() must be called with Test2 instance as first argument (got nothing
instead)
So far there's nothing surprising here. When it comes to special methods however, Python's behavior is a little (or very) different:
>>> class Test3(object):
... def __getitem__(self, key):
... return 1
...
>>> t = Test3()
>>> t.__getitem__('a key')
1
>>> Test3['a key']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'type' object is unsubscriptable
The error messages are very different. With Test2, Python complains about an unbound method call, whereas with Test3 it complains about the unsubscriptability.
If you try to invoke a special method - by way of using it's associated operator - on an object, Python doesn't try to find it in the objects __dict__ but goes straight to the __dict__ of the object's class, which, if the object is itself a class, is a metaclass. So that's where you have to define it:
>>> class Test4(object):
... class __metaclass__(type):
... def __getitem__(cls, key):
... return 1
...
>>> Test4['a key']
1
There's no other way. To quote PEP20: There should be one-- and preferably only one --obvious way to do it.

Understanding python object membership for sets

If I understand correctly, the __cmp__() function of an object is called in order to evaluate all objects in a collection while determining whether an object is a member, or 'in', the collection.
However, this does not seem to be the case for sets:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data-other.data
a = MyObject(5)
b = MyObject(5)
print a in [b] //evaluates to True, as I'd expect
print a in set([b]) //evaluates to False
How is an object membership tested in a set, then?
Adding a __hash__ method to your class yields this:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data - other.data
def __hash__(self):
return hash(self.data)
a = MyObject(5)
b = MyObject(5)
print a in [b] # True
print a in set([b]) # Also True!
>>> xs = []
>>> set([xs])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
There you are. Sets use hashes, very similar to dicts. This help performance extremely (membership tests are O(1), and many other operations depend on membership tests), and it also fits the semantics of sets well: Set items must be unique, and different items will produce different hashes, while same hashes indicate (well, in theory) duplicates.
Since the default __hash__ is just id (which is rather stupid imho), two instances of a class that inherits object's __hash__ will never hash to the same value (well, unless adress space is larger than the sizeof the hash).
As others pointed, your objects don't have a __hash__ so they use the default id as a hash, and you can override it as Nathon suggested, BUT read the docs about __hash__, specifically the points about when you should and should not do that.
A set uses a dict behind the scenes, so the "in" statement is checking whether the object exists as a key in the dict. Since your object doesn't implement a hash function, the default hash function for objects uses the object's id. So even though a and b are equivalent, they're not the same object, and that's what's being tested.

Categories

Resources