Redeclaration of the method "in" within a class - python

I am creating an Abstract Data Type, which create a doubly linked list (not sure it's the correct translation). In it I have create a method __len__ to calcucate the length of it in the correct way, a method __repr__ to represent it correctly, but I wan't now to create a method which, when the user will make something like:
if foo in liste_adt
will return the correct answer, but I don't know what to use, because __in__ is not working.
Thank you,

Are you looking for __contains__?
object.__contains__(self, item)
Called to implement membership test operators. Should return true if item is in self, false otherwise. For mapping objects, this should consider the keys of the mapping rather than the values or the key-item pairs.
For objects that don’t define __contains__(), the membership test first tries iteration via __iter__(), then the old sequence iteration protocol via __getitem__(), see this section in the language reference.
Quick example:
>>> class Bar:
... def __init__(self, iterable):
... self.list = list(iterable)
... def __contains__(self, item):
... return item in self.list
>>>
>>> b = Bar([1,2,3])
>>> b.list
[1, 2, 3]
>>> 4 in b
False
>>> 2 in b
True
Note: Usually when you have this kind of doubts references can be found in the Data Model section of the The Python Language Reference.

Since the data structure is a linked list, it is necessary to iterate over it to check membership. Implementing an __iter__() method would make both if in and for in work. If there is a more efficient way for checking membership, implement that in __contains__().

Related

|= vs update with a subclass of collections.abc.Set

I need to subclass set so I subclassed collections.abc.Set, as suggested here: https://stackoverflow.com/a/6698723/211858.
Please find my simple implementation below.
It essentially wraps a set of integers.
I generate list of 10,000 MySet instances consisting of 100 random integers.
I would like to take the union of these wrapped sets.
I have two implementations below.
For some reason, the first using update is very fast, yet the second using |= is slow.
The tqdm wrapper is to conduct nonrigorous benchmarks.
Is there some way to correct the definition of the class to fix this performance issue?
Thanks!
I'm on Python 3.10.5.
from collections.abc import Iterable, Iterator, Set
from tqdm import tqdm
class MySet(Set):
def __init__(self, integers: Iterable[int]) -> None:
self.data: set[int] = set(integers)
def __len__(self) -> int:
return len(self.data)
def __iter__(self) -> Iterator[int]:
return iter(self.data)
def __contains__(self, x: object) -> bool:
if isinstance(x, int):
return x in self.data
else:
raise NotImplemented
def my_func(self):
...
def my_other_func(self):
...
# %%
import random
# Make some mock data
my_sets: list[MySet] = [
MySet(random.sample(range(1_000_000), 100)) for _ in range(10_000)
]
# %%
universe: set[int] = set()
universe2: set[int] = set()
# %%
# Nearly instant
for my_set in tqdm(my_sets):
universe.update(my_set)
# %%
# Takes well over 5 minutes on my laptop
for my_set in tqdm(my_sets):
universe2 |= my_set
Conclusion: The way to add the least code is to implement the __ior__ method.
What happens when there is no implementation:
When binary inplace or operation is performed for the first time, because universe2 is set and my_set is MySet, set cannot recognize the MySet class, so the binary inplace or operation will degenerate into a binary or operation.
As in point 1, the binary or operation of set will fail, so Python will try to call the __ror__ method of MySet.
Because MySet has no __ror__ method, Python will fall back to the collections.abc.Set. The __ror__ method of it is the same as the __or__ method and returns the result of type MySet. You can find it in the _collections_abc.py file:
class Set(Collection):
...
#classmethod
def _from_iterable(cls, it):
'''Construct an instance of the class from any iterable input.
Must override this method if the class constructor signature
does not accept an iterable for an input.
'''
return cls(it)
...
def __or__(self, other):
if not isinstance(other, Iterable):
return NotImplemented
chain = (e for s in (self, other) for e in s)
return self._from_iterable(chain)
__ror__ = __or__
...
For the subsequent binary inplace or operation, because the first __ror__ operation changes universe2 to MySet type and neither MySet nor collections.abc.Set has the __ior__ method, so the collections.abc.Set.__or__ function will be called repeatly, and a copy will be made per loop. This is the root cause of the slow speed of the second loop. Therefore, as long as the __ior__ method is implemented to avoid copying of subsequent operations, the performance will be greatly improved.
Suggestions for better implementation: The abstract class collections.abc.Set represents an immutable set. For this reason, it does not implement the inplace operation method. If you need your subclass to support inplace operation, you should consider inheriting collections.abc.MutableSet and implementing the add and discard abstract methods. Mutableset implements the inplace operation methods such as __ior__ through these two abstract methods (of course, it is still not efficient compared with the built-in set, so it is better to implement them by yourself):
class MutableSet(Set):
...
def __ior__(self, it):
for value in it:
self.add(value)
return self
...
Correction: There are some mistakes in the old answers. Here we want to correct them. I hope those who have read the old answers can see here:
Mistacke 1:
If necessary, you can also implement the __ior__ method, but it is not recommended to implement it separately when neither __or__ nor __ror__ methods are implemented, because Python will try to call the __ior__ method when it cannot find their implementation, which will make the behavior of non inplace operations become inplace operations, and may lead to unexpected results.
Correction: the binary or operation does not call the __ior__ method when the __or__ and __ror__ are missing.
Mistacke 2:
Generally speaking, binary operations between instances of different types may expect to get the type results of left operands, such as set and frozenset:
>>> {1} | frozenset({2})
{1, 2}
>>> frozenset({2}) | {1}
frozenset({1, 2})
Correction: This is not always true. For example, the __ror__ operation of collections.abc.Set also returns its subtype instead of the type of the left operand.

In python, when two lists are appended, what is called on the second list to get the items?

I'm in a slightly tricky situation with a list subclass.
The list class I'm using has overridden __iter__ and __getitem__ in order to return something slightly different from what's actually stored internally (this is because this requires processing time which was wanted to be only done when the item is first directly accessed)
My issue comes up in the following use case. Let's say, for the use of this example, that the overridden methods turn the internal values into strings.
>>> myList = MyList([1, 2, 3])
>>> standardList = list(["a", "b", "c"])
>>>
>>> for item in myList:
>>> print item
"1"
"2"
"3"
>>> newList = standardList + myList
>>> for item in newList:
>>> print item
"a"
"b"
"c"
1
2
3
So what I'm after here is how the values are pulled from myList when standardList + myList is run, so that I can ensure that newList has had the relevant modifications made.
Incidentally, I'm well aware that I could get this working if I overrode MyList.__add__ and then did myList + standardList, but this is in a module that is used elsewhere, and I'd rather ensure that it works the right way in both directions.
Thanks
To ensure if works in both directions you should define both override both __add__ and __radd__ in your MyList class. Quoting from data model page(7th bullet point)*:
Exception to the previous item: if the left operand is an instance of
a built-in type or a new-style class, and the right operand is an
instance of a proper subclass of that type or class and overrides the
base’s __rop__() method, the right operand’s __rop__() method is tried
before the left operand’s __op__() method.
So, your code will look like:
class MyList(list):
def __radd__(self, other):
return MyList(list.__add__(other, self))
*Note that the proper subclass thing mentioned in Python 2 docs is actually a documentation bug and they fixed it in Python 3:
If the right operand’s type is a subclass of the left operand’s type
and that subclass provides the reflected method for the operation,
this method will be called before the left operand’s non-reflected
method. This behavior allows subclasses to override their ancestors’
operations.
You can simply pass the result of list addition as in the constructor of MyList, like:
newList = MyList(standardList + myList)

Is there a reason to prefer list or tuple for __slots__?

You can define __slots__ in new-style python classes using either list or tuple (or perhaps any iterable?). The type persists after instances are created.
Given that tuples are always a little more efficient than lists and are immutable, is there any reason why you would not want to use a tuple for __slots__?
>>> class foo(object):
... __slots__ = ('a',)
...
>>> class foo2(object):
... __slots__ = ['a']
...
>>> foo().__slots__
('a',)
>>> foo2().__slots__
['a']
First, tuples aren't any more efficient than lists; they both support the exact same fast iteration mechanism from C API code, and use the same code for both indexing and iterating from Python.
More importantly, the __slots__ mechanism doesn't actually use the __slots__ member except during construction. This may not be that clearly explained by the documentation, but if you read all of the bullet points carefully enough the information is there.
And really, it has to be true. Otherwise, this wouldn't work:
class Foo(object):
__slots__ = (x for x in ['a', 'b', 'c'] if x != 'b')
… and, worse, this would:
slots = ['a', 'b', 'c']
class Foo(object):
__slots__ = slots
foo = Foo()
slots.append('d')
foo.d = 4
For further proof:
>>> a = ['a', 'b']
>>> class Foo(object):
... __slots__ = a
>>> del Foo.__slots__
>>> foo = Foo()
>>> foo.d = 3
AttributeError: 'Foo' object has no attribute 'd'
>>> foo.__dict__
AttributeError: 'Foo' object has no attribute '__dict__'
>>> foo.__slots__
AttributeError: 'Foo' object has no attribute '__slots__'
So, that __slots__ member in Foo is really only there for documentation and introspection purposes. Which means there is no performance issue, or behavior issue, just a stylistic one.
According to the Python docs..
This class variable can be assigned a string, iterable, or sequence of
strings with variable names used by instances.
So, you can define it using any iterable. Which one you use is up to you, but in terms of which to "prefer", I would use a list.
First, let's look at what would be the preferred choice if performance were not an issue, which would mean it would be the same decision you would make between list and tuples in all Python code. I would say a list, and the reason is because a tuple is design to have semantic structure: it should semantically mean something that you stored an element as the first item rather than the second. For example, if you stored the first value of an (X,Y) coordinate tuple (the X) as the second item, you just completely changed the semantic value of the structure. If you rearrange the names of the attributes in the __slots__ list, you haven't semantically changed anything. Therefore, in this case, you should use a list.
Now, about performance. First, this is probably premature optimization. I don't know about the performance difference between lists and tuples, but I would guess there isn't anyway. But even assuming there is, it would really only come into play if the __slots__ variable is accessed many times.
I haven't actually looked at the code for when __slots__ is accessed, but I ran the following test..
print('Defining slotter..')
class Slotter(object):
def __iter__(self):
print('Looking for slots')
yield 'A'
yield 'B'
yield 'C'
print('Defining Mine..')
class Mine(object):
__slots__ = Slotter()
print('Creating first mine...')
m1 = Mine()
m1.A = 1
m1.B = 2
print('Creating second mine...')
m2 = Mine()
m2.A = 1
m2.C = 2
Basically, I use a custom class so that I can see exactly when the slots variable is actually iterated. You'll see that it is done exactly once, when the class is defined.
Defining slotter..
Defining Mine..
Looking for slots
Creating first mine...
Creating second mine...
Unless there is a case that I'm missing where the __slots__ variable is iterated again, I think that the performance difference can be declared negligible at worst.

Assign object properties to list in a set order

How can I iterate over an object and assign all it properties to a list
From
a = []
class A(object):
def __init__(self):
self.myinstatt1 = 'one'
self.myinstatt2 = 'two'
to
a =['one','two']
Don't create a full fledged class if you just want to store a bunch of attributes and return a list so that your API can consume it. Use a namedtuple instead. Here is an example.
>>> import collections
>>> Point = collections.namedtuple('Point', ['x', 'y'])
>>> p = Point(1, 2)
>>> p
Point(x=1, y=2)
If your API just expects a sequence (not specifically a list), you can pass p directly. If it needs a list specifically, it is trivial to convert the Point object to a list.
>>> list(p)
[1, 2]
You can even subclass the newly created Point class and add more methods (documentation has details). If namedtuple doesn't meet your needs, consider sub-classing abc.Sequence Abstract Base Class or using it as a mixin.
One approach is to make your class behave like a list by implementing some or all of the container API. Depending on how the external API you're using works, you might only need to implement __iter__. If it needs more, you could always pass it list(a), which will build a list using an iterator.
Here's an example of how easy it can be to add an __iter__ method:
class A(object):
def __init__(self):
self.myAttr1 = "one"
self.myAttr2 = "two"
def __iter__(self):
yield self.myAttr1
yield self.myAttr2

Understanding python object membership for sets

If I understand correctly, the __cmp__() function of an object is called in order to evaluate all objects in a collection while determining whether an object is a member, or 'in', the collection.
However, this does not seem to be the case for sets:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data-other.data
a = MyObject(5)
b = MyObject(5)
print a in [b] //evaluates to True, as I'd expect
print a in set([b]) //evaluates to False
How is an object membership tested in a set, then?
Adding a __hash__ method to your class yields this:
class MyObject(object):
def __init__(self, data):
self.data = data
def __cmp__(self, other):
return self.data - other.data
def __hash__(self):
return hash(self.data)
a = MyObject(5)
b = MyObject(5)
print a in [b] # True
print a in set([b]) # Also True!
>>> xs = []
>>> set([xs])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
There you are. Sets use hashes, very similar to dicts. This help performance extremely (membership tests are O(1), and many other operations depend on membership tests), and it also fits the semantics of sets well: Set items must be unique, and different items will produce different hashes, while same hashes indicate (well, in theory) duplicates.
Since the default __hash__ is just id (which is rather stupid imho), two instances of a class that inherits object's __hash__ will never hash to the same value (well, unless adress space is larger than the sizeof the hash).
As others pointed, your objects don't have a __hash__ so they use the default id as a hash, and you can override it as Nathon suggested, BUT read the docs about __hash__, specifically the points about when you should and should not do that.
A set uses a dict behind the scenes, so the "in" statement is checking whether the object exists as a key in the dict. Since your object doesn't implement a hash function, the default hash function for objects uses the object's id. So even though a and b are equivalent, they're not the same object, and that's what's being tested.

Categories

Resources