Why CPython call len(iterable) when executing func(*iterable)? - python

Recently I'm writing a download program, which uses the HTTP Range field to download many blocks at the same time. I wrote a Python class to represent the Range (the HTTP header's Range is a closed interval):
class ClosedRange:
def __init__(self, begin, end):
self.begin = begin
self.end = end
def __iter__(self):
yield self.begin
yield self.end
def __str__(self):
return '[{0.begin}, {0.end}]'.format(self)
def __len__(self):
return self.end - self.begin + 1
The __iter__ magic method is to support the tuple unpacking:
header = {'Range': 'bytes={}-{}'.format(*the_range)}
And len(the_range) is how many bytes in that Range.
Now I found that 'bytes={}-{}'.format(*the_range) occasionally causes the MemoryError. After some debugging I found that the CPython interpreter will try to call len(iterable) when executing func(*iterable), and (may) allocate memory based on the length. On my machine, when len(the_range) is greater than 1GB, the MemoryError appears.
This is a simplified one:
class C:
def __iter__(self):
yield 5
def __len__(self):
print('__len__ called')
return 1024**3
def f(*args):
return args
>>> c = C()
>>> f(*c)
__len__ called
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>> # BTW, `list(the_range)` have the same problem.
>>> list(c)
__len__ called
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
So my questions are:
Why CPython call len(iterable)? From this question I see you won't know an iterator's length until you iterate throw it. Is this an optimization?
Can __len__ method return the 'fake' length (i.e. not the real number of elements in memory) of an object?

Why CPython call len(iterable)? From this question I see you won't know an iterator's length until you iterate throw it. Is this an optimization?
when python (assuming python3) execute f(*c), opcode CALL_FUNCTION_EX is used:
0 LOAD_GLOBAL 0 (f)
2 LOAD_GLOBAL 1 (c)
4 CALL_FUNCTION_EX 0
6 POP_TOP
as c is an iterable, PySequence_Tuple is called to convert it to a tuple, then PyObject_LengthHint is called to determine the new tuple length, as __len__ method is defined on c, it gets called and its return value is used to allocate memory for a new tuple, as malloc failed, finally MemoryError error gets raised.
/* Guess result size and allocate space. */
n = PyObject_LengthHint(v, 10);
if (n == -1)
goto Fail;
result = PyTuple_New(n);
Can __len__ method return the 'fake' length (i.e. not the real number of elements in memory) of an object?
in this scenario, yes.
when the return value of __len__ is smaller than need, python will adjust memory space of new tuple object to fit when filling the tuple. if it is larger than need, although python will allocate extra memory, _PyTuple_Resize will be called in the end to reclaim over-allocated space.

Related

Equality Comparison with NumPy Instance Invokes `__bool__`

I have defined a class where its __ge__ method returns an instance of itself, and whose __bool__ method is not allowed to be invoked (similar to a Pandas Series).
Why is X.__bool__ invoked during np.int8(0) <= x, but not for any of the other examples? Who is invoking it? I have read the Data Model docs but I haven’t found my answer there.
import numpy as np
import pandas as pd
class X:
def __bool__(self):
print(f"{self}.__bool__")
assert False
def __ge__(self, other):
print(f"{self}.__ge__")
return X()
x = X()
np.int8(0) <= x
# Console output:
# <__main__.X object at 0x000001BAC70D5C70>.__ge__
# <__main__.X object at 0x000001BAC70D5D90>.__bool__
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "<stdin>", line 4, in __bool__
# AssertionError
0 <= x
# Console output:
# <__main__.X object at 0x000001BAC70D5C70>.__ge__
# <__main__.X object at 0x000001BAC70D5DF0>
x >= np.int8(0)
# Console output:
# <__main__.X object at 0x000001BAC70D5C70>.__ge__
# <__main__.X object at 0x000001BAC70D5D30>
pd_ge = pd.Series.__ge__
def ge_wrapper(self, other):
print("pd.Series.__ge__")
return pd_ge(self, other)
pd.Series.__ge__ = ge_wrapper
pd_bool = pd.Series.__bool__
def bool_wrapper(self):
print("pd.Series.__bool__")
return pd_bool(self)
pd.Series.__bool__ = bool_wrapper
np.int8(0) <= pd.Series([1,2,3])
# Console output:
# pd.Series.__ge__
# 0 True
# 1 True
# 2 True
# dtype: bool
I suspect that np.int8.__le__ is defined so that instead of returning NotImplemented and letting X.__ge__ take over, it instead tries to return something like not (np.int(8) > x), and then np.int8.__gt__ raises NotImplemented. Once X.__gt__(x, np.int8(0)) returns an instance of X rather than a Boolean value, then we need to call x.__bool__() in order to compute the value of not x.
(Still trying to track down where int8.__gt__ is defined to confirm.)
(Update: not quite. int8 uses a single generic rich comparison function that simply converts the value to a 0-dimensional array, then returns the result of PyObject_RichCompare on the array and x.)
I did find this function that appears to ultimately implement np.int8.__le__:
static NPY_INLINE int
rational_le(rational x, rational y) {
return !rational_lt(y,x);
}
It's not clear to me how we avoid getting to this function if one of the arguments (like X) would not be a NumPy type. I think I give up.
TL;DR
X.__array_priority__ = 1000
The biggest hint is that it works with a pd.Series.
First I tried having X inherit from pd.Series. This worked (i.e. __bool__ no longer called).
To determine whether NumPy is using an isinstance check or duck-typing approach, I removed the explicit inheritance and added (based on this answer):
#property
def __class__(self):
return pd.Series
The operation no longer worked (i.e. __bool__ was called).
So now I think we can conclude NumPy is using a duck-typing approach. So I checked to see what attributes are being accessed on X.
I added the following to X:
def __getattribute__(self, item):
print("getattr", item)
return object.__getattribute__(self, item)
Again instantiating X as x, and invoking np.int8(0) <= x, we get:
getattr __array_priority__
getattr __array_priority__
getattr __array_priority__
getattr __array_struct__
getattr __array_interface__
getattr __array__
getattr __array_prepare__
<__main__.X object at 0x000002022AB5DBE0>.__ge__
<__main__.X object at 0x000002021A73BE50>.__bool__
getattr __array_struct__
getattr __array_interface__
getattr __array__
Traceback (most recent call last):
File "<stdin>", line 32, in <module>
np.int8(0) <= x
File "<stdin>", line 21, in __bool__
assert False
AssertionError
Ah-ha! What is __array_priority__? Who cares, really. With a little digging, all we need to know is that NDFrame (from which pd.Series inherits) sets this value as 1000.
If we add X.__array_priority__ = 1000, it works! __bool__ is no longer called.
What made this so difficult (I believe) is that the NumPy code didn't show up in the call stack because it is written in C. I could investigate further if I tried out the suggestion here.

How does the list.append work?

alist = []
def show(*args, **kwargs):
alist.append(*args, **kwargs)
print(alist)
>>> show('tiger')
['tiger']
>>> show('tiger','cat')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in show
TypeError: append() takes exactly one argument (2 given)
>>> show('tiger','cat', {'name':'tom'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in show
TypeError: append() takes exactly one argument (3 given)
Since the method append of alist only accepts one argument, why not detect a syntax error on the line alist.append(*args, **kwargs) in the definition of the method show?
It's not a syntax error because the syntax is perfectly fine and that function may or may not raise an error depending on how you call it.
The way you're calling it:
alist = []
def show(*args, **kwargs):
alist.append(*args, **kwargs)
print(alist)
>>> show('tiger')
['tiger']
>>> show('tiger','cat')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in show
TypeError: append() takes exactly one argument (2 given)
A different way:
alist = []
def show(*args, **kwargs):
alist.append(*args, **kwargs)
print(alist)
>>> show('tiger')
['tiger', 'tiger']
>>> class L: pass
...
>>> alist = L()
>>> alist.append = print
>>> show('tiger','cat')
tiger cat
<__main__.L object at 0x000000A45DBCC048>
Python objects are strongly typed. The names that bind to them are not. Nor are function arguments. Given Python's dynamic nature it would be extremely difficult to statically predict what type a variable at a given source location will be at execution time, so the general rule is that Python doesn't bother trying.
In your specific example, alist is not in the local scope. Therefore it can be modified after your function definition was executed and the changes will be visible to your function, cf. code snippets below.
So, in accord with the general rule: predicting whether or not alist will be a list when you call .append? Near-impossible. In particular, the interpreter cannot predict that this will be an error.
Here is some code just to drive home the point that static type checking is by all practical means impossible in Python. It uses non-local variables as in your example.
funcs = []
for a in [1, "x", [2]]:
def b():
def f():
print(a)
return f
funcs.append(b())
for f in funcs:
f()
Output:
[2] # value of a at definition time (of f): 1
[2] # value of a at definition time (of f): 'x'
[2] # value of a at definition time (of f): [2]
And similarly for non-global non-local variables:
funcs = []
for a in [1, "x", [2]]:
def b(a):
def f():
print(a)
a = a+a
return f
funcs.append(b(a))
for f in funcs:
f()
Output:
2 # value of a at definition time (of f): 1
xx # value of a at definition time (of f): 'x'
[2, 2] # value of a at definition time (of f): [2]
It's not a syntax error because it's resolved at runtime. Syntax errors are caught initially during parsing. Things like unmatched brackets, undefined variable names, missing arguments (this is not a missing argument *args means any number of arguments).
show has no way of knowing what you'll pass it at runtime and since you are expanding your args variable inside show, there could be any number of arguments coming in and it's valid syntax! list.append takes one argument! One tuple, one list, one int, string, custom class etc. etc. What you are passing it is some number elements depending on input. If you remove the * it's all dandy as its one element e.g. alist.append(args).
All this means that your show function is faulty. It is equipped to handle args only when its of length 1. If its 0 you also get a TypeError at the point append is called. If its more than that its broken, but you wont know until you run it with the bad input.
You could loop over the elements in args (and kwargs) and add them one by one.
alist = []
def show(*args, **kwargs):
for a in args:
alist.append(a)
for kv in kwargs.items():
alist.append(kv)
print(alist)

"yield from iterable" vs "return iter(iterable)"

When wrapping an (internal) iterator one often has to reroute the __iter__ method to the underlying iterable. Consider the following example:
class FancyNewClass(collections.Iterable):
def __init__(self):
self._internal_iterable = [1,2,3,4,5]
# ...
# variant A
def __iter__(self):
return iter(self._internal_iterable)
# variant B
def __iter__(self):
yield from self._internal_iterable
Is there any significant difference between variant A and B?
Variant A returns an iterator object that has been queried via iter() from the internal iterable. Variant B returns a generator object that returns values from the internal iterable. Is one or the other preferable for some reason? In collections.abc the yield from version is used. The return iter() variant is the pattern that I have used until now.
The only significant difference is what happens when an exception is raised from within the iterable. Using return iter() your FancyNewClass will not appear on the exception traceback, whereas with yield from it will. It is generally a good thing to have as much information on the traceback as possible, although there could be situations where you want to hide your wrapper.
Other differences:
return iter has to load the name iter from globals - this is potentially slow (although unlikely to significantly affect performance) and could be messed with (although anyone who overwrites globals like that deserves what they get).
With yield from you can insert other yield expressions before and after (although you could equally use itertools.chain).
As presented, the yield from form discards any generator return value (i.e. raise StopException(value). You can fix this by writing instead return (yield from iterator).
Here's a test comparing the disassembly of the two approaches and also showing exception tracebacks: http://ideone.com/1YVcSe
Using return iter():
3 0 LOAD_GLOBAL 0 (iter)
3 LOAD_FAST 0 (it)
6 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
9 RETURN_VALUE
Traceback (most recent call last):
File "./prog.py", line 12, in test
File "./prog.py", line 10, in i
RuntimeError
Using return (yield from):
5 0 LOAD_FAST 0 (it)
3 GET_ITER
4 LOAD_CONST 0 (None)
7 YIELD_FROM
8 RETURN_VALUE
Traceback (most recent call last):
File "./prog.py", line 12, in test
File "./prog.py", line 5, in bar
File "./prog.py", line 10, in i
RuntimeError

Python: What's wrong with this Fibonacci function?

I tried to write a simple python function which should return the list of fib numbers upto some specified max. But I am getting this error. I can't seem to find out what I am doing wrong.
def fib(a,b,n):
f = a+b
if (f > n):
return []
return [f].extend(fib(b,f,n))
>>>fib(0,1,10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lvl2.py", line 35, in fib
return [f].extend(fib(b,f,n))
File "lvl2.py", line 35, in fib
return [f].extend(fib(b,f,n))
File "lvl2.py", line 35, in fib
return [f].extend(fib(b,f,n))
File "lvl2.py", line 35, in fib
return [f].extend(fib(b,f,n))
TypeError: 'NoneType' object is not iterable
list.extend extends a list in-place. You can use the + operator to concatenate two lists together.
However, your code isn't particularly Pythonic. You should use a generator for infinite sequences, or, as a slight improvement over your code:
def fib(a,b,n):
data = []
f = a+b
if (f > n):
return data
data.append(f)
data.extend(fib(b,f,n))
return data
An example using generators for infinite sequences:
def fibgen(a, b):
while True:
a, b = b, a + b
yield b
You can create the generator with fibgen() and pull off the next value using .next().
You may be interested in an especially neat Fibonacci implementation, though it only works in Python 3.2 and higher:
#functools.lru_cache(maxsize=None)
def fib(n):
return fib(n-1) + fib(n-2) if n > 0 else 0
The point of the first line is to memoise the recursive call. In other words, it is slow to evaluate e.g. fib(20), because you will repeat a lot of effort, so instead we cache the values as they are computed.
It is still probably more efficient to do
import itertools
def nth(iterable, n, default=None):
"Returns the nth item or a default value"
return next(islice(iterable, n, None), default)
nth(fibgen())
as above, because it doesn't have the space overhead of the large cache.

Need to add an element at the start of an iterator in python

I have a program as follows:
a=reader.next()
if *some condition holds*:
#Do some processing and continue the iteration
else:
#Append the variable a back to the iterator
#That is nullify the operation *a=reader.next()*
How do I add an element to the start of the iterator?
(Or is there an easier way to do this?)
EDIT: OK let me put it this way. I need the next element in an iterator without removing it.
How do I do this>?
You're looking for itertools.chain:
import itertools
values = iter([1,2,3]) # the iterator
value = 0 # the value to prepend to the iterator
together = itertools.chain([value], values) # there it is
list(together)
# -> [0, 1, 2, 3]
Python iterators, as such, have very limited functionality -- no "appending" or anything like that. You'll need to wrap the generic iterator in a wrapper adding that functionality. E.g.:
class Wrapper(object):
def __init__(self, it):
self.it = it
self.pushedback = []
def __iter__(self):
return self
def next(self):
if self.pushedback:
return self.pushedback.pop()
else:
return self.it.next()
def pushback(self, val):
self.pushedback.append(val)
This is Python 2.5 (should work in 2.6 too) -- slight variants advised for 2.6 and mandatory for 3.any (use next(self.it) instead of self.it.next() and define __next__ instead of next).
Edit: the OP now says what they need is "peek ahead without consuming". Wrapping is still the best option, but an alternative is:
import itertools
...
o, peek = itertools.tee(o)
if isneat(peek.next()): ...
this doesn't advance o (remember to advance it if and when you decide you DO want to;-).
By design (in general development concepts) iterators are intended to be read-only, and any attempt to change them would break.
Alternatively, you could read the iterator backwards, and add it to the end of hte element (which is actually the start :) )?
This isn't too close what you asked for, but if you have control over the generator and you don't need to "peek" before the value is generated (and any side effects have occurred), you can use the generator.send method to tell the generator to repeat the last value it yielded:
>>> def a():
... for x in (1,2,3):
... rcvd = yield x
... if rcvd is not None:
... yield x
...
>>> gen = a()
>>> gen.send("just checking")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't send non-None value to a just-started generator
>>> gen.next()
1
>>> gen.send("just checking")
1
>>> gen.next()
2
>>> gen.next()
3
>>> gen.next()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Categories

Resources