Is Python `list.extend(iterator)` guaranteed to be lazy? - python

Summary
Suppose I have an iterator that, as elements are consumed from it, performs some side effect, such as modifying a list. If I define a list l and call l.extend(iterator), is it guaranteed that extend will push elements onto l one-by-one, as elements from the iterator are consumed, as opposed to kept in a buffer and then pushed on all at once?
My experiments
I did a quick test in Python 3.7 on my computer, and list.extend seems to be lazy based on that test. (See code below.) Is this guaranteed by the spec, and if so, where in the spec is that mentioned?
(Also, feel free to criticize me and say "this is not Pythonic, you fool!"--though I would appreciate it if you also answer the question if you want to criticize me. Part of why I'm asking is for my own curiosity.)
Say I define an iterator that pushes onto a list as it runs:
l = []
def iterator(k):
for i in range(5):
print([j in k for j in range(5)])
yield i
l.extend(iterator(l))
Here are examples of non-lazy (i.e. buffered) vs. lazy possible extend implementations:
def extend_nonlazy(l, iterator):
l += list(iterator)
def extend_lazy(l, iterator):
for i in iterator:
l.append(i)
Results
Here's what happens when I run both known implementations of extend.
Non-lazy:
l = []
extend_nonlazy(l, iterator(l))
# output
[False, False, False, False, False]
[False, False, False, False, False]
[False, False, False, False, False]
[False, False, False, False, False]
[False, False, False, False, False]
# l = [0, 1, 2, 3, 4]
Lazy:
l = []
extend_lazy(l, iterator(l))
[False, False, False, False, False]
[True, False, False, False, False]
[True, True, False, False, False]
[True, True, True, False, False]
[True, True, True, True, False]
My own experimentation shows that native list.extend seems to work like the lazy version, but my question is: does the Python spec guarantee that?

I don't think the issue is lazy vs non lazy because, either in slice assignment or in list extend, you need all the elements of the iterator and these elements are consumed at once (in the normal case). The issue you raised is more important: are these operations atomic or not atomic? See one definition of "atomicity" in Wikipedia:
Atomicity guarantees that each transaction is treated as a single "unit", which either succeeds completely, or fails completely.
Have a look at this example (CPython 3.6.8):
>>> def new_iterator(): return (1/(i-2) for i in range(5))
>>> L = []
>>> L[:] = new_iterator()
Traceback (most recent call last):
...
ZeroDivisionError: division by zero
>>> L
[]
The slice assignment failed because of the exception (i == 2 => 1/(i - 2) raises an exception) and the list was left unchanged. Hence, the slice assignement operation is atomic.
Now, the same example with: extend:
>>> L.extend(new_iterator())
Traceback (most recent call last):
...
ZeroDivisionError: division by zero
>>> L
[-0.5, -1.0]
When the exception was raised, the two first elements were already appended to the list. The extend operation is not atomic, since a failure does not leave the list unchanged.
Should the extend operation be atomic or not? Frankly I have no idea about that, but as written in #wim's answer, the real issue is that it's not clearly stated in the documentation (and worse, the documentation asserts that extend is equivalent to the slice assignment, which is not true in the reference implementation).

Is Python list.extend(iterator) guaranteed to be lazy?
No. On the contrary, it's documented that
l.extend(iterable)
is equivalent to
l[len(l):] = iterable
In CPython, such a slice assignment will first convert a generator on the right hand side into a list anyway (see here), i.e. it's consuming the iterable all at once.
The example shown in your question is, strictly speaking, contradicting the documentation. I've filed a documentation bug, but it was promptly closed by Raymond Hettinger.
As an aside, there are less convoluted ways to demonstrate the discrepancy. Just define a failing generator:
def gen():
yield 1
yield 2
yield 3
uh-oh
Now L.extend(gen()) will modify L, but L[:] = gen() will not.

Related

Python: Different methods to initialize 2D arrays gives different outputs [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 2 years ago.
I was solving a few questions involving dynamic programming. I initialized the dp table as -
n = 3
dp = [[False]*n]*n
print(dp)
#Output - [[False, False, False], [False, False, False], [False, False, False]]
Followed by which I set the diagonal elements to True using -
for i in range(n):
dp[i][i] = True
print(dp)
#Output - [[True, True, True], [True, True, True], [True, True, True]]
However, the above sets every value in dp to True. But when I initialize dp as -
dp = [[False]*n for i in range(n)]
Followed by setting diagonal elements to True, I get the correct output - [[True, False, False], [False, True, False], [False, False, True]]
So how exactly does the star operator generate values of the list?
When you do dp = [[False]*n]*n, you get a list of n of the same lists, as in, when you modify one, all are modified. That's why with that loop of n, you seemingly modify all n^2 elements.
You can check it like this:
[id(x) for x in dp]
> [1566380391432, 1566380391432, 1566380391432, 1566380391432, 1566380391432] # you'll see same values
With dp = [[False]*n for i in range(n)] you are creating different lists n times. Let's try again for this dp:
[id(x) for x in dp]
[1566381807176, 1566381801160, 1566381795912, 1566381492552, 1566380166600]
In general, opt to use * to expand immutable data types, and use for ... to expand mutable data types (like lists).
Your issue is that in the first example, you are not actually creating more lists.
To explain, lets go threw the example line by line.
First, you create a new list [False]*3. This will create list with the value False 3 times.
Next, you create a different list with a reference to the first list. Note that the first list is not copied, only a reference is stored.
Next, by multiplying by 3 you create a list with 3 references to the same list. Since these are only references, changing one will change the others too.
This is why assigning dp[i][i]=True, you are actually setting all element i of all three lists to True, since they are all 3 the same list. Thus if you do this for all i all value in all (but there is only one) lists will be set to true.
The second option actually creates 3 separate lists, so the code works properly.

How can I check that a Python list contains only True and then only False using one or two lines?

I would like to only allow lists where the first contiguous group of elements are True and then all of the remaining elements are False. I want lists like these examples to return True:
[True]
[False]
[True, False]
[True, False, False]
[True, True, True, False]
And lists like these to return False:
[False, True]
[True, False, True]
I am currently using this function, but I feel like there is probably a better way of doing this:
def my_function(x):
n_trues = sum(x)
should_be_true = x[:n_trues] # get the first n items
should_be_false = x[n_trues:len(x)] # get the remaining items
# return True only if all of the first n elements are True and the remaining
# elements are all False
return all(should_be_true) and all([not element for element in should_be_false])
Testing:
test_cases = [[True], [False],
[True, False],
[True, False, False],
[True, True, True, False],
[False, True],
[True, False, True]]
print([my_function(test_case) for test_case in test_cases])
# expected output: [True, True, True, True, True, False, False]
Is it possible to use a comprehension instead to make this a one/two line function? I know I could not define the two temporary lists and instead put their definitions in place of their names on the return line, but I think that would be too messy.
Method 1
You could use itertools.groupby. This would avoid doing multiple passes over the list and would also avoid creating the temp lists in the first place:
def check(x):
status = list(k for k, g in groupby(x))
return len(status) <= 2 and (status[0] is True or status[-1] is False)
This assumes that your input is non-empty and already all boolean. If that's not always the case, adjust accordingly:
def check(x):
status = list(k for k, g in groupby(map(book, x)))
return status and len(status) <= 2 and (status[0] or not status[-1])
If you want to have empty arrays evaluate to True, either special case it, or complicate the last line a bit more:
return not status or (len(status) <= 2 and (status[0] or not status[-1]))
Method 2
You can also do this in one pass using an iterator directly. This relies on the fact that any and all are guaranteed to short-circuit:
def check(x):
iterator = iter(x)
# process the true elements
all(iterator)
# check that there are no true elements left
return not any(iterator)
Personally, I think method 1 is total overkill. Method 2 is much nicer and simpler, and achieves the same goals faster. It also stops immediately if the test fails, rather than having to process the whole group. It also doesn't allocate any temporary lists at all, even for the group aggregation. Finally, it handles empty and non-boolean inputs out of the box.
Since I'm writing on mobile, here's an IDEOne link for verification: https://ideone.com/4MAYYa

Numpy search for elements of an array in a subset

Suppose I have numpy arrays
a = np.array([1,3,5,7,9,11,13])
b = np.array([3,5,7,11,13])
and I want to create a boolean array of the size of a where each entry is True or False depending on whether the element of a is also in b.
So in this case, I want
a_b = np.array([False,True,True,True,False,True,True]).
I can do this when b consists of one element as a == b[0]. Is there a quick way to do this when b has length greater than 1.
Use numpy.in1d:
In [672]: np.in1d([1,2,3,4], [1,2])
Out[672]: array([ True, True, False, False], dtype=bool)
For your data:
In [674]: np.in1d(a, b)
Out[674]: array([False, True, True, True, False, True, True], dtype=bool)
This is available in version 1.4.0 or later according to the docs. The docs also describe how the operation might look in pure Python:
in1d can be considered as an element-wise function version of the python keyword in, for 1-D sequences. in1d(a, b) is roughly equivalent to np.array([item in b for item in a]).
The docs for this function are worthwhile to read as there is the invert keyword argument and the assume_unique keyword argument -- each of which can be quite useful in some situations.
I also found it interesting to create my own version using np.vectorize and operator.contains:
from operator import contains
v_in = np.vectorize(lambda x,y: contains(y, x), excluded={1,})
and then:
In [696]: v_in([1,2,3, 2], [1, 2])
Out[696]: array([ True, True, False, True], dtype=bool)
Because operator.contains flips the arguments, I needed the lambda to make the calling convention match your use case -- but you could skip this if it was okay to call with b first then a.
But note that you need to use the excluded option for vectorize since you want whichever argument represents the b sequence (the sequence to check for membership within) to actually remain as a sequence (so if you chose not to flip the contains arguments with the lambda then you would want to exclude index 0 not 1).
The way with in1d will surely be much faster and is a much better way since it relies on a well-known built-in. But it's good to know how to do these tricks with operator and vectorize sometimes.
You could even create a Python Infix recipe instance for this and then use v_in as an "infix" operation:
v_in = Infix(np.vectorize(lambda x,y: contains(y, x), excluded={1,}))
# even easier: v_in = Infix(np.in1d)
and example usage:
In [702]: v_in([1, 2, 3, 2], [1, 2])
Out[702]: array([ True, True, False, True], dtype=bool)
In [704]: [1, 2, 3, 2] <<v_in>> [1, 2]
Out[704]: array([ True, True, False, True], dtype=bool)

Using list comprehension to compare elements of two arrays

How can I use list comprehension in python to compare if two arrays has same elements or not?
I did the following:
>>> aa=[12,3,13];
>>> bb=[3,13,12];
>>> pp=[True for x in aa for y in bb if y==x]
>>> pp
[True, True, True]
>>> bb=[3,13,123];
>>> pp=[True for x in aa for y in bb if y==x]
[True, True]
I also want to output the False if not true rather than outputting just two trues like in the latter case but don't know how to do it.
Finally, I want to get one True/False value (if all are true then true, if one of them is false, then false) rather than the list of true and/or false. I know the simple loop to iterate over pp(list of true and false) is enough for that but I am sure there is more pythonic way to that.
You are testing every element of each list against every element of the other list, finding all combinations that are True. Apart from inefficient, this is also the incorrect approach.
Use membership testing instead, and see all these tests are True with the all() function:
all(el in bb for el in aa)
all() returns True if each element of the iterable you give it is True, False otherwise.
This won't quite test if the lists are equivalent; you need to test for the length as well:
len(aa) == len(bb) and all(el in bb for el in aa)
To make this a little more efficient for longer bb lists; create a set() from that list first:
def equivalent(aa, bb):
if len(aa) != len(bb):
return False
bb_set = set(bb)
return all(el in bb_set for el in aa)
This still doesn't deal with duplicate numbers very well; [1, 1, 2] is equivalent to [1, 2, 2] with this approach. You underspecified what should happen in such cornercases; the only strict equivalent test would be to sort both inputs:
len(aa) == len(bb) and sorted(aa) == sorted(bb)
where we first test for length to avoid having to sort in case the lengths differ.
If duplicates are allowed, whatever the length of the input, you can forgo loops altogether and just use sets:
not set(aa).symmetric_difference(bb)
to test if they have the same unique elements.
set(aa) == set(bb)
This has the same effect but may be slightly faster
not set(aa).symmetric_difference(bb)
If you need [1,1] to be not equivalent to [1] do
sorted(aa) == sorted(bb)
"You are testing every element of each list against every element of
the other list, finding all combinations that are True. Apart from
inefficient, this is also the incorrect approach."
I agree with the above statement, the below code lists the False values as well but I don't think you really need this.
>>> bp = [y==x for x in aa for y in bb]
[False, False, True, True, False, False, False, True, False]
>>> False in bp
True

itertools and strided list assignment

Given a list, e.g. x = [True]*20, I want to assign False to every other element.
x[::2] = False
raises TypeError: must assign iterable to extended slice
So I naively assumed you could do something like this:
x[::2] = itertools.repeat(False)
or
x[::2] = itertools.cycle([False])
However, as far as I can tell, this results in an infinite loop. Why is there an infinite loop? Is there an alternative approach that does not involve knowing the number of elements in the slice before assignment?
EDIT: I understand that x[::2] = [False] * len(x)/2 works in this case, or you can come up with an expression for the multiplier on the right side in the more general case. I'm trying to understand what causes itertools to cycle indefinitely and why list assignment behaves differently from numpy array assignment. I think there must be something fundamental about python I'm misunderstanding. I was also thinking originally there might be performance reasons to prefer itertools to list comprehension or creating another n-element list.
What you are attempting to do in this code is not what you think (i suspect)
for instance:
x[::2] will return a slice containing every odd element of x, since x is of size 20,
the slice will be of size 10, but you are trying to assign a non-iterable of size 1 to it.
to successfully use the code you have you will need to do:
x = [True]*20
x[::2] = [False]*10
wich will assign an iterable of size 10 to a slice of size 10.
Why work in the dark with the number of elements? use
len(x[::2])
which would be equal to 10, and then use
x[::2] = [False]*len(x[::2])
you could also do something like:
x = [True if (index & 0x1 == 0) else False for index, element in enumerate(x)]
EDIT: Due to OP edit
The documentation on cycle says it Repeats indefinitely. which means it will continuously 'cycle' through the iterator it has been given.
Repeat has a similar implementation, however documentation states that it
Runs indefinitely unless the times argument is specified.
which has not been done in the questions code. Thus both will lead to infinite loops.
About the itertools being faster comment. Yes itertools are generally faster than other implementations because they are optimised to be as fast as the creators could make them.
However if you do not want to recreate a list you can use generator expressions such as the following:
x = (True if (index & 0x1 == 0) else False for index, element in enumerate(x))
which do not store all of their elements in memory but produce them as they are needed, however, generator functions can be used up.
for instance:
x = [True]*20
print(x)
y = (True if (index & 0x1 == 0) else False for index, element in enumerate(x))
print ([a for a in y])
print ([a for a in y])
will print x then the elements in the generator y, then a null list, because the generator has been used up.
As Mark Tolonen pointed out in a concise comment, the reason why your itertools attempts are cycling indefinitely is because, for the list assignment, python is checking the length of the right hand side.
Now to really dig in...
When you say:
x[::2] = itertools.repeat(False)
The left hand side (x[::2]) is a list, and you are assigning a value to a list where the value is the itertools.repeat(False) iterable, which will iterate forever since it wasn't given a length (as per the docs).
If you dig into the list assignment code in the cPython implementation, you'll find the unfortunately/painfully named function list_ass_slice, which is at the root of a lot of list assignment stuff. In that code you'll see this segment:
v_as_SF = PySequence_Fast(v, "can only assign an iterable");
if(v_as_SF == NULL)
goto Error;
n = PySequence_Fast_GET_SIZE(v_as_SF);
Here it is trying to get the length (n) of the iterable you are assigning to the list. However, before even getting there it is getting stuck on PySequence_Fast, where it ends up trying to convert your iterable to a list (with PySequence_List), within which it ultimately creates an empty list and tries to simply extend it with your iterable.
To extend the list with the iterable, it uses listextend(), and in there you'll see the root of the problem:
/* Run iterator to exhaustion. */
for (;;) {
and there you go.
Or least I think so... :) It was an interesting question so I thought I'd have some fun and dig through the source to see what was up, and ended up there.
As to the different behaviour with numpy arrays, it will simply be a difference in how the numpy.array assignments are handled.
Note that using itertools.repeat doesn't work in numpy, but it doesn't hang up (I didn't check the implementation to figure out why):
>>> import numpy, itertools
>>> x = numpy.ones(10,dtype='bool')
>>> x[::2] = itertools.repeat(False)
>>> x
array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)
>>> #but the scalar assignment does work as advertised...
>>> x = numpy.ones(10,dtype='bool')
>>> x[::2] = False
>>> x
array([False, True, False, True, False, True, False, True, False, True], dtype=bool)
Try this:
l = len(x)
x[::2] = itertools.repeat(False, l/2 if l % 2 == 0 else (l/2)+1)
Your original solution ends up in an infinite loop because that's what repeat is supposed to do, from the documentation:
Make an iterator that returns object over and over again. Runs indefinitely unless the times argument is specified.
The slice x[::2] is exactly len(x)/2 elements long, so you could achieve what you want with:
x[::2] = [False]*(len(x)/2)
The itertools.repeat and itertools.cycle methods are designed to yield values infinitely. However you can specify a limit on repeat(). Like this:
x[::2] = itertools.repeat(False, len(x)/2)
The right hand side of an extended slice assignment needs to be an iterable of the right size (ten, in this case).
Here is it with a regular list on the righthand side:
>>> x = [True] * 20
>>> x[::2] = [False] * 10
>>> x
[False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True]
And here it is with itertools.repeat on the righthand side.
>>> from itertools import repeat
>>> x = [True] * 20
>>> x[::2] = repeat(False, 10)
>>> x
[False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True]

Categories

Resources