Trouble understanding python generators and iterable arguments - python

I am supposed to write a generator that given a list of iterable arguments produces the 1st element from the 1st argument, 1st element from 2nd argument, 1st element from 3rd element, 2nd element from 1st argument and so on.
So
''.join([v for v in alternate('abcde','fg','hijk')]) == afhbgicjdke
My function works for string arguments like this but I encounter a problem when I try and use a given test case that goes like this
def hide(iterable):
for v in iterable:
yield v
''.join([v for v in alternate(hide('abcde'),hide('fg'),hide('hijk'))])= afhbgicjdke
Here is my generator:
def alternate(*args):
for i in range(10):
for arg in args:
arg_num = 0
for thing in arg:
if arg_num == i:
yield thing
arg_num+=1
Can I change something in this to get it to work as described or is there something fundamentally wrong with my function?
EDIT: as part of the assignment, I am not allowed to use itertools

Something like this works OK:
def alternate(*iterables):
iterators = [iter(iterable) for iterable in iterables]
sentinel = object()
keep_going = True
while keep_going:
keep_going = False
for iterator in iterators:
maybe_yield = next(iterator, sentinel)
if maybe_yield != sentinel:
keep_going = True
yield maybe_yield
print ''.join(alternate('abcde','fg','hijk'))
The trick is realizing that when a generator is exhausted, next will return the sentinel value. As long as 1 of the iterators returns a sentinel, then we need to keep going until it is exhausted. If the sentinel was not returned from next, then the value is good and we need to yield it.
Note that if the number of iterables is large, this implementation is sub-optimal (It'd be better to store the iterables in a data-structure that supports O(1) removal and to remove an iterable as soon as it is detected to be exhausted -- a collections.OrderedDict could probably be used for this purpose, but I'll leave that as an exercise for the interested reader).
If we want to open things up to the standard library, itertools can help here too:
from itertools import izip_longest, chain
def alternate2(*iterables):
sentinel = object()
result = chain.from_iterable(izip_longest(*iterables, fillvalue=sentinel))
return (item for item in result if item is not sentinel)
Here, I return a generator expression ... Which is slightly different than writing a generator function, but really not much :-). Again, this can be slightly inefficient if there are a lot of iterables and one of them is much longer than the others (consider the case where you have 100 iterables of length 1 and 1 iterable of length 101 -- This will run in effectively 101 * 101 steps whereas you should really be able to accomplish the iteration in about 101 * 2 + 1 steps).

There are several things that can be improved in your code. What is causing you problems is the most wrong of them all - you are actually iterating several times over each of your arguments - and essentially doing nothing with the intermediate values in each pass.
That takes place when you iterate for thing in arg for each value of i.
While that is a tremendous waste of resources in any account, it also does not work with iterators (which are what you get with your hide function), since they go exhausted after iterating over its elements once - that is in contrast with sequences - that can be iterated - and re-reiterated several ties over (like the strings you are using for test)
(Another wrong thing is to have the 10 hardcoded there as the longest sequence value you'd ever had - in Python you iterate over generators and sequences - don't matter their size)
Anyway, the fix for that is to make sure you iterate over each of your arguments just once - the built-in zip can do that - or for your use case, itertools.zip_longest(izip_longest in Python 2.x) can retrieve the values you want from your args in a single for structure:
from itertools import izip_longest
def alternate(*args):
sentinel = object()
for values in izip_longest(*args, fillvalue=sentinel):
for value in values:
if value is not sentinel:
yield value

If you want to only pass iterators (this wont work with static string) use the fallowing code :
def alternate(*args):
for i in range(10):
for arg in args:
arg_num = i
for thing in arg:
if arg_num == i:
yield thing
break
else:
arg_num+=1
This is just your original code with a little bit of change .
When you are using static string every time that you call alternate function a new string will be passed in and you can start to count from 0 (arg_num = 0).
But when you create iterators by calling hide() method, only one single instance of iterator will be created for each string and you should keep track of your position in the iterators so you have to change arg_num = 0 to arg_num = i and also you need to add the break statement as well .

Related

Do coordinates belong to a sphere? Prints wrong result if we print result after printing the coordinates. Blows my mind [duplicate]

Consider the code:
def test(data):
for row in data:
print("first loop")
for row in data:
print("second loop")
When data is an iterator, for example a list iterator or a generator expression*, this does not work:
>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop
This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?
Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.
On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:
>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop
* More examples:
file objects
generators created from an explicit generator function
filter, map, and zip objects (in 3.x)
enumerate objects
csv.readers
various iterators defined in the itertools standard library
For general theory and terminology explanation, see What are iterator, iterable, and iteration?.
To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.
An iterator can only be consumed once. For example:
lst = [1, 2, 3]
it = iter(lst)
next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration
When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.
A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:
data = list(data)
If the iterator would iterate over many elements, however, it's a better idea to create independent iterators using tee():
import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed
Now each one can be iterated over in turn:
for e in it1:
print("first loop")
for e in it2:
print("second loop")
Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.
This is explained in Óscar López's answer, however, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.
In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.
Once an iterator is exhausted, it will not yield any more.
>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
How do I loop over an iterator twice?
It is usually impossible. (Explained later.) Instead, do one of the following:
Collect the iterator into a something that can be looped over multiple times.
items = list(iterator)
for item in items:
...
Downside: This costs memory.
Create a new iterator. It usually takes only a microsecond to make a new iterator.
for item in create_iterator():
...
for item in create_iterator():
...
Downside: Iteration itself may be expensive (e.g. reading from disk or network).
Reset the "iterator". For example, with file iterators:
with open(...) as f:
for item in f:
...
f.seek(0)
for item in f:
...
Downside: Most iterators cannot be "reset".
Philosophy of an Iterator
Typically, though not technically1:
Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
Iterator: A pointer to some element of an iterable.
If we were to define a sequence iterator, it might look something like this:
class SequenceIterator:
index: int
items: Sequence # Sequences can be randomly indexed via items[index].
def __next__(self):
"""Increment index, and return the latest item."""
The important thing here is that typically, an iterator does not store any actual data inside itself.
Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.
Exhausting an Iterator
What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:
iterable = "ABC"
iterator = iter(iterable)
for item in iterator:
print(item)
Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:
class SequenceIterator:
def __next__(self):
item = self.items[self.index]
self.index += 1
return item
Hold on. What if index goes past the last element of items? We should raise a safe exception for that:
class SequenceIterator:
def __next__(self):
try:
item = self.items[self.index]
except IndexError:
raise StopIteration # Safely says, "no more items in iterator!"
self.index += 1
return item
Now, the for loop knows when to stop extracting items from the iterator.
What happens if we now try to loop over the iterator again?
iterable = "ABC"
iterator = iter(iterable)
# iterator.index == 0
for item in iterator:
print(item)
# iterator.index == 3
for item in iterator:
print(item)
# iterator.index == 3
Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.
1 Technically:
Iterable: An object that returns an iterator when __iter__ is called on it.
Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.
More details here.
Why doesn't iterating work the second time for iterators?
It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.
Why does it work for other kinds of iterables?
Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.
This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.
Given an iterator, how can we iterate twice over the data?
By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.
Caching
The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:
elements = list(iterator)
for element in elements:
...
for element in elements:
...
Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:
abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
print(element) # starts at 1, because the list does
for element in concrete:
print(element) # also starts at 1, because a new iterator is created
A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they're iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)
from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
print(element)
if element == 3:
break
for element in y:
print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer
Starting over
If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:
def powers_of(base, *range_args):
for i in range(*range_args):
yield base ** i
exhaustible = powers_of(2, 1, 12):
for value in exhaustible:
print(value)
print('exhausted')
for value in exhaustible: # no results from here
print(value)
# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
print(value)
Seekable or resettable iterators
Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.
Nothing in Python disallows this, but in many cases it's not feasible to provide a simple interface; in most other cases, it just isn't supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.
The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value - e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.
In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to - i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that's straightforward: just modify i. For a file, we'd need to know what the file position was at the beginning of the desired line, not just the line number. That's why the file interface provides .tell as well as .seek.
Here's a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:
class PowersOf:
def __init__(self, base):
self._exponent = 0
self._base = base
def __iter__(self):
return self
def __next__(self):
result = self._base ** self._exponent
self._exponent += 1
return result
#property
def exponent(self):
return self._exponent
#exponent.setter
def exponent(self, value):
if not isinstance(new_value, int):
raise TypeError("must set with an integer")
if new_value < 0:
raise ValueError("can't set to negative value")
self._exponent = new_value
Examples:
pot = PowersOf(2)
for i in pot:
if i > 1000:
break
print(i)
pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64
Technical detail
Iterators vs. iterables
Recall that, briefly:
"iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
using a for loop
using a comprehension or generator expression
unpacking an iterable, including calling a function with * or ** syntax
constructing a list, tuple, etc. from another iterable
"iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that - basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" - in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
"iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.
Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements - since we can find out what they are, by examining them - and certainly they can be examined in a specified order - since that's what the iterator enables. So, we can conclude that an iterator is a kind of iterable - and Python's definitions agree.
How iteration works
In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.
Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it's some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn't.
The iterator protocol
Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:
Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)
Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to
class Iterator:
def __init__(self, bound_getitem):
self._index = 0
self._bound_getitem = bound_getitem
def __iter__(self):
return self
def __next__(self):
try:
result = self._bound_getitem(self._index)
except IndexError:
raise StopIteration
self._index += 1
return result
Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.
With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like
for element in iterable:
...
will approximately translate to:
iterator = iter(iterable)
while True:
try:
element = next(iterator)
except StopIteration:
break
...
except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

Why would an earlier print statement cause different downstream results when commented out? [duplicate]

Consider the code:
def test(data):
for row in data:
print("first loop")
for row in data:
print("second loop")
When data is an iterator, for example a list iterator or a generator expression*, this does not work:
>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop
This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?
Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.
On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:
>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop
* More examples:
file objects
generators created from an explicit generator function
filter, map, and zip objects (in 3.x)
enumerate objects
csv.readers
various iterators defined in the itertools standard library
For general theory and terminology explanation, see What are iterator, iterable, and iteration?.
To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.
An iterator can only be consumed once. For example:
lst = [1, 2, 3]
it = iter(lst)
next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration
When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.
A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:
data = list(data)
If the iterator would iterate over many elements, however, it's a better idea to create independent iterators using tee():
import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed
Now each one can be iterated over in turn:
for e in it1:
print("first loop")
for e in it2:
print("second loop")
Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.
This is explained in Óscar López's answer, however, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.
In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.
Once an iterator is exhausted, it will not yield any more.
>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
How do I loop over an iterator twice?
It is usually impossible. (Explained later.) Instead, do one of the following:
Collect the iterator into a something that can be looped over multiple times.
items = list(iterator)
for item in items:
...
Downside: This costs memory.
Create a new iterator. It usually takes only a microsecond to make a new iterator.
for item in create_iterator():
...
for item in create_iterator():
...
Downside: Iteration itself may be expensive (e.g. reading from disk or network).
Reset the "iterator". For example, with file iterators:
with open(...) as f:
for item in f:
...
f.seek(0)
for item in f:
...
Downside: Most iterators cannot be "reset".
Philosophy of an Iterator
Typically, though not technically1:
Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
Iterator: A pointer to some element of an iterable.
If we were to define a sequence iterator, it might look something like this:
class SequenceIterator:
index: int
items: Sequence # Sequences can be randomly indexed via items[index].
def __next__(self):
"""Increment index, and return the latest item."""
The important thing here is that typically, an iterator does not store any actual data inside itself.
Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.
Exhausting an Iterator
What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:
iterable = "ABC"
iterator = iter(iterable)
for item in iterator:
print(item)
Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:
class SequenceIterator:
def __next__(self):
item = self.items[self.index]
self.index += 1
return item
Hold on. What if index goes past the last element of items? We should raise a safe exception for that:
class SequenceIterator:
def __next__(self):
try:
item = self.items[self.index]
except IndexError:
raise StopIteration # Safely says, "no more items in iterator!"
self.index += 1
return item
Now, the for loop knows when to stop extracting items from the iterator.
What happens if we now try to loop over the iterator again?
iterable = "ABC"
iterator = iter(iterable)
# iterator.index == 0
for item in iterator:
print(item)
# iterator.index == 3
for item in iterator:
print(item)
# iterator.index == 3
Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.
1 Technically:
Iterable: An object that returns an iterator when __iter__ is called on it.
Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.
More details here.
Why doesn't iterating work the second time for iterators?
It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.
Why does it work for other kinds of iterables?
Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.
This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.
Given an iterator, how can we iterate twice over the data?
By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.
Caching
The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:
elements = list(iterator)
for element in elements:
...
for element in elements:
...
Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:
abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
print(element) # starts at 1, because the list does
for element in concrete:
print(element) # also starts at 1, because a new iterator is created
A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they're iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)
from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
print(element)
if element == 3:
break
for element in y:
print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer
Starting over
If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:
def powers_of(base, *range_args):
for i in range(*range_args):
yield base ** i
exhaustible = powers_of(2, 1, 12):
for value in exhaustible:
print(value)
print('exhausted')
for value in exhaustible: # no results from here
print(value)
# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
print(value)
Seekable or resettable iterators
Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.
Nothing in Python disallows this, but in many cases it's not feasible to provide a simple interface; in most other cases, it just isn't supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.
The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value - e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.
In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to - i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that's straightforward: just modify i. For a file, we'd need to know what the file position was at the beginning of the desired line, not just the line number. That's why the file interface provides .tell as well as .seek.
Here's a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:
class PowersOf:
def __init__(self, base):
self._exponent = 0
self._base = base
def __iter__(self):
return self
def __next__(self):
result = self._base ** self._exponent
self._exponent += 1
return result
#property
def exponent(self):
return self._exponent
#exponent.setter
def exponent(self, value):
if not isinstance(new_value, int):
raise TypeError("must set with an integer")
if new_value < 0:
raise ValueError("can't set to negative value")
self._exponent = new_value
Examples:
pot = PowersOf(2)
for i in pot:
if i > 1000:
break
print(i)
pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64
Technical detail
Iterators vs. iterables
Recall that, briefly:
"iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
using a for loop
using a comprehension or generator expression
unpacking an iterable, including calling a function with * or ** syntax
constructing a list, tuple, etc. from another iterable
"iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that - basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" - in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
"iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.
Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements - since we can find out what they are, by examining them - and certainly they can be examined in a specified order - since that's what the iterator enables. So, we can conclude that an iterator is a kind of iterable - and Python's definitions agree.
How iteration works
In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.
Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it's some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn't.
The iterator protocol
Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:
Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)
Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to
class Iterator:
def __init__(self, bound_getitem):
self._index = 0
self._bound_getitem = bound_getitem
def __iter__(self):
return self
def __next__(self):
try:
result = self._bound_getitem(self._index)
except IndexError:
raise StopIteration
self._index += 1
return result
Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.
With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like
for element in iterable:
...
will approximately translate to:
iterator = iter(iterable)
while True:
try:
element = next(iterator)
except StopIteration:
break
...
except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

python grouper object could not print anything after using itemgetter [duplicate]

Consider the code:
def test(data):
for row in data:
print("first loop")
for row in data:
print("second loop")
When data is an iterator, for example a list iterator or a generator expression*, this does not work:
>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop
This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?
Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.
On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:
>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop
* More examples:
file objects
generators created from an explicit generator function
filter, map, and zip objects (in 3.x)
enumerate objects
csv.readers
various iterators defined in the itertools standard library
For general theory and terminology explanation, see What are iterator, iterable, and iteration?.
To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.
An iterator can only be consumed once. For example:
lst = [1, 2, 3]
it = iter(lst)
next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration
When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.
A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:
data = list(data)
If the iterator would iterate over many elements, however, it's a better idea to create independent iterators using tee():
import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed
Now each one can be iterated over in turn:
for e in it1:
print("first loop")
for e in it2:
print("second loop")
Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.
This is explained in Óscar López's answer, however, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.
In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.
Once an iterator is exhausted, it will not yield any more.
>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
How do I loop over an iterator twice?
It is usually impossible. (Explained later.) Instead, do one of the following:
Collect the iterator into a something that can be looped over multiple times.
items = list(iterator)
for item in items:
...
Downside: This costs memory.
Create a new iterator. It usually takes only a microsecond to make a new iterator.
for item in create_iterator():
...
for item in create_iterator():
...
Downside: Iteration itself may be expensive (e.g. reading from disk or network).
Reset the "iterator". For example, with file iterators:
with open(...) as f:
for item in f:
...
f.seek(0)
for item in f:
...
Downside: Most iterators cannot be "reset".
Philosophy of an Iterator
Typically, though not technically1:
Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
Iterator: A pointer to some element of an iterable.
If we were to define a sequence iterator, it might look something like this:
class SequenceIterator:
index: int
items: Sequence # Sequences can be randomly indexed via items[index].
def __next__(self):
"""Increment index, and return the latest item."""
The important thing here is that typically, an iterator does not store any actual data inside itself.
Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.
Exhausting an Iterator
What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:
iterable = "ABC"
iterator = iter(iterable)
for item in iterator:
print(item)
Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:
class SequenceIterator:
def __next__(self):
item = self.items[self.index]
self.index += 1
return item
Hold on. What if index goes past the last element of items? We should raise a safe exception for that:
class SequenceIterator:
def __next__(self):
try:
item = self.items[self.index]
except IndexError:
raise StopIteration # Safely says, "no more items in iterator!"
self.index += 1
return item
Now, the for loop knows when to stop extracting items from the iterator.
What happens if we now try to loop over the iterator again?
iterable = "ABC"
iterator = iter(iterable)
# iterator.index == 0
for item in iterator:
print(item)
# iterator.index == 3
for item in iterator:
print(item)
# iterator.index == 3
Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.
1 Technically:
Iterable: An object that returns an iterator when __iter__ is called on it.
Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.
More details here.
Why doesn't iterating work the second time for iterators?
It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.
Why does it work for other kinds of iterables?
Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.
This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.
Given an iterator, how can we iterate twice over the data?
By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.
Caching
The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:
elements = list(iterator)
for element in elements:
...
for element in elements:
...
Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:
abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
print(element) # starts at 1, because the list does
for element in concrete:
print(element) # also starts at 1, because a new iterator is created
A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they're iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)
from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
print(element)
if element == 3:
break
for element in y:
print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer
Starting over
If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:
def powers_of(base, *range_args):
for i in range(*range_args):
yield base ** i
exhaustible = powers_of(2, 1, 12):
for value in exhaustible:
print(value)
print('exhausted')
for value in exhaustible: # no results from here
print(value)
# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
print(value)
Seekable or resettable iterators
Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.
Nothing in Python disallows this, but in many cases it's not feasible to provide a simple interface; in most other cases, it just isn't supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.
The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value - e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.
In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to - i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that's straightforward: just modify i. For a file, we'd need to know what the file position was at the beginning of the desired line, not just the line number. That's why the file interface provides .tell as well as .seek.
Here's a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:
class PowersOf:
def __init__(self, base):
self._exponent = 0
self._base = base
def __iter__(self):
return self
def __next__(self):
result = self._base ** self._exponent
self._exponent += 1
return result
#property
def exponent(self):
return self._exponent
#exponent.setter
def exponent(self, value):
if not isinstance(new_value, int):
raise TypeError("must set with an integer")
if new_value < 0:
raise ValueError("can't set to negative value")
self._exponent = new_value
Examples:
pot = PowersOf(2)
for i in pot:
if i > 1000:
break
print(i)
pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64
Technical detail
Iterators vs. iterables
Recall that, briefly:
"iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
using a for loop
using a comprehension or generator expression
unpacking an iterable, including calling a function with * or ** syntax
constructing a list, tuple, etc. from another iterable
"iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that - basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" - in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
"iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.
Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements - since we can find out what they are, by examining them - and certainly they can be examined in a specified order - since that's what the iterator enables. So, we can conclude that an iterator is a kind of iterable - and Python's definitions agree.
How iteration works
In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.
Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it's some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn't.
The iterator protocol
Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:
Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)
Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to
class Iterator:
def __init__(self, bound_getitem):
self._index = 0
self._bound_getitem = bound_getitem
def __iter__(self):
return self
def __next__(self):
try:
result = self._bound_getitem(self._index)
except IndexError:
raise StopIteration
self._index += 1
return result
Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.
With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like
for element in iterable:
...
will approximately translate to:
iterator = iter(iterable)
while True:
try:
element = next(iterator)
except StopIteration:
break
...
except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

How to extract a string from lines of a file in Python and use the value as a float? [duplicate]

Consider the code:
def test(data):
for row in data:
print("first loop")
for row in data:
print("second loop")
When data is an iterator, for example a list iterator or a generator expression*, this does not work:
>>> test(iter([1, 2]))
first loop
first loop
>>> test((_ for _ in [1, 2]))
first loop
first loop
This prints first loop a few times, since data is non-empty. However, it does not print second loop. Why does iterating over data work the first time, but not the second time? How can I make it work a second time?
Aside from for loops, the same problem appears to occur with any kind of iteration: list/set/dict comprehensions, passing the iterator to list(), sum() or reduce(), etc.
On the other hand, if data is another kind of iterable, such as a list or a range (which are both sequences), both loops run as expected:
>>> test([1, 2])
first loop
first loop
second loop
second loop
>>> test(range(2))
first loop
first loop
second loop
second loop
* More examples:
file objects
generators created from an explicit generator function
filter, map, and zip objects (in 3.x)
enumerate objects
csv.readers
various iterators defined in the itertools standard library
For general theory and terminology explanation, see What are iterator, iterable, and iteration?.
To detect whether the input is an iterator or a "reusable" iterable, see Ensure that an argument can be iterated twice.
An iterator can only be consumed once. For example:
lst = [1, 2, 3]
it = iter(lst)
next(it)
# => 1
next(it)
# => 2
next(it)
# => 3
next(it)
# => StopIteration
When the iterator is supplied to a for loop instead, that last StopIteration will cause it to exit the first time. Trying to use the same iterator in another for loop will cause StopIteration again immediately, because the iterator has already been consumed.
A simple way to work around this is to save all the elements to a list, which can be traversed as many times as needed. For example:
data = list(data)
If the iterator would iterate over many elements, however, it's a better idea to create independent iterators using tee():
import itertools
it1, it2 = itertools.tee(data, 2) # create as many as needed
Now each one can be iterated over in turn:
for e in it1:
print("first loop")
for e in it2:
print("second loop")
Iterators (e.g. from calling iter, from generator expressions, or from generator functions which yield) are stateful and can only be consumed once.
This is explained in Óscar López's answer, however, that answer's recommendation to use itertools.tee(data) instead of list(data) for performance reasons is misleading.
In most cases, where you want to iterate through the whole of data and then iterate through the whole of it again, tee takes more time and uses more memory than simply consuming the whole iterator into a list and then iterating over it twice. According to the documentation:
This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().
tee may be preferred if you will only consume the first few elements of each iterator, or if you will alternate between consuming a few elements from one iterator and then a few from the other.
Once an iterator is exhausted, it will not yield any more.
>>> it = iter([3, 1, 2])
>>> for x in it: print(x)
...
3
1
2
>>> for x in it: print(x)
...
>>>
How do I loop over an iterator twice?
It is usually impossible. (Explained later.) Instead, do one of the following:
Collect the iterator into a something that can be looped over multiple times.
items = list(iterator)
for item in items:
...
Downside: This costs memory.
Create a new iterator. It usually takes only a microsecond to make a new iterator.
for item in create_iterator():
...
for item in create_iterator():
...
Downside: Iteration itself may be expensive (e.g. reading from disk or network).
Reset the "iterator". For example, with file iterators:
with open(...) as f:
for item in f:
...
f.seek(0)
for item in f:
...
Downside: Most iterators cannot be "reset".
Philosophy of an Iterator
Typically, though not technically1:
Iterable: A for-loopable object that represents data. Examples: list, tuple, str.
Iterator: A pointer to some element of an iterable.
If we were to define a sequence iterator, it might look something like this:
class SequenceIterator:
index: int
items: Sequence # Sequences can be randomly indexed via items[index].
def __next__(self):
"""Increment index, and return the latest item."""
The important thing here is that typically, an iterator does not store any actual data inside itself.
Iterators usually model a temporary "stream" of data. That data source is consumed by the process of iteration. This is a good hint as to why one cannot loop over an arbitrary source of data more than once. We need to open a new temporary stream of data (i.e. create a new iterator) to do that.
Exhausting an Iterator
What happens when we extract items from an iterator, starting with the current element of the iterator, and continuing until it is entirely exhausted? That's what a for loop does:
iterable = "ABC"
iterator = iter(iterable)
for item in iterator:
print(item)
Let's support this functionality in SequenceIterator by telling the for loop how to extract the next item:
class SequenceIterator:
def __next__(self):
item = self.items[self.index]
self.index += 1
return item
Hold on. What if index goes past the last element of items? We should raise a safe exception for that:
class SequenceIterator:
def __next__(self):
try:
item = self.items[self.index]
except IndexError:
raise StopIteration # Safely says, "no more items in iterator!"
self.index += 1
return item
Now, the for loop knows when to stop extracting items from the iterator.
What happens if we now try to loop over the iterator again?
iterable = "ABC"
iterator = iter(iterable)
# iterator.index == 0
for item in iterator:
print(item)
# iterator.index == 3
for item in iterator:
print(item)
# iterator.index == 3
Since the second loop starts from the current iterator.index, which is 3, it does not have anything else to print and so iterator.__next__ raises the StopIteration exception, causing the loop to end immediately.
1 Technically:
Iterable: An object that returns an iterator when __iter__ is called on it.
Iterator: An object that one can repeatedly call __next__ on in a loop in order to extract items. Furthermore, calling __iter__ on it should return itself.
More details here.
Why doesn't iterating work the second time for iterators?
It does "work", in the sense that the for loop in the examples does run. It simply performs zero iterations. This happens because the iterator is "exhausted"; it has already iterated over all of the elements.
Why does it work for other kinds of iterables?
Because, behind the scenes, a new iterator is created for each loop, based on that iterable. Creating the iterator from scratch means that it starts at the beginning.
This happens because iterating requires an iterable. If an iterable was already provided, it will be used as-is; but otherwise, a conversion is necessary, which creates a new object.
Given an iterator, how can we iterate twice over the data?
By caching the data; starting over with a new iterator (assuming we can re-create the initial condition); or, if the iterator was specifically designed for it, seeking or resetting the iterator. Relatively few iterators offer seeking or resetting.
Caching
The only fully general approach is to remember what elements were seen (or determine what elements will be seen) the first time and iterate over them again. The simplest way is by creating a list or tuple from the iterator:
elements = list(iterator)
for element in elements:
...
for element in elements:
...
Since the list is a non-iterator iterable, each loop will create a new iterable that iterates over all the elements. If the iterator is already "part way through" an iteration when we do this, the list will only contain the "following" elements:
abstract = (x for x in range(10)) # represents integers from 0 to 9 inclusive
next(abstract) # skips the 0
concrete = list(abstract) # makes a list with the rest
for element in concrete:
print(element) # starts at 1, because the list does
for element in concrete:
print(element) # also starts at 1, because a new iterator is created
A more sophisticated way is using itertools.tee. This essentially creates a "buffer" of elements from the original source as they're iterated over, and then creates and returns several custom iterators that work by remembering an index, fetching from the buffer if possible, and appending to the buffer (using the original iterable) when necessary. (In the reference implementation of modern Python versions, this does not use native Python code.)
from itertools import tee
concrete = list(range(10)) # `tee` works on any iterable, iterator or not
x, y = tee(concrete, 2) # the second argument is the number of instances.
for element in x:
print(element)
if element == 3:
break
for element in y:
print(element) # starts over at 0, taking 0, 1, 2, 3 from a buffer
Starting over
If we know and can recreate the starting conditions for the iterator when the iteration started, that also solves the problem. This is implicitly what happens when iterating multiple times over a list: the "starting conditions for the iterator" are just the contents of the list, and all the iterators created from it give the same results. For another example, if a generator function does not depend on an external state, we can simply call it again with the same parameters:
def powers_of(base, *range_args):
for i in range(*range_args):
yield base ** i
exhaustible = powers_of(2, 1, 12):
for value in exhaustible:
print(value)
print('exhausted')
for value in exhaustible: # no results from here
print(value)
# Want the same values again? Then use the same generator again:
print('replenished')
for value in powers_of(2, 1, 12):
print(value)
Seekable or resettable iterators
Some specific iterators may make it possible to "reset" iteration to the beginning, or even to "seek" to a specific point in the iteration. In general, iterators need to have some kind of internal state in order to keep track of "where" they are in the iteration. Making an iterator "seekable" or "resettable" simply means allowing external access to, respectively, modify or re-initialize that state.
Nothing in Python disallows this, but in many cases it's not feasible to provide a simple interface; in most other cases, it just isn't supported even though it might be trivial. For generator functions, the internal state in question, on the other hand, the internal state is quite complex, and protects itself against modification.
The classic example of a seekable iterator is an open file object created using the built-in open function. The state in question is a position within the underlying file on disk; the .tell and .seek methods allow us to inspect and modify that position value - e.g. .seek(0) will set the position to the beginning of the file, effectively resetting the iterator. Similarly, csv.reader is a wrapper around a file; seeking within that file will therefore affect the subsequent results of iteration.
In all but the simplest, deliberately-designed cases, rewinding an iterator will be difficult to impossible. Even if the iterator is designed to be seekable, this leaves the question of figuring out where to seek to - i.e., what the internal state was at the desired point in the iteration. In the case of the powers_of generator shown above, that's straightforward: just modify i. For a file, we'd need to know what the file position was at the beginning of the desired line, not just the line number. That's why the file interface provides .tell as well as .seek.
Here's a re-worked example of powers_of representing an unbound sequence, and designed to be seekable, rewindable and resettable via an exponent property:
class PowersOf:
def __init__(self, base):
self._exponent = 0
self._base = base
def __iter__(self):
return self
def __next__(self):
result = self._base ** self._exponent
self._exponent += 1
return result
#property
def exponent(self):
return self._exponent
#exponent.setter
def exponent(self, value):
if not isinstance(new_value, int):
raise TypeError("must set with an integer")
if new_value < 0:
raise ValueError("can't set to negative value")
self._exponent = new_value
Examples:
pot = PowersOf(2)
for i in pot:
if i > 1000:
break
print(i)
pot.exponent = 5 # jump to this point in the (unbounded) sequence
print(next(pot)) # 32
print(next(pot)) # 64
Technical detail
Iterators vs. iterables
Recall that, briefly:
"iteration" means looking at each element in turn, of some abstract, conceptual sequence of values. This can include:
using a for loop
using a comprehension or generator expression
unpacking an iterable, including calling a function with * or ** syntax
constructing a list, tuple, etc. from another iterable
"iterable" means an object that represents such a sequence. (What the Python documentation calls a "sequence" is in fact more specific than that - basically it also needs to be finite and ordered.). Note that the elements do not need to be "stored" - in memory, disk or anywhere else; it is sufficient that we can determine them during the process of iteration.
"iterator" means an object that represents a process of iteration; in some sense, it keeps track of "where we are" in the iteration.
Combining the definitions, an iterable is something that represents elements that can be examined in a specified order; an iterator is something that allows us to examine elements in a specified order. Certainly an iterator "represents" those elements - since we can find out what they are, by examining them - and certainly they can be examined in a specified order - since that's what the iterator enables. So, we can conclude that an iterator is a kind of iterable - and Python's definitions agree.
How iteration works
In order to iterate, we need an iterator. When we iterate in Python, an iterator is needed; but in normal cases (i.e. except in poorly written user-defined code), any iterable is permissible. Behind the scenes, Python will convert other iterables to corresponding iterators; the logic for this is available via the built-in iter function. To iterate, Python repeatedly asks the iterator for a "next element" until the iterator raises a StopException. The logic for this is available via the built-in next function.
Generally, when iter is given a single argument that already is an iterator, that same object is returned unchanged. But if it's some other kind of iterable, a new iterator object will be created. This directly leads to the problem in the OP. User-defined types can break both of these rules, but they probably shouldn't.
The iterator protocol
Python roughly defines an "iterator protocol" that specifies how it decides whether a type is an iterable (or specifically an iterator), and how types can provide the iteration functionality. The details have changed a slightly over the years, but the modern setup works like so:
Anything that has an __iter__ or a __getitem__ method is an iterable. Anything that defines an __iter__ method and a __next__ method is specifically an iterator. (Note in particular that if there is a __getitem__ and a __next__ but no __iter__, the __next__ has no particular meaning, and the object is a non-iterator iterable.)
Given a single argument, iter will attempt to call the __iter__ method of that argument, verify that the result has a __next__ method, and return that result. It does not ensure the presence of an __iter__ method on the result. Such objects can often be used in places where an iterator is expected, but will fail if e.g. iter is called on them.) If there is no __iter__, it will look for __getitem__, and use that to create an instance of a built-in iterator type. That iterator is roughly equivalent to
class Iterator:
def __init__(self, bound_getitem):
self._index = 0
self._bound_getitem = bound_getitem
def __iter__(self):
return self
def __next__(self):
try:
result = self._bound_getitem(self._index)
except IndexError:
raise StopIteration
self._index += 1
return result
Given a single argument, next will attempt to call the __next__ method of that argument, allowing any StopIteration to propagate.
With all of this machinery in place, it is possible to implement a for loop in terms of while. Specifically, a loop like
for element in iterable:
...
will approximately translate to:
iterator = iter(iterable)
while True:
try:
element = next(iterator)
except StopIteration:
break
...
except that the iterator is not actually assigned any name (the syntax here is to emphasize that iter is only called once, and is called even if there are no iterations of the ... code).

Why does len() not support iterators?

Many of Python's built-in functions (any(), all(), sum() to name some) take iterables but why does len() not?
One could always use sum(1 for i in iterable) as an equivalent, but why is it len() does not take iterables in the first place?
Many iterables are defined by generator expressions which don't have a well defined len. Take the following which iterates forever:
def sequence(i=0):
while True:
i+=1
yield i
Basically, to have a well defined length, you need to know the entire object up front. Contrast that to a function like sum. You don't need to know the entire object at once to sum it -- Just take one element at a time and add it to what you've already summed.
Be careful with idioms like sum(1 for i in iterable), often it will just exhaust iterable so you can't use it anymore. Or, it could be slow to get the i'th element if there is a lot of computation involved. It might be worth asking yourself why you need to know the length a-priori. This might give you some insight into what type of data-structure to use (frequently list and tuple work just fine) -- or you may be able to perform your operation without needing calling len.
This is an iterable:
def forever():
while True:
yield 1
Yet, it has no length. If you want to find the length of a finite iterable, the only way to do so, by definition of what an iterable is (something you can repeatedly call to get the next element until you reach the end) is to expand the iterable out fully, e.g.:
len(list(the_iterable))
As mgilson pointed out, you might want to ask yourself - why do you want to know the length of a particular iterable? Feel free to comment and I'll add a specific example.
If you want to keep track of how many elements you have processed, instead of doing:
num_elements = len(the_iterable)
for element in the_iterable:
...
do:
num_elements = 0
for element in the_iterable:
num_elements += 1
...
If you want a memory-efficient way of seeing how many elements end up being in a comprehension, for example:
num_relevant = len(x for x in xrange(100000) if x%14==0)
It wouldn't be efficient to do this (you don't need the whole list):
num_relevant = len([x for x in xrange(100000) if x%14==0])
sum would probably be the most handy way, but it looks quite weird and it isn't immediately clear what you're doing:
num_relevant = sum(1 for _ in (x for x in xrange(100000) if x%14==0))
So, you should probably write your own function:
def exhaustive_len(iterable):
length = 0
for _ in iterable: length += 1
return length
exhaustive_len(x for x in xrange(100000) if x%14==0)
The long name is to help remind you that it does consume the iterable, for example, this won't work as you might think:
def yield_numbers():
yield 1; yield 2; yield 3; yield 5; yield 7
the_nums = yield_numbers()
total_nums = exhaustive_len(the_nums)
for num in the_nums:
print num
because exhaustive_len has already consumed all the elements.
EDIT: Ah in that case you would use exhaustive_len(open("file.txt")), as you have to process all lines in the file one-by-one to see how many there are, and it would be wasteful to store the entire file in memory by calling list.

Categories

Resources