3 questions about generators and iterators in Python

3 questions about generators and iterators in Python - python

Everyone says you lose the benefit of generators if you put the result into a list.
But you need a list, or a sequence, to even have a generator to begin with, right? So, for example, if I need to go through the files in a directory, don't I have to make them into a list first, like with os.listdir()? If so, then how is that more efficient? (I am always working with strings and files, so I really hate that all the examples use range and integers, but I digress)
Taking it a step further, the mere presence of the yield keyword is supposed to make a generator. So if I do:
for x in os.listdir():
yield x
Is a list still being created? Or is os.listdir() itself now also magically a generator? Is it possible that, os.listdir() not having been called yet, that there really isn't a list here yet?
Finally, we are told that iterators need iter() and next() methods. But doesn’t that also mean they need an index? If not, what is next() operating on? How does it know what is next without an index? Before 3.6, dict keys had no order, so how did that iteration work?

No.
See, there's no list here:
def one():
while True:
yield 1
Index and next() are two independent tools to perform an iteration. Again, if you have an object such that its iterator's next() always returns 1, you don't need any indices.
In deeper detail...
See, technically, you can always associate a list and an index with any generator or iterator: simply write down all its returned values — you'll get at most countable set of values a₀, a₁, ... But those are merely a mathematical formalism quite unnecessarily having anything in common with how a real generator works. For instance, you have a generator that always yields one. You can count how many ones have you got from it so far, and call that an index. You can write down all that ones, comma-separated, and call that a list. Do those two objects correctly describe your elapsed generator's output? Apparently so. Are they in a least bit important for the generator itself? Not really.
Of course, a real generator will probably have a state (you can call it an index—provided you don't necessarily call something an index if it is only a non-negative integral scalar; you can write down all its states, provided it works deterministically, number them and call current state's number index—yes, approximately that). They will always have a source of their states and returned values. So, indices and lists can be regarded as abstractions that describe object's behaviour. But quite unnecessary they are concrete implementation details that are really used.
Consider unbuffered file reader. It retrieves a single byte from the disk and immediately yields it. There's no a real list in memory, only the file contents on the disk (there may even be no, if our file reader is connected to a net socket instead of a real disk drive, and the Oracle of Delphi is at connection's other end). You can call file position index—until you read the stdin, which is only forward-traversable and thus indexing it makes no real physical sense—same goes for network connections via unreliable protocol, BTW.
Something like this.

1) This is wrong; it is just the easiest example to explain a generator from a list. If you think of the 8 queens-problem and you return each position as soon as the program finds it, I can't recognize a result list anywhere. Note, that often iterators are alternately offered even by python standard library (islice() vs. slice(), and an easy example not representable by a list is itertools.cycle().
In consequence 2 and 3 are also wrong.

Related

Can I input a List of strings to a function that takes a string as input (python) [duplicate]

In python 2, I used map to apply a function to several items, for instance, to remove all items matching a pattern:
map(os.remove,glob.glob("*.pyc"))
Of course I ignore the return code of os.remove, I just want all files to be deleted. It created a temp instance of a list for nothing, but it worked.
With Python 3, as map returns an iterator and not a list, the above code does nothing.
I found a workaround, since os.remove returns None, I use any to force iteration on the full list, without creating a list (better performance)
any(map(os.remove,glob.glob("*.pyc")))
But it seems a bit hazardous, specially when applying it to methods that return something. Another way to do that with a one-liner and not create an unnecessary list?

The change from map() (and many other functions from 2.7 to 3.x) returning a generator instead of a list is a memory saving technique. For most cases, there is no performance penalty to writing out the loop more formally (it may even be preferred for readability).
I would provide an example, but #vaultah nailed it in the comments: still a one-liner:
for x in glob.glob("*.pyc"): os.remove(x)

cleanest way to call one function on a list of items

In python 2, I used map to apply a function to several items, for instance, to remove all items matching a pattern:
map(os.remove,glob.glob("*.pyc"))
Of course I ignore the return code of os.remove, I just want all files to be deleted. It created a temp instance of a list for nothing, but it worked.
With Python 3, as map returns an iterator and not a list, the above code does nothing.
I found a workaround, since os.remove returns None, I use any to force iteration on the full list, without creating a list (better performance)
any(map(os.remove,glob.glob("*.pyc")))
But it seems a bit hazardous, specially when applying it to methods that return something. Another way to do that with a one-liner and not create an unnecessary list?

The change from map() (and many other functions from 2.7 to 3.x) returning a generator instead of a list is a memory saving technique. For most cases, there is no performance penalty to writing out the loop more formally (it may even be preferred for readability).
I would provide an example, but #vaultah nailed it in the comments: still a one-liner:
for x in glob.glob("*.pyc"): os.remove(x)

What is a "Physically Stored Sequence" in Python?

I am currently reading Learning Python, 5th Edition - by Mark Lutz and have come across the phrase "Physically Stored Sequence".
From what I've learnt so far, a sequence is an object that contains items that can be indexed in sequential order from left to right e.g. Strings, Tuples and Lists.
So in regards to a "Physically Stored Sequence", would that be a Sequence that is referenced by a variable for use later on in a program? Or am not getting it?
Thank you in advance for your answers.

A Physically Stored Sequence is best explained by contrast. It is one type of "iterable" with the main example of the other type being a "generator."
A generator is an iterable, meaning you can iterate over it as in a "for" loop, but it does not actually store anything--it merely spits out values when requested. Examples of this would be a pseudo-random number generator, the whole itertools package, or any function you write yourself using yield. Those sorts of things can be the subject of a "for" loop but do not actually "contain" any data.
A physically stored sequence then is an iterable which does contain its data. Examples include most data structures in Python, like lists. It doesn't matter in the Python parlance if the items in the sequence have any particular reference count or anything like that (e.g. the None object exists only once in Python, so [None, None] does not exactly "store" it twice).
A key feature of physically stored sequences is that you can usually iterate over them multiple times, and sometimes get items other than the "first" one (the one any iterable gives you when you call next() on it).
All that said, this phrase is not very common--certainly not something you'd expect to see or use as a workaday Python programmer.

Why python for loops don't default to one iteration for single objects

This may seem like an odd question but why doesn't python by default "iterate" through a single object by default.
I feel it would increase the resilience of for loops for low level programming/simple scripts.
At the same time it promotes sloppiness in defining data structures properly though. It also clashes with strings being iterable by character.
E.g.
x = 2
for a in x:
print(a)
As opposed to:
x = [2]
for a in x:
print(a)
Are there any reasons?
FURTHER INFO: I am writing a function that takes a column/multiple columns from a database and puts it into a list of lists. It would just be visually "nice" to have a number instead of a single element list without putting type sorting into the function (probably me just being OCD again though)
Pardon the slightly ambiguous question; this is a "why is it so?" not an "how to?". but in my ignorant world, I would prefer integers to be iterable for the case of the above mentioned function. So why would it not be implemented. Is it to do with it being an extra strain on computing adding an __iter__ to the integer object?
Discussion Points
Is an __iter__ too much of a drain on machine resources?
Do programmers want an exception to be thrown as they expect integers to be non-iterable
It brings up the idea of if you can't do it already, why not just let it, since people in the status quo will keep doing what they've been doing unaffected (unless of course the previous point is what they want); and
From a set theory perspective I guess strictly a set does not contain itself and it may not be mathematically "pure".

Python cannot iterate over an object that is not 'iterable'.
The 'for' loop actually calls inbuilt functions within the iterable data-type which allow it to extract elements from the iterable.
non-iterable data-types don't have these methods so there is no way to extract elements from them.
This Stack over flow question on 'if an object is iterable' is a great resource.

The problem is with the definition of "single object". Is "foo" a single object (Hint: it is an iterable with three strings)? Is [[1, 2, 3]][0] a single object (It is only one object, with 3 elements)?

The short answer is that there is no generalizable way to do it. However, you can write functions that have knowledge of your problem domain and can do conversions for you. I don't know your specific case, but suppose you want to handle an integer or list of integers transparently. You can create your own iterator:
def col_iter(item):
if isinstance(item, int):
yield item
else:
for i in item:
yield i
x = 2
for a in col_iter(x):
print a
y = [1,2,3,4]
for a in col_iter(y):
print a

The only thing that i can think of is that python for loops are looking for something to iterate through not just a value. If you think about it what would the value of "a" be? if you want it to be the number 2 then you don't need the for loop in the first place. If you want it to go through 1, 2 or 0, 1, 2 then you want. for a in range(x): not positive if that's the answer you're looking for but it's what i got.

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.

In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"

the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.

The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.

for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.