Why can a zip() variable in python be parsed only once? [duplicate] - python

This question already has answers here:
The zip() function in Python 3
(2 answers)
Closed last month.
I seem to have found out a weird bug in Python and I do not know if it exists already or is it something wrong that I am doing. Please explain.
We know that we can zip two lists in python to combine them as tuples. We can again parse them easily. When I am trying to parse the same zipped variable more than once, Python doesnt seem to be doing that and it ends up giving empty lists []. The first time it will do it but more than once it wont.
Example:
lis1=[1,2,3,4,5]
lis2=['a','b','a','b','a']
zip_variable=zip(lis1,lis2)
op1=[val2 for (val1,val2) in zip_variable if val1<4]
op2=[val1 for (val1,val2) in zip_variable if val2=='a']
op3=[val1 for (val1,val2) in zip_variable if val2=='b']
print(op1,"\n",op2,"\n",op3)
Output:
['a','b','a']
[]
[]
I have the solution to fix it which is by making multiple variables for the same zip i.e as below:
lis1=[1,2,3,4,5]
lis2=['a','b','a','b','a']
zip_variable1=zip(lis1,lis2)
zip_variable2=zip(lis1,lis2)
zip_variable3=zip(lis1,lis2)
op1=[val2 for (val1,val2) in zip_variable1 if val1<4]
op2=[val1 for (val1,val2) in zip_variable2 if val2=='a']
op3=[val1 for (val1,val2) in zip_variable3 if val2=='b']
print(op1,"\n",op2,"\n",op3)
Output:
['a','b','a']
[1,3,5]
[2,4]
The solution is always possible if we dont care about memory.
But the main question why does this happen?

zip() returns an iterator in Python 3. It produces only one tuple at a time from the source iterables, as needed, and when those have been iterated over, zip() has nothing more to yield. This approach reduces memory needs and can improve performance as well (especially if you don't actually ever request all the zipped tuples).
If you need the same sequence again, either call zip() again, or convert zip() to a list like list(zip(...)).
You could also use itertools.tee() to create "copies" of a zip() iterator. However, behind the scenes, this stores any items that haven't been requested by all iterators. If you're going to do that, you might as well just use a list to begin with.

Because zip function returns an iterator.
This kind of object can only be iterated once.
If you want to iterate multiple times the same zip I recomend you creating a list or a tuple from it (list(zip(a, b)) or tuple(zip(a, b)))

Related

What is the meaning of this code segment?

I am trying to implement a function in python which takes in input an iterable and loops through it to perform some operation. I was confused about how to handle different iterables (example: lists and dictionaries cannot be looped in the same general way), so I looked in the statistics library in python and found that they are handling this situation like this: -
def variance(data, xbar=None):
if iter(data) is data: #<-----1
data = list(data)
...
then, they are handling data as list everywhere.
So, my question is : -
What is the meaning of (1); and
Is this the right method as it is everytime making a new list out of data. Can't they simply use the iterator to loop through the data?
iter(something) returns an iterator object that returns the elements of something. If something is already an iterator, it simply returns it unchanged. So
if iter(data) is data:
is a way of telling whether data is an iterator object. If it is, it converts it to a list of all the elements.
It's doing this because the code after that needs a real list of the elements. There are things you can do with a list that you can't do with an iterator, such as access specific elements, insert/delete elements, and loop over it multiple times. Iterators can only be processed sequentially.

similar list.append statements returning different results

I have the two expressions below, which to me are basically the same but the first line gives a list with generator inside rather than the values while the second one works fine.
I just wanted to know why this happens what is a generator and how its used.
newer_list.append([sum(i)] for i in new_list)
for i in new_list:
newer_list.append([sum(i)])
The first one has a generator expression (sum[i] for i in new_list), while the second one just loops, adding the sum.
It is possible you wanted something like newer_list.extend([sum(i) for i in new_list]), where extend concatenates lists instead of just appending, and the whole thing is wrapped in brackets so it's a list comprehension instead of a generator.
A generator is a way for Python to keep from storing everything in memory. The expression ([sum(i)] for i in new_list) is a formula for generating the items in a list. To keep from storing that list in memory, it just stores the function it would need to execute, which has less of a memory footprint.
To turn a generator into a list, you can just do list([sum(i)] for i in new_list), or in this case ([[sum(i)] for i in new_list])

Can I input a List of strings to a function that takes a string as input (python) [duplicate]

In python 2, I used map to apply a function to several items, for instance, to remove all items matching a pattern:
map(os.remove,glob.glob("*.pyc"))
Of course I ignore the return code of os.remove, I just want all files to be deleted. It created a temp instance of a list for nothing, but it worked.
With Python 3, as map returns an iterator and not a list, the above code does nothing.
I found a workaround, since os.remove returns None, I use any to force iteration on the full list, without creating a list (better performance)
any(map(os.remove,glob.glob("*.pyc")))
But it seems a bit hazardous, specially when applying it to methods that return something. Another way to do that with a one-liner and not create an unnecessary list?
The change from map() (and many other functions from 2.7 to 3.x) returning a generator instead of a list is a memory saving technique. For most cases, there is no performance penalty to writing out the loop more formally (it may even be preferred for readability).
I would provide an example, but #vaultah nailed it in the comments: still a one-liner:
for x in glob.glob("*.pyc"): os.remove(x)

cleanest way to call one function on a list of items

In python 2, I used map to apply a function to several items, for instance, to remove all items matching a pattern:
map(os.remove,glob.glob("*.pyc"))
Of course I ignore the return code of os.remove, I just want all files to be deleted. It created a temp instance of a list for nothing, but it worked.
With Python 3, as map returns an iterator and not a list, the above code does nothing.
I found a workaround, since os.remove returns None, I use any to force iteration on the full list, without creating a list (better performance)
any(map(os.remove,glob.glob("*.pyc")))
But it seems a bit hazardous, specially when applying it to methods that return something. Another way to do that with a one-liner and not create an unnecessary list?
The change from map() (and many other functions from 2.7 to 3.x) returning a generator instead of a list is a memory saving technique. For most cases, there is no performance penalty to writing out the loop more formally (it may even be preferred for readability).
I would provide an example, but #vaultah nailed it in the comments: still a one-liner:
for x in glob.glob("*.pyc"): os.remove(x)

Does it pay off to use a generator as input to sorted() instead of a list-comprehension [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
sorted() using Generator Expressions Rather Than Lists
We all know using generators instead of instantiating lists all the time saves time and memory, especially if we use comprehensions a lot.
Here's a question though, consider the following code:
output = SomeExpensiveCallEgDatabase()
results = [result[0] for result in output]
return sorted(results)
The call to sorted will return a sorted list of the results. Would it be better or worse to declare results as below and then call sorted?
results = (result[0] for result in output)
My guess is the call to sorted() would traverse the generator and instantiate a list itself in order to run quicksort or mergesort on it. So there would be no advantage in using the generator here. Is this assumption correct?
I believe your assumption to be true, since there is no easy way of ordering the collection without first having the whole list in memory (at least certainly not with the default sorting algorithm, TimSort if I'm not mistaken).
Check this out:
sorted() using Generator Expressions Rather Than Lists
To create the new List, the builtin sorted method uses PySequence_List:
PyObject* PySequence_List(PyObject *o) Return value: New reference.
Return a list object with the same contents as the arbitrary sequence
o. The returned list is guaranteed to be new.
Pros and cons of both approaches:
Memory-wise:
The returned list is the one used for the sorted version, so this would mean that in this case, only one list is stored completely in memory at any given time, using the generator version.
This makes the generator version more efficient memory-wise.
Speed:
Here the version with the whole list wins.
To create a new list based on a generator, an empty list must be created (or at best with the first element), and each following element appended to the list, with the possible redimensioning steps this may provoke.
To create a new list based on a previous list, the size of the list is known beforehand, and thus can be allocated at once and each of the entries assigned (possibly, there are other optimizations at work here, but I can't back that up).
So regarding speed, the list wins.
The answer to "what's the best", comes down to the most common answer in any field of engineering... it depends....
No you are still creating a brand new list with sorted()
output = SomeExpensiveCallEgDatabase()
results = [result[0] for result in output]
results.sort()
return results
would be closer to the generator version.
I believe it's better to use the generator version because some future version of Python may be able to take advantage of this to work more efficiently. It's always nice to get a speed up for free.
Yes, you are correct (although I believe the sorting routine is still called tim-sort, after uncle timmy <wink-ly y'rs>)

Categories

Resources