This question already has answers here:
What does the "yield" keyword do in Python?
(51 answers)
Closed 1 year ago.
I'm creating a python generator to loop over a sentence. "this is a test" should return
this
is
a
test
Question1: What's the problem with the implementation below? It only return "this" and doesn't loop. how to fix it?
def my_sentense_with_generator(sentence):
index = 0
words = sentence.split()
current = index
yield words[current]
index +=1
for i in my_sentense_with_generator('this is a test'):
print(i)
>> this
Question2 : Another way of implementation is below. It works. But i'm confused about the purpose of using 'for' here. I was taught that in one way, generator is used in lieu of "for loop" so that python doesn't have to build up the the whole list upfront, so it takes much less memory and time link. But in this solution, it uses for loop to construct a generator.. does it defeat the purpose of generator??
def my_sentense_with_generator(sentence):
for w in sentence.split():
yield w
The purpose of a generator is not to avoid defining a loop, it is to generate the elements only when they are needed (and not when it is constructed)
In your 1st example, you need a loop in the generator as well. Otherwise the generator is only able to generate a single element, then it is exhausted.
NB. In the generator below, the str.split creates a list, so there is no memory benefit in using a generator. This could be replaced by an iterator iter(sentence.split())
def my_sentence_with_generator(sentence):
words = sentence.split()
for word in words:
yield word
for i in my_sentence_with_generator('this is a test'):
print(i)
output:
this
is
a
test
The loop in the generator defines the elements of the generator, if will pause at a yield until something requests an element of the generator. So you also need one loop outside the generator to request the elements.
Example of a partial collection of the elements:
g = my_sentence_with_generator('this is a test')
next(g), next(g)
output: ('this', 'is')
example of the utility of a generator:
def count():
'''this generator can yield 1 trillion numbers'''
for i in range(1_000_000_000_000_000):
yield i
# we instanciate the generator
c = count()
# we collect only 3 elements, this consumes very little memory
next(c), next(c), next(c)
str.split returns a list, so you're not going to avoid creating a list if you call that within your generator. And you'll either need to keep the original string in memory until you're done with the results, or create twice as many strings as you'd need otherwise if you don't want to have that list, otherwise it'll be impossible to figure out what next to yield.
As an example, this is what a version of str.split might look like as a generator:
def split_string(sentence, sep=None):
# Creates a list with a maximum of two elements:
split_list = sentence.split(sep, maxsplit=1)
while split_list:
yield split_list[0]
if len(split_list) == 1:
# No more separators to be found in the rest of the string, so we return
return
split_list = split_list[1].split(sep, maxsplit=1)
This creates a lot of short-lived lists, but they will never have more than 2 elements each. Not very practical, it's likely to be much less performant than just calling str.split once, but hopefully it gives you a better understanding of how generators work.
Related
I am iterating two different generators using two different for loops. But i could see that the iteration through one generator expression is impacting the order of iteration of another generator expression.
Though I understand and hope that this is impossible, but not sure why I am experiencing this weird behaviour.
we=KeyedVectors.load_word2vec_format('../input/nlpword2vecembeddingspretrained/GoogleNews-vectors-negative300.bin', binary='True')
data1=(['this','is','an','example','text1'],['this','is','an','example','text2'],....)
data2=(['test data1','test data2'],['test data3','test data4'],....)
txt_emb=(sum([we[token] for token in doc if token in we.key_to_index])/len(doc) for doc in data1)
phr_emb=([sum([we[token] for token in phrase.split(' ') if token in we.key_to_index])/len(phrase.split(' ')) for phrase in phrases]for phrases in data2)
for i in txt_emb:
print(i)
break
for j in phr_emb:
print(j)
break
txt_emb :
([-0.06002714 0.00999211 0.0358354 ....],..........[0.07940271 -0.02072765 -0.03981323...])
phr_emb:
([array([-0.13269043,0.03266907,...]),array([0.04994202,0.15716553,...])],
[array([-0.06970215,0.01029968,...]),array([0.02503967,0.13970947,...])],.......)
Here txt_emb is a generator expression with each iterable being a list.
The phr_emb is a generator expression with each iterable being a list and each list containing varying number of arrays(say 2-6).
When I iterate txt_emb first as in above code, i get the first element(list at index 0) of txt_emb which is as expected. Similarly when I iterate through phr_emb, i expect to get the first element(list at index 0), but I get the second element(list at index 1).
Similarly if I continue to iterate txt_emb again, i get the third element(list at index 2) rather than getting the element at index 1 of txt_emb, as I have iterated txt_emb only once before this.
I face similar issues when i zip the two generator expression txt_emb and phr_emb and try iterate through it.
I am running all this in kaggle notebook.But if I iterate both the generator expressions seperately in different cells of the notebook then I get the elements in order as expected.
From your comments, this appears to be an issue with your two generator expressions pulling data from some other, shared iterator (probably another generator expression). When the first generator expression advances, it takes data from this third iterator, which makes it unavailable to the second generator expression.
You can recreate this issue with simpler code like this:
data = range(10) # our underlying data
data_iterator = iter(data) # our shared iterator, which could be a generator expression
doubles = (x * 2 for x in data_iterator) # first generator expression
squares = (x * x for x in data_iterator) # second generator expression
print(next(doubles), next(doubles), next(doubles)) # prints 0 2 4
print(next(squares), next(squares), next(squares)) # prints 9 16 25, not 0 1 4 as you might expect
If you take a few values from one of the generator expressions, the corresponding value will be skipped in the other one. That's because they're each advancing the shared data_iterator in the background, which only goes over each value from the list once.
The solution is either to create separate iterators for each of the generator (e.g. multiple versions of data_iterator, or if it's awkward or time consuming to recompute, to dump it into a sequence like a list, so it can be iterated upon repeatedly.
For instance, we could dump data_iterator into data_list, like this, and then build the generator expressions off the list:
data = range(10)
data_iterator = iter(data)
data_list = list(data_iterator) # this can be iterated upon repeatedly
doubles = (x * 2 for x in data_list)
squares = (x * x for x in data_list)
print(next(doubles), next(doubles), next(doubles)) # prints 0 2 4
print(next(squares), next(squares), next(squares)) # prints 0 1 4 as expected
Now, storing the data in a list like that may take more memory than you want. One of the nice things about generators and generator expressions is the lazy computation that they allow. If you want to maintain that lazy computation approach and only need the a few values to be available to one generators before the other, because they're being consumed mostly in parallel (e.g. by zip), then itertools.tee may be just what you need.
import itertools
data = range(10)
data_iterator = iter(data)
data_it1, data_it2 = itertools.tee(data_iterator) # two iterators that will yield the same results
doubles = (x * 2 for x in data_it1)
squares = (x * x for x in data_it2)
for d, s in zip(doubles, squares): # consume the values in parallel
print(d, s)
The iterators that tee returns can still be used if you're planning on fully consuming one generator before starting on the other, but they're a lot less efficient in that situation (just dumping the whole intermediate iterator into a list is probably better).
Is there any way to use itertools product function where the function returns each combination of lists step by step ?
For example:
itertools.product(*mylist)
-> the solution should return the first combination of the lists , after that the second one etc.
As #ggorlen has explained, itertools.product(...) returns an iterator. For example, if you have
import itertools
mylist = [('Hello','Hi'),('Andy','Betty')]
iterator = itertools.product(*mylist)
next(iterator) or iterator.__next__() will evaluate to 'Hello Andy' the first time you call them, for example. When you next call next(iterator), it will return 'Hello Betty', then 'Hi Andy', and finally 'Hi Betty', before raising StopIteration errors.
You can also convert an iterator into a list with list(iterator), if you are more comfortable with a list, but if you just need the first few values and mylist is big, this would be really inefficient, and it might be worth the time familiarising yourself with iterators.
Do consider whether you are really just iterating through iterator though. If so, just use
for combination in iterator:
pass # Body of loop
Even if you just need the first n elements and n is large you can use
for combination in itertools.islice(iterator, n):
pass # Body of loop
The question is how to use list comprehension/lambda+maps (in order to push the task of performing the actual loop to the underlying c) when the loop references and updates things outside of itself?
My example is this:
words = []
wordCount = {}
for i in tqdm_notebook(range (0,len(sentences_wiki))):
sentences_wiki[i]
for j in range (0,len(sentences_wiki[i])):
word = sentences_wiki[i][j]
if word not in words:
words.append(word)
wordCount[word]= 1
else:
wordCount[word] = wordCount[word] + 1
Note sentences_wiki is a an array of sentences - each of which is an array of words.
As an attempt I ended up with the following non functional example
def blah(listy_words,words,wordCount):
if word not in listy_words:
words.append(word)
wordCount[word]= 1
else:
wordCount[word] = wordCount[word] + 1
return(words)
words = []
wordCount = {}
a = map(lambdap:list(map(blah(p, words, wordCount), sentences_wiki[i])), sentences_wiki)
p=list(a)
I
You shouldn't use map or a list comprehension just for the side effects of the function you're applying. You should only use it when the resulting list contains meaningful data. In your case, you'd be creating a big nested list of lists containing a whole bunch of repeated references to the same words list you created at the global level. That's not useful at all.
Furthermore, your entire reason for making the change seems to be based on the premise that using map or a comprehension is sure to be faster. That's probably not true. They may be about the same speed, or may be slower. I think it's very unlikely that anything you can do will make the comprehension/map version faster than the explicit loop. The main reason is that function calls in Python are pretty slow, and so the need to pack some of the loop logic into a function makes that part slower than it was with the explicit loops.
What often can be faster is using builtin functions or types to do the iteration for you in C without ever reaching back out to Python code. In your example, you want to be counting the words in your list of lists, so using collections.Counter is probably a good idea. I'd probably do something like this, eliminating the inner loop while keeping the outer one:
from collections import Counter
word_counts = Counter()
for sentence in sentences_wiki:
word_counts.update(sentence)
words = list(word_counts) # get a list of keys, if you really need it separate from the count
As Patrick Haugh commented, it's even possible to eliminate both loops using itertools if you want to create the counter in one line:
import itertools
from collections import Counter
word_counts = Counter(itertools.chain.from_iterable(sentences_wiki))
words = list(word_counts)
I am supposed to write a generator that given a list of iterable arguments produces the 1st element from the 1st argument, 1st element from 2nd argument, 1st element from 3rd element, 2nd element from 1st argument and so on.
So
''.join([v for v in alternate('abcde','fg','hijk')]) == afhbgicjdke
My function works for string arguments like this but I encounter a problem when I try and use a given test case that goes like this
def hide(iterable):
for v in iterable:
yield v
''.join([v for v in alternate(hide('abcde'),hide('fg'),hide('hijk'))])= afhbgicjdke
Here is my generator:
def alternate(*args):
for i in range(10):
for arg in args:
arg_num = 0
for thing in arg:
if arg_num == i:
yield thing
arg_num+=1
Can I change something in this to get it to work as described or is there something fundamentally wrong with my function?
EDIT: as part of the assignment, I am not allowed to use itertools
Something like this works OK:
def alternate(*iterables):
iterators = [iter(iterable) for iterable in iterables]
sentinel = object()
keep_going = True
while keep_going:
keep_going = False
for iterator in iterators:
maybe_yield = next(iterator, sentinel)
if maybe_yield != sentinel:
keep_going = True
yield maybe_yield
print ''.join(alternate('abcde','fg','hijk'))
The trick is realizing that when a generator is exhausted, next will return the sentinel value. As long as 1 of the iterators returns a sentinel, then we need to keep going until it is exhausted. If the sentinel was not returned from next, then the value is good and we need to yield it.
Note that if the number of iterables is large, this implementation is sub-optimal (It'd be better to store the iterables in a data-structure that supports O(1) removal and to remove an iterable as soon as it is detected to be exhausted -- a collections.OrderedDict could probably be used for this purpose, but I'll leave that as an exercise for the interested reader).
If we want to open things up to the standard library, itertools can help here too:
from itertools import izip_longest, chain
def alternate2(*iterables):
sentinel = object()
result = chain.from_iterable(izip_longest(*iterables, fillvalue=sentinel))
return (item for item in result if item is not sentinel)
Here, I return a generator expression ... Which is slightly different than writing a generator function, but really not much :-). Again, this can be slightly inefficient if there are a lot of iterables and one of them is much longer than the others (consider the case where you have 100 iterables of length 1 and 1 iterable of length 101 -- This will run in effectively 101 * 101 steps whereas you should really be able to accomplish the iteration in about 101 * 2 + 1 steps).
There are several things that can be improved in your code. What is causing you problems is the most wrong of them all - you are actually iterating several times over each of your arguments - and essentially doing nothing with the intermediate values in each pass.
That takes place when you iterate for thing in arg for each value of i.
While that is a tremendous waste of resources in any account, it also does not work with iterators (which are what you get with your hide function), since they go exhausted after iterating over its elements once - that is in contrast with sequences - that can be iterated - and re-reiterated several ties over (like the strings you are using for test)
(Another wrong thing is to have the 10 hardcoded there as the longest sequence value you'd ever had - in Python you iterate over generators and sequences - don't matter their size)
Anyway, the fix for that is to make sure you iterate over each of your arguments just once - the built-in zip can do that - or for your use case, itertools.zip_longest(izip_longest in Python 2.x) can retrieve the values you want from your args in a single for structure:
from itertools import izip_longest
def alternate(*args):
sentinel = object()
for values in izip_longest(*args, fillvalue=sentinel):
for value in values:
if value is not sentinel:
yield value
If you want to only pass iterators (this wont work with static string) use the fallowing code :
def alternate(*args):
for i in range(10):
for arg in args:
arg_num = i
for thing in arg:
if arg_num == i:
yield thing
break
else:
arg_num+=1
This is just your original code with a little bit of change .
When you are using static string every time that you call alternate function a new string will be passed in and you can start to count from 0 (arg_num = 0).
But when you create iterators by calling hide() method, only one single instance of iterator will be created for each string and you should keep track of your position in the iterators so you have to change arg_num = 0 to arg_num = i and also you need to add the break statement as well .
Many of Python's built-in functions (any(), all(), sum() to name some) take iterables but why does len() not?
One could always use sum(1 for i in iterable) as an equivalent, but why is it len() does not take iterables in the first place?
Many iterables are defined by generator expressions which don't have a well defined len. Take the following which iterates forever:
def sequence(i=0):
while True:
i+=1
yield i
Basically, to have a well defined length, you need to know the entire object up front. Contrast that to a function like sum. You don't need to know the entire object at once to sum it -- Just take one element at a time and add it to what you've already summed.
Be careful with idioms like sum(1 for i in iterable), often it will just exhaust iterable so you can't use it anymore. Or, it could be slow to get the i'th element if there is a lot of computation involved. It might be worth asking yourself why you need to know the length a-priori. This might give you some insight into what type of data-structure to use (frequently list and tuple work just fine) -- or you may be able to perform your operation without needing calling len.
This is an iterable:
def forever():
while True:
yield 1
Yet, it has no length. If you want to find the length of a finite iterable, the only way to do so, by definition of what an iterable is (something you can repeatedly call to get the next element until you reach the end) is to expand the iterable out fully, e.g.:
len(list(the_iterable))
As mgilson pointed out, you might want to ask yourself - why do you want to know the length of a particular iterable? Feel free to comment and I'll add a specific example.
If you want to keep track of how many elements you have processed, instead of doing:
num_elements = len(the_iterable)
for element in the_iterable:
...
do:
num_elements = 0
for element in the_iterable:
num_elements += 1
...
If you want a memory-efficient way of seeing how many elements end up being in a comprehension, for example:
num_relevant = len(x for x in xrange(100000) if x%14==0)
It wouldn't be efficient to do this (you don't need the whole list):
num_relevant = len([x for x in xrange(100000) if x%14==0])
sum would probably be the most handy way, but it looks quite weird and it isn't immediately clear what you're doing:
num_relevant = sum(1 for _ in (x for x in xrange(100000) if x%14==0))
So, you should probably write your own function:
def exhaustive_len(iterable):
length = 0
for _ in iterable: length += 1
return length
exhaustive_len(x for x in xrange(100000) if x%14==0)
The long name is to help remind you that it does consume the iterable, for example, this won't work as you might think:
def yield_numbers():
yield 1; yield 2; yield 3; yield 5; yield 7
the_nums = yield_numbers()
total_nums = exhaustive_len(the_nums)
for num in the_nums:
print num
because exhaustive_len has already consumed all the elements.
EDIT: Ah in that case you would use exhaustive_len(open("file.txt")), as you have to process all lines in the file one-by-one to see how many there are, and it would be wasteful to store the entire file in memory by calling list.