I am trying to loop over a directory and load all files. I've tried using one generator to load files and another one to generate batches and call the first generator when it runs of out memory.
def file_gen(b):
# iterate over my directory and load two audio file at a time
for n in range(len(b)):
path_ = os.path.join(os.path.join(path,'Mixtures'), 'Dev')
os.chdir(os.path.join(path_,b[n]))
y, _ = librosa.load('mixture.wav', sr=rate)
path_vox = os.path.join(os.path.join(path,'Sources'), 'Dev')
os.chdir(os.path.join(path_vox,b[n]))
x, _ = librosa.load('vocals.wav', sr=rate)
yield y, x
list_titles = os.listdir(os.path.join(os.path.join(path,'Mixtures'),'Dev'))
gen_file = file_gen(list_titles)
# second generator
def memory_test():
memory = 0
if memory == 0:
a, b = next(gen_file)
a, _ = mag_phase(spectrogram(a))
b, _ = mag_phase(spectrogram(b))
# calculate how many batches I can generate from the file
memory = a.shape[1]/(n_frames*(time_len-overlap) + time_len)
for n in range(memory):
yield memory
memory = memory -1
test = memory_test()
The second generator is where the problem is. Ideally, I would like both generator to iterate indefinitely though (the first one should go back to the beginning of the list).
Thank you!
itertools.cycle()
One way you could do this is to use itertools.cycle() which will essentially store the results of the generator and then continuously loops them over and over. docs
If you chose to do that, you would consume a lot of additional memory storing those results.
except StopIteration
As an alternative method, you could try: and except StopIteration for your generator yield in order to reset it back to the beginning. Generators always raise StopIteration if you call __next__ on an exhausted generator.
Edit: I originally linked to a wrapper function here but the code in that example actually doesn't work. Below is code that I have tested to work which is hopefully helpful. My answer here is based on the same concept.
def Primes(max): # primary generator
number = 1
while number < max:
number += 1
if check_prime(number):
yield number
primes = Primes(100)
def primer(): # this acts as a loop and resets your generator
global primes
try:
ok = next(primes)
return ok
except StopIteration:
primes = Primes(100)
ok = next(primes)
return ok
while True: # this is the actual loop continuing forever
primer()
You'll notice we couldn't implicitly refer to our own function in order to reset itself, and we also couldn't use a standard for loop because it will always catch StopIteration before you can, by design [more info].
Related
I have a problem where I need to read a large text file and conditionally send data from it to two different database tables using Python. I would like to use a generator pipeline to avoid loading all the data into memory.
The logic of what I am trying to do is equivalent to turning a generator that yields integers into one for odd numbers and another for even numbers then simultaneously them writing them to separate files.
I can split into two generators with itertools.tee as follows:
def generate_numbers(limit):
# My real-life function has this form.
for i in range(limit):
yield i
numbers1, numbers2 = itertools.tee(generate_numbers(10), n=2)
evens = (num for num in numbers1 if num % 2 == 0)
odds = (num for num in numbers2 if num %2 != 0)
This produces the expected result:
>>> list(evens)
[0, 2, 4, 6, 8]
>>> list(odds)
[1, 3, 5, 7, 9]
The tee documentation warns the lots of temporary data will be stored if one iterator uses most or all of the data before the other starts. This is what happens when I write the data (from fresh iterators) to files as below.
def write_to_file(filename, numbers):
# My real-life function takes an iterable as an argument.
with open(filename, 'wt') as outfile:
for i in numbers:
outfile.write(f"{i}\n")
write_to_file('evens.txt', evens)
write_to_file('odds.txt', odds)
Is it possible to consume the generators simultaneously? The tee documentation also warns that it isn't thread safe. Can this be done with asyncio?
Alternatively, is there another approach? Would chunking the data help? My main constraints are that I don't want to hold all the data for a single table in memory and that my consuming functions expect an iterable of items, rather than an individual item.
Similar questions
This question: Separate odd and even lines in a generator with python is very similar to mine. The accepted answer suggests passing through the input file twice. I may end up doing this, but it would be nice to do a single pass.
There is another answer the opens both output files at once for writing and then processes each item one-by-one. This isn't suitable for me as my real-life consuming function expects to read all the items in an iterator.
Edit: After posting, I found this answer from 2015: https://stackoverflow.com/a/28030261/3508733, which suggests that you can't do it with itertools.tee, because of the memory issue. Is there an alternative way?
After reading more about itertools, I've found a way that works for me. I still create two generators with tee as before, but then I split the data into chunks with itertools.islice. That way I can alternate between generators without letting one get too far ahead of the other.
# Create two generators using tee
numbers1, numbers2 = itertools.tee(generate_numbers(10), 2)
evens = (num for num in numbers1 if num % 2 == 0)
odds = (num for num in numbers2 if num %2 != 0)
def append_to_file(filename, numbers):
# Append to file, instead of writing the whole file at once
with open(filename, 'at') as f:
for num in numbers:
f.write(f"{num}\n")
# Use islice to move through both generators in chunks
chunksize = 2
while True:
odds_chunk = list(itertools.islice(odds, chunksize))
append_to_file('/tmp/odds.txt', odds_chunk)
evens_chunk = list(itertools.islice(evens, chunksize))
append_to_file('/tmp/evens.txt', evens_chunk)
if odds_chunk == evens_chunk == []:
break
In my real-life case, I expect that a chunk size of a few thousand will be a good balance between memory use and reducing round-trips to the database.
Based on the chunking method in my earlier answer and comments from #KellyBundy about unbalanced inputs, I have modified the code. The following meets my requirements, even if it isn't technically generators all the way through.
✅ Doesn't hold more than chunksize items in memory at once
✅ Downstream function receives a generator
✅ Single pass through the data
❌ Data are temporarily materialized (but only in chunks)
import itertools
def generate_numbers(limit):
# My real-life function has this form.
for i in range(limit):
yield i
def append_to_file(filename, numbers):
# My real-life function takes an iterable as an argument.
with open(filename, 'at') as f:
for num in numbers:
f.write(f"{num}\n")
numbers = generate_numbers(10)
chunksize = 2
while True:
evens_chunk = []
odds_chunk = []
for num in itertools.islice(numbers, chunksize):
# Could use `match-case` in Python >= 3.10
if num %2 == 0:
evens_chunk.append(num)
else:
odds_chunk.append(num)
if evens_chunk:
evens_gen = (num for num in evens_chunk)
append_to_file('evens.txt', evens_gen)
if odds_chunk:
odds_gen = (num for num in odds_chunk)
append_to_file('odds.txt', odds_gen)
if odds_chunk == evens_chunk == []:
break
If the downstream functions don't require a generator as input, the loop can be simplified further:
while True:
evens_chunk = []
odds_chunk = []
for num in itertools.islice(numbers, chunksize):
if num %2 == 0:
evens_chunk.append(num)
else:
odds_chunk.append(num)
append_to_file('evens.txt', evens_chunk)
append_to_file('odds.txt', odds_chunk)
if odds_chunk == evens_chunk == []:
break
I have made a function: generatesequence (shown below)
def generatesequence(start: float, itera: float = 1, stop: float = None):
"""
Generate a sequence, that can have a stopping point, starting point.
"""
__num = start
# if sequence has a stopping point
if stop != None:
# if stop is negative
if stop < 0:
# while num is greater than stop (0 < 5, but 0 > -5)
while __num >= stop:
# yield __num variable (yield = return without exiting function)
yield __num
# add iter to __num
__num += itera
else:
while __num <= stop:
yield __num
__num += itera
else:
# if sequence has no stopping point, run forever
while True:
yield __num
__num += itera
I have also made a Sequence Class (also shown below)
class Sequence:
def __init__(self, start, itera, stop):
self.sequence = generatesequence(start, itera, stop)
self.sequencelength = iterlen(self.sequence)
print(self.sequencelength)
def printself(self):
for i in range(self.sequencelength):
print(next(self.sequence))
However, when I run printself on a Sequence instance, it gives me a StopIteration error. How can I fix this?
You don't need to do that with a generator, you can just do the following:
def printself(self):
for i in self.sequence:
print(i)
This way you don't need to calculate the length of the generator beforehand
Caculating length of generator defies the whole purpose of using generator. And it also explains StopIteration.
Unlike list or some data structure that takes O(n) memory space, generator takes O(1) space and it cannot know the length without iterating one by one.
And by calcuating length you have moved the iter for your generator from start to end, and now your iter points at StopIteration.
Now when you access generator afterwards it returns StopIteration.
Actually the whole purpose of generator and the likes is to save memory space for iterables that you know will be iterated at most once. You can not do two or more full iterations on generator. To do that, use list function on generator beforehand and save values in list or similar data structures . Or simply recreate generator after it's been used up (=iterated over).
In short, to fix bug, remove the line where it computes length of generator in init method. And do for loop using
"for i in generator_name: "
syntax
Alternatively you can make a method that makes generator and call it to recreate generator whenever / whereever you need
Recently i have been using the 'yield' in python. And I find generator functions very useful. My query is that, is there something which could decrement the imaginative cursor in the generator object. Just how next(genfun) moves and outputs +i'th item in the container, i would like to know if there exists any function that may call upon something like previous(genfun) and moves to -1th item in the conatiner.
Actual Working
def wordbyword():
words = ["a","b","c","d","e"]
for word in words:
yield word
getword = wordbyword()
next(getword)
next(getword)
Output's
a
b
What I would like to see and achieve is
def wordbyword():
words = ["a","b","c","d","e"]
for word in words:
yield word
getword = wordbyword()
next(getword)
next(getword)
previous(getword)
Expected Output
a
b
a
This may sound silly, but is there someway there is this previous in generator, if not why is it so?. Why not we could decrement the iterator, or am I ignorant of an existing method, pls shower some light. What can be the closest way to implement what I have here in hand.
No there is no such function to sort of go back in a generator function. The reason is that Python does not store up the previous value in a generator function natively, and as it does not store it, it also cannot perform a recalculation.
For example, if your generator is a time-sensitive function, such as
def time_sensitive_generator():
yield datetime.now()
You will have no way to recalculate the previous value in this generator function.
Of course, this is only one of the many possible cases that a previous value cannot be calculated, but that is the idea.
If you do not store the value yourself, it will be lost forever.
As already said, there is no such function since the entire point of a generator is to have a small memory footprint. You would need to store the result.
You could automate the storing of previous results. One use-case of generators is when you have a conceptually infinite list (e.g. that of prime numbers) for which you only need an initial segment. You could write a generator that builds up these initial segments as a side effect. Have an optional history parameter that the generator appends to while it is yielding. For example:
def wordbyword(history = None):
words = ["a","b","c","d","e"]
for word in words:
if isinstance(history,list): history.append(word)
yield word
If you use the generator without an argument, getword = wordbyword(), it will work like an ordinary generator, but if you pass it a list, that list will store the growing history:
hist = []
getword = wordbyword(hist)
print(next(getword)) #a
print(next(getword)) #b
print(hist) #['a','b']
Iterating over a generator object consumes its elements, so there is nothing to go back to after using next. You could convert the generator to a list and implement your own next and previous
index = 0
def next(lst):
global index
index += 1
if index > len(lst):
raise StopIteration
return lst[index - 1]
def previous(lst):
global index
index -= 1
if index == 0:
raise StopIteration
return lst[index - 1]
getword = list(wordbyword())
print(next(getword)) # a
print(next(getword)) # b
print(previous(getword)) # a
One option is to wrap wordbyword with a class that has a custom __next__ method. In this way, you can still use the built-in next function to consume the generator on-demand, but the class will store all the past results from the next calls and make them accessible via a previous attribute:
class save_last:
def __init__(self, f_gen):
self.f_gen = f_gen
self._previous = []
def __next__(self):
self._previous.append(n:=next(self.i_gen))
return n
def __call__(self, *args, **kwargs):
self.i_gen = self.f_gen(*args, **kwargs)
return self
#property
def previous(self):
if len(self._previous) < 2:
raise Exception
return self._previous[-2]
#save_last
def wordbyword():
words = ["a","b","c","d","e"]
for word in words:
yield word
getword = wordbyword()
print(next(getword))
print(next(getword))
print(getword.previous)
Output:
a
b
a
TL;DR is what I'm trying to do too complicated for a yield-based generator?
I have a python application where I need to repeat an expensive test on a list of objects, one at a time, and then mangle those that pass. I expect several objects to pass, but I do not want to create a list of all those that pass, as mangle will alter the state of some of the other objects. There is no requirement to test in any particular order. Then rinse and repeat until some stop condition.
My first simple implementation was this, which runs logically correctly
while not stop_condition:
for object in object_list:
if test(object):
mangle(object)
break
else:
handle_no_tests_passed()
unfortunately, for object in object_list: always restarts at the beginning of the list, where the objects probably haven't been changed, and there are objects at the end of the list ready to test. Picking them at random would be slightly better, but I would rather carry on where I left off from the previous for/in call. I still want the for/in call to terminate when it's traversed the entire list.
This sounded like a job for yield, but I tied my brain in knots failing to make it do what I wanted. I can use it in the simple cases, iterating over a range or returning filtered records from some source, but I couldn't find out how to make it save state and restart reading from its source.
I can often do things the long wordy way with classes, but fail to understand how to use the alleged simplifications like yield. Here is a solution that does exactly what I want.
class CyclicSource:
def __init__(self, source):
self.source = source
self.pointer = 0
def __iter__(self):
# reset how many we've done, but not where we are
self.done_this_call = 0
return self
def __next__(self):
ret_val = self.source[self.pointer]
if self.done_this_call >= len(self.source):
raise StopIteration
self.done_this_call += 1
self.pointer += 1
self.pointer %= len(self.source)
return ret_val
source = list(range(5))
q = CyclicSource(source)
print('calling once, aborted early')
count = 0
for i in q:
count += 1
print(i)
if count>=2:
break
else:
print('ran off first for/in')
print('calling again')
for i in q:
print(i)
else:
print('ran off second for/in')
which demonstrates the desired behaviour
calling once, aborted early
0
1
calling again
2
3
4
0
1
ran off second for/in
Finally, the question. Is it possible to do what I want with the simplified generator syntax using yield, or does maintaining state between successive for/in calls require the full class syntax?
Your use of the __iter__ method causes your iterator to be reset. This actually goes quite counter to regular behaviour of an iterator; the __iter__ method should just return self, nothing more. You rely on a side effect of for applying iter() to your iterator each time you create a for i in q: loop. This makes your iterator work, but the behaviour is surprising and will trip up future maintainers. I'd prefer that effect to be split out to a separate .reset() method, for example.
You can reset a generator too, using generator.send() to signal it to reset:
def cyclic_source(source):
pointer = 0
done_this_call = 0
while done_this_call < len(source):
ret_val = source[pointer]
done_this_call += 1
pointer = (pointer + 1) % len(source)
reset = yield ret_val
if reset is not None:
done_this_call = 0
yield # pause again for next iteration sequence
Now you can 'reset' your count back to zero:
q = cyclic_source(source)
for count, i in enumerate(q):
print(i)
if count == 1:
break
else:
print('ran off first for/in')
print('explicitly resetting the generator')
q.send(True)
for i in q:
print(i)
else:
print('ran off second for/in')
This is however, rather.. counter to readability. I'd instead use an infinite generator by using itertools.cycle() that is limited in the number of iterations with itertools.islice():
from itertools import cycle, islice
q = cycle(source)
for count, i in enumerate(islice(q, len(source))):
print(i)
if count == 1:
break
else:
print('ran off first for/in')
for i in islice(q, len(source)):
print(i)
else:
print('ran off second for/in')
q will produce values from source in an endless loop. islice() cuts off iteration after len(source) elements. But because q is reused, it is still maintaining the iteration state.
If you must have a dedicated iterator, stick to a class object and make an iterable, so have it return a new iterator each time __iter__ is called:
from itertools import cycle, islice
class CyclicSource:
def __init__(self, source):
self.length = len(source)
self.source = cycle(source)
def __iter__(self):
return islice(self.source, self.length)
This keeps state in the cycle() iterator still, but simply creates a new islice() object each time you create an iterator for this. It basically encapsulates the islice() approach above.
for x in records:
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
ls.append(adapter.insert_posts(collection, data))
I want to execute the code ls.append(adapter.insert_post(collection, x)) in the batch size of 500, where x should contain 500 data dicts. I could create a list a of 500 data dicts using a double for loop and a list and then insert it. I could do that in the following way, , is there a better way to do it? :
for x in records:
for i in xrange(0,len(records)/500):
for j in xrange(0,500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
for i in xrange(0,len(records)%500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
The general structure I use looks like this:
worklist = [...]
batchsize = 500
for i in range(0, len(worklist), batchsize):
batch = worklist[i:i+batchsize] # the result might be shorter than batchsize at the end
# do stuff with batch
Note that we're using the step argument of range to simplify the batch processing considerably.
If you're working with sequences, the solution by #nneonneo is about as performant as you can get. If you want a solution which works with arbitrary iterables, you can look into some of the itertools recipes. e.g. grouper:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
I tend to not use this one because it "fills" the last group with None so that it is the same length as the others. I usually define my own variant which doesn't have this behavior:
def grouper2(iterable, n):
iterable = iter(iterable)
while True:
tup = tuple(itertools.islice(iterable, 0, n))
if tup:
yield tup
else:
break
This yields tuples of the requested size. This is generally good enough, but, for a little fun we can write a generator which returns lazy iterables of the correct size if we really want to...
The "best" solution here I think depends a bit on the problem at hand -- particularly the size of the groups and objects in the original iterable and the type of the original iterable. Generally, these last 2 recipes will find less use because they're more complex and rarely needed. However, If you're feeling adventurous and in the mood for a little fun, read on!
The only real modification that we need to get a lazy iterable instead of a tuple is the ability to "peek" at the next value in the islice to see if there is anything there. here I just peek at the value -- If it's missing, StopIteration will be raised which will stop the generator just as if it had ended normally. If it's there, I put it back using itertools.chain:
def grouper3(iterable, n):
iterable = iter(iterable)
while True:
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
yield itertools.chain((item,), group)
Careful though, this last function only "works" if you completely exhaust each iterable yielded before moving on to the next one. In the extreme case where you don't exhaust any of the iterables, e.g. list(grouper3(..., n)), you'll get "m" iterables which yield only 1 item, not n (where "m" is the "length" of the input iterable). This behavior could actually be useful sometimes, but not typically. We can fix that too if we use the itertools "consume" recipe (which also requires importing collections in addition to itertools):
def grouper4(iterable, n):
iterable = iter(iterable)
group = []
while True:
collections.deque(group, maxlen=0) # consume all of the last group
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
group = itertools.chain((item,), group)
yield group
Of course, list(grouper4(..., n)) will return empty iterables -- Any value not pulled from the "group" before the next invocation of next (e.g. when the for loop cycles back to the start) will never get yielded.
I like #nneonneo and #mgilson's answers but doing this over and over again is tedious. The bottom of the itertools page in python3 mentions the library more-itertools (I know this question was about python2 and this is python3 library, but some might find this useful). The following seems to do what you ask:
from more_itertools import chunked # Note: you might also want to look at ichuncked
for batch in chunked(records, 500):
# Do the work--`batch` is a list of 500 records (or less for the last batch).
Maybe something like this?
l = []
for ii, x in enumerate(records):
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
l.append(data)
if not ii % 500:
ls.append(adapter.insert_posts(collection, l))
l = []
I think one particular case scenario is not covered here. Let`s say the batch size is 100 and your list size is 103, the above answer might miss the last 3 element.
list = [.....] 103 elements
total_size = len(list)
batch_size_count=100
for start_index in range(0, total_size, batch_size_count):
list[start_index : start_index+batch_size_count] #Slicing operation
Above sliced list can be sent to each method call to complete the execution for all the elements.