How to execute a for loop in batches? - python

for x in records:
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
ls.append(adapter.insert_posts(collection, data))
I want to execute the code ls.append(adapter.insert_post(collection, x)) in the batch size of 500, where x should contain 500 data dicts. I could create a list a of 500 data dicts using a double for loop and a list and then insert it. I could do that in the following way, , is there a better way to do it? :
for x in records:
for i in xrange(0,len(records)/500):
for j in xrange(0,500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
for i in xrange(0,len(records)%500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))

The general structure I use looks like this:
worklist = [...]
batchsize = 500
for i in range(0, len(worklist), batchsize):
batch = worklist[i:i+batchsize] # the result might be shorter than batchsize at the end
# do stuff with batch
Note that we're using the step argument of range to simplify the batch processing considerably.

If you're working with sequences, the solution by #nneonneo is about as performant as you can get. If you want a solution which works with arbitrary iterables, you can look into some of the itertools recipes. e.g. grouper:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
I tend to not use this one because it "fills" the last group with None so that it is the same length as the others. I usually define my own variant which doesn't have this behavior:
def grouper2(iterable, n):
iterable = iter(iterable)
while True:
tup = tuple(itertools.islice(iterable, 0, n))
if tup:
yield tup
else:
break
This yields tuples of the requested size. This is generally good enough, but, for a little fun we can write a generator which returns lazy iterables of the correct size if we really want to...
The "best" solution here I think depends a bit on the problem at hand -- particularly the size of the groups and objects in the original iterable and the type of the original iterable. Generally, these last 2 recipes will find less use because they're more complex and rarely needed. However, If you're feeling adventurous and in the mood for a little fun, read on!
The only real modification that we need to get a lazy iterable instead of a tuple is the ability to "peek" at the next value in the islice to see if there is anything there. here I just peek at the value -- If it's missing, StopIteration will be raised which will stop the generator just as if it had ended normally. If it's there, I put it back using itertools.chain:
def grouper3(iterable, n):
iterable = iter(iterable)
while True:
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
yield itertools.chain((item,), group)
Careful though, this last function only "works" if you completely exhaust each iterable yielded before moving on to the next one. In the extreme case where you don't exhaust any of the iterables, e.g. list(grouper3(..., n)), you'll get "m" iterables which yield only 1 item, not n (where "m" is the "length" of the input iterable). This behavior could actually be useful sometimes, but not typically. We can fix that too if we use the itertools "consume" recipe (which also requires importing collections in addition to itertools):
def grouper4(iterable, n):
iterable = iter(iterable)
group = []
while True:
collections.deque(group, maxlen=0) # consume all of the last group
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
group = itertools.chain((item,), group)
yield group
Of course, list(grouper4(..., n)) will return empty iterables -- Any value not pulled from the "group" before the next invocation of next (e.g. when the for loop cycles back to the start) will never get yielded.

I like #nneonneo and #mgilson's answers but doing this over and over again is tedious. The bottom of the itertools page in python3 mentions the library more-itertools (I know this question was about python2 and this is python3 library, but some might find this useful). The following seems to do what you ask:
from more_itertools import chunked # Note: you might also want to look at ichuncked
for batch in chunked(records, 500):
# Do the work--`batch` is a list of 500 records (or less for the last batch).

Maybe something like this?
l = []
for ii, x in enumerate(records):
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
l.append(data)
if not ii % 500:
ls.append(adapter.insert_posts(collection, l))
l = []

I think one particular case scenario is not covered here. Let`s say the batch size is 100 and your list size is 103, the above answer might miss the last 3 element.
list = [.....] 103 elements
total_size = len(list)
batch_size_count=100
for start_index in range(0, total_size, batch_size_count):
list[start_index : start_index+batch_size_count] #Slicing operation
Above sliced list can be sent to each method call to complete the execution for all the elements.

Related

Stopping a Python iterator/generator after a given number of times

I have a generator with many elements, say
long_generator = (i**2 for i in range(10**1000))
I would like to extract the first n elements (without obviously parsing the generator until the end): what pythonic way could do this?
The function iter has a second parameter being a sentinel based on the returned value:
numbers = iter(lambda:next(long_generator), 81) # but this assumes we know the results
So would there be an equivalent based on the number of "iterations" instead?
I came up with the following function:
def first_elements(iterable, n:int):
"""Iterates over n elements of an iterable"""
for _ in range(n):
yield next(iterable)
And you could get a list as follows: first_10 = list(first_elements(long_generator, 10))
Is there some built-in or better/more elegant way?

How to get a zip of all characters in a string. zip misses out on final characters and itertools.zip_longest adds none [duplicate]

This question already has answers here:
Python generator that groups another iterable into groups of N [duplicate]
(9 answers)
Closed 1 year ago.
I am passing the result of itertools.zip_longest to itertools.product, however I get errors when it gets to the end and finds None.
The error I get is:
Error: (, TypeError('sequence item 0: expected str instance, NoneType found',), )
If I use zip instead of itertools.zip_longest then I don't get all the items.
Here is the code I am using to generate the zip:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
print(args)
#return zip(*args)
return itertools.zip_longest(*args)
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
for x in grouper(sCharacters, 4):
print(x)
Here is the output. The first one is itertools.zip_longest and the second is just zip. You can see the first with the None items and the second is missing the final item, the comma: ','
How can I get a zip of all characters in a string without the none at the end.
Or how can I avoid this error?
Thanks for your time.
I've had to solve this in a performance critical case before, so here is the fastest code I've found for doing this (works no matter the values in iterable):
from itertools import zip_longest
def grouper(n, iterable):
fillvalue = object() # Guaranteed unique sentinel, cannot exist in iterable
for tup in zip_longest(*(iter(iterable),) * n, fillvalue=fillvalue):
if tup[-1] is fillvalue:
yield tuple(v for v in tup if v is not fillvalue)
else:
yield tup
The above is, a far as I can tell, unbeatable when the input is long enough and the chunk sizes are small enough. For cases where the chunk size is fairly large, it can lose out to this even uglier case, but usually not by much:
from future_builtins import map # Only on Py2, and required there
from itertools import islice, repeat, starmap, takewhile
from operator import truth # Faster than bool when guaranteed non-empty call
def grouper(n, iterable):
'''Returns a generator yielding n sized groups from iterable
For iterables not evenly divisible by n, the final group will be undersized.
'''
# Can add tests to special case other types if you like, or just
# use tuple unconditionally to match `zip`
rettype = ''.join if type(iterable) is str else tuple
# Keep islicing n items and converting to groups until we hit an empty slice
return takewhile(truth, map(rettype, starmap(islice, repeat((iter(iterable), n)))))
Either approach seamlessly leaves the final element incomplete if there aren't sufficient items to complete the group. It runs extremely fast because literally all of the work is pushed to the C layer in CPython after "set up", so however long the iterable is, the Python level work is the same, only the C level work increases. That said, it does a lot of C work, which is why the zip_longest solution (which does much less C work, and only trivial Python level work for all but the final chunk) usually beats it.
The slower, but more readable equivalent code to option #2 (but skipping the dynamic return type in favor of just tuple) is:
def grouper(n, iterable):
iterable = iter(iterable)
while True:
x = tuple(islice(iterable, n))
if not x:
return
yield x
Or more succinctly with Python 3.8+'s walrus operator:
def grouper(n, iterable):
iterable = iter(iterable)
while x := tuple(islice(iterable, n)):
yield x
the length of sCharacters is 93 (Note, 92 % 4 ==0). so since zip outputs a sequence of length of the shortest input sequence, it will miss the last element
Beware, the addition of the Nones of itertools.zip_longest are artificial values which may not be the desired behaviour for everyone. That's why zip just ignores unneccessary, additional values
EDIT:
to be able to use zip you could append some whitespace to your string:
n=4
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
if len(sCharacters) % n > 0:
sCharacters = sCharacters + (" "*(n-len(sCharacters) % n))
EDIT2:
to obtain the missing tail when using zip use code like this:
tail = '' if len(sCharacters)%n == 0 else sCharacters[-(len(sCharacters)%n):]

python deque understanding

I came across a bit of code in StackOverflow that raised two questions about the way deque works. I don't have enough reputation to ask "in situ", therefore this question:
from collections import deque
from itertools import islice
def sliding_window(iterable, size=2, step=1, fillvalue=None):
if size < 0 or step < 1:
raise ValueError
it = iter(iterable)
q = deque(islice(it, size), maxlen=size)
if not q:
return # empty iterable or size == 0
q.extend(fillvalue for _ in range(size - len(q))) # pad to size
while True:
yield iter(q) # iter() to avoid accidental outside modifications
q.append(next(it))
q.extend(next(it, fillvalue) for _ in range(step - 1))
The code computes a sliding window of a given size over a sequence.
The steps I don't understand are first:
q = deque(islice(it, size), maxlen=size)
What is the use of maxlen here? Isn't islice always going to output an iterable of at most length size?
And second:
yield iter(q) # iter() to avoid accidental outside modifications
why do we need to transform to to iterable to avoid "accidental outside modifications"?
To answer second part of the question, everything in Python is passed by reference. So in case of above generator q is a reference to the original deque hold by the function, so any method that may amend the deque, would break original algorithm of the generation. When you surround q with iter() what you effectively have yielded is an iterator. You can take elements from iterator (read), but you cannot change elements itself or amend the sequence of them (write not allowed). So it's a good practice to protect from accidental damage to the container hold internally be the generator.
To answer the first part of your question, setting maxlen will make the deque not exceed that size as items are added - older items are discarded.

Nicest, efficient way to get result tuple of sequence items fulfilling and not fulfilling condition

(This is professional best practise/ pattern interest, not home work request)
INPUT: any unordered sequence or generator items, function myfilter(item) returns True if filter condition is fulfilled
OUTPUT: (filter_true, filter_false) tuple of sequences of
original type which contain the
elements partitioned according to
filter in original sequence order.
How would you express this without doing double filtering, or should I use double filtering? Maybe filter and loop/generator/list comprehencion with next could be answer?
Should I take out the requirement of keeping the type or just change requirement giving tuple of tuple/generator result, I can not return easily generator for generator input, or can I? (The requirements are self-made)
Here test of best candidate at the moment, offering two streams instead of tuple
import itertools as it
from sympy.ntheory import isprime as myfilter
mylist = xrange(1000001,1010000,2)
left,right = it.tee((myfilter(x), x) for x in mylist)
filter_true = (x for p,x in left if p)
filter_false = (x for p,x in right if not p)
print 'Hundred primes and non-primes odd numbers'
print '\n'.join( " Prime %i, not prime %i" %
(next(filter_true),next(filter_false))
for i in range(100))
Here is a way to do it which only calls myfilter once for each item and will also work if mylist is a generator
import itertools as it
left,right = it.tee((myfilter(x), x) for x in mylist)
filter_true = (x for p,x in left if p)
filter_false = (x for p,x in right if not p)
Let's suppose that your probleme is not memory but cpu, myfilter is heavy and you don't want to iterate and filter the original dataset twice. Here are some single pass ideas :
The simple and versatile version (memoryvorous) :
filter_true=[]
filter_false=[]
for item in items:
if myfilter(item):
filter_true.append(item)
else:
filter_false.append(item)
The memory friendly version : (doesn't work with generators (unless used with list(items)))
while items:
item=items.pop()
if myfilter(item):
filter_true.append(item)
else:
filter_false.append(item)
The generator friendly version :
while True:
try:
item=next(items)
if myfilter(item):
filter_true.append(item)
else:
filter_false.append(item)
except StopIteration:
break
The easy way (but less efficient) is to tee the iterable and filter both of them:
import itertools
left, right = itertools.tee( mylist )
filter_true = (x for x in left if myfilter(x))
filter_false = (x for x in right if myfilter(x))
This is less efficient than the optimal solution, because myfilter will be called repeatedly for each element. That is, if you have tested an element in left, you shouldn't have to re-test it in right because you already know the answer. If you require this optimisation, it shouldn't be hard to implement: have a look at the implementation of tee for clues. You'll need a deque for each returned iterable which you stock with the elements of the original sequence that should go in it but haven't been asked for yet.
I think your best bet will be constructing two separate generators:
filter_true = (x for x in mylist if myfilter(x))
filter_false = (x for x in mylist if not myfilter(x))

How to write a pager for Python iterators?

I'm looking for a way to "page through" a Python iterator. That is, I would like to wrap a given iterator iter and page_size with another iterator that would would return the items from iter as a series of "pages". Each page would itself be an iterator with up to page_size iterations.
I looked through itertools and the closest thing I saw is itertools.islice. In some ways, what I'd like is the opposite of itertools.chain -- instead of chaining a series of iterators together into one iterator, I'd like to break an iterator up into a series of smaller iterators. I was expecting to find a paging function in itertools but couldn't locate one.
I came up with the following pager class and demonstration.
class pager(object):
"""
takes the iterable iter and page_size to create an iterator that "pages through" iter. That is, pager returns a series of page iterators,
each returning up to page_size items from iter.
"""
def __init__(self,iter, page_size):
self.iter = iter
self.page_size = page_size
def __iter__(self):
return self
def next(self):
# if self.iter has not been exhausted, return the next slice
# I'm using a technique from
# https://stackoverflow.com/questions/1264319/need-to-add-an-element-at-the-start-of-an-iterator-in-python
# to check for iterator completion by cloning self.iter into 3 copies:
# 1) self.iter gets advanced to the next page
# 2) peek is used to check on whether self.iter is done
# 3) iter_for_return is to create an independent page of the iterator to be used by caller of pager
self.iter, peek, iter_for_return = itertools.tee(self.iter, 3)
try:
next_v = next(peek)
except StopIteration: # catch the exception and then raise it
raise StopIteration
else:
# consume the page from the iterator so that the next page is up in the next iteration
# is there a better way to do this?
#
for i in itertools.islice(self.iter,self.page_size): pass
return itertools.islice(iter_for_return,self.page_size)
iterator_size = 10
page_size = 3
my_pager = pager(xrange(iterator_size),page_size)
# skip a page, then print out rest, and then show the first page
page1 = my_pager.next()
for page in my_pager:
for i in page:
print i
print "----"
print "skipped first page: " , list(page1)
I'm looking for some feedback and have the following questions:
Is there a pager already in itertools that serves a pager that I'm overlooking?
Cloning self.iter 3 times seems kludgy to me. One clone is to check whether self.iter has any more items. I decided to go with a technique Alex Martelli suggested (aware that he wrote of a wrapping technique). The second clone was to enable the returned page to be independent of the internal iterator (self.iter). Is there a way to avoid making 3 clones?
Is there a better way to deal with the StopIteration exception beside catching it and then raising it again? I am tempted to not catch it at all and let it bubble up.
Thanks!
-Raymond
Look at grouper(), from the itertools recipes.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Why aren't you using this?
def grouper( page_size, iterable ):
page= []
for item in iterable:
page.append( item )
if len(page) == page_size:
yield page
page= []
yield page
"Each page would itself be an iterator with up to page_size" items. Each page is a simple list of items, which is iterable. You could use yield iter(page) to yield the iterator instead of the object, but I don't see how that improves anything.
It throws a standard StopIteration at the end.
What more would you want?
I'd do it like this:
def pager(iterable, page_size):
args = [iter(iterable)] * page_size
fillvalue = object()
for group in izip_longest(fillvalue=fillvalue, *args):
yield (elem for elem in group if elem is not fillvalue)
That way, None can be a legitimate value that the iterator spits out. Only the single object fillvalue filtered out, and it cannot possibly be an element of the iterable.
Based on the pointer to the itertools recipe for grouper(), I came up with the following adaption of grouper() to mimic Pager. I wanted to filter out any None results and wanted to return an iterator rather than a tuple (though I suspect that there might be little advantage in doing this conversion)
# based on http://docs.python.org/library/itertools.html#recipes
def grouper2(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
for item in izip_longest(fillvalue=fillvalue, *args):
yield iter(filter(None,item))
I'd welcome feedback on how what I can do to improve this code.
def group_by(iterable, size):
"""Group an iterable into lists that don't exceed the size given.
>>> group_by([1,2,3,4,5], 2)
[[1, 2], [3, 4], [5]]
"""
sublist = []
for index, item in enumerate(iterable):
if index > 0 and index % size == 0:
yield sublist
sublist = []
sublist.append(item)
if sublist:
yield sublist
more_itertools.chunked will do exactly what you're looking for:
>>> import more_itertools
>>> list(chunked([1, 2, 3, 4, 5, 6], 3))
[[1, 2, 3], [4, 5, 6]]
If you want the chunking without creating temporary lists, you can use more_itertools.ichunked.
That library also has lots of other nice options for efficiently grouping, windowing, slicing, etc.

Categories

Resources