I'm looking for a way to "page through" a Python iterator. That is, I would like to wrap a given iterator iter and page_size with another iterator that would would return the items from iter as a series of "pages". Each page would itself be an iterator with up to page_size iterations.
I looked through itertools and the closest thing I saw is itertools.islice. In some ways, what I'd like is the opposite of itertools.chain -- instead of chaining a series of iterators together into one iterator, I'd like to break an iterator up into a series of smaller iterators. I was expecting to find a paging function in itertools but couldn't locate one.
I came up with the following pager class and demonstration.
class pager(object):
"""
takes the iterable iter and page_size to create an iterator that "pages through" iter. That is, pager returns a series of page iterators,
each returning up to page_size items from iter.
"""
def __init__(self,iter, page_size):
self.iter = iter
self.page_size = page_size
def __iter__(self):
return self
def next(self):
# if self.iter has not been exhausted, return the next slice
# I'm using a technique from
# https://stackoverflow.com/questions/1264319/need-to-add-an-element-at-the-start-of-an-iterator-in-python
# to check for iterator completion by cloning self.iter into 3 copies:
# 1) self.iter gets advanced to the next page
# 2) peek is used to check on whether self.iter is done
# 3) iter_for_return is to create an independent page of the iterator to be used by caller of pager
self.iter, peek, iter_for_return = itertools.tee(self.iter, 3)
try:
next_v = next(peek)
except StopIteration: # catch the exception and then raise it
raise StopIteration
else:
# consume the page from the iterator so that the next page is up in the next iteration
# is there a better way to do this?
#
for i in itertools.islice(self.iter,self.page_size): pass
return itertools.islice(iter_for_return,self.page_size)
iterator_size = 10
page_size = 3
my_pager = pager(xrange(iterator_size),page_size)
# skip a page, then print out rest, and then show the first page
page1 = my_pager.next()
for page in my_pager:
for i in page:
print i
print "----"
print "skipped first page: " , list(page1)
I'm looking for some feedback and have the following questions:
Is there a pager already in itertools that serves a pager that I'm overlooking?
Cloning self.iter 3 times seems kludgy to me. One clone is to check whether self.iter has any more items. I decided to go with a technique Alex Martelli suggested (aware that he wrote of a wrapping technique). The second clone was to enable the returned page to be independent of the internal iterator (self.iter). Is there a way to avoid making 3 clones?
Is there a better way to deal with the StopIteration exception beside catching it and then raising it again? I am tempted to not catch it at all and let it bubble up.
Thanks!
-Raymond
Look at grouper(), from the itertools recipes.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Why aren't you using this?
def grouper( page_size, iterable ):
page= []
for item in iterable:
page.append( item )
if len(page) == page_size:
yield page
page= []
yield page
"Each page would itself be an iterator with up to page_size" items. Each page is a simple list of items, which is iterable. You could use yield iter(page) to yield the iterator instead of the object, but I don't see how that improves anything.
It throws a standard StopIteration at the end.
What more would you want?
I'd do it like this:
def pager(iterable, page_size):
args = [iter(iterable)] * page_size
fillvalue = object()
for group in izip_longest(fillvalue=fillvalue, *args):
yield (elem for elem in group if elem is not fillvalue)
That way, None can be a legitimate value that the iterator spits out. Only the single object fillvalue filtered out, and it cannot possibly be an element of the iterable.
Based on the pointer to the itertools recipe for grouper(), I came up with the following adaption of grouper() to mimic Pager. I wanted to filter out any None results and wanted to return an iterator rather than a tuple (though I suspect that there might be little advantage in doing this conversion)
# based on http://docs.python.org/library/itertools.html#recipes
def grouper2(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
for item in izip_longest(fillvalue=fillvalue, *args):
yield iter(filter(None,item))
I'd welcome feedback on how what I can do to improve this code.
def group_by(iterable, size):
"""Group an iterable into lists that don't exceed the size given.
>>> group_by([1,2,3,4,5], 2)
[[1, 2], [3, 4], [5]]
"""
sublist = []
for index, item in enumerate(iterable):
if index > 0 and index % size == 0:
yield sublist
sublist = []
sublist.append(item)
if sublist:
yield sublist
more_itertools.chunked will do exactly what you're looking for:
>>> import more_itertools
>>> list(chunked([1, 2, 3, 4, 5, 6], 3))
[[1, 2, 3], [4, 5, 6]]
If you want the chunking without creating temporary lists, you can use more_itertools.ichunked.
That library also has lots of other nice options for efficiently grouping, windowing, slicing, etc.
Related
I've created two enumeration methods, one which returns a list and the other which returns a yield/generator:
def enum_list(sequence, start=0):
lst = []
num = start
for sequence_item in sequence:
lst.append((num, sequence_item))
num += 1
return lst
def enum_generator(sequence, start=0):
num = start
for sequence_item in sequence:
yield (num, sequence_item)
num += 1
A few questions on this:
(1) Is changing a list to a generator as simple as doing:
# build via list
l = list()
for item in items:
l.append(item)
# build via iterator
# l = list() (1) <== delete this line
for item in items:
yield item # (2) change l.append(...) to yield ...
(2) Is "lazy evaluation" the only reason to use a generator, or are there other reasons as well?
(1) generator are simply created as adding yield to your iteration.
(2) Yes, for lazy evaluation. But generators are also used to create stack and queue as they can be only iterate once. This property is also exploited in context manager, by yielding the context.
An additional difference in your case is that since list is created before use and generator is evaluated at each next call, the generator function can check the context and come to different result for each yield, depending on external conditions, which vary with time.
Consider pseudocode:
def alloted_time():
while True:
if len(global_queue)>10:
yield 5
else:
yield 10
If queue is large, allot 5 mins for next person, else 10.
This question already has answers here:
Python generator that groups another iterable into groups of N [duplicate]
(9 answers)
Closed 1 year ago.
I am passing the result of itertools.zip_longest to itertools.product, however I get errors when it gets to the end and finds None.
The error I get is:
Error: (, TypeError('sequence item 0: expected str instance, NoneType found',), )
If I use zip instead of itertools.zip_longest then I don't get all the items.
Here is the code I am using to generate the zip:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
print(args)
#return zip(*args)
return itertools.zip_longest(*args)
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
for x in grouper(sCharacters, 4):
print(x)
Here is the output. The first one is itertools.zip_longest and the second is just zip. You can see the first with the None items and the second is missing the final item, the comma: ','
How can I get a zip of all characters in a string without the none at the end.
Or how can I avoid this error?
Thanks for your time.
I've had to solve this in a performance critical case before, so here is the fastest code I've found for doing this (works no matter the values in iterable):
from itertools import zip_longest
def grouper(n, iterable):
fillvalue = object() # Guaranteed unique sentinel, cannot exist in iterable
for tup in zip_longest(*(iter(iterable),) * n, fillvalue=fillvalue):
if tup[-1] is fillvalue:
yield tuple(v for v in tup if v is not fillvalue)
else:
yield tup
The above is, a far as I can tell, unbeatable when the input is long enough and the chunk sizes are small enough. For cases where the chunk size is fairly large, it can lose out to this even uglier case, but usually not by much:
from future_builtins import map # Only on Py2, and required there
from itertools import islice, repeat, starmap, takewhile
from operator import truth # Faster than bool when guaranteed non-empty call
def grouper(n, iterable):
'''Returns a generator yielding n sized groups from iterable
For iterables not evenly divisible by n, the final group will be undersized.
'''
# Can add tests to special case other types if you like, or just
# use tuple unconditionally to match `zip`
rettype = ''.join if type(iterable) is str else tuple
# Keep islicing n items and converting to groups until we hit an empty slice
return takewhile(truth, map(rettype, starmap(islice, repeat((iter(iterable), n)))))
Either approach seamlessly leaves the final element incomplete if there aren't sufficient items to complete the group. It runs extremely fast because literally all of the work is pushed to the C layer in CPython after "set up", so however long the iterable is, the Python level work is the same, only the C level work increases. That said, it does a lot of C work, which is why the zip_longest solution (which does much less C work, and only trivial Python level work for all but the final chunk) usually beats it.
The slower, but more readable equivalent code to option #2 (but skipping the dynamic return type in favor of just tuple) is:
def grouper(n, iterable):
iterable = iter(iterable)
while True:
x = tuple(islice(iterable, n))
if not x:
return
yield x
Or more succinctly with Python 3.8+'s walrus operator:
def grouper(n, iterable):
iterable = iter(iterable)
while x := tuple(islice(iterable, n)):
yield x
the length of sCharacters is 93 (Note, 92 % 4 ==0). so since zip outputs a sequence of length of the shortest input sequence, it will miss the last element
Beware, the addition of the Nones of itertools.zip_longest are artificial values which may not be the desired behaviour for everyone. That's why zip just ignores unneccessary, additional values
EDIT:
to be able to use zip you could append some whitespace to your string:
n=4
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
if len(sCharacters) % n > 0:
sCharacters = sCharacters + (" "*(n-len(sCharacters) % n))
EDIT2:
to obtain the missing tail when using zip use code like this:
tail = '' if len(sCharacters)%n == 0 else sCharacters[-(len(sCharacters)%n):]
I came across a bit of code in StackOverflow that raised two questions about the way deque works. I don't have enough reputation to ask "in situ", therefore this question:
from collections import deque
from itertools import islice
def sliding_window(iterable, size=2, step=1, fillvalue=None):
if size < 0 or step < 1:
raise ValueError
it = iter(iterable)
q = deque(islice(it, size), maxlen=size)
if not q:
return # empty iterable or size == 0
q.extend(fillvalue for _ in range(size - len(q))) # pad to size
while True:
yield iter(q) # iter() to avoid accidental outside modifications
q.append(next(it))
q.extend(next(it, fillvalue) for _ in range(step - 1))
The code computes a sliding window of a given size over a sequence.
The steps I don't understand are first:
q = deque(islice(it, size), maxlen=size)
What is the use of maxlen here? Isn't islice always going to output an iterable of at most length size?
And second:
yield iter(q) # iter() to avoid accidental outside modifications
why do we need to transform to to iterable to avoid "accidental outside modifications"?
To answer second part of the question, everything in Python is passed by reference. So in case of above generator q is a reference to the original deque hold by the function, so any method that may amend the deque, would break original algorithm of the generation. When you surround q with iter() what you effectively have yielded is an iterator. You can take elements from iterator (read), but you cannot change elements itself or amend the sequence of them (write not allowed). So it's a good practice to protect from accidental damage to the container hold internally be the generator.
To answer the first part of your question, setting maxlen will make the deque not exceed that size as items are added - older items are discarded.
for x in records:
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
ls.append(adapter.insert_posts(collection, data))
I want to execute the code ls.append(adapter.insert_post(collection, x)) in the batch size of 500, where x should contain 500 data dicts. I could create a list a of 500 data dicts using a double for loop and a list and then insert it. I could do that in the following way, , is there a better way to do it? :
for x in records:
for i in xrange(0,len(records)/500):
for j in xrange(0,500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
for i in xrange(0,len(records)%500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
The general structure I use looks like this:
worklist = [...]
batchsize = 500
for i in range(0, len(worklist), batchsize):
batch = worklist[i:i+batchsize] # the result might be shorter than batchsize at the end
# do stuff with batch
Note that we're using the step argument of range to simplify the batch processing considerably.
If you're working with sequences, the solution by #nneonneo is about as performant as you can get. If you want a solution which works with arbitrary iterables, you can look into some of the itertools recipes. e.g. grouper:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
I tend to not use this one because it "fills" the last group with None so that it is the same length as the others. I usually define my own variant which doesn't have this behavior:
def grouper2(iterable, n):
iterable = iter(iterable)
while True:
tup = tuple(itertools.islice(iterable, 0, n))
if tup:
yield tup
else:
break
This yields tuples of the requested size. This is generally good enough, but, for a little fun we can write a generator which returns lazy iterables of the correct size if we really want to...
The "best" solution here I think depends a bit on the problem at hand -- particularly the size of the groups and objects in the original iterable and the type of the original iterable. Generally, these last 2 recipes will find less use because they're more complex and rarely needed. However, If you're feeling adventurous and in the mood for a little fun, read on!
The only real modification that we need to get a lazy iterable instead of a tuple is the ability to "peek" at the next value in the islice to see if there is anything there. here I just peek at the value -- If it's missing, StopIteration will be raised which will stop the generator just as if it had ended normally. If it's there, I put it back using itertools.chain:
def grouper3(iterable, n):
iterable = iter(iterable)
while True:
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
yield itertools.chain((item,), group)
Careful though, this last function only "works" if you completely exhaust each iterable yielded before moving on to the next one. In the extreme case where you don't exhaust any of the iterables, e.g. list(grouper3(..., n)), you'll get "m" iterables which yield only 1 item, not n (where "m" is the "length" of the input iterable). This behavior could actually be useful sometimes, but not typically. We can fix that too if we use the itertools "consume" recipe (which also requires importing collections in addition to itertools):
def grouper4(iterable, n):
iterable = iter(iterable)
group = []
while True:
collections.deque(group, maxlen=0) # consume all of the last group
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
group = itertools.chain((item,), group)
yield group
Of course, list(grouper4(..., n)) will return empty iterables -- Any value not pulled from the "group" before the next invocation of next (e.g. when the for loop cycles back to the start) will never get yielded.
I like #nneonneo and #mgilson's answers but doing this over and over again is tedious. The bottom of the itertools page in python3 mentions the library more-itertools (I know this question was about python2 and this is python3 library, but some might find this useful). The following seems to do what you ask:
from more_itertools import chunked # Note: you might also want to look at ichuncked
for batch in chunked(records, 500):
# Do the work--`batch` is a list of 500 records (or less for the last batch).
Maybe something like this?
l = []
for ii, x in enumerate(records):
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
l.append(data)
if not ii % 500:
ls.append(adapter.insert_posts(collection, l))
l = []
I think one particular case scenario is not covered here. Let`s say the batch size is 100 and your list size is 103, the above answer might miss the last 3 element.
list = [.....] 103 elements
total_size = len(list)
batch_size_count=100
for start_index in range(0, total_size, batch_size_count):
list[start_index : start_index+batch_size_count] #Slicing operation
Above sliced list can be sent to each method call to complete the execution for all the elements.
Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.
After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).
My problem is with the search() function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.
Example:
postings = [[1, 100, 142, 322, 12312],
[2, 100, 101, 322, 1221],
[100, 142, 322, 956, 1222]]
Should yield:
[100, 322]
There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools abuse, or any other kind of code golf is more than welcome. :-)
It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.
For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!
import heapq, itertools
def intersect(*its):
for key, values in itertools.groupby(heapq.merge(*its)):
if len(list(values)) == len(its):
yield key
>>> list(intersect(*postings))
[100, 322]
def postings(posts):
sets = (set(l) for l in posts)
return sorted(reduce(set.intersection, sets))
... you could try and take advantage of the fact that the lists are ordered, but since reduce, generator expressions and set are all implemented in C, you'll probably have a hard time doing better than the above with logic implemented in python.
This solution will compute the intersection of your iterators. It works by advancing the iterators one step at a time and looking for the same value in all of them. When found, such values are yielded -- this makes the intersect function a generator itself.
import operator
def intersect(sequences):
"""Compute intersection of sequences of increasing integers.
>>> list(intersect([[1, 100, 142, 322, 12312],
... [2, 100, 101, 322, 1221],
... [100, 142, 322, 956, 1222]]))
[100, 322]
"""
iterators = [iter(seq) for seq in sequences]
last = [iterator.next() for iterator in iterators]
indices = range(len(iterators) - 1)
while True:
# The while loop stops when StopIteration is raised. The
# exception will also stop the iteration by our caller.
if reduce(operator.and_, [l == last[0] for l in last]):
# All iterators contain last[0]
yield last[0]
last = [iterator.next() for iterator in iterators]
# Now go over the iterators once and advance them as
# necessary. To stop as soon as the smallest iterator is
# exhausted we advance each iterator only once per iteration
# in the while loop.
for i in indices:
if last[i] < last[i+1]:
last[i] = iterators[i].next()
if last[i] > last[i+1]:
last[i+1] = iterators[i+1].next()
If these are really long (or even infinite) sequences, and you don't want to load everything into a set in advance, you can implement this with a 1-item lookahead on each iterator.
EndOfIter = object() # Sentinel value
class PeekableIterator(object):
def __init__(self, it):
self.it = it
self._peek = None
self.next() # pump iterator to get first value
def __iter__(self): return self
def next(self):
cur = self._peek
if cur is EndOfIter:
raise StopIteration()
try:
self._peek = self.it.next()
except StopIteration:
self._peek = EndOfIter
return cur
def peek(self):
return self._peek
def contained_in_all(seqs):
if not seqs: return # No items
iterators = [PeekableIterator(iter(seq)) for seq in seqs]
first, rest = iterators[0], iterators[1:]
for item in first:
candidates = list(rest)
while candidates:
if any(c.peek() is EndOfIter for c in candidates): return # Exhausted an iterator
candidates = [c for c in candidates if c.peek() < item]
for c in candidates: c.next()
# Out of loop if first item in remaining iterator are all >= item.
if all(it.peek() == item for it in rest):
yield item
Usage:
>>> print list(contained_in_all(postings))
[100, 322]
What about this:
import heapq
def inalliters(iterators):
heap=[(iterator.next(),iterator) for iterator in iterators]
heapq.heapify(heap)
maximal = max(heap)[0]
while True:
value,iterator = heapq.heappop(heap)
if maximal==value: yield value
nextvalue=iterator.next()
heapq.heappush(heap,(nextvalue,iterator))
maximal=max(maximal,nextvalue)
postings = [iter([1, 100, 142, 322, 12312]),
iter([2, 100, 101, 322, 1221]),
iter([100, 142, 322, 956, 1222])]
print [x for x in inalliters(postings)]
I haven't tested it very thoroughly (just ran your example), but I believe the basic idea is sound.
I want to show that there's an elegant solution, which only iterates forward once. Sorry, I don't know the Python well enough, so I use fictional classes. This one reads input, an array of iterators, and writes to output on-the-fly without ever going back or using any array function!.
def intersect (input, output)
do:
min = input[0]
bingo = True
for i in input:
if (i.cur < min.cur):
bingo = False
min = i
if bingo:
output.push(min.cur)
while (min.step())
This one runs in O(n*m) where n is the sum of all iterator lengths, and m is the number of lists. It can be made O(n*logm) by using a heap in line 6.
def intersection(its):
if not its: return
vs = [next(it) for it in its]
m = max(vs)
while True:
v, i = min((v,i) for i,v in enumerate(vs))
if v == m:
yield m
vs[i] = next(its[i])
m = max(m, vs[i])