Joining a set of ordered-integer yielding Python iterators - python

Here is a seemingly simple problem: given a list of iterators that yield sequences of integers in ascending order, write a concise generator that yields only the integers that appear in every sequence.
After reading a few papers last night, I decided to hack up a completely minimal full text indexer in Python, as seen here (though that version is quite old now).
My problem is with the search() function, which must iterate over each posting list and yield only the document IDs that appear on every list. As you can see from the link above, my current non-recursive 'working' attempt is terrible.
Example:
postings = [[1, 100, 142, 322, 12312],
[2, 100, 101, 322, 1221],
[100, 142, 322, 956, 1222]]
Should yield:
[100, 322]
There is at least one elegant recursive function solution to this, but I'd like to avoid that if possible. However, a solution involving nested generator expressions, itertools abuse, or any other kind of code golf is more than welcome. :-)
It should be possible to arrange for the function to only require as many steps as there are items in the smallest list, and without sucking the entire set of integers into memory. In future, these lists may be read from disk, and larger than available RAM.
For the past 30 minutes I've had an idea on the tip of my tongue, but I can't quite get it into code. Remember, this is just for fun!

import heapq, itertools
def intersect(*its):
for key, values in itertools.groupby(heapq.merge(*its)):
if len(list(values)) == len(its):
yield key
>>> list(intersect(*postings))
[100, 322]

def postings(posts):
sets = (set(l) for l in posts)
return sorted(reduce(set.intersection, sets))
... you could try and take advantage of the fact that the lists are ordered, but since reduce, generator expressions and set are all implemented in C, you'll probably have a hard time doing better than the above with logic implemented in python.

This solution will compute the intersection of your iterators. It works by advancing the iterators one step at a time and looking for the same value in all of them. When found, such values are yielded -- this makes the intersect function a generator itself.
import operator
def intersect(sequences):
"""Compute intersection of sequences of increasing integers.
>>> list(intersect([[1, 100, 142, 322, 12312],
... [2, 100, 101, 322, 1221],
... [100, 142, 322, 956, 1222]]))
[100, 322]
"""
iterators = [iter(seq) for seq in sequences]
last = [iterator.next() for iterator in iterators]
indices = range(len(iterators) - 1)
while True:
# The while loop stops when StopIteration is raised. The
# exception will also stop the iteration by our caller.
if reduce(operator.and_, [l == last[0] for l in last]):
# All iterators contain last[0]
yield last[0]
last = [iterator.next() for iterator in iterators]
# Now go over the iterators once and advance them as
# necessary. To stop as soon as the smallest iterator is
# exhausted we advance each iterator only once per iteration
# in the while loop.
for i in indices:
if last[i] < last[i+1]:
last[i] = iterators[i].next()
if last[i] > last[i+1]:
last[i+1] = iterators[i+1].next()

If these are really long (or even infinite) sequences, and you don't want to load everything into a set in advance, you can implement this with a 1-item lookahead on each iterator.
EndOfIter = object() # Sentinel value
class PeekableIterator(object):
def __init__(self, it):
self.it = it
self._peek = None
self.next() # pump iterator to get first value
def __iter__(self): return self
def next(self):
cur = self._peek
if cur is EndOfIter:
raise StopIteration()
try:
self._peek = self.it.next()
except StopIteration:
self._peek = EndOfIter
return cur
def peek(self):
return self._peek
def contained_in_all(seqs):
if not seqs: return # No items
iterators = [PeekableIterator(iter(seq)) for seq in seqs]
first, rest = iterators[0], iterators[1:]
for item in first:
candidates = list(rest)
while candidates:
if any(c.peek() is EndOfIter for c in candidates): return # Exhausted an iterator
candidates = [c for c in candidates if c.peek() < item]
for c in candidates: c.next()
# Out of loop if first item in remaining iterator are all >= item.
if all(it.peek() == item for it in rest):
yield item
Usage:
>>> print list(contained_in_all(postings))
[100, 322]

What about this:
import heapq
def inalliters(iterators):
heap=[(iterator.next(),iterator) for iterator in iterators]
heapq.heapify(heap)
maximal = max(heap)[0]
while True:
value,iterator = heapq.heappop(heap)
if maximal==value: yield value
nextvalue=iterator.next()
heapq.heappush(heap,(nextvalue,iterator))
maximal=max(maximal,nextvalue)
postings = [iter([1, 100, 142, 322, 12312]),
iter([2, 100, 101, 322, 1221]),
iter([100, 142, 322, 956, 1222])]
print [x for x in inalliters(postings)]
I haven't tested it very thoroughly (just ran your example), but I believe the basic idea is sound.

I want to show that there's an elegant solution, which only iterates forward once. Sorry, I don't know the Python well enough, so I use fictional classes. This one reads input, an array of iterators, and writes to output on-the-fly without ever going back or using any array function!.
def intersect (input, output)
do:
min = input[0]
bingo = True
for i in input:
if (i.cur < min.cur):
bingo = False
min = i
if bingo:
output.push(min.cur)
while (min.step())

This one runs in O(n*m) where n is the sum of all iterator lengths, and m is the number of lists. It can be made O(n*logm) by using a heap in line 6.
def intersection(its):
if not its: return
vs = [next(it) for it in its]
m = max(vs)
while True:
v, i = min((v,i) for i,v in enumerate(vs))
if v == m:
yield m
vs[i] = next(its[i])
m = max(m, vs[i])

Related

Python code to find the max values via recursive

I have a question about my Python code to find the max value within a list. The functional code is as following:
def large(x):
if len(x) == 1:
return x.pop(0)
else:
temp = x.pop(0)
previous = large(x)
if previous >= temp:
return previous
else:
return temp
But before that, I tried:
def large(x):
if len(x) == 1:
return x.pop(0)
else:
temp = x.pop(0)
if large(x) >= temp:
return large(x)
else:
return temp
And it will return the error message as:
<ipython-input-74-dd8676a7c4e6> in large(x)
3 return x.pop(0)
4 else:
----> 5 temp = x.pop(0)
6 if large(x) >= temp:
7 return large(x)
IndexError: pop from empty list
The toy data would be:
inputlist = [6,1,3,2,3,4,5,6]
large(inputlist)
Thank you for your help in advance. I can't find the main cause of this error. As for me, these two codes are completely same.
The problem with
if large(x) >= temp:
return large(x)
is that you end up calling large(x) (and therefore pop) more than once, which is removing elements from the list.
Personally, I would more go for this style than using a mutating function like pop.
def large(x):
if len(x) == 1:
return x[0]
remainder = large(x[1:])
return x[0] if x[0] > remainder else remainder
The same solution as OneCricketeer's
but without creating list slices upon every recursive call unnecessarily.
It also handles an empty list.
def large(x):
def rec(y):
try:
v = next(y)
except StopIteration:
return None
r = rec(y)
if r is None:
return v
return v if v > r else r
return rec(iter(x))
inputlist = [6,1,3,2,3,4,5,6]
print(large(inputlist))
print(large([]))
which produces
6
None
This does not answer why the original is incorrect. Rather, it lays down a 'standard pattern' that can be used for implementing a number of recursive problems.
I am wondering that how should I eliminate the number of elements in each round with index rather than pop?
Don't "eliminate" the elements :-)
Many recursive-friendly problems operate by reducing the range each step. This includes finding the max value (or any operation that can be expressed as a fold), a binary search, a top-down merge sort, etc. Many of these problems are themselves expressed in pseudo-code using arrays and sub-problem reduction by adjusting the ranges of each recursive call. In the case of a max/binary-search this also avoids any mutations to the original object.
Thus, a recursive max function can written as the following. Note that this form of passing in the working state is Tail-Call Friendly. While I find this form is easier to express certain problems, it does not really matter in Python as [C]Python does not support Tail-Call Optimizations^.
def my_max(lst, m=None, i=0):
# base-case: return result of work
if len(lst) == i:
return m
# Compute max through here
c = lst[i]
m = c if m is None or c > m else m
# Call recursive function increasing the index by 1.
# This is how the problem advances.
return my_max(lst, m, i + 1)
The above example also use default arguments instead of a helper method. Here is an alternative that uses the recursive result — which is often how recursive functions are introduced — as well as a discrete helper method.
def my_max(lst):
# Wrapper can ensure helper pre-conditions.
# In this case that is a non-Empty list per the base case check.
if not lst:
return None
return my_max_helper(lst, 0)
def my_max_helper(lst, i):
# base case: last item in list returns itself
if len(lst) - 1 == i:
return lst[i]
c = lst[i]
m = my_max_helper(lst, i + 1)
return c if c > m else m
In both cases temporarily variables are used to avoid duplicate expressions; while sometimes merely a stylistic choice, this consistency would have mitigated the original issue due to avoiding the unexpected side-effect of additional pop-mutation.
The above methods should be called with a list or other sequence that supports O(1) indexed lookups. In particular, the 'index' approach is not suitable, and will not work with, generator objects. There are other answers that cover this — just beware to avoid potential list slices like h,*t=l or l[1:] which can lead to bad performance bounds.
^There are modules in Python which can emulate TCO through spring-boarding.
Since this is a recursion exercise, and not something we'd do in system code, I'd go with descriptive code over efficient code and do something like:
def largest(array):
if array:
head, *tail = array
if tail and head < (result := largest(tail)):
return result
return head
return None
if __name__ == "__main__":
from random import choices
array = choices(range(100), k=10)
print(array, '->', largest(array))
OUTPUT
> python3 test.py
[46, 67, 0, 22, 23, 20, 30, 7, 87, 50] -> 87
> python3 test.py
[83, 77, 61, 53, 7, 65, 68, 43, 44, 47] -> 83
> python3 test.py
[36, 99, 47, 93, 60, 43, 56, 90, 53, 44] -> 99
>
If you really need to be efficient, I'd recommend doing so safely. Specifically, not exposing an API with special arguments that the caller is not supposed to use, e.g.:
def my_max(lst, m=None, i=0):
As they can supply values for these extra arguments that will make your code fail, and ultimately blame it on you. Ditto for exposing internal functions that the caller might call instead of the intended one:
def my_max(lst, m=None, i=0):
def my_max_helper(lst, i):
Accidentally calling my_max_helper() with a bogus value for the poorly named i argument. Instead, I'd consider nesting your functions to avoid such calling errors:
def largest(array):
def largest_recursive(array, index):
a = array[index]
if len(array) - index != 1:
if (b := largest_recursive(array, index + 1)) > a:
return b
return a
if array:
return largest_recursive(array, 0)
return None

Finite permutations of a list python

I have a list and would like to generate a finite number of permutation with no repeated elements.
itertools.permutations(x)
gives all possible orderings but I only need a specific number of permutation. (my initial list contains ~200 elements => 200! will take an unreasonable amount of time and I don't need all of them)
what I have done so far
def createList(My_List):
New_List = random.sample(My_List, len(My_List))
return New_List
def createManyList(Nb_of_Lists):
list_of_list = []
for i in range(0, Nb_of_Lists):
list_of_list.append(createList())
return list_of_list
It's working but my List_of_list will not have unique permutations or at least I have no guaranty about it.
Is there any way around to do so? Thanks
Just use islice, which allows you to take a number of elements from an iterable:
from itertools import permutations, islice
n_elements = 1000
list(islice(permutations(x), 0, 1000))
This will return a list of (the first) 1000 permutations.
The reason this works is that permutations returns an iterator, which is an object that generates values to return as they are needed, not immediately. Therefore, the process goes something like this:
The calling function (in this case, list) asks for the next value from islice
islice checks if 1000 values have been returned; if not, it asks for the next value from permutations
permutations returns the next value, in order
Because of this, the full list of permutations never needs to be generated; we take only as many as we want.
You can do:
i = 0
while i < Nb_of_Lists:
if createlist() not in list_of_lists:
list_of_list.append(createList())
else:
i -= 1
This will check if that permutation was already used.
You don't need to roll your own permutation. You just to halt the generator once you get enough:
# python 2.7
import random
import itertools
def createList(My_List):
New_List = random.sample(My_List, len(My_List))
return New_List
x = createList(xrange(20))
def getFirst200():
for i, result in enumerate(itertools.permutations(x)):
if i == 200:
raise StopIteration
yield result
print list(getFirst200()) # print first 200 of the result
This is faster and more memory efficient than 'generate of full set then take first 200' approach

How to check if an ordered non-consecutive subsequence is in array? Python

I'd be surprised if this hasn't been asked yet.
Let's say I have an array [5,6,7,29,34] and I want to check if the sequence 5,6,7 appears in it (which it does). Order does matter.
How would I do this?
Just for fun, here is a quick (very quick) and dirty (very dirty) solution (that is somewhat flawed, so don't really use this):
>>> str([5,6,7]).strip('[]') in str([5,6,7,29,34])
True
The RightWay™ is likely to use list.index() to find candidate matches for the first element and then verify the full match with slicing and list equality:
>>> def issubsequence(sub, seq):
i = -1
while True:
try:
i = seq.index(sub[0], i+1) # locate first character
except ValueError:
return False
if seq[i : i+len(sub)] == sub: # verify full match
return True
>>> issubsequence([5, 6, 7], [5,6,7,29,34])
True
>>> issubsequence([5, 20, 7], [5,6,7,29,34])
False
Edit: The OP clarified in a comment that the subsequence must be in order but need not be in consecutive positions. That has a different and much more complicated solution which was already answered here: How do you check if one array is a subsequence of another?
Here is a good solution:
def is_sublist(a, b):
if not a: return True
if not b: return False
return b[:len(a)] == a or is_sublist(a, b[1:])
As mentioned by Stefan Pochmann this can be rewritten as:
def is_sublist(a, b):
return b[:len(a)] == a or bool(b) and is_sublist(a, b[1:])
Here's a solution that works (efficiently!) on any pair of iterable objects:
import collections
import itertools
def consume(iterator, n=None):
"""Advance the iterator n-steps ahead. If n is none, consume entirely."""
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(islice(iterator, n, n), None)
def is_slice(seq, subseq):
"""Returns whether subseq is a contiguous subsequence of seq."""
subseq = tuple(subseq) # len(subseq) is needed so we make it a tuple.
seq_window = itertools.tee(seq, n=len(subseq))
for steps, it in enumerate(seq_window):
# advance each iterator to point to subsequent values in seq.
consume(it, n=steps)
return any(subseq == seq_slice for seq_slice in izip(*seq_window))
consume comes from itertools recipes.

How to execute a for loop in batches?

for x in records:
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
ls.append(adapter.insert_posts(collection, data))
I want to execute the code ls.append(adapter.insert_post(collection, x)) in the batch size of 500, where x should contain 500 data dicts. I could create a list a of 500 data dicts using a double for loop and a list and then insert it. I could do that in the following way, , is there a better way to do it? :
for x in records:
for i in xrange(0,len(records)/500):
for j in xrange(0,500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
for i in xrange(0,len(records)%500):
l=[]
data = {}
for y in sObjectName.describe()['fields']:
data[y['name']] = x[y['name']]
#print data
#print data
l.append(data)
ls.append(adapter.insert_posts(collection, data))
The general structure I use looks like this:
worklist = [...]
batchsize = 500
for i in range(0, len(worklist), batchsize):
batch = worklist[i:i+batchsize] # the result might be shorter than batchsize at the end
# do stuff with batch
Note that we're using the step argument of range to simplify the batch processing considerably.
If you're working with sequences, the solution by #nneonneo is about as performant as you can get. If you want a solution which works with arbitrary iterables, you can look into some of the itertools recipes. e.g. grouper:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
I tend to not use this one because it "fills" the last group with None so that it is the same length as the others. I usually define my own variant which doesn't have this behavior:
def grouper2(iterable, n):
iterable = iter(iterable)
while True:
tup = tuple(itertools.islice(iterable, 0, n))
if tup:
yield tup
else:
break
This yields tuples of the requested size. This is generally good enough, but, for a little fun we can write a generator which returns lazy iterables of the correct size if we really want to...
The "best" solution here I think depends a bit on the problem at hand -- particularly the size of the groups and objects in the original iterable and the type of the original iterable. Generally, these last 2 recipes will find less use because they're more complex and rarely needed. However, If you're feeling adventurous and in the mood for a little fun, read on!
The only real modification that we need to get a lazy iterable instead of a tuple is the ability to "peek" at the next value in the islice to see if there is anything there. here I just peek at the value -- If it's missing, StopIteration will be raised which will stop the generator just as if it had ended normally. If it's there, I put it back using itertools.chain:
def grouper3(iterable, n):
iterable = iter(iterable)
while True:
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
yield itertools.chain((item,), group)
Careful though, this last function only "works" if you completely exhaust each iterable yielded before moving on to the next one. In the extreme case where you don't exhaust any of the iterables, e.g. list(grouper3(..., n)), you'll get "m" iterables which yield only 1 item, not n (where "m" is the "length" of the input iterable). This behavior could actually be useful sometimes, but not typically. We can fix that too if we use the itertools "consume" recipe (which also requires importing collections in addition to itertools):
def grouper4(iterable, n):
iterable = iter(iterable)
group = []
while True:
collections.deque(group, maxlen=0) # consume all of the last group
group = itertools.islice(iterable, n)
item = next(group) # raises StopIteration if the group doesn't yield anything
group = itertools.chain((item,), group)
yield group
Of course, list(grouper4(..., n)) will return empty iterables -- Any value not pulled from the "group" before the next invocation of next (e.g. when the for loop cycles back to the start) will never get yielded.
I like #nneonneo and #mgilson's answers but doing this over and over again is tedious. The bottom of the itertools page in python3 mentions the library more-itertools (I know this question was about python2 and this is python3 library, but some might find this useful). The following seems to do what you ask:
from more_itertools import chunked # Note: you might also want to look at ichuncked
for batch in chunked(records, 500):
# Do the work--`batch` is a list of 500 records (or less for the last batch).
Maybe something like this?
l = []
for ii, x in enumerate(records):
data = {}
for y in sObjectName.describe()['fields']
data[y['name']] = x[y['name']]
l.append(data)
if not ii % 500:
ls.append(adapter.insert_posts(collection, l))
l = []
I think one particular case scenario is not covered here. Let`s say the batch size is 100 and your list size is 103, the above answer might miss the last 3 element.
list = [.....] 103 elements
total_size = len(list)
batch_size_count=100
for start_index in range(0, total_size, batch_size_count):
list[start_index : start_index+batch_size_count] #Slicing operation
Above sliced list can be sent to each method call to complete the execution for all the elements.

How to write a pager for Python iterators?

I'm looking for a way to "page through" a Python iterator. That is, I would like to wrap a given iterator iter and page_size with another iterator that would would return the items from iter as a series of "pages". Each page would itself be an iterator with up to page_size iterations.
I looked through itertools and the closest thing I saw is itertools.islice. In some ways, what I'd like is the opposite of itertools.chain -- instead of chaining a series of iterators together into one iterator, I'd like to break an iterator up into a series of smaller iterators. I was expecting to find a paging function in itertools but couldn't locate one.
I came up with the following pager class and demonstration.
class pager(object):
"""
takes the iterable iter and page_size to create an iterator that "pages through" iter. That is, pager returns a series of page iterators,
each returning up to page_size items from iter.
"""
def __init__(self,iter, page_size):
self.iter = iter
self.page_size = page_size
def __iter__(self):
return self
def next(self):
# if self.iter has not been exhausted, return the next slice
# I'm using a technique from
# https://stackoverflow.com/questions/1264319/need-to-add-an-element-at-the-start-of-an-iterator-in-python
# to check for iterator completion by cloning self.iter into 3 copies:
# 1) self.iter gets advanced to the next page
# 2) peek is used to check on whether self.iter is done
# 3) iter_for_return is to create an independent page of the iterator to be used by caller of pager
self.iter, peek, iter_for_return = itertools.tee(self.iter, 3)
try:
next_v = next(peek)
except StopIteration: # catch the exception and then raise it
raise StopIteration
else:
# consume the page from the iterator so that the next page is up in the next iteration
# is there a better way to do this?
#
for i in itertools.islice(self.iter,self.page_size): pass
return itertools.islice(iter_for_return,self.page_size)
iterator_size = 10
page_size = 3
my_pager = pager(xrange(iterator_size),page_size)
# skip a page, then print out rest, and then show the first page
page1 = my_pager.next()
for page in my_pager:
for i in page:
print i
print "----"
print "skipped first page: " , list(page1)
I'm looking for some feedback and have the following questions:
Is there a pager already in itertools that serves a pager that I'm overlooking?
Cloning self.iter 3 times seems kludgy to me. One clone is to check whether self.iter has any more items. I decided to go with a technique Alex Martelli suggested (aware that he wrote of a wrapping technique). The second clone was to enable the returned page to be independent of the internal iterator (self.iter). Is there a way to avoid making 3 clones?
Is there a better way to deal with the StopIteration exception beside catching it and then raising it again? I am tempted to not catch it at all and let it bubble up.
Thanks!
-Raymond
Look at grouper(), from the itertools recipes.
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Why aren't you using this?
def grouper( page_size, iterable ):
page= []
for item in iterable:
page.append( item )
if len(page) == page_size:
yield page
page= []
yield page
"Each page would itself be an iterator with up to page_size" items. Each page is a simple list of items, which is iterable. You could use yield iter(page) to yield the iterator instead of the object, but I don't see how that improves anything.
It throws a standard StopIteration at the end.
What more would you want?
I'd do it like this:
def pager(iterable, page_size):
args = [iter(iterable)] * page_size
fillvalue = object()
for group in izip_longest(fillvalue=fillvalue, *args):
yield (elem for elem in group if elem is not fillvalue)
That way, None can be a legitimate value that the iterator spits out. Only the single object fillvalue filtered out, and it cannot possibly be an element of the iterable.
Based on the pointer to the itertools recipe for grouper(), I came up with the following adaption of grouper() to mimic Pager. I wanted to filter out any None results and wanted to return an iterator rather than a tuple (though I suspect that there might be little advantage in doing this conversion)
# based on http://docs.python.org/library/itertools.html#recipes
def grouper2(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
for item in izip_longest(fillvalue=fillvalue, *args):
yield iter(filter(None,item))
I'd welcome feedback on how what I can do to improve this code.
def group_by(iterable, size):
"""Group an iterable into lists that don't exceed the size given.
>>> group_by([1,2,3,4,5], 2)
[[1, 2], [3, 4], [5]]
"""
sublist = []
for index, item in enumerate(iterable):
if index > 0 and index % size == 0:
yield sublist
sublist = []
sublist.append(item)
if sublist:
yield sublist
more_itertools.chunked will do exactly what you're looking for:
>>> import more_itertools
>>> list(chunked([1, 2, 3, 4, 5, 6], 3))
[[1, 2, 3], [4, 5, 6]]
If you want the chunking without creating temporary lists, you can use more_itertools.ichunked.
That library also has lots of other nice options for efficiently grouping, windowing, slicing, etc.

Categories

Resources