How does heapq.merge() work with infinite generators? - python

I want to understand how heapq.merge() works with infinite generators. Consider this example:
>>> from heapq import merge
>>> from itertools import count
>>> m = merge(count(0, 2), count(1, 2))
>>> for _ in range(10):
... print(next(m))
...
0
1
2
3
4
5
6
7
8
9
The docs state that it does not pull the data into memory all at once. But how does it consume each of the infinite generators?

A very simple implementation of such a function could look like the following. Note, though, that for the sake of simplicity this does not handle any special (and not-so-special) cases like empty or exhausted iterables.
def merge(*iterables):
heap = [(next(it), i) for i, it in enumerate(iterables)]
heapq.heapify(heap)
while heap:
val, i = heapq.heappop(heap)
yield val
heapq.heappush(heap, (next(iterables[i]), i))
It works like this:
get the first element from each sorted iterable, together with that iterable's index in the list
yield the next smallest element from that heap
add the next element from the iterable with the same index as the one just yielded to the heap
The actual implementation is a bit more involved, but seems to work roughly along the same lines. You can get the location of your local source with heapq.__file__, which on my system is /usr/lib/python3.6/heapq.py, and check yourself.

Related

Finite permutations of a list python

I have a list and would like to generate a finite number of permutation with no repeated elements.
itertools.permutations(x)
gives all possible orderings but I only need a specific number of permutation. (my initial list contains ~200 elements => 200! will take an unreasonable amount of time and I don't need all of them)
what I have done so far
def createList(My_List):
New_List = random.sample(My_List, len(My_List))
return New_List
def createManyList(Nb_of_Lists):
list_of_list = []
for i in range(0, Nb_of_Lists):
list_of_list.append(createList())
return list_of_list
It's working but my List_of_list will not have unique permutations or at least I have no guaranty about it.
Is there any way around to do so? Thanks
Just use islice, which allows you to take a number of elements from an iterable:
from itertools import permutations, islice
n_elements = 1000
list(islice(permutations(x), 0, 1000))
This will return a list of (the first) 1000 permutations.
The reason this works is that permutations returns an iterator, which is an object that generates values to return as they are needed, not immediately. Therefore, the process goes something like this:
The calling function (in this case, list) asks for the next value from islice
islice checks if 1000 values have been returned; if not, it asks for the next value from permutations
permutations returns the next value, in order
Because of this, the full list of permutations never needs to be generated; we take only as many as we want.
You can do:
i = 0
while i < Nb_of_Lists:
if createlist() not in list_of_lists:
list_of_list.append(createList())
else:
i -= 1
This will check if that permutation was already used.
You don't need to roll your own permutation. You just to halt the generator once you get enough:
# python 2.7
import random
import itertools
def createList(My_List):
New_List = random.sample(My_List, len(My_List))
return New_List
x = createList(xrange(20))
def getFirst200():
for i, result in enumerate(itertools.permutations(x)):
if i == 200:
raise StopIteration
yield result
print list(getFirst200()) # print first 200 of the result
This is faster and more memory efficient than 'generate of full set then take first 200' approach

Python iterator and zip

With x = [1,2,3,4], I can get an iterator from i = iter(x).
With this iterator, I can use zip function to create a tuple with two items.
>>> i = iter(x)
>>> zip(i,i)
[(1, 2), (3, 4)]
Even I can use this syntax to get the same results.
>>> zip(*[i] * 2)
[(1, 2), (3, 4)]
How does this work? How an iterator with zip(i,i) and zip(*[i] * 2) work?
An iterator is like a stream of items. You can only look at the items in the stream one at a time and you only ever have access to the first element. To look at something in the stream, you need to remove it from the stream and once you take something from the top of the stream, it's gone from the stream for good.
When you call zip(i, i), zip first looks at the first stream and takes an item out. Then it looks at the second stream (which happens to be the same stream as the first one) and takes an item out. Then it makes a tuple out of those two items and repeats this over and over until there is nothing left in the stream.
Maybe it's easier to see if I were to write the zip function in pure python (with only 2 arguments for simplicity). It would look something like1:
def zip(a, b):
out = []
try:
while True:
item1 = next(a)
item2 = next(b)
out.append((item1, item2))
except StopIteration:
return out
Now imagine the case that you are talking about where a and b are the same object. In that case, we just call next twice on the iterator (i in your example case) which will just take the first two items from i in sequence and pack them into a tuple.
Once we've understood why zip(i, i) behaves the way it does, zip(*([i] * 2)) isn't too hard. Lets read the expression from the inside out...
[i] * 2
That just creates a new list (of length 2) where both of the elements are references to the iterator i. So it's the same thing as zip(*[i, i]) (it's just more convenient to write when you want to repeat something many more than 2 times). * unpacking is a common idiom in python and you can find more information in the python tutorial. The gist of it is that python takes the iterable and "unpacks" it as if each item of the iterable was a separate positional argument to the function. So:
zip(*[i, i])
does the same thing as:
zip(i, i)
And now Bob's our uncle. We've just come full-circle since zip(i, i) is where this discussion started.
1This example code is definitely simplified more than just the afore-mentioned only accepting 2 arguments. For example, zip is probably going to call iter on the input arguments so that it works for any iterable (not just iterators), but this should be enough to get the point across...
Every time you get an item from an iterator, it stays at that spot rather than "rewinding." So zip(i, i) gets the first item from i, then the second item from i, and returns that as a tuple. It continues to do this for each available pair, until the iterator is exhausted.
zip(*[i]*2) creates a list of [i, i] by multiplying i by 2, then unpacks it with the * at the far left, which, in effect, sends two arguments i and i to zip, producing the same result as the first snippet.

How to get a zip of all characters in a string. zip misses out on final characters and itertools.zip_longest adds none [duplicate]

This question already has answers here:
Python generator that groups another iterable into groups of N [duplicate]
(9 answers)
Closed 1 year ago.
I am passing the result of itertools.zip_longest to itertools.product, however I get errors when it gets to the end and finds None.
The error I get is:
Error: (, TypeError('sequence item 0: expected str instance, NoneType found',), )
If I use zip instead of itertools.zip_longest then I don't get all the items.
Here is the code I am using to generate the zip:
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
print(args)
#return zip(*args)
return itertools.zip_longest(*args)
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
for x in grouper(sCharacters, 4):
print(x)
Here is the output. The first one is itertools.zip_longest and the second is just zip. You can see the first with the None items and the second is missing the final item, the comma: ','
How can I get a zip of all characters in a string without the none at the end.
Or how can I avoid this error?
Thanks for your time.
I've had to solve this in a performance critical case before, so here is the fastest code I've found for doing this (works no matter the values in iterable):
from itertools import zip_longest
def grouper(n, iterable):
fillvalue = object() # Guaranteed unique sentinel, cannot exist in iterable
for tup in zip_longest(*(iter(iterable),) * n, fillvalue=fillvalue):
if tup[-1] is fillvalue:
yield tuple(v for v in tup if v is not fillvalue)
else:
yield tup
The above is, a far as I can tell, unbeatable when the input is long enough and the chunk sizes are small enough. For cases where the chunk size is fairly large, it can lose out to this even uglier case, but usually not by much:
from future_builtins import map # Only on Py2, and required there
from itertools import islice, repeat, starmap, takewhile
from operator import truth # Faster than bool when guaranteed non-empty call
def grouper(n, iterable):
'''Returns a generator yielding n sized groups from iterable
For iterables not evenly divisible by n, the final group will be undersized.
'''
# Can add tests to special case other types if you like, or just
# use tuple unconditionally to match `zip`
rettype = ''.join if type(iterable) is str else tuple
# Keep islicing n items and converting to groups until we hit an empty slice
return takewhile(truth, map(rettype, starmap(islice, repeat((iter(iterable), n)))))
Either approach seamlessly leaves the final element incomplete if there aren't sufficient items to complete the group. It runs extremely fast because literally all of the work is pushed to the C layer in CPython after "set up", so however long the iterable is, the Python level work is the same, only the C level work increases. That said, it does a lot of C work, which is why the zip_longest solution (which does much less C work, and only trivial Python level work for all but the final chunk) usually beats it.
The slower, but more readable equivalent code to option #2 (but skipping the dynamic return type in favor of just tuple) is:
def grouper(n, iterable):
iterable = iter(iterable)
while True:
x = tuple(islice(iterable, n))
if not x:
return
yield x
Or more succinctly with Python 3.8+'s walrus operator:
def grouper(n, iterable):
iterable = iter(iterable)
while x := tuple(islice(iterable, n)):
yield x
the length of sCharacters is 93 (Note, 92 % 4 ==0). so since zip outputs a sequence of length of the shortest input sequence, it will miss the last element
Beware, the addition of the Nones of itertools.zip_longest are artificial values which may not be the desired behaviour for everyone. That's why zip just ignores unneccessary, additional values
EDIT:
to be able to use zip you could append some whitespace to your string:
n=4
sCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789~`!##$%^&*()_-+={[}]|\"""':;?/>.<,"
if len(sCharacters) % n > 0:
sCharacters = sCharacters + (" "*(n-len(sCharacters) % n))
EDIT2:
to obtain the missing tail when using zip use code like this:
tail = '' if len(sCharacters)%n == 0 else sCharacters[-(len(sCharacters)%n):]

index into a circular array

I have a circular array. I created it with the following:
from itertools import cycle
myArray = ['a','b','c','d']
pool = cycle(myArray)
Now I want to print the nth item in pool where n could be larger than 4. Normally this would be a simple use of the modulo function but logically I think Python has a method which will know the number of elements in the pool (4 in this example) and automatically apply the modulo function.
For example the 1st and 5th item is 'a'. So I'm hoping for, logically, the equivalent of pool[0] and pool[4] giving me 'a'.
Is there such a method?
No, there's no built-in method to accomplish what you're attempting to do. As suggested earlier, you could use zip, but that would involve indexing into the result based on your sequence, as well as generating n elements out to the item you want.
Sometimes the simplest approach is the clearest. Use modulo to accomplish what you're after.
def fetch_circular(n):
myArray = ['a','b','c','d']
return myArray[n % 4]
I think you may be confusing arrays with generators.
The modulo function of an array is the way to go, in terms of performance.
cycle is a function which generates elements as they are requested. It is not a Cycle class with convenient methods. You can see the equivalent implementation in the documentation, and you'll probably understand what is the idea behind it:
https://docs.python.org/2/library/itertools.html#itertools.cycle
A list is definitely the way to go but if you actually had a cycle object and wanted the nth object wrapping around, you could islice:
from itertools import cycle, islice
myArray = ['a','b','c','d']
pool = cycle(myArray)
print(next(islice(pool, 5)))
a
Note once you call next(islice you have started cycling the list, if you actually want to be constantly rotating you may actually want a deque
Your pool object is already a generator and it will keep looping through myArray forever, so all you need is to zip your iterator with pool this way:
>>> pool = cycle(myA)
>>> for i,item in zip(range(10),pool):
print i,item
0 a
1 b
2 c
3 d
4 a
5 b
6 c
7 d
8 a
9 b
>>>

Interleaving multiple iterables randomly while preserving their order in python

Inspired by this earlier stack overflow question I have been considering how to randomly interleave iterables in python while preserving the order of elements within each iterable. For example:
>>> def interleave(*iterables):
... "Return the source iterables randomly interleaved"
... <insert magic here>
>>> interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15))
[1, 5, 10, 11, 2, 6, 3, 12, 4, 13, 7, 14, 8, 9]
The original question asked to randomly interleave two lists, a and b, and the accepted solution was:
>>> c = [x.pop(0) for x in random.sample([a]*len(a) + [b]*len(b), len(a)+len(b))]
However, this solution works for only two lists (though it can easily be extended) and relies on the fact that a and b are lists so that pop() and len() can be called on them, meaning it cannot be used with iterables. It also has the unfortunate side effect of emptying the source lists a and b.
Alternate answers given for the original question take copies of the source lists to avoid modifying them, but this strikes me as inefficient, especially if the source lists are sizeable. The alternate answers also make use of len() and therefore cannot be used on mere iterables.
I wrote my own solution that works for any number of input lists and doesn't modify them:
def interleave(*args):
iters = [i for i, b in ((iter(a), a) for a in args) for _ in xrange(len(b))]
random.shuffle(iters)
return map(next, iters)
but this solution also relies on the source arguments being lists so that len() can be used on them.
So, is there an efficient way to randomly interleave iterables in python, preserving the original order of elements, which doesn't require knowledge of the length of the iterables ahead of time and doesn't take copies of the iterables?
Edit: Please note that, as with the original question, I don't need the randomisation to be fair.
Here is one way to do it using a generator:
import random
def interleave(*args):
iters = map(iter, args)
while iters:
it = random.choice(iters)
try:
yield next(it)
except StopIteration:
iters.remove(it)
print list(interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15)))
Not if you want fit to be "fair".
Imagine you have a list containing one million items and another containing just two items. A "fair" randomization would have the first element from the short list occurring at about index 300000 or so.
a,a,a,a,a,a,a,...,a,a,a,b,a,a,a,....
^
But there's no way to know in advance until you know the length of the lists.
If you just take from each list with 50% (1/n) probability then it can be done without knowing the lengths of the lists but you'll get something more like this:
a,a,b,a,b,a,a,a,a,a,a,a,a,a,a,a,...
^ ^
I am satisfied that the solution provided by aix meets the requirements of the question. However, after reading the comments by Mark Byers I wanted to see just how "unfair" the solution was.
Furthermore, sometime after I wrote this question, stack overflow user EOL posted another solution to the original question which yields a "fair" result. EOL's solution is:
>>> a.reverse()
>>> b.reverse()
>>> [(a if random.randrange(0, len(a)+len(b)) < len(a) else b).pop()
... for _ in xrange(len(a)+len(b))]
I also further enhanced my own solution so that it does not rely on its arguments supporting len() but does make copies of the source iterables:
def interleave(*args):
iters = sum(([iter(list_arg)]*len(list_arg) for list_arg in map(list, args)), [])
random.shuffle(iters)
return map(next, iters)
or, written differently:
def interleave(*args):
iters = [i for i, j in ((iter(k), k) for k in map(list, args)) for _ in j]
random.shuffle(iters)
return map(next, iters)
I then tested the accepted solution to the original question, written by F.J and reproduced in my question above, to the solutions of aix, EOL and my own. The test involved interleaving a list of 30000 elements with a single element list (the sentinel). I repeated the test 1000 times and the following table shows, for each algorithm, the minimum, maximum and mean index of the sentinel after interleaving, along with the total time taken. We would expect a "fair" algorithm to produce a mean of approx. 15,000:
algo min max mean total_seconds
---- --- --- ---- -------------
F.J: 5 29952 14626.3 152.1
aix: 0 8 0.9 27.5
EOL: 45 29972 15091.0 61.2
srgerg: 23 29978 14961.6 18.6
As can be seen from the results, each of the algorithms of F.J, EOL and srgerg produce ostensibly "fair" results (at least under the given test conditions). However aix's algorithm has always placed the sentinel within the first 10 elements of the result. I repeated the experiment several times with similar results.
So Mark Byers is proved correct. If a truly random interleaving is desired, the length of the source iterables will need to be known ahead of time, or copies will need to be made so the length can be determined.

Categories

Resources