Random item from iterator?

Random item from iterator? - python

I have the following code
number_list = (i for i in range(5))
permutations = (num for num in itertools.product(number_list, repeat=9))
This is generating an iterator called permutations which will hold all permutations of 9 characters within the number_list if I'm not mistaken. This can get pretty large for a big number_list.
I can iterate through permutations with next(permutations) but the problem is that it's sequencial. I would like to be able to draw a random item from any part of the iterator. If it was a list, I could simply do random.choice() but for a big number_list I don't have near enough memory nor time for that.
I could also just use next() and store a list of X amount of items and them randomize them but that won't work either because it can get so incredibly big that the outputs would be so similar it wouldn't really be "random".
I was wondering, if it isn't possible to draw a random item from the iterator, is there an algorithm which allows me to create an iterator which will output a random set with next() but that when it ends it will have gone through the entire permutations witout repeating?
The final idea would be having an iterator that would spit a random permutation of n characters out of a list of i elements, being able to get both n and i to arbitrarily large numbers without memory constraints, making sure that when the whole iterator ends up finishing (doesn't matter when, even if it finished after years in theory), all possible permutations would be exhausted without repetitions.

Firstly, your code does not generate permutations, but draws with replacement. Secondly, iterators (as the name suggests) are meant to ITERATE through some collection, and not to jump to random places in it (of course, you can write your own __next__ function which does whatever you want - whether you want to call the resulting object an iterator is a philosophical question). Thirdly, producing random samples with replacement is a much studied and implemented question. See for example: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

Related

Compare whether the first two elements on a nested list are equal to a comparison list in python

In python 2.7, I would like to verify whether a subset list of elements is included in a longer nested list when comparing let's say only the first two elements.
Lets say we have a big list of nested elements (this big_list will have over 10k elements so looping for every comparison is very inefficient and I'd like to avoid this). For this example, lets say we only have 4 nested lists in big_list:
`
big_list = ((2,3,5,6,7), (4,5,6,7,8), (6,7,8,8), (8,4,2,7))
`
If I have a single list, let's say (4,5,11,11,11), I am looking for an operation that will return True when compared to big_list since the second list in big_list starts with (4,5,...) and matches the first two elements of my single_list. Essentially I want to know whether the first two elements of a single list (e.g. (4,5,11,11,11)) are repeated in my big list regardless of the other followed numbers (e.g. 11,11, ...).
My operation should also return False if another single_list (e.g. (4,8,11,11,11) ) does not match the first two element in the big_list.
I hope this is clearer. Any help?
Thanks in advance,

Since you have a huge list, to avoid iterating over the whole thing every time — O(n) time complexity for each search, you can do a constant time lookup using a set.
tup_truth_set = set([tup[:2] for tup in big_list]) # set with first two letters of interest
then you would simply do something like this to check in constant time:
tuple_of_interest[:2] in tup_truth_set

I don't think that you can avoid the loop over your list. Even if you don't run the loop yourself and suppose there is a built-in function, that I am not aware of and can do what you are asking, I am pretty sure it would loop the list in the background. So I suggest a single line of code to do that, including a loop, obviously.
(4,5,11,11,11)[:2] in [i[:2] for i in big_list]

Is there a form of 'for' that evaluates the "expression list" every time?

lt = 1000 #list primes to ...
remaining = list(range(2, lt + 1)) #remaining primes
for c in remaining: #current "prime" being tested
for t in remaining[0: remaining.index(c)]: #test divisor
if c % t == 0 and c != t:
if c in remaining:
remaining.remove(c)
If you don't need context:
How can I either re-run the same target-list value, or use something other than for that reads the expression list every iteration?
If you need context:
I am currently creating a program that lists primes from 2 to a given value (lt). I have a list 'remaining' that starts as all integers from 2 to the given value. One at a time, it tests a value on the list 'c' and tests for divisibility one by one by all smaller numbers on the list 't'. If 'c' is divisible by 't', it removes it from the list. By the end of the program, in theory, only primes remain but I have run into the problem that because I am removing items from the list, and for only reads remaining once, for is skipping values in remaining and thus leaving composites in the list.

What you're trying to do is almost never the right answer (and it's definitely not the right answer here, for reasons I'll get to later), which is why Python doesn't give you a way to do it automatically. In fact, it's illegal for delete from or insert into a list while you're iterating over it, even if CPython and other Python implementations usually don't check for that error.
But there is a way you can simulate what you want, with a little verbosity:
for i in range(remaining.index(c)):
if i >= remaining.index(c): break
t = remaining[i]
Now we're not iterating over remaining, we're iterating over its indices. So, if we remove values, we'll be iterating over the indices of the modified list. (Of course we're not really relying on the range there, since the if…break tests the same thing; if you prefer for i in itertools.count():, that will work too.)
And, depending on what you want to do, you can expand it in different ways, such as:
end = remaining.index(c)
for i in range(end):
if i >= end: break
t = remaining[i]
# possibly subtract from end within the loop
# so we don't have to recalculate remaining.index(c)
… and so on.
However, as I mentioned at the top, this is really not what you want to be doing. If you look at your code, it's not only looping over all the primes less than c, it's calling a bunch of functions inside that loop that also loop over either all the primes less than c or your entire list (that's how index, remove, and in work for lists), meaning you're turning linear work into quadratic work.
The simplest way around this is to stop trying to mutate the original list to remove composite numbers, and instead build a set of primes as you go along. You can search, add, and remove from a set in constant time. And you can just iterate your list in the obvious way because you're no longer mutating it.
Finally, this isn't actually implementing a proper prime sieve, but a much less efficient algorithm that for some reason everyone has been teaching as a Scheme example for decades and more recently translating into other languages. See The Genuine Sieve of Eratosthenes for details, or this project for sample code in Python and Ruby that shows how to implement a proper sieve and a bit of commentary on performance tradeoffs.

(In the following, I ignore the XY problem of finding primes using a "mutable for".)
It's not entirely trivial to design an iteration over a sequence with well-defined (and efficient) behavior when the sequence is modified. In your case, where the sequence is merely being depleted, one reasonable thing to do is to use a list but "delete" elements by replacing them with a special value. (This makes it easy to preserve the current iteration position and avoids the cost of shifting the subsequent elements.)
To make it efficient to skip the deleted elements (both for the outer iteration and any inner iterations like in your example), the special value should be (or contain) a count of any following deleted elements. Note that there is a special case of deleting the current element, where for maximum efficiency you must move the cursor while you still know how far to move.

Python range for multiple iteration

I was told that if you'd like to iterate over the list multiple times, it's probably better to use range. This is because xrange has to generate an integer object every time you access an index, whereas range is a static list and the integers are already "there" to use.
So,i deduce that the list created by the range method remains in memory enough to be re-iterated. When will that list be destroyed ?

In Python 2, range(n) returns a list of numbers from 0 up to n-1.
If you code
for i in range(1000000):
then your program creates a list of a million integers, and throws it away when the for statement ends. That list has to be constructed in memory, and that can be expensive, if your code does it often.
If you code
for i in xrange(1000000):
then your program doesn't create a list of a million integers: instead you get a sort of iterator. But it will still be destroyed at the end of the for statement. The xrange() function dates from Python 1, before the language had iterators as first-class constructs, but the idea is much the same. It is not a data structure but a chunk of code that returns the next number when you ask for it, instead of generating them all beforehand.
So your deduction "that the list created by the range method remains in memory enough to be re-iterated" is incorrect. (Though I think you meant xrange not range. Either way, it's wrong.)
If you want a range()-generated list to persist in memory, do this:
myrange = range(1000000):
Then you can do
for i in myrange:
as many times as you like and you will pay the overhead of creating a list of a million integers only once.
If your range is of the order of dozens or hundreds, not millions, and you need to ask this question, then you should not be fretting about efficiency.

Maximal Length of List to Shuffle with Python random.shuffle?

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)
However, the Python reference states:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...

TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."
So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.

I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True

What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

Python Reverse Generator

I'm looking for a way to reverse a generator object. I know how to reverse sequences:
foo = imap(seq.__getitem__, xrange(len(seq)-1, -1, -1))
But is something similar possible with a generator as the input and a reversed generator as the output (len(seq) stays the same, so the value from the original sequence can be used)?

You cannot reverse a generator in any generic way except by casting it to a sequence and creating an iterator from that. Later terms of a generator cannot necessarily be known until the earlier ones have been calculated.
Even worse, you can't know if your generator will ever hit a StopIteration exception until you hit it, so there's no way to know what there will even be a first term in your sequence.
The best you could do would be to write a reversed_iterator function:
def reversed_iterator(iter):
return reversed(list(iter))
EDIT: You could also, of course, replace reversed in this with your imap based iterative version, to save one list creation.

reversed(list(input_generator)) is probably the easiest way.
There's no way to get a generator's values in "reverse" order without gathering all of them into a sequence first, because generating the second item could very well rely on the first having been generated.

You have to walk through the generator anyway to get the first item so you might as well make a list. Try
reversed(list(g))
where g is a generator.
reversed(tuple(g))
would work as well (I didn't check to see if there is a significant difference in performance).

def reverseGenerator(gen):
new = [i for i in gen]
yield new[::-1][0]
new.pop()
yield from reverseGenerator(i for i in new)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.