Maximal Length of List to Shuffle with Python random.shuffle?

Maximal Length of List to Shuffle with Python random.shuffle? - python

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)
However, the Python reference states:
Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.
Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...

TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)
Complete answer:
First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.
Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).
Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."
So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.
So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.

I wrote that comment in the Python source originally, so maybe I can clarify ;-)
When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.
The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.
There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:
>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True

What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.
Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.
On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)
So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.
But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

Related

Recursive python matching algorithm based on subsets working too slowly

I'm building a web app to match high school students considering a gap year to students who have taken a gap year, based on interest as denoted by tags. A prototype is up at covidgapyears.com. I have never written a matching/recommendation algorithm, so though people have suggested things like collaborative filtering and association rule mining, or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset (few hundred users right now, few thousand soon). So I wrote my own alg using common sense.
It essentially takes in a list of tags that the student is interested it, then searches for an exact match of those tags with someone who has taken a gap year and registered with the site (who also selected tags on registration). An exactMatch, as given below, is when the tags the user specifies are ALL contained by some profile (i.e., are a subset). If it can't find an exact match with ALL of the user's inputted tags, it will check all n-1 length subsets of the tags list itself to see if any less selective queries have matches. It does this recursively until at least 3 matches are found. While it works fine for small tags selections (up to 5-7) it gets slow for larger tags selections (7-13), taking several seconds to return a result. When 11-13 tags are selected, hits a Heroku error due to worker timeout.
I did some tests by putting variables inside the algorithm to count computations and it seems that when it goes a bit deep into the recursive stack, it checks a few hundred subsets each time (to see if there's an exactMatch for that subset, and if there is, add it to results list to output), and the total number of computations doubles as you add one more tag (it went 54, 150, 270, 500, 1000, 1900, 3400 operations for more and more tags). It is true that there are a few hundred subsets at each depth. But exactMatches is O(1) as I've written it (no iteration), and aside from the other O(1) operations like IF, the FOR inside the subset loop will, at most, be gone through around 10 times. This agrees with the measured result of a few thousand computations each time.
This did not surprise me as selecting and iterating over all subsets seems to be something that could get non harder, but my question is about why it's so slow despite only doing a few thousand computations. I know my computer operates in GHz and I expect web servers are similar, so surely a few thousand computations would be near-instantaneous? What am I missing and how can I improve this algorithm? Any other approaches I should look into?
# takes in a list of length n and returns a list of all combos of subsets of depth n
def arbSubsets(seq, n):
return list(itertools.combinations(seq, len(seq)-n))
# takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags
def exactMatches(tagsList):
tagsSet = set(tagsList)
exactMatches = []
for gapper in Gapper.objects.all():
gapperSet = set(gapper.tags.names())
if tagsSet.issubset(gapperSet):
exactMatches.append(gapper)
return exactMatches
# takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match
def matchGapper(tagsList, depth, results):
# handles the case where we're only given tags contained by no gappers
if depth == len(tagsList):
return []
# counter variable is to measure complexity for debugging
counter += 1
# we don't want too many results or it stops feeling tailored
upper_limit_results = 3
# now we must check subsets for match
subsets = arbSubsets(tagsList, depth)
for subset in subsets:
counter += 1
matches = exactMatches(subset)
if matches:
for match in matches:
counter += 1
# new need to check because we might be adding depth 2 to results from depth 1
# which we didn't do before, to make sure we have at least 3 results
if match not in results:
# don't want to show too many or it doesn't feel tailored anymore
counter += 1
if len(results) > upper_limit_results: break
results.append(match)
# always give at least 3 results
if len(results) > 2:
return results
else:
# check one level deeper (less specific) into tags if not enough gappers that match to get more results
counter += 1
return matchGapper(tagsList, depth + 1, results)
# this is the list of matches we then return to the user
matches = matchGapper(tagsList, 0, [])

It doesn't seem you are doing a few hundred computation steps. In fact you have a few hundred options for each depth, thus you should not add, but multiply the number of steps at each depth to estimate the complexity of your solution.
Additionally this statement: This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset is also obviously not true. Although these algorithms may be overkill for some very simple cases, they are still valid and will work for them.

Okay, so after much fiddling with timers I've figured it out. There are a few functions at play when matching: exactMatches, matchGapper and arbSubset. When I put the counter into a global variable and measured operations (as measured as lines of my code being executed, it came in around 2-10K for large inputs (around 10 tags)).
It is true that arbSubset, which returns a list of subsets, at first seems like a plausible bottleneck. But if you look closely, we are 1) handling small amounts of tags (order of 10-50) and more importantly, 2) we are only calling arbSubset when we recurse matchGapper, which only happens a max of about 10 times, since tagsList can only be around 10 (order of 10-50, as above). And when I checked the time it took to generate arbSubsets, it was order of 2e-5. And so the total time spend on generating the subsets of arbitrary size is only 2e-4. In other words, not the source of the 5-30 second waiting time in the web app.
And so with that aside, knowing that arbSubset is only called on the order of 10 times, and is fast at that, and knowing that there are only around a max of 10K computations taking place in my code it starts to become clear that I must be using some out-of-the-box function, I don't know--like set() or .issubset() or something like that--that takes a nontrivial amount of time to compute, and is executed many times. Adding some counters in some more places, it becomes clear that exactMatch() accounts for around 95-99% of all computations that take place (as would be expected if we have to check all combinations of subsets of various sizes for exactMatches).
So the problem, at this point, is reduced to the fact that exactMatch takes around 0.02s (empirically) as implemented, and is called several thousand times. And so we can either try to make it faster by a couple of order of magnitudes (it's already pretty optimal), or take another approach that doesn't involve finding matches using subsets. A friend of mine suggested creating a dict with all the combinations of tags (so 2^len(tagsList) keys) and setting them equal to lists of registered profiles with that exact combination. This way, querying is just traversing a (huge) dict, which can be done fast. Any other suggestions are welcome.

Random item from iterator?

I have the following code
number_list = (i for i in range(5))
permutations = (num for num in itertools.product(number_list, repeat=9))
This is generating an iterator called permutations which will hold all permutations of 9 characters within the number_list if I'm not mistaken. This can get pretty large for a big number_list.
I can iterate through permutations with next(permutations) but the problem is that it's sequencial. I would like to be able to draw a random item from any part of the iterator. If it was a list, I could simply do random.choice() but for a big number_list I don't have near enough memory nor time for that.
I could also just use next() and store a list of X amount of items and them randomize them but that won't work either because it can get so incredibly big that the outputs would be so similar it wouldn't really be "random".
I was wondering, if it isn't possible to draw a random item from the iterator, is there an algorithm which allows me to create an iterator which will output a random set with next() but that when it ends it will have gone through the entire permutations witout repeating?
The final idea would be having an iterator that would spit a random permutation of n characters out of a list of i elements, being able to get both n and i to arbitrarily large numbers without memory constraints, making sure that when the whole iterator ends up finishing (doesn't matter when, even if it finished after years in theory), all possible permutations would be exhausted without repetitions.

Firstly, your code does not generate permutations, but draws with replacement. Secondly, iterators (as the name suggests) are meant to ITERATE through some collection, and not to jump to random places in it (of course, you can write your own __next__ function which does whatever you want - whether you want to call the resulting object an iterator is a philosophical question). Thirdly, producing random samples with replacement is a much studied and implemented question. See for example: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

Is there a form of 'for' that evaluates the "expression list" every time?

lt = 1000 #list primes to ...
remaining = list(range(2, lt + 1)) #remaining primes
for c in remaining: #current "prime" being tested
for t in remaining[0: remaining.index(c)]: #test divisor
if c % t == 0 and c != t:
if c in remaining:
remaining.remove(c)
If you don't need context:
How can I either re-run the same target-list value, or use something other than for that reads the expression list every iteration?
If you need context:
I am currently creating a program that lists primes from 2 to a given value (lt). I have a list 'remaining' that starts as all integers from 2 to the given value. One at a time, it tests a value on the list 'c' and tests for divisibility one by one by all smaller numbers on the list 't'. If 'c' is divisible by 't', it removes it from the list. By the end of the program, in theory, only primes remain but I have run into the problem that because I am removing items from the list, and for only reads remaining once, for is skipping values in remaining and thus leaving composites in the list.

What you're trying to do is almost never the right answer (and it's definitely not the right answer here, for reasons I'll get to later), which is why Python doesn't give you a way to do it automatically. In fact, it's illegal for delete from or insert into a list while you're iterating over it, even if CPython and other Python implementations usually don't check for that error.
But there is a way you can simulate what you want, with a little verbosity:
for i in range(remaining.index(c)):
if i >= remaining.index(c): break
t = remaining[i]
Now we're not iterating over remaining, we're iterating over its indices. So, if we remove values, we'll be iterating over the indices of the modified list. (Of course we're not really relying on the range there, since the if…break tests the same thing; if you prefer for i in itertools.count():, that will work too.)
And, depending on what you want to do, you can expand it in different ways, such as:
end = remaining.index(c)
for i in range(end):
if i >= end: break
t = remaining[i]
# possibly subtract from end within the loop
# so we don't have to recalculate remaining.index(c)
… and so on.
However, as I mentioned at the top, this is really not what you want to be doing. If you look at your code, it's not only looping over all the primes less than c, it's calling a bunch of functions inside that loop that also loop over either all the primes less than c or your entire list (that's how index, remove, and in work for lists), meaning you're turning linear work into quadratic work.
The simplest way around this is to stop trying to mutate the original list to remove composite numbers, and instead build a set of primes as you go along. You can search, add, and remove from a set in constant time. And you can just iterate your list in the obvious way because you're no longer mutating it.
Finally, this isn't actually implementing a proper prime sieve, but a much less efficient algorithm that for some reason everyone has been teaching as a Scheme example for decades and more recently translating into other languages. See The Genuine Sieve of Eratosthenes for details, or this project for sample code in Python and Ruby that shows how to implement a proper sieve and a bit of commentary on performance tradeoffs.

(In the following, I ignore the XY problem of finding primes using a "mutable for".)
It's not entirely trivial to design an iteration over a sequence with well-defined (and efficient) behavior when the sequence is modified. In your case, where the sequence is merely being depleted, one reasonable thing to do is to use a list but "delete" elements by replacing them with a special value. (This makes it easy to preserve the current iteration position and avoids the cost of shifting the subsequent elements.)
To make it efficient to skip the deleted elements (both for the outer iteration and any inner iterations like in your example), the special value should be (or contain) a count of any following deleted elements. Note that there is a special case of deleting the current element, where for maximum efficiency you must move the cursor while you still know how far to move.

Possible error in very large numbers in the Fibonacci sequence (Python 2.7.6 integer math)

Edited to put questions in bold.
I wrote the following Python code (using Python 2.7.6) to calculate the Fibonacci sequence. It doesn't use any extra libraries, just the core python modules.
I was wondering if there was a limit to how may terms of the sequence I could calculate, perhaps due to the absurd length of the resulting integers, or if there would be a point where Python no longer performed the calculations accurately.
Also, for the fibopt(n) function, it seems to sometimes return the term under the one requested (e. g. 99th instead of 100th) but always works at lower terms (1st, 2nd, 10th, 15th). Why is that?
def fibopt(n): # Returns term "n" of the Fibonacci sequence.
f = [0,1] # List containing the first two numbers in the Fibonacci sequence.
x = 0 # Empty integer to store the next value in the sequence. Not really necessary.
optnum = 2 # Number of calculated entries in the sequence. Starts at 2 (0, 1).
while optnum < n: # Until the "n"th value in the sequence has been calculated.
if optnum % 10000 == 0:
print "Calculating index number %s." % optnum # Notify the user for every 10000th value calculated. This is useful because the program can take a very long time to calculate higher values (e. g. about 15 minutes on an i7-4790 for the 10000000th value).
x = [f[-1] + f[-2]] # Calculate the next value in the sequence based of the previous two. This could be built into the next line.
f.extend(x) # Append that value to the sequence. This could be f.extend([f[-1] + f[-2]]) instead.
optnum +=1 # Increment the counter for number of values calculated by 1.
del f[:-2] # Remove all values from the table except for the last two. Without this, the integers become so long that they fill 16 GB of RAM in seconds.
return f[:n] # Returns the requested term of the sequence.
def fib(n): # Similar to fibopt(n), but returns all of the terms in the sequence up to and including term "n". Can use a lot of memory very quickly.
f = [0,1]
x = 0
while len(f) < n:
x = [f[-1] + f[-2]]
f.extend(x)
return f[:n]

The good news is: integer math in Python is easy -- there are no overflows.
As long as your integers can fit within a C long, Python will use that. Once you go past that, it will auto-promote to arbitrary-precision integers (which means it'll be slower and use more memory, but the calculations will remain correct).
The only limits are:
The amount of memory addressable by the Python process. If you're using 32-bit Python, you need to be able to fit all of your data within 2 gigabytes or RAM (get past that and your program will fail with MemoryError). If you're using 64-bit Python, your physical RAM + swapfile is the theoretical limit.
The time you're willing to wait while calculations are being performed. The larger your ints, the slower the calculations are. If you ever hit your swap space, your program will reach continental drift levels of slow.

If you go to Python 2.7 documentation, there is a section on Fibonacci numbers. In this section on Fibonacci numbers, the arbitrary end is not the elongated answer we would all want to view. It shortens it.
If this does not answer your question, please see: 4.6 Defining Functions.
If you have downloaded the interpreter, the manuals come preinstalled. You can go online if necessary to www.python.org or you can view your manual to see the Fibonacci numbers that end in an "arbitrary" short, i.e. not the entire numerical value.
Seth
P.S. If you have any questions on where to find this section in your manual, please see The Python Tutorial/4. More Control Flow Tools/4.6 Defining Functions. I hope this helps a bit.

Python integers can express arbitrary-length values and will not be automatically converted to float. You can check by simply creating a very large number and checking its type:
>>> type(2**(2**25))
<class 'int'> # long in Python 2.x
fibopt returns f[:n], and that is a list. You seem to expect it to return a term, so either the expectation (the first comment) or the implementation must change.

Time Complexity Python Script

I am writing a small script that guesses numeric passwords (including ones with leading zeros). The script works fine but I am having trouble understanding what the worst case time complexity would be for this algorithm. Any insight on the complexity of this implementation would be great, thanks.
def bruteforce(cipherText):
for pLen in itertools.count():
for password in itertools.product("0123456789", repeat=pLen):
if hashlib.sha256("".join(password)).hexdigest() == cipherText:
return "".join(password)

First, it's always possible that you're going to find a hash collision before you find the right password. And, for a long-enough input string, this is guaranteed. So, really, the algorithm is constant time: it will complete in about 2^256 steps no matter what the input is.
But this isn't very helpful when you're asking about how it scales with more reasonable N, so let's assume we had some upper limit that was low enough where hash collisions weren't relevant.
Now, if the password is length N, the outer loop will run N times.*
* I'm assuming here that the password really will be purely numeric. Otherwise, of course, it'll fail to find the right answer at N and keep going until it finds a hash collision.
How long does the inner loop take? Well, the main thing it does is iterate each element in product("0123456789", repeat=pLen). That just iterates the cartesian product of the 10-element list pLen times—in other words, there are 10^pLen elements in the product.
Since 10**pLen is greater than sum(10**i for i in range(pLen)) (e.g., 100000 > 11111), we can ignore all but the last time through the outer loop, so that 10**pLen is the total number of inner loops.
The stuff that it does inside each inner loop is all linear on pLen (joining a string, hashing a string) or constant (comparing two hashes), so there are (10^pLen)*pLen total steps.
So, the worst-case complexity is exponential: O(10^N). Because that's the same as multiplying N by a constant and then doing 2^cN, that's the same complexity class as O(2^N).
If you want to combine the two together, you could call this O(2^min(log2(10)*N, 256)). Which, again, is constant-time (since the asymptote is still 2^256), but shows how it's practically exponential if you only care about smaller N.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.