random.shuffle(lst_shuffle, random.random)
I know the latter part is an optional argument. But what does it do exactly. I don't understand what this mean.
This is from the docs.
random.random()ΒΆ
Return the next random floating point
number in the range [0.0, 1.0).
I also see this, is this what this range 0,0, 1,0 means?
Pseudorandom number generators
Most, if not all programming languages have libraries that include a pseudo-random
number generator. This generator usually returns a random number between 0 and 1 (not
including 1). In a perfect generator all numbers have the same probability of being selected but
in the pseudo generators some numbers have zero probability.
Existing answers do a good job of addressing the question's specific, but I think it's worth mentioning a side issue: why you're particularly likely to want to pass an alternative "random generator" to shuffle as opposed to other functions in the random module. Quoting the docs:
Note that for even rather small
len(x), the total number of
permutations of x is larger than the
period of most random number
generators; this implies that most
permutations of a long sequence can
never be generated.
The phrase "random number generators" here refers to what may be more pedantically called pseudo-random number generators -- generators that give a good imitation of randomness, but are entirely algorithmic, and therefore are known not to be "really random". Any such algorithmic approach will have a "period" -- it will start repeating itself eventually.
Python's random module uses a particularly good and well-studied pseudo-random generator, the Mersenne Twister, with a period of 2**19937-1 -- a number that has more than 6 thousand digits when written out in decimal digits, as len(str(2**19937-1)) will confirm;-). On my laptop I can generate about 5 million such numbers per second:
$ python -mtimeit -s'import random' 'random.random()'
1000000 loops, best of 3: 0.214 usec per loop
Assuming a much faster machine, able to generate a billion such numbers per second, the cycle would take about 105985 years to repeat -- and the best current estimate for the age of the Universe is a bit less than 1.5*1012 years. It would thus take an almost-unimaginable number of Universe-lifetimes to reach the point of repetition;-). Making the computation parallel wouldn't help much; there are estimated to be about 1080 atoms in the Universe, so even if you were able to run such a billion-per-second generator on each atom in the Universe, it would still take well over 105800 Universe-lifetimes to start repeating.
So, you might be justified in suspecting that this worry about repetition is a tiny little bit of a theoretical, rather than practical, issue;-).
Nevertheless, factorials (which count the permutations of a sequence of length N) also grow pretty fast. The Mersenne Twister, for example, might be able to produce all permutations of a sequence of length 2080, but definitely not of one of length 2081 or higher. Were it not for the "lifetime of the Universe" issue, the docs' worry about "even rather small len(x)" would be justified -- we know that many possible permutations can never be reached by shuffling with such a pseudo-RNG, as soon as we have a reasonably long sequence, so one might worry about what kind of bias we're actually introducing with even a few shuffles!: -)
os.urandom mediates access to whatever sources of physical randomness the OS provides -- CryptGenRandom on Windows, /dev/urandom on Linux, etc. os.urandom gives sequences of bytes, but with the help of struct it's easy to make them into random numbers:
>>> n = struct.calcsize('I')
>>> def s2i(s): return struct.unpack('I', s)[0]
...
>>> maxi = s2i(b'\xff'*n) + 1
>>> maxi = float(s2i(b'\xff'*n) + 1)
>>> def rnd(): return s2i(os.urandom(n))/maxi
Now we can call random.shuffle(somelist, rnd) and worry less about bias;-).
Unfortunately, measurement shows that this approach to RNG is about 50 times slower than calls to random.random() -- this could be an important practical consideration if we're going to need many random numbers (and if we don't, the worry about possible bias may be misplaced;-). The os.urandom approach is also hard to use in predictable, repeatable ways (e.g., for testing purposes), while with random.random() you need only provide a fixed initial random.seed at the start of the test to guarantee reproducible behavior.
In practice, therefore, os.urandom is only used when you need "cryptographic quality" random numbers - ones that a determined attacker can't predict - and are therefore willing to pay the practical price for using it instead of random.random.
The second argument is used to specify which random number generator to use. This could be useful if you need/have something "better" than random.random. Security-sensitive applications might need to use a cryptographically secure random number generator.
The difference between random.random and random.random() is that the first one is a reference to the function that produces simple random numbers, and the second one actually calls that function.
If you had another random number generator, you wanted to use, you could say
random.shuffle(x, my_random_number_function)
As to what random.random (the default generator) is doing, it uses an algorithm called the Mersenne twister to create a seemingly random floating point number between 0 and 1 (not including 1), all the numbers in that interval being of equal likelihood.
That the interval is from 0 to 1 is just a convention.
The second argument is the function which is called to produce random numbers that are in turn used to shuffle the sequence (first argument). The default function used if you don't provide your own is random.random.
You might want to provide this parameter if you want to customize how shuffle is performed.
And your customized function will have to return numbers in range [0.0, 1.0) - 0.0 included, 1.0 excluded.
The docs go on saying:
The optional argument random is a
0-argument function returning a random
float in [0.0, 1.0); by default, this
is the function random().
It means that you can either specify your own random number generator function, or tell the module to use the default random function. The second option is almost always the best choice, because Python uses a pretty good PRNG.
The function it expects is supposed to return a floating point pseudo-random number in the range [0.0, 1.0), which means 0.0 included and 1.0 isn't included (i.e. 0.9999 is a valid number to be returned, but 1.0 is not). Each number in this range should be in theory returned with equal probability (i.e. this is a linear distribution).
From the example:
>>> random.random() # Random float x, 0.0 <= x < 1.0
0.37444887175646646
It generates a random floating point number between 0 and 1.
The shuffle function depends on an RNG (Random Number Generator), which defaults to random.random. The second argument is there so you can provide your own RNG instead of the default.
UPDATE:
The second argument is a random number generator that generates a new, random number in the range [0.0, 1.0) each time you call it.
Here's an example for you:
import random
def a():
return 0.0
def b():
return 0.999999999999
arr = [1,2,3]
random.shuffle(arr)
print arr # prints [1, 3, 2]
arr.sort()
print arr # prints [1, 2, 3]
random.shuffle(arr)
print arr # prints [3, 2, 1]
arr.sort()
random.shuffle(arr, a)
print arr # prints [2, 3, 1]
arr.sort()
random.shuffle(arr, a)
print arr # prints [2, 3, 1]
arr.sort()
random.shuffle(arr, b)
print arr # prints [1, 2, 3]
arr.sort()
random.shuffle(arr, b)
print arr # prints [1, 2, 3]
So if the function always returns the same value, you always get the same permutation. If the function returns random values each time it's called, you get a random permutation.
Related
I'm looking for a pseudo-random number generator (an algorithm where you input a seed number and it outputs a different 'random-looking' number, and the same seed will always generate the same output) for numbers between 1 and 951,312,000.
I would use the Linear Feedback Shift Register (LFSR) PRNG, but if I did, I would have to convert the seed number (which could be up to 1.2 million digits long in base-10) into a binary number, which would be so massive that I think it would take too long to compute.
In response to a similar question, the Feistel cipher was recommended, but I didn't understand the vocabulary of the wiki page for that method (I'm going into 10th grade so I don't have a degree in encryption), so if you could use layman's terms, I would strongly appreciate it.
Is there an efficient way of doing this which won't take until the end of time, or is this problem impossible?
Edit: I forgot to mention that the prng sequence needs to have a full period. My mistake.
A simple way to do this is to use a linear congruential generator with modulus m = 95^1312000.
The formula for the generator is x_(n+1) = a*x_n + c (mod m). By the Hull-Dobell Theorem, it will have full period if and only if gcd(m,c) = 1 and 95 divides a-1. Furthermore, if you want good second values (right after the seed) even for very small seeds, a and c should be fairly large. Also, your code can't store these values as literals (they would be much too big). Instead, you need to be able to reliably produce them on the fly. After a bit of trial and error to make sure gcd(m,c) = 1, I hit upon:
import random
def get_book(n):
random.seed(1941) #Borges' Library of Babel was published in 1941
m = 95**1312000
a = 1 + 95 * random.randint(1, m//100)
c = random.randint(1, m - 1) #math.gcd(c,m) = 1
return (a*n + c) % m
For example:
>>> book = get_book(42)
>>> book % 10**100
4779746919502753142323572698478137996323206967194197332998517828771427155582287891935067701239737874
shows the last 100 digits of "book" number 42. Given Python's built-in support for large integers, the code runs surprisingly fast (it takes less than 1 second to grab a book on my machine)
If you have a method that can produce a pseudo-random digit, then you can concatenate as many together as you want. It will be just as repeatable as the underlying prng.
However, you'll probably run out of memory scaling that up to millions of digits and attempting to do arithmetic. Normally stuff on that scale isn't done on "numbers". It's done on byte vectors, or something similar.
For an introduction to Python course, I'm looking at generating a random floating point number in Python, and I have seen a standard recommended code of
import random
lower = 5
upper = 10
range_width = upper - lower
x = random.random() * range_width + lower
for a random floating point from 5 up to but not including 10.
It seems to me that the same effect could be achieved by:
import random
x = random.randrange(5, 10) + random.random()
Since that would give an integer of 5, 6, 7, 8, or 9, and then tack on a decimal to it.
The question I have is would this second code still give a fully even probability distribution, or would it not keep the full randomness of the first version?
According to the documentation then yes random() is indeed a uniform distribution.
random(), which generates a random float uniformly in the semi-open range [0.0, 1.0). Python uses the Mersenne Twister as the core generator.
So both code examples should be fine. To shorten your code, you can equally do:
random.uniform(5, 10)
Note that uniform(a, b) is simply a + (b - a) * random() so the same as your first example.
The second example depends on the version of Python you're using.
Prior to 3.2 randrange() could produce a slightly uneven distributions.
There is a difference. Your second method is theoretically superior, although in practice it only matters for large ranges. Indeed, both methods will give you a uniform distribution. But only the second method can return all values in the range that are representable as a floating point number.
Since your range is so small, there is no appreciable difference. But still there is a difference, which you can see by considering a larger range. If you take a random real number between 0 and 1, you get a floating-point representation with a given number of bits. Now suppose your range is, say, in the order of 2**32. By multiplying the original random number by this range, you lose 32 bits of precision in the result. Put differently, there will be gaps between the values that this method can return. The gaps are still there when you multiply by 4: You have lost the two least significant bits of the original random number.
The two methods can give different results, but you'll only notice the difference in fairly extreme situations (with very wide ranges). For instance, If you generate random numbers between 0 and 2/sys.float_info.epsilon (9007199254740992.0, or a little more than 9 quintillion), you'll notice that the version using multiplication will never give you any floats with fractional values. If you increase the maximum bound to 4/sys.float_info.epsilon, you won't get any odd integers, only even ones. That's because the 64-bit floating point type Python uses doesn't have enough precision to represent all integers at the upper end of that range, and it's trying to maintain a uniform distribution (so it omits small odd integers and fractional values even though those can be represented in parts of the range).
The second version of the calculation will give extra precision to the smaller random numbers generated. For instance, if you're generating numbers between 0 and 2/sys.float_info.epsilon and the randrange call returned 0, you can use the full precision of the random call to add a fractional part to the number. On the other hand if the randrange returned the largest number in the range (2/sys.float_info.epsilon - 1), very little of the precision of the fraction would be used (the number will round to the nearest integer without any fractional part remaining).
Adding a fractional value also can't help you deal with ranges that are too large for every integer to be represented. If randrange returns only even numbers, adding a fraction usually won't make odd numbers appear (it can in some parts of the range, but not for others, and the distribution may be very uneven). Even for ranges where all integers can be represented, the odds of a specific floating point number appearing will not be entirely uniform, since the smaller numbers can be more precisely represented. Large but imprecise numbers will be more common than smaller but more precisely represented ones.
Is there any difference whatsoever between using random.randrange to pick 5 digits individually, like this:
a=random.randrange(0,10)
b=random.randrange(0,10)
c=random.randrange(0,10)
d=random.randrange(0,10)
e=random.randrange(0,10)
print (a,b,c,d,e)
...and picking the 5-digit number at once, like this:
x=random.randrange(0, 100000)
print (x)
Any random-number-generator differences (if any --- see the section on Randomness) are minuscule compared to the utility and maintainability drawbacks of the digit-at-a-time method.
For starters, generating each digit would require a lot more code to handle perfectly normal calls like randrange(0, 1024) or randrange(0, 2**32), where the digits do not arise in equal probability. For example, on the closed-closed range [0,1023] (requiring 4 digits), the first digit of the four can never be anything other than 0 or 1. The last digit is slightly more likely to be a 0, 1, 2, or 3. And so on.
Trying to cover all the bases would rapidly make that code slower, more bug-prone, and more brittle than it already is. (The number of annoying little details you've encountered just posting this question should give you an idea what lies farther down that path.)
...and all that grief is before you consider how easily random.randrange handles non-zero start values, the step parameter, and negative arguments.
Randomness Problems
If your RNG is good, your alternative method should produce "equally random" results (assuming you've handled all the problems I mentioned above). However, if your RNG is biased, then the digit-at-a-time method will probably increase its effect on your outputs.
For demonstration purposes, assume your absurdly biased RNG has an off-by-one error, so that it never produces the last value of the given range:
The call randrange(0, 2**32) will never produce 2**32 - 1 (4,294,967,295), but the remaining 4-billion-plus values will appear in very nearly their expected probability. Its output over millions of calls would be very hard to distinguish from a working pseudo-random number generator.
Producing the ten digits of that same supposedly-random number individually will subject each digit to that same off-by-one error, resulting in a ten-digit output that consists entirely of the digits [0,8], with no 9s present... ever. This is vastly "less random" than generating the whole number at once.
Conversely, the digit-at-a-time method will never be better than the RNG backing it, even when the range requested is very small. That method might magnify any RNG bias, or just repeat that bias, but it will never reduce it.
Yes, no and no.
Yes: probabilities multiply, so the digit sequences have the same probability
prob(a) and prob(b) = prob(a) * prob(b)
Since each digit has 0.1 chance of appear, the probability of two particular digits in order is 0.1**2, or 0.01, which is the probability of a number between 0 and 99 inclusive.
No: you have a typo in your second number.
The second form only has four digits; you probably meant randrange(0, 100000)
No: the output will not be the same
The second form will not print leading digits; you could print("%05d"%x) to get all the digits. Also, the first form has spaces in the output, so you could instead print("%d%d%d%d%d"%(a,b,c,d,e)).
Not a maths major or a cs major, I just fool around with python (usually making scripts for simulations/theorycrafting on video games) and I discovered just how bad random.randint is performance wise. It's got me wondering why random.randint or random.randrange are used/made the way they are. I made a function that produces (for all intents and actual purposes) identical results to random.randint:
big_bleeping_float= (2**64 - 2)/(2**64 - 2)
def fastrandint(start, stop):
return start + int(random.random() * (stop - start + big_bleeping_float))
There is a massive 180% speed boost using that to generate an integer in the range (inclusive) 0-65 compared to random.randrange(0, 66), the next fastest method.
>>> timeit.timeit('random.randint(0, 66)', setup='from numpy import random', number=10000)
0.03165552873121058
>>> timeit.timeit('random.randint(0, 65)', setup='import random', number=10000)
0.022374771118336412
>>> timeit.timeit('random.randrange(0, 66)', setup='import random', number=10000)
0.01937231027605435
>>> timeit.timeit('fastrandint(0, 65)', setup='import random; from fasterthanrandomrandom import fastrandint', number=10000)
0.0067909916844523755
Furthermore, the adaptation of this function as an alternative to random.choice is 75% faster, and I'm sure adding larger-than-one stepped ranges would be faster (although I didn't test that). For almost double the speed boost as using the fastrandint function you can simply write it inline:
>>> timeit.timeit('int(random.random() * (65 + big_bleeping_float))', setup='import random; big_bleeping_float= (2**64 - 2)/(2**64 - 2)', number=10000)
0.0037642723021917845
So in summary, why am I wrong that my function is a better, why is it faster if it is better, and is there a yet even faster way to do what I'm doing?
randint calls randrange which does a bunch of range/type checks and conversions and then uses _randbelow to generate a random int. _randbelow again does some range checks and finally uses random.
So if you remove all the checks for edge cases and some function call overhead, it's no surprise your fastrandint is quicker.
random.randint() and others are calling into random.getrandbits() which may be less efficient that direct calls to random(), but for good reason.
It is actually more correct to use a randint that calls into random.getrandbits(), as it can be done in an unbiased manner.
You can see that using random.random to generate values in a range ends up being biased since there are only M floating point values between 0 and 1 (for M pretty large). Take an N that doesn't divide into M, then if we write M = k N + r for 0<r<N. At best, using random.random() * (N+1)
we'll get r numbers coming out with probability (k+1)/M and N-r numbers coming out with probability k/M. (This is at best, using the pigeon hole principle - in practice I'd expect the bias to be even worse).
Note that this bias is only noticeable for
A large number of sampling
where N is a large fraction of M the number of floats in (0,1]
So it probably won't matter to you, unless you know you need unbiased values - such as for scientific computing etc.
In contrast, a value from randint(0,N) can be unbiased by using rejection sampling from repeated calls to random.getrandbits(). Of course managing this can introduce additional overhead.
Aside
If you end up using a custom random implementation then
From the python 3 docs
Almost all module functions depend on the basic function random(), which
generates a random float uniformly in the semi-open range [0.0, 1.0).
This suggests that randint and others may be implemented using random.random. If this is the case I would expect them to be slower,
incurring at least one addition function call overhead per call.
Looking at the code referenced in https://stackoverflow.com/a/37540577/221955 you can see that this will happen if the random implementation doesn't provide a getrandbits() function.
This is probably rarely a problem but randint(0,10**1000) works while fastrandint(0,10**1000) crashes. The slower time is probably the price you need to pay to have a function that works for all possible cases...
More specifically, numpy:
In [24]: a=np.random.RandomState(4)
In [25]: a.rand()
Out[25]: 0.9670298390136767
In [26]: a.get_state()
Out[26]:
('MT19937',
array([1248735455, ..., 1532921051], dtype=uint32),
2,0,0.0)
octave:
octave:17> rand('state',4)
octave:18> rand()
ans = 0.23605
octave:19> rand('seed',4)
octave:20> rand()
ans = 0.12852
Octave claims to perform the same algorithm (Mersenne Twister with a period of 2^{19937-1})
Anybody know why the difference?
Unfortunately the MT19937 generator in Octave does not allow you to initialise it using a single 32-bit integer as np.random.RandomState(4) does. If you use rand("seed",4) this actually switches to an earlier version of the PRNG used previously in Octave, which PRNG is not MT19937 at all, but rather the Fortran RANDLIB.
It is possible to get the same numbers in NumPy and Octave, but you have to hack around the random seed generation algorithm in Octave and write your own function to construct the state vector out of the initial 32-bit integer seed. I am not an Octave guru, but with several Internet searches on bit manipulation functions and integer classes in Octave/Matlab I was able to write the following crude script to implement the seeding:
function state = mtstate(seed)
state = uint32(zeros(625,1));
state(1) = uint32(seed);
for i=1:623,
tmp = uint64(1812433253)*uint64(bitxor(state(i),bitshift(state(i),-30)))+i;
state(i+1) = uint32(bitand(tmp,uint64(intmax('uint32'))));
end
state(625) = 1;
Use it like this:
octave:9> rand('state',mtstate(4));
octave:10> rand(1,5)
ans =
0.96703 0.54723 0.97268 0.71482 0.69773
Just for comparison with NumPy:
>>> a = numpy.random.RandomState(4)
>>> a.rand(5)
array([ 0.96702984, 0.54723225, 0.97268436, 0.71481599, 0.69772882])
The numbers (or at least the first five of them) match.
Note that the default random number generator in Python, provided by the random module, is also MT19937, but it uses a different seeding algorithm, so random.seed(4) produces a completely different state vector and hence the PRN sequence is then different.
If you look at the details of the Mersenne Twister algorithm, there are lots of parameters that affect the actual numbers produced. I don't think Python and Octave are trying to produce the same sequence of numbers.
It looks like numpy is returning the raw random integers, whereas octave is normalising them to floats between 0 and 1.0.