I have a big script in Python. I inspired myself in other people's code so I ended up using the numpy.random module for some things (for example for creating an array of random numbers taken from a binomial distribution) and in other places I use the module random.random.
Can someone please tell me the major differences between the two?
Looking at the doc webpage for each of the two it seems to me that numpy.random just has more methods, but I am unclear about how the generation of the random numbers is different.
The reason why I am asking is because I need to seed my main program for debugging purposes. But it doesn't work unless I use the same random number generator in all the modules that I am importing, is this correct?
Also, I read here, in another post, a discussion about NOT using numpy.random.seed(), but I didn't really understand why this was such a bad idea. I would really appreciate if someone explain me why this is the case.
You have made many correct observations already!
Unless you'd like to seed both of the random generators, it's probably simpler in the long run to choose one generator or the other. But if you do need to use both, then yes, you'll also need to seed them both, because they generate random numbers independently of each other.
For numpy.random.seed(), the main difficulty is that it is not thread-safe - that is, it's not safe to use if you have many different threads of execution, because it's not guaranteed to work if two different threads are executing the function at the same time. If you're not using threads, and if you can reasonably expect that you won't need to rewrite your program this way in the future, numpy.random.seed() should be fine. If there's any reason to suspect that you may need threads in the future, it's much safer in the long run to do as suggested, and to make a local instance of the numpy.random.Random class. As far as I can tell, random.random.seed() is thread-safe (or at least, I haven't found any evidence to the contrary).
The numpy.random library contains a few extra probability distributions commonly used in scientific research, as well as a couple of convenience functions for generating arrays of random data. The random.random library is a little more lightweight, and should be fine if you're not doing scientific research or other kinds of work in statistics.
Otherwise, they both use the Mersenne twister sequence to generate their random numbers, and they're both completely deterministic - that is, if you know a few key bits of information, it's possible to predict with absolute certainty what number will come next. For this reason, neither numpy.random nor random.random is suitable for any serious cryptographic uses. But because the sequence is so very very long, both are fine for generating random numbers in cases where you aren't worried about people trying to reverse-engineer your data. This is also the reason for the necessity to seed the random value - if you start in the same place each time, you'll always get the same sequence of random numbers!
As a side note, if you do need cryptographic level randomness, you should use the secrets module, or something like Crypto.Random if you're using a Python version earlier than Python 3.6.
From Python for Data Analysis, the module numpy.random supplements the Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.
By contrast, Python's built-in random module only samples one value at a time, while numpy.random can generate very large sample faster. Using IPython magic function %timeit one can see which module performs faster:
In [1]: from random import normalvariate
In [2]: N = 1000000
In [3]: %timeit samples = [normalvariate(0, 1) for _ in xrange(N)]
1 loop, best of 3: 963 ms per loop
In [4]: %timeit np.random.normal(size=N)
10 loops, best of 3: 38.5 ms per loop
The source of the seed and the distribution profile used are going to affect the outputs - if you are looking for cryptgraphic randomness, seeding from os.urandom() will get nearly real random bytes from device chatter (ie ethernet or disk) (ie /dev/random on BSD)
this will avoid you giving a seed and so generating determinisitic random numbers. However the random calls then allow you to fit the numbers to a distribution (what I call scientific random ness - eventually all you want is a bell curve distribution of random numbers, numpy is best at delviering this.
SO yes, stick with one generator, but decide what random you want - random, but defitniely from a distrubtuion curve, or as random as you can get without a quantum device.
It surprised me the randint(a, b) method exists in both numpy.random and random, but they have different behaviors for the upper bound.
random.randint(a, b) returns a random integer N such that a <= N <= b. Alias for randrange(a, b+1). It has b inclusive. random documentation
However if you call numpy.random.randint(a, b), it will return low(inclusive) to high(exclusive). Numpy documentation
Related
From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?
The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.
The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.
But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"
If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.
A Mersenne twister, the random number generator, used by Python is seeded by the operating system served random numbers by default on those platforms where it is possible (Unixen, Windows); however on other platforms the seed defaults to the system time which means very repeatable values if the system time has a bad precision. On such systems seeding with known better random values is thus beneficial. Note that on Python 3 specifically, if version 2 is used, you can pass in any str, bytes, or bytearray to seed the generator; thus taking use of the Mersenne twister's large state better.
Another reason to use a seed value is indeed to guarantee that you get the same sequence of random numbers again and again - by reusing the known seed. Quoting the docs:
Sometimes it is useful to be able to reproduce the sequences given by
a pseudo random number generator. By re-using a seed value, the same
sequence should be reproducible from run to run as long as multiple
threads are not running.
Most of the random module’s algorithms and seeding functions are
subject to change across Python versions, but two aspects are
guaranteed not to change:
If a new seeding method is added, then a backward compatible seeder will be offered.
The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.
For this however, you mostly want to use the random.Random instances instead of using module global methods (the multiple threads issue, etc).
Finally also note that the random numbers produced by Mersenne twister are unsuitable for cryptographical use; whereas they appear very random, it is possible to fully recover the internal state of the random generator by observing only some hundreds of values from the generator. For cryptographical algorithms, you want to use the SystemRandom class.
In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.
In order to make random simulations we run reproducible later, my colleagues and I often explicitly seed the random or numpy.random modules' random number generators using the random.seed and np.random.seed methods. Seeding with an arbitrary constant like 42 is fine if we're just using one of those modules in a program, but sometimes, we use both random and np.random in the same program. I'm unsure whether there are any best practices I should be following about how to seed the two RNGs together.
In particular, I'm worried that there's some sort of trap we could fall into where the two RNGs together behave in a "non-random" way, such as both generating the exact same sequence of random numbers, or one sequence trailing the other by a few values (e.g. the kth number from random is always the k+20th number from np.random), or the two sequences being related to each other in some other mathematical way. (I realise that pseudo-random number generators are all imperfect simulations of true randomness, but I want to avoid exacerbating this with poor seed choices.)
With this objective in mind, are there any particular ways we should or shouldn't seed the two RNGs? I've used, or seen colleagues use, a few different tactics, like:
Using the same arbitrary seed:
random.seed(42)
np.random.seed(42)
Using two different arbitrary seeds:
random.seed(271828)
np.random.seed(314159)
Using a random number from one RNG to seed the other:
random.seed(42)
np.random.seed(random.randint(0, 2**32))
... and I've never noticed any strange outcomes from any of these approaches... but maybe I've just missed them. Are there any officially blessed approaches to this? And are there any possible traps that I can spot and raise the alarm about in code review?
I will discuss some guidelines on how multiple pseudorandom number generators (PRNGs) should be seeded. I assume you're not using random-behaving numbers for information security purposes (if you are, only a cryptographic generator is appropriate and this advice doesn't apply).
To reduce the risk of correlated pseudorandom numbers, you can use PRNG algorithms, such as SFC and other so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011), that support independent "streams" of pseudorandom numbers. There are other strategies as well, and I explain more about this in "Seeding Multiple Processes".
If you can use NumPy 1.17, note that that version introduced a new PRNG system and added SFC (SFC64) to its repertoire of PRNGs. For NumPy-specific advice on parallel pseudorandom generation, see "Parallel Random Number Generation" in the NumPy documentation.
You should avoid seeding PRNGs (especially several at once) with timestamps.
You mentioned this question in a comment, when I started writing this answer. The advice there is not to seed multiple instances of the same kind of PRNG. This advice, however, doesn't apply as much if the seeds are chosen to be unrelated to each other, or if a PRNG with a very big state (such as Mersenne Twister) or a PRNG that gives each seed its own nonoverlapping pseudorandom number sequence (such as SFC) is used. The accepted answer there (at the time of this writing) demonstrates what happens when multiple instances of .NET's System.Random, with sequential seeds, are used, but not necessarily what happens with PRNGs of a different design, PRNGs of multiple designs, or PRNGs initialized with unrelated seeds. Moreover, .NET's System.Random is a poor choice for a PRNG precisely because it allows only seeds no more than 32 bits long (so the number of pseudorandom sequences it can produce is limited), and also because it has implementation bugs (if I understand correctly) that have been preserved for backward compatibility.
My intention is to create a guideline on how to do reproducible computations in Python (if possible, regardless the environment, the operating system, etc.). However the issue of generating random numbers keeps coming back to my mind. I am struggling to find a bulletproof way (if there is one).
Standard way how to make the output of random generators reproducible is to use
import random
random.seed()
As far as I know, the automatic choice of the seed is dependent on the system. (See random.seed's documentation in Python.)
A better way is therefore to use specific number to seed the generator.
import random
random.seed(0)
However, there are libraries that does not use built-in random but rather use numpy.random. Therefore you also need to seed numpy's generator.
import numpy
numpy.random.seed(0)
Built-in random works as a singleton and I suppose numpy.random works the same way. It means that you set the seed once and then it is used everywhere.
I would like to create a code snippet which you could use at the beginning of your code and which would ensure the computational reproducibility in terms of random generators.
Is there any better way then combining both generators and setting both seeds and possibly even keeping the reproducibility across operating systems?
And are you familiar with any widely used pseudo-random generators that should be added to built-in random and to numpy random generators in order to make the snippet as general as possible?
I need to compute something like 10^8 uniformly distributed numbers in [0,1) for a Monte Carlo simulation. I can see two approaches to get these - compute all random numbers I need in one go, e.g. by using
numpy.random.random_sample(however_many_I_need)
or repeatedly call
numpy.random.random_sample()
Is there any difference in speed or quality of the random numbers between the two approaches?
Why not time them and see?
import timeit
timeit.timeit("np.random.random_sample()",
setup="import numpy as np",
number=int(1E8))
14.27977508635054
timeit.timeit("np.random.random_sample(int(1E8))",
setup="import numpy as np",
number=1)
1.4685100695671025
As to the quality, both results will be as pseudo-random as the other - they come from the same generator. If you need something more secure it might be worth looking elsewhere, but if this is for a simple Monte Carlo problem I don't think you really need to.
PS timeit is great
10^8 is a large number. As with all things numpy, this will be much faster if you pre-generate the numbers in one go, since you avoid the python function call overhead. This also applies for other operations you may want to do - additions, subtractions, multiplications, divisions, exponentiation, filtering, and lots of others
On the other hand, it doesn't help much if you pre-generate the numbers and the proceed to access each one individually from python. Make sure you can complete the simulation using matrix/vector operations.
As for the quality, there's no difference between the two methods you mention. If you need cryptographically secure random numbers, you should check #MayurPatel's answer. This is only needed if you need the random sequence to be difficult to guess for an attacker. For a monte carlo simulation you're probably more interested in statistical soundness, and numpy's random is enough
Would you consider using os.urandom()? I believe this is the highest quality you will get natively in python; but it may not be as fast as some other methods.
From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?
The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.
The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.
But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"
If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.
A Mersenne twister, the random number generator, used by Python is seeded by the operating system served random numbers by default on those platforms where it is possible (Unixen, Windows); however on other platforms the seed defaults to the system time which means very repeatable values if the system time has a bad precision. On such systems seeding with known better random values is thus beneficial. Note that on Python 3 specifically, if version 2 is used, you can pass in any str, bytes, or bytearray to seed the generator; thus taking use of the Mersenne twister's large state better.
Another reason to use a seed value is indeed to guarantee that you get the same sequence of random numbers again and again - by reusing the known seed. Quoting the docs:
Sometimes it is useful to be able to reproduce the sequences given by
a pseudo random number generator. By re-using a seed value, the same
sequence should be reproducible from run to run as long as multiple
threads are not running.
Most of the random module’s algorithms and seeding functions are
subject to change across Python versions, but two aspects are
guaranteed not to change:
If a new seeding method is added, then a backward compatible seeder will be offered.
The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.
For this however, you mostly want to use the random.Random instances instead of using module global methods (the multiple threads issue, etc).
Finally also note that the random numbers produced by Mersenne twister are unsuitable for cryptographical use; whereas they appear very random, it is possible to fully recover the internal state of the random generator by observing only some hundreds of values from the generator. For cryptographical algorithms, you want to use the SystemRandom class.
In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.