So, I'm in a project that uses Monte Carlo Method and I was studying the importance of the seed for pseudo-random numbers generation.
While doing experiments with python numpy random, I was trying to understand how the change in the seed affects the randomness, but I found something peculiar, at least for me. Using numpy.random.get_state() I saw that every time I run the script the seed starts different, changes once, but then keeps the same value for the entire script, as show in this code where it compares the state from two consecutive sampling:
import numpy as np
rand_state = [0]
for i in range(5):
rand_state_i = np.random.get_state()[1]
# printing only 3 state numbers, but comparing all of them
print(np.random.rand(), rand_state_i[:3], all(rand_state_i==rand_state))
rand_state = rand_state_i
# Print:
# 0.9721364306537633 [2147483648 2240777606 2786125948] False
# 0.0470329351113805 [3868808884 608863200 2913530561] False
# 0.4471038484385019 [3868808884 608863200 2913530561] True
# 0.2690477632739811 [3868808884 608863200 2913530561] True
# 0.7279016433547768 [3868808884 608863200 2913530561] True
So, my question is: how is the seed keeping the same value but returning different random values for each sampling? Does numpy uses other or more "data" to generate random numbers other than those present in numpy.random.get_state()?
You're only looking at part of the state. The big array of 624 integers isn't the whole story.
The Mersenne Twister only updates its giant internal state array once every 624 calls. The rest of the time, it just reads an element of that array, feeds it through a "tempering" pass, and outputs the tempered result. It only updates the array on the first call, or once it's read every element.
To keep track of the last element it read, the Mersenne Twister has an additional position variable that you didn't account for. It's at index 2 in the get_state() tuple. You'll see it increment in steps of 2 in your loop, because np.random.rand() has to fetch 2 32-bit integers to build a single double-precision floating point output.
(NumPy also maintains some additional state that's not really part of the Mersenne Twister state, to generate normally distributed values more efficiently. You'll find this in indexes 3 and 4 of the get_state() tuple.)
Related
Is it possible to generate in pseudo-random ORDER all the numbers from 0 .. N, without repeating any number AND w/o keeping track of what numbers were already generated
F.e. the opposite, a non-random rule would be:
- generate all ODD values
- generate all EVEN values
does:
np.random.choice(range(1000000),1000000,replace=False)
materialize the range ?
Yes, it's possible.
You could create a custom LCG for your given N or the next power of two greater than your N, but the quality of the random numbers is quite bad.
A better method is to create a seeded hash function that is reversible for every power of two, and hash all numbers from 0 to next_pow_2(N), whiles rejecting numbers greater than N. This article explains it quite well: https://andrew-helmer.github.io/permute/
The above method works best if N isn't that small (N > 2^14 for the implementation in the linked article would be advisory), because creating good hash functions for a small input width is very hard.
Note that while these methods work, you should really consider just shuffling an array of numbers 0 to N, as that is usually faster than the above methods.
Shuffle all the numbers in the range, then pick them off in the shuffled order.
More work, would be to develop a format preserving encryption that only produces numbers in the required range, and then encrypt the numbers 0, 1, 2, 3, ... Because encryption is a one-to-one mapping with a given key, different inputs are guaranteed to produce different outputs.
Whatever method you use you will only be able to output as many unique numbers as there are in the initial range, obviously. After that, the numbers will start to repeat.
Introduction - legacy NumPy
The legacy NumPy code of initializing MT19937 instances (same as on Wikipedia) ensured that different seed values lead to different initial states (or at least if a single int is provided). Let's check the first 3 numbers in the prng's state:
np.random.seed(3)
np.random.get_state()[1][:3]
# gives array([ 3, 1142332464, 3889748055], dtype=uint32)
np.random.seed(7)
np.random.get_state()[1][:3]
array([ 7, 4097098180, 3572822661, 1142383841], dtype=uint32)
# gives array([ 7, 4097098180, 3572822661], dtype=uint32)
However, this method is criticized for 2 reasons:
seed size is limited by the underlying type, uint32
similar seeds may result in similar random numbers
The former can be solved if one can provide a sequence of int (which is indeed implemented, but how?), but the latter is harder to address. The implementation of the legacy code has been written keeping this property in mind [^1].
Introduction - new NumPy random
In the new implementation, the seed value provided is hashed first, then used to feed the initial state of the MT19937. This hashing ensures that
the similarity of the seed values doesn't matter, 2 similar seed values produce different initial state with the same probability as non-similar seed values. Previously we have seen that, for adjacent seed values, the first state variable (out of 600+) is similar. Whereas in the new implementation, not a single similar value can be found (with high chance) except for the first one for some reason:
prng = np.random.Generator(np.random.MT19937(3))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 2902887791, 607385081], dtype=uint32)
prng = np.random.Generator(np.random.MT19937(7))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 3939563265, 4185785210], dtype=uint32)
Two different seed values (ints of any length) may result in a similar initial state with a probability of $2^{-128}$ (by default).
If the problem of similar seeds has been already solved by Matsumoto et el., [^1], then there was no need to use a hash function, which introduces the state collision problem.
Question
Given the new implementation in NumPy, is there a good practice that ensures the different initial states of the MT19937 instances and passes quality requirements when it comes to similar seed values? I am looking for an initialization method that consumes at least 64 bits.
How about modifying the generate_state output of the SeedSequence class: if two ints are given, replace the first 2 states (maybe except the first one) with the given seed values themselves:
class secure_SeedSequence(np.random.SeedSequence):
def __init__(self, seed1: np.uint32, seed2: np.uint32):
self.seed1 = seed1
self.seed2 = seed2
def generate_state(self, n_words, dtype):
ss = np.random.SeedSequence([self.seed1, self.seed2])
states = ss.generate_state(n_words, dtype)
states[1] = self.seed1
states[2] = self.seed2
return states
ss_a = secure_SeedSequence(3, 1)
prng_a = np.random.Generator(np.random.MT19937(ss_a))
# gives [2147483648 3 1 354512857 3507208499 1943065218]
ss_b = secure_SeedSequence(3, 2)
prng_b = np.random.Generator(np.random.MT19937(ss_b))
# gives [2147483648 3 2 2744275888 1746192816 3474647608]
Here secure_SeedSequence consumes 2*32=64 bits, prng_a and prng_b are in different states, and except for the first 3 state variables, all the state variables are not alike. According to Wikipedia, the first 2 numbers may have some correlation with the 2 first state-variables, but after generating 624 random numbers, the next internal state won't reflect the initial seeds anymore. To avoid this problem, the code can be improved by skipping the first 2 random numbers.
Workaround
One can claim that the chances that two MT19937 instances will have the same state after providing different entropy for their SeedSequence is arbitrary low, by default, it is $2^{-128}$. But I am looking for a solution that ensures 100% probability that the initial states are different, not only with
$1-2^{-32\cdot N}$ probability.
Moreover, my concern with this calculation is that although the chance of getting garbage streams are low, once we have them, they produce garbage output forever, therefore, if a stream of length $M$ is generated, and $N$ streams/prngs are used, then by selecting $M$ pieces of numbers from this $M \times N$ 2D array, the chances that a number is garbage, tends to 1.
Why I asked it here?
this is strongly related to a given implementation in NumPy
the chances that I get an answer is the highest here
I think this is a common issue and I hope others have investigated this topic deeply already
[^1]: Common Defects in Initialization of Pseudorandom Number Generators, MAKOTO MATSUMOTO et al., around equation 30.
Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.
I am using random.seed to generate pseudo-random numbers over a certain number of iterations. However, with this method, for the same number of iterations, it generates the same initial value each time. I am wondering if there is a way to write the code to generate 4 different random initial values that are in different locations in the parameter range? For example my code looks like this:
import random
N=10
random.seed(N)
vx1 = [random.uniform(-3,3) for i in range(N)]
And each time this will generate the starting vx1[0] = 0.428. Is there a way to write the code for it to generate four different initial values of vx1? So the initial value for vx1 could equal 0.428 or 3 other values. Then each initial value would also have the following 9 random numbers in the range.
I think you have a fundamental misunderstanding as to what random.seed does. "Random" number generators are actually deterministic systems that generate pseudo-random numbers. The seed is a label for a reproducible initial state. The whole point of it is that the same sequence of numbers will be generated for the same seed.
If you want to create a reproducible sequence of 1,000,000 numbers, use a seed:
s = 10
N = 1000000
random.seed(s)
vx1 = [random.uniform(-3, 3) for i in range(N)]
If you want to generate a different sequence every time, use a different seed every time. The simplest way to do this is to just not call seed:
N = 1000000
vx1 = [random.uniform(-3, 3) for i in range(N)]
I am aware of the numpy.random.rand() command, however there doesn't seem to be any variables allowing you to adjust the uniform interval in which the numbers are chosen to something other than [0,1).
I considered using a for loop i.e. initiating a zero array of the needed size, and using numpy.random.unifom(a,b,N) to generate N random numbers in the interval (a,b), and then putting these into the initiated array. I am not aware of this module being to create an array of arbitrary dimension, like the rand above. This is clearly inelegant, although y main concern is the run time. I presume this method would have a much higher run time than using the appropriate random number generator from the start.
Edit and additional thought: the interval I am working in is [0,pi/8) which is less than 1. Strictly speaking, I won't be affecting the randomness of the generated numbers if I just rescale, but multiplying each generated random number would clearly be additional computational time, I presume by a factor the order of the number of elements.
np.random.uniform accepts a low and a high:
In [11]: np.random.uniform(-3, 3, 7) # 7 numbers between -3 and 3
Out[11]: array([ 2.68365104, -0.97817374, 1.92815971, -2.56190434, 2.48954842, -0.16202127, -0.37050593])
numpy.random.uniform accepts a size argument where you can just pass the size of your array as tuple. For generating an MxN array use
np.random.uniform(low,high, size=(M,N))