How to ensure good, different initial NumPy MT19937 states? - python

Introduction - legacy NumPy
The legacy NumPy code of initializing MT19937 instances (same as on Wikipedia) ensured that different seed values lead to different initial states (or at least if a single int is provided). Let's check the first 3 numbers in the prng's state:
np.random.seed(3)
np.random.get_state()[1][:3]
# gives array([ 3, 1142332464, 3889748055], dtype=uint32)
np.random.seed(7)
np.random.get_state()[1][:3]
array([ 7, 4097098180, 3572822661, 1142383841], dtype=uint32)
# gives array([ 7, 4097098180, 3572822661], dtype=uint32)
However, this method is criticized for 2 reasons:
seed size is limited by the underlying type, uint32
similar seeds may result in similar random numbers
The former can be solved if one can provide a sequence of int (which is indeed implemented, but how?), but the latter is harder to address. The implementation of the legacy code has been written keeping this property in mind [^1].
Introduction - new NumPy random
In the new implementation, the seed value provided is hashed first, then used to feed the initial state of the MT19937. This hashing ensures that
the similarity of the seed values doesn't matter, 2 similar seed values produce different initial state with the same probability as non-similar seed values. Previously we have seen that, for adjacent seed values, the first state variable (out of 600+) is similar. Whereas in the new implementation, not a single similar value can be found (with high chance) except for the first one for some reason:
prng = np.random.Generator(np.random.MT19937(3))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 2902887791, 607385081], dtype=uint32)
prng = np.random.Generator(np.random.MT19937(7))
prng.bit_generator.state["state"]["key"][:3]
# gives array([2147483648, 3939563265, 4185785210], dtype=uint32)
Two different seed values (ints of any length) may result in a similar initial state with a probability of $2^{-128}$ (by default).
If the problem of similar seeds has been already solved by Matsumoto et el., [^1], then there was no need to use a hash function, which introduces the state collision problem.
Question
Given the new implementation in NumPy, is there a good practice that ensures the different initial states of the MT19937 instances and passes quality requirements when it comes to similar seed values? I am looking for an initialization method that consumes at least 64 bits.
How about modifying the generate_state output of the SeedSequence class: if two ints are given, replace the first 2 states (maybe except the first one) with the given seed values themselves:
class secure_SeedSequence(np.random.SeedSequence):
def __init__(self, seed1: np.uint32, seed2: np.uint32):
self.seed1 = seed1
self.seed2 = seed2
def generate_state(self, n_words, dtype):
ss = np.random.SeedSequence([self.seed1, self.seed2])
states = ss.generate_state(n_words, dtype)
states[1] = self.seed1
states[2] = self.seed2
return states
ss_a = secure_SeedSequence(3, 1)
prng_a = np.random.Generator(np.random.MT19937(ss_a))
# gives [2147483648 3 1 354512857 3507208499 1943065218]
ss_b = secure_SeedSequence(3, 2)
prng_b = np.random.Generator(np.random.MT19937(ss_b))
# gives [2147483648 3 2 2744275888 1746192816 3474647608]
Here secure_SeedSequence consumes 2*32=64 bits, prng_a and prng_b are in different states, and except for the first 3 state variables, all the state variables are not alike. According to Wikipedia, the first 2 numbers may have some correlation with the 2 first state-variables, but after generating 624 random numbers, the next internal state won't reflect the initial seeds anymore. To avoid this problem, the code can be improved by skipping the first 2 random numbers.
Workaround
One can claim that the chances that two MT19937 instances will have the same state after providing different entropy for their SeedSequence is arbitrary low, by default, it is $2^{-128}$. But I am looking for a solution that ensures 100% probability that the initial states are different, not only with
$1-2^{-32\cdot N}$ probability.
Moreover, my concern with this calculation is that although the chance of getting garbage streams are low, once we have them, they produce garbage output forever, therefore, if a stream of length $M$ is generated, and $N$ streams/prngs are used, then by selecting $M$ pieces of numbers from this $M \times N$ 2D array, the chances that a number is garbage, tends to 1.
Why I asked it here?
this is strongly related to a given implementation in NumPy
the chances that I get an answer is the highest here
I think this is a common issue and I hope others have investigated this topic deeply already
[^1]: Common Defects in Initialization of Pseudorandom Number Generators, MAKOTO MATSUMOTO et al., around equation 30.

Related

How Does Numpy Random Seed Changes?

So, I'm in a project that uses Monte Carlo Method and I was studying the importance of the seed for pseudo-random numbers generation.
While doing experiments with python numpy random, I was trying to understand how the change in the seed affects the randomness, but I found something peculiar, at least for me. Using numpy.random.get_state() I saw that every time I run the script the seed starts different, changes once, but then keeps the same value for the entire script, as show in this code where it compares the state from two consecutive sampling:
import numpy as np
rand_state = [0]
for i in range(5):
rand_state_i = np.random.get_state()[1]
# printing only 3 state numbers, but comparing all of them
print(np.random.rand(), rand_state_i[:3], all(rand_state_i==rand_state))
rand_state = rand_state_i
# Print:
# 0.9721364306537633 [2147483648 2240777606 2786125948] False
# 0.0470329351113805 [3868808884 608863200 2913530561] False
# 0.4471038484385019 [3868808884 608863200 2913530561] True
# 0.2690477632739811 [3868808884 608863200 2913530561] True
# 0.7279016433547768 [3868808884 608863200 2913530561] True
So, my question is: how is the seed keeping the same value but returning different random values for each sampling? Does numpy uses other or more "data" to generate random numbers other than those present in numpy.random.get_state()?
You're only looking at part of the state. The big array of 624 integers isn't the whole story.
The Mersenne Twister only updates its giant internal state array once every 624 calls. The rest of the time, it just reads an element of that array, feeds it through a "tempering" pass, and outputs the tempered result. It only updates the array on the first call, or once it's read every element.
To keep track of the last element it read, the Mersenne Twister has an additional position variable that you didn't account for. It's at index 2 in the get_state() tuple. You'll see it increment in steps of 2 in your loop, because np.random.rand() has to fetch 2 32-bit integers to build a single double-precision floating point output.
(NumPy also maintains some additional state that's not really part of the Mersenne Twister state, to generate normally distributed values more efficiently. You'll find this in indexes 3 and 4 of the get_state() tuple.)

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Fitness function with multiple weights in DEAP

I'm learning to use the Python DEAP module and I have created a minimising fitness function and an evaluation function. The code I am using for the fitness function is below:
ct.create("FitnessFunc", base.Fitness, weights=(-0.0001, -100000.0))
Notice the very large difference in weights. This is because the DEAP documentation for Fitness says:
The weights can also be used to vary the importance of each objective one against another. This means that the weights can be any real number and only the sign is used to determine if a maximization or minimization is done.
To me, this says that you can prioritise one weight over another by making it larger.
I'm using algorithms.eaSimple (with a HallOfFame) to evolve and the best individuals in the population are selected with tools.selTournament.
The evaluation function returns abs(sum(input)), len(input).
After running, I take the values from the HallOfFame and evaluate them, however, the output is something like the following (numbers at end of line added by me):
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(390.50576340000003, 14) 4
(417.37616340000005, 4) 5
The thing that is confusing me is that I thought that the documentation stated that the larger second weight meant that len(input) would have a larger influence and would result in an output like so:
(154.2830144, 3) 1
(365.6353634, 4) 2
(390.50576340000003, 3) 3
(417.37616340000005, 4) 5
(390.50576340000003, 14) 4
Notice that lines 4 and 5 are swapped. This is because the weight of line 4 was much larger than the weight of line 5.
It appears that the fitness is actually evaluated based on the first element first, and then the second element is only considered if there is a tie between the first elements. If this is the case, then what is the purpose of setting a weight other than -1 or +1?
From a Pareto-optimality standpoint, neither of the two A=(390.50576340000003, 14) and B=(417.37616340000005, 4) solutions are superior to the other, regardless of the weights; always f1(A) > f1(B) and f2(A) < f2(B), and therefore neither dominates the other (source):
If they are on the same frontier, the winner can now be selected based on a secondary metric: density of solutions surrounding each solution in the frontier, which now accounts for the weights (wighted crowding distance). Indeed, if you select an appropriate operator, like selNSGA2. The selTournament operator you are using selects on the basis the first objective only:
def selTournament(individuals, k, tournsize, fit_attr="fitness"):
chosen = []
for i in xrange(k):
aspirants = selRandom(individuals, tournsize)
chosen.append(max(aspirants, key=attrgetter(fit_attr)))
return chosen
If you still want to use that, you can consider updating your evaluation function to return a single output of the weighted sum of the objectives. This approach would fail in the case of a non-convex objective space though (Page 12 here for details).

Difference of random numbers generation in matlab and python

I am a newbie in python. I am trying to implement a genetic algorithm, which I have previously implemented in MatLab. I need to create several chromosomes (individuals) for my initial population. In MatLab, I used to do this with a rand function and It used to give me plenty of unique (or at least different enough) initial population. But Here in python, I have tried different methods of random but among 50 individuals I have only 3 to 4 unique chromosomes. Here this is my python code in __main__.py :
for i in range(pop_size):
popSpace.append(Chromosom(G=mG,M=mM))
sum_Q+=popSpace[i].Q
And my Chromosom class:
class Chromosom:
def __init__(self,G,M):
self.Q = 0
self.V = []
self.Chr = [0]*len(self.M)
self.M= M
self.G= G
self.randomChromosom()
self.updateQ_Qs()
def randomChromosom(self):
for m in range(len(self.M)):
if (random.random()< 0.5):
self.Chr[m] = 1
else:
self.Chr[m] = 0
I have also tried to get random bits but results still were the same. For example, I have used print(str(main.mRand.getrandbits(6)) to see the results in the console and realized that there were too much-duplicated numbers. Is there a method to create more unique random numbers? in MatLab, the same code with rand function was working well (of course pretty slowly). Having such a close initial population causes poor results in the next steps (also I should mention that the randomness problem causes similar mutations too). My problem is that there are so many similar chromosomes. For example, I have several 01111001s which is strange considering their probability of occurrence.

Generating an NxM array of uniformly distributed random numbers over a stated interval (not [0,1)) in numpy

I am aware of the numpy.random.rand() command, however there doesn't seem to be any variables allowing you to adjust the uniform interval in which the numbers are chosen to something other than [0,1).
I considered using a for loop i.e. initiating a zero array of the needed size, and using numpy.random.unifom(a,b,N) to generate N random numbers in the interval (a,b), and then putting these into the initiated array. I am not aware of this module being to create an array of arbitrary dimension, like the rand above. This is clearly inelegant, although y main concern is the run time. I presume this method would have a much higher run time than using the appropriate random number generator from the start.
Edit and additional thought: the interval I am working in is [0,pi/8) which is less than 1. Strictly speaking, I won't be affecting the randomness of the generated numbers if I just rescale, but multiplying each generated random number would clearly be additional computational time, I presume by a factor the order of the number of elements.
np.random.uniform accepts a low and a high:
In [11]: np.random.uniform(-3, 3, 7) # 7 numbers between -3 and 3
Out[11]: array([ 2.68365104, -0.97817374, 1.92815971, -2.56190434, 2.48954842, -0.16202127, -0.37050593])
numpy.random.uniform accepts a size argument where you can just pass the size of your array as tuple. For generating an MxN array use
np.random.uniform(low,high, size=(M,N))

Categories

Resources