I'm using sequential seeds (1,2,3,4,...) for generation of random numbers in a simulation. Does the fact that the seeds are near each other make the generated pseudo-random numbers similar as well?
I think it doesn't change anything, but I'm using python
Edit: I have done some tests and the numbers don't look similar. But I'm afraid that the similarity cannot be noticed just by looking at the numbers. Is there any theoretical feature of random number generation that guarantees that different seeds give completely independent pseudo-random numbers?
There will definitely be a correlation between the seed and the random numbers generated, by definition. The question is whether the randomization algorithm is sufficient to produce results that seem uncorrelated, and you should study up on the methods for evaluating randomness to answer that question.
You are right to be concerned though. Here are the results from Microsoft's C++ rand function with seed values from 0 to 9:
38 7719 21238 2437 8855 11797 8365 32285 10450 30612
41 18467 6334 26500 19169 15724 11478 29358 26962 24464
45 29216 24198 17795 29484 19650 14590 26431 10705 18316
48 7196 9294 9091 7031 23577 17702 23503 27217 12168
51 17945 27159 386 17345 27504 20815 20576 10960 6020
54 28693 12255 24449 27660 31430 23927 17649 27472 32640
58 6673 30119 15745 5206 2589 27040 14722 11216 26492
61 17422 15215 7040 15521 6516 30152 11794 27727 20344
64 28170 311 31103 25835 10443 497 8867 11471 14195
68 6151 18175 22398 3382 14369 3609 5940 27982 8047
If you are worried about sequential seeds, then don't use sequential seeds. Set up a master RNG, with a known seed, and then take successive outputs from that master RNG to seed the various child RNGs as needed.
Because you know the initial seed for the master RNG, the whole simulation can be run again, exactly as before, if required.
masterSeed <- 42
masterRNG <- new Random(masterSeed)
childRNGs[] <- array of child RNGs
foreach childRNG in childRNGs
childRNG.setSeed(masterRNG.next())
endforeach
I have found measurable, but small, correlations in random numbers generated from the Mersenne Twister when using sequential seeds for multiple simulations--the results of which are averaged to yield final results. In python on linux, the correlations go away if I use seeds generated by the system random function (non pseudo random numbers) via random.SystemRandom(). I store SystemRandom numbers in files and read them when a seed is needed in a simulation.
To generate seeds:
import random
myrandom = random.SystemRandom
x = myrandom.random # yields a number in [0,1)
dump x out to file...
Then when seeds are needed
import random
read x from file...
newseed = int(x*(2**31)) # produce a 32 bit integer
random.seed(newseed)
nextran = random.random()
nextran = random.random()...
First: define similarity. Next: code a similarity test. Then: check for similarity.
With only a vague description of similarity it is hard to check for it.
What kind of simulation are you doing?
For simulation purposes your argument is valid (depending on the type of simulation) but if you implement it in an environment other than simulation, then it could be easily hacked if it requires that there are security concerns of the environment based on the generated random numbers.
If you are simulating the outcome of a machine whether it is harmful to society or not then the outcome of your results will not be acceptable. It requires maximum randomness in every way possible and I would never trust your reasoning.
To quote the documentation from the random module:
General notes on the underlying Mersenne Twister core generator:
The period is 2**19937-1.
It is one of the most extensively tested generators in existence.
I'd be more worried about my code being broken than my RNG not being random enough. In general, your gut feelings about randomness are going to be wrong - the Human mind is really good at finding patterns, even if they don't exist.
As long as you know your results aren't going to be 'secure' due to your lack of random seeding, you should be fine.
Related
I found some articles online that mentioned Ax.dev's capability to cope with a constrained search space (e.g. dimension_x + dimension_y <= bound). However, I only experienced Ax.dev to ignore/violate all constraints. I have tried some different constraints on the Hartmann6d example. I assume Ax.dev models the constraints as soft constraints (not sure though, might as well be my coding skills...). So, my first question is: does Ax.dev SearchSpace use parameter_constraints as soft or hard constraint(s).
My second problem:
from ax import *
number of parameters
...
c0 = SumConstraint(parameters=[ some parameters ], bound= some boundary)
c1...
space = SearchSpace(parameters=[ parameters ], parameter_constraints=[c0, c1])
exp = SimpleExperiment(
name='EXPERIMENT5',
search_space=space,
evaluation_function=black_box_function,
objective_name='BLABLA',
minimize=False,
)
sobol = Models.SOBOL(exp.search_space)
for i in range(10):
exp.new_trial(generator_run=sobol.gen(1))
exp.trials[len(exp.trials) - 1].run()
returns
SearchSpaceExhausted: Rejection sampling error (specified maximum draws (100000) exhausted, without finding sufficiently many (1) candidates). This likely means that there are no new points left in the search space.
I have not been able to find useful information concerning this, despite all promising articles online stating ax.dev benefits (such as a constrained parameter space!) :(
meta-comment: Probably a better place for GitHub issues (not much by way of Ax help/docs on stackoverflow to my knowledge, but their GitHub issues is rich and generally has a lot of developer/community support).
Does Ax.dev SearchSpace use parameter_constraints as soft or hard constraint(s)?
I think parameter constraints are hard constraints (pretty sure that's the case at least for Sobol sampling, but I'm not sure about Bayesian models).
Outcome constraints are soft penalties (constraints)
SearchSpaceExhausted: Rejection sampling error
Search: https://github.com/facebook/Ax/issues?q=is%3Aissue+sort%3Aupdated-desc+specified+maximum+draws+is%3Aclosed
--> https://github.com/facebook/Ax/issues/694
--> https://github.com/facebook/Ax/issues/694#issuecomment-987353936
Since it's on master, first I install the latest version in a conda environment:
pip install 'git+https://github.com/facebook/Ax.git#egg=ax-platform'
The relevant imports are:
from ax.modelbridge.generation_strategy import GenerationStrategy, GenerationStep
from ax.modelbridge.registry import Models
Then based on 1A. Manually configured generation strategy, I change the first GenerationStep model_kwargs from:
model_kwargs={"seed": 999}
to
model_kwargs={
"seed": 999,
"fallback_to_sample_polytope": True,
}
With the full generation strategy (gs) given by:
gs = GenerationStrategy(
steps=[
# 1. Initialization step (does not require pre-existing data and is well-suited for
# initial sampling of the search space)
GenerationStep(
model=Models.SOBOL,
num_trials=5, # How many trials should be produced from this generation step
min_trials_observed=3, # How many trials need to be completed to move to next model
max_parallelism=5, # Max parallelism for this step
model_kwargs={
"seed": 999,
"fallback_to_sample_polytope": True,
}, # Any kwargs you want passed into the model
model_gen_kwargs={}, # Any kwargs you want passed to `modelbridge.gen`
),
# 2. Bayesian optimization step (requires data obtained from previous phase and learns
# from all data available at the time of each new candidate generation call)
GenerationStep(
model=Models.GPEI,
num_trials=-1, # No limitation on how many trials should be produced from this step
max_parallelism=3, # Parallelism limit for this step, often lower than for Sobol
# More on parallelism vs. required samples in BayesOpt:
# https://ax.dev/docs/bayesopt.html#tradeoff-between-parallelism-and-total-number-of-trials
),
]
)
Finally, in the case of this issue, and as mentioned:
AxClient(generation_strategy=gs)
Or in the case of the Loop API:
optimize(..., generation_strategy=gs)
Seems to work well for my use-case; thank you! I'll try to update the other relevant issues soon.
I have looked between random and secrets and found that secrets is "cryptographically secure". Everyone stack overflow source says it's the closest to true random. So I thought to use it for generating a population. However, it didn't give very random results at all, rather, predictable results.
The first characteristic I tested was gender, 4 to be exact, and mapped it all out...
# code may not function as it's typed on mobile without a computer to test on
import secrets
import multiprocessing
def gen(args*):
gender = ["Male", "Female", "X", "XXY"]
rng = secrets.choice(gender)
return rng
with multiprocessing.Pool(processes=4) as pool:
id_ = [I for I in range (2000000000)]
Out = pool.map(gen, id_)
# Do stuff with the data
When I process the data through other functions that determine the percent of 1 gender to the other it is always 25 +- 1% . I was expecting to have the occasional 100% of 1 gender and 0 others, but that never happened.
I also tried the same thing with random, it produced similar results but somehow took twice as long.
I also changed the list gender to have one of X and XXY, while having 49 of the other two, and it gave the predictable result of 1% X and 1% XXY.
I don't have much experience with RNG in computers aside from the term entropy... Does Python have any native or PYPI packages that produce entropy or chaotic numbers?
Is the secrets module supposed to act in a somewhat predictable way?
I think you might be conflating some different ideas here.
The secrets.choice function is going to randomly select 1 of the 4 gender options you have provided every time it is called, which in your example is 2000000000 times. The likelihood of getting 100% of any option after randomly selecting from a list of 4 options 2000000000 times is practically zero in any reasonably implemented randomness generator.
If I am understanding your question correctly, this is actually pretty strong evidence that the secrets.choice function is behaving as expected and providing an even distribution of the options provided to it. The variance should drop to zero as your N approaches infinity.
What are use cases to hand over different numbers in random.seed(0)?
import random
random.seed(0)
random.random()
For example, to use random.seed(17) or random.seed(9001) instead of always using random.seed(0). Both return the same "pseudo" random numbers that can be used for testing.
import random
random.seed(17)
random.random()
Why dont use always random.seed(0)?
The seed is saying "random, but always the same randomness". If you want to randomize, e.g. search results, but not for every search you could pass the current day.
If you want to randomize per user you could use a user ID and so on.
An application should specify its own seed (e.g., with random.seed()) only if it needs reproducible "randomness"; examples include unit tests, games that display a "code" based on the seed to players, and simulations. Specifying a seed this way is not appropriate where information security is involved. See also my article on this matter.
Say I have some python code:
import random
r=random.random()
Where is the value of r seeded from in general?
And what if my OS has no random, then where is it seeded?
Why isn't this recommended for cryptography? Is there some way to know what the random number is?
Follow da code.
To see where the random module "lives" in your system, you can just do in a terminal:
>>> import random
>>> random.__file__
'/usr/lib/python2.7/random.pyc'
That gives you the path to the .pyc ("compiled") file, which is usually located side by side to the original .py where readable code can be found.
Let's see what's going on in /usr/lib/python2.7/random.py:
You'll see that it creates an instance of the Random class and then (at the bottom of the file) "promotes" that instance's methods to module functions. Neat trick. When the random module is imported anywhere, a new instance of that Random class is created, its values are then initialized and the methods are re-assigned as functions of the module, making it quite random on a per-import (erm... or per-python-interpreter-instance) basis.
_inst = Random()
seed = _inst.seed
random = _inst.random
uniform = _inst.uniform
triangular = _inst.triangular
randint = _inst.randint
The only thing that this Random class does in its __init__ method is seeding it:
class Random(_random.Random):
...
def __init__(self, x=None):
self.seed(x)
...
_inst = Random()
seed = _inst.seed
So... what happens if x is None (no seed has been specified)? Well, let's check that self.seed method:
def seed(self, a=None):
"""Initialize internal state from hashable object.
None or no argument seeds from current time or from an operating
system specific randomness source if available.
If a is not None or an int or long, hash(a) is used instead.
"""
if a is None:
try:
a = long(_hexlify(_urandom(16)), 16)
except NotImplementedError:
import time
a = long(time.time() * 256) # use fractional seconds
super(Random, self).seed(a)
self.gauss_next = None
The comments already tell what's going on... This method tries to use the default random generator provided by the OS, and if there's none, then it'll use the current time as the seed value.
But, wait... What the heck is that _urandom(16) thingy then?
Well, the answer lies at the beginning of this random.py file:
from os import urandom as _urandom
from binascii import hexlify as _hexlify
Tadaaa... The seed is a 16 bytes number that came from os.urandom
Let's say we're in a civilized OS, such as Linux (with a real random number generator). The seed used by the random module is the same as doing:
>>> long(binascii.hexlify(os.urandom(16)), 16)
46313715670266209791161509840588935391L
The reason of why specifying a seed value is considered not so great is that the random functions are not really "random"... They're just a very weird sequence of numbers. But that sequence will be the same given the same seed. You can try this yourself:
>>> import random
>>> random.seed(1)
>>> random.randint(0,100)
13
>>> random.randint(0,100)
85
>>> random.randint(0,100)
77
No matter when or how or even where you run that code (as long as the algorithm used to generate the random numbers remains the same), if your seed is 1, you will always get the integers 13, 85, 77... which kind of defeats the purpose (see this about Pseudorandom number generation) On the other hand, there are use cases where this can actually be a desirable feature, though.
That's why is considered "better" relying on the operative system random number generator. Those are usually calculated based on hardware interruptions, which are very, very random (it includes interruptions for hard drive reading, keystrokes typed by the human user, moving a mouse around...) In Linux, that O.S. generator is /dev/random. Or, being a tad picky, /dev/urandom (that's what Python's os.urandom actually uses internally) The difference is that (as mentioned before) /dev/random uses hardware interruptions to generate the random sequence. If there are no interruptions, /dev/random could be exhausted and you might have to wait a little bit until you can get the next random number. /dev/urandom uses /dev/random internally, but it guarantees that it will always have random numbers ready for you.
If you're using linux, just do cat /dev/random on a terminal (and prepare to hit Ctrl+C because it will start output really, really random stuff)
borrajax#borrajax:/tmp$ cat /dev/random
_+�_�?zta����K�����q�ߤk��/���qSlV��{�Gzk`���#p$�*C�F"�B9��o~,�QH���ɭ�f��̬po�2o�(=��t�0�p|m�e
���-�5�߁ٵ�ED�l�Qt�/��,uD�w&m���ѩ/��;��5Ce�+�M����
~ �4D��XN��?ס�d��$7Ā�kte▒s��ȿ7_���- �d|����cY-�j>�
�b}#�W<դ���8���{�1»
. 75���c4$3z���/̾�(�(���`���k�fC_^C
Python uses the OS random generator or a time as a seed. This means that the only place where I could imagine a potential weakness with Python's random module is when it's used:
In an OS without an actual random number generator, and
In a device where time.time is always reporting the same time (has a broken clock, basically)
If you are concerned about the actual randomness of the random module, you can either go directly to os.urandom or use the random number generator in the pycrypto cryptographic library. Those are probably more random. I say more random because...
Image inspiration came from this other SO answer
i need to make Python app to make a random prime number between 10^300 and 10^301 , i done it with this but its very slow. any solution?
import random , math
check_prime = 0
print "Please wait ..."
def is_prime(n):
import math
n = abs(n)
i = 2
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 1
return True
while check_prime == 0 :
randomnumber = random.randrange(math.pow(10,300),math.pow(10,301)-1)
if is_prime(randomnumber):
print randomnumber
break
First things first: Don't use math.pow(), as it is just a wrapper for the C floating-point function, and your numbers are WAY too big to be accurately represented as floats. Use Python's exponentiation operator, which is **.
Second: If you are on a platform which has a version of gmpy, use that for your primality testing.
Third: As eumiro notes, you might be dealing with too big of a problem space for there to be any truly fast solution.
If you need something quick and dirty simply use Fermat's little theorem.
def isPrime(p):
if(p==2): return True
if(not(p&1)): return False
return pow(2,p-1,p)==1
Although this works quite well for random numbers this will fail for numbers know as "pseudoprimes". (quite scarce)
But If you need something foolproof and simple enough to implement I'd suggest you
read up about Miller Rabin's Primality Test.
PS: pow(a,b,c) takes O(log(b)) time in python.
Take a look at Miller-Rabin test I am sure you will find some implementations in Python on the internet.
As explained in the comments, you can forget about building a list of all primes between 10^300 and 10^301 and picking one randomly - there's orders of magnitudes too many primes in that range for that to work.
The only practical approach, as far as I can see, is to randomly pick a number from the range, then test if it is prime. Repeat until you find a prime.
For the primality testing, that's a whole subject by itself (libraries [in the book sense] have been written about it). Basically you first do a few quick tests to discard numbers that are obviously not prime, then you bring out the big guns and use one of the common primality tests (Miller-Rabin (iterated), AKS, ...). I don't know enough to recommend one, but that's really a topic for research-level maths, so you should probably head over to https://math.stackexchange.com/ .
See e.g. this question for a starting point and some simple recipes:
How do you determine if a number is prime or composite?
About your code:
The code you post basically does just what I described, however it uses an extremely naive primality test (trial division by all numbers 1...sqrt(n)). That's why it's so slow. If you use a better primatlity test, it will be orders of magnitude faster.
Since you are handling extremely large numbers, you should really build on what clever people already did for you. Given that you need to know for sure your number is a prime (Miller Rabin is out), and you do want a random prime (not of a specific structure), I would suggest AKS. I am not sure how to optimize your numbers under test, but just choosing a random number might be okay given the speed of the primality test.
If you want to generate a large number of primes, or generate them quickly, pure python may not be the way to go - one option would be to use a python openssl wrapper, and use openssl's facility for generating RSA private keys (which are in fact a pair of primes), or some of openssl's other primality-related functions. Another way to acheive speed would be a C extension implementing one of the tests below...
If opessl isn't an option, your two choices (as many of the comments have mentioned) are the Miller-Rabin test and the AKS test. The main differences - AKS is deterministic, and is guaranteed to give no false results; whereas Miller-Rabin is probabalistic, it may occasionally give false positives - but the longer you run it, the lower that probability becomes (the odds are 1/4**k for k rounds of testing). You would think AKS would obviously be the way to go, except that it's much slower - O(log(n)**12) compared to Miller-Rabin's O(k*log(n)**3). (For comparison, the scan test you presented will take O(n**.5), so either of these will be much faster for large numbers).
If it would be useful, I can paste in a Miller-Rabin implementation I have, but it's rather long.
It is probably better to use a different algorithm, but you could optimise this code pretty easily as well.
This function will be twice as fast:
def is_prime(n):
import math
n = abs(n)
if n % 2 == 0:
return False
i = 3
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 2
return True