I have looked between random and secrets and found that secrets is "cryptographically secure". Everyone stack overflow source says it's the closest to true random. So I thought to use it for generating a population. However, it didn't give very random results at all, rather, predictable results.
The first characteristic I tested was gender, 4 to be exact, and mapped it all out...
# code may not function as it's typed on mobile without a computer to test on
import secrets
import multiprocessing
def gen(args*):
gender = ["Male", "Female", "X", "XXY"]
rng = secrets.choice(gender)
return rng
with multiprocessing.Pool(processes=4) as pool:
id_ = [I for I in range (2000000000)]
Out = pool.map(gen, id_)
# Do stuff with the data
When I process the data through other functions that determine the percent of 1 gender to the other it is always 25 +- 1% . I was expecting to have the occasional 100% of 1 gender and 0 others, but that never happened.
I also tried the same thing with random, it produced similar results but somehow took twice as long.
I also changed the list gender to have one of X and XXY, while having 49 of the other two, and it gave the predictable result of 1% X and 1% XXY.
I don't have much experience with RNG in computers aside from the term entropy... Does Python have any native or PYPI packages that produce entropy or chaotic numbers?
Is the secrets module supposed to act in a somewhat predictable way?
I think you might be conflating some different ideas here.
The secrets.choice function is going to randomly select 1 of the 4 gender options you have provided every time it is called, which in your example is 2000000000 times. The likelihood of getting 100% of any option after randomly selecting from a list of 4 options 2000000000 times is practically zero in any reasonably implemented randomness generator.
If I am understanding your question correctly, this is actually pretty strong evidence that the secrets.choice function is behaving as expected and providing an even distribution of the options provided to it. The variance should drop to zero as your N approaches infinity.
Related
Introduction
Today I found a weird behaviour in python while running some experiments with exponentiation and I was wondering if someone here knows what's happening. In my experiments, I was trying to check what is faster in python int**int or float**float. To check that I run some small snippets, and I found a really weird behaviour.
Weird results
My first approach was just to write some for loops and prints to check which one is faster. The snipper I used is this one
import time
# Run powers outside a method
ti = time.time()
for i in range(EXPERIMENTS):
x = 2**2
tf = time.time()
print(f"int**int took {tf-ti:.5f} seconds")
ti = time.time()
for i in range(EXPERIMENTS):
x = 2.**2.
tf = time.time()
print(f"float**float took {tf-ti:.5f} seconds")
After running it I got
int**int took 0.03004
float**float took 0.03070 seconds
Cool, it seems that data types do not affect the execution time. However, since I try to be a clean coder I refactored the repeated logic in a function power_time
import time
# Run powers in a method
def power_time(base, exponent):
ti = time.time()
for i in range(EXPERIMENTS):
x = base ** exponent
tf = time.time()
return tf-ti
print(f"int**int took {power_time(2, 2):.5f} seconds")
print(f"float**float took {power_time(2., 2.):5f} seconds")
And what a surprise of mine when I got these results
int**int took 0.20140 seconds
float**float took 0.05051 seconds
The refactor didn't affect a lot the float case, but it multiplied by ~7 the time required for the int case.
Conclusions and questions
Apparently, running something in a method can slow down your process depending on your data types, and that's really weird to me.
Also, if I run the same experiments but change ** by * or + the weird results disappear, and all the approaches give more or less the same results
Does someone know why is this happening? Am I missing something?
Apparently, running something in a method can slow down your process depending on your data types, and that's really weird to me.
It would be really weird if it was not like this! You can write your class that has it's own ** operator (through implementing the __pow__(self, other) method), and you could, for example, sleep 1s in there. Why should that take as long as taking a float to the power of another?
So, yeah, Python is a dynamically typed language. So, the operations done on data depend on the type of that data, and things can generally take different times.
In your first example, the difference never arises, because a) most probably the values get cached, because right after parsing it's clear that 2**2 is a constant and does not need to get evaluated every loop. Even if that's not the case, b) the time it costs to run a loop in python is hundreds of times that it takes to actually execute the math here – again, dynamically typed, dynamically named.
base**exponent is a whole different story. None about this is constant. So, there's actually going to be a calculation every iteration.
Now, the ** operator (__rpow__ in the Python data model) for Python's built-in float type is specified to do the float exponent (which is something implemented in highly optimized C and assembler), as exponentiation can elegantly be done on floating point numbers. Look for nb_power in cpython's floatobject.c. So, for the float case, the actual calculation is "free" for all that matters, again, because your loop is limited by how much effort it is to resolve all the names, types and functions to call in your loop. Not by doing the actual math, which is trivial.
The ** operator on Python's built-in int type is not as neatly optimized. It's a lot more complicated – it needs to do checks like "if the exponent is negative, return a float," and it does not do elementary math that your computer can do with a simple instruction, it handles arbitrary-length integers (remember, a python integer has as many bytes as it needs. You can save numbers that are larger than 64 bit in a Python integer!), which comes with allocation and deallocations. (I encourage you to read long_pow in CPython's longobject.c; it has 200 lines.)
All in all, integer exponentiation is expensive in python, because of python's type system.
What are use cases to hand over different numbers in random.seed(0)?
import random
random.seed(0)
random.random()
For example, to use random.seed(17) or random.seed(9001) instead of always using random.seed(0). Both return the same "pseudo" random numbers that can be used for testing.
import random
random.seed(17)
random.random()
Why dont use always random.seed(0)?
The seed is saying "random, but always the same randomness". If you want to randomize, e.g. search results, but not for every search you could pass the current day.
If you want to randomize per user you could use a user ID and so on.
An application should specify its own seed (e.g., with random.seed()) only if it needs reproducible "randomness"; examples include unit tests, games that display a "code" based on the seed to players, and simulations. Specifying a seed this way is not appropriate where information security is involved. See also my article on this matter.
I'd like to be able to test the efficiency of an IF statement to a dictionary case statement hack in Python. Since there is no case statement I am currently using the dictionary method. It looks like so....
self.options = {"0": self.racerOne,
"1": self.racerTwo,
"2": self.racerThree,
"3": self.racerFour,
"0f": self.racerFinish,
"1f": self.racerFinish,
"2f": self.racerFinish,
"3f": self.racerFinish,
"x": self.emptyLine,
"t": self.raceTime,
"G": self.letsGo,
"CD": self.countDown,
"C": self.emptyLine,
}
Real world data will vary but I have a way to run a controlled test and that reads 688 lines of streaming data over 6.4 seconds.
I have also read this post: Python Dictionary vs If Statement Speed I am going to check out the cProfile method as well.
Does anyone have any other suggestions on how I can accurately measure an IF statement compared to the dictionary option? By efficient I guess that means using the least processing power and can keep up with the stream better.
Over those 6.4 seconds I read each line of streaming data, parse it, evaluate it, then visually display it in real time. I don't think there is going to be much different in running my application on a Win or OSX system but it also have to run on a Raspberry Pi where processing power is limited.
Thanks in advance.
It sounds as though your major areas to optimize will not be this statement.
However, out of curiosity, I examined it anyway. The intuitive answer, which is given in the question you linked to, is that python dictionaries are implemented as hash tables. Lookup should scale at around O(1) with the number of items. If statements along what you showed will scale at O(n), as each will be done sequentially. Running 1000 random numbers through functions using each, with anywhere from 2 to 1000 choices, I got the following timing (y scale is in seconds per choice, and is log scale). If chains are blue, dict lookup is green. The x scale is in number of possible choices:
As can be seen, lookup is fast, and much faster than long if statement chains.
Even for short chains, in this code, lookup is still faster or around the same:
But note the times here. We're talking about times in the sub-microsecond range per choice on my computer: around 600ns for a low number of choices. At that point, the overhead may be from things as simple as function calls. On the other hand, if you have a huge number of possible choices, the best thing to use should be pretty clear.
The code for the above is below. It uses numpy to keep track of all the times. For simple timing issues like this, it's usually easiest to just use time.time() to get a time value before and after whatever you want to do. For something very fast, you'll need to loop through multiple times and take an average of the times.
I should add the caveat that the way I created the if statement chains was mildly evil. It's possible that in doing so (with an exec statement) the function was somehow not optimized in the same way: I'm not an expert on python internals.
import numpy as np
import time
def createdictfun(n):
optdict = { x: 2*x for x in range(0,n) }
def dictfun(x):
return optdict[x]
return dictfun
def createiffun(n):
s="def iffun(x):\n"
s+=" if x==0: return 0\n"
for i in range(1,n):
s+=" elif x=={}: return {}\n".format(i,i*2)
s+=" else: raise ValueError\n"
exec s
return iffun
ntry=10
maxchoice=1000
trialsize=1000
ifvals=np.zeros((maxchoice,2))
dictvals=np.zeros((maxchoice,2))
ns=np.arange(1,maxchoice)
for n in ns:
ift = np.zeros(ntry)
dictt = np.zeros(ntry)
vals=np.random.randint(0,n,size=trialsize)
iffun = createiffun(n)
dictfun = createdictfun(n)
for trial in range(0,ntry):
ts=time.time()
for x in vals:
iffun(x)
ift[trial]=time.time()-ts
ts=time.time()
for x in vals:
dictfun(x)
dictt[trial]=time.time()-ts
ifvals[n,0]=np.mean(ift)/trialsize
ifvals[n,1]=np.std(ift)/trialsize
dictvals[n,0]=np.mean(dictt)/trialsize
dictvals[n,1]=np.std(dictt)/trialsize
print str(n)+" ",
I'm using sequential seeds (1,2,3,4,...) for generation of random numbers in a simulation. Does the fact that the seeds are near each other make the generated pseudo-random numbers similar as well?
I think it doesn't change anything, but I'm using python
Edit: I have done some tests and the numbers don't look similar. But I'm afraid that the similarity cannot be noticed just by looking at the numbers. Is there any theoretical feature of random number generation that guarantees that different seeds give completely independent pseudo-random numbers?
There will definitely be a correlation between the seed and the random numbers generated, by definition. The question is whether the randomization algorithm is sufficient to produce results that seem uncorrelated, and you should study up on the methods for evaluating randomness to answer that question.
You are right to be concerned though. Here are the results from Microsoft's C++ rand function with seed values from 0 to 9:
38 7719 21238 2437 8855 11797 8365 32285 10450 30612
41 18467 6334 26500 19169 15724 11478 29358 26962 24464
45 29216 24198 17795 29484 19650 14590 26431 10705 18316
48 7196 9294 9091 7031 23577 17702 23503 27217 12168
51 17945 27159 386 17345 27504 20815 20576 10960 6020
54 28693 12255 24449 27660 31430 23927 17649 27472 32640
58 6673 30119 15745 5206 2589 27040 14722 11216 26492
61 17422 15215 7040 15521 6516 30152 11794 27727 20344
64 28170 311 31103 25835 10443 497 8867 11471 14195
68 6151 18175 22398 3382 14369 3609 5940 27982 8047
If you are worried about sequential seeds, then don't use sequential seeds. Set up a master RNG, with a known seed, and then take successive outputs from that master RNG to seed the various child RNGs as needed.
Because you know the initial seed for the master RNG, the whole simulation can be run again, exactly as before, if required.
masterSeed <- 42
masterRNG <- new Random(masterSeed)
childRNGs[] <- array of child RNGs
foreach childRNG in childRNGs
childRNG.setSeed(masterRNG.next())
endforeach
I have found measurable, but small, correlations in random numbers generated from the Mersenne Twister when using sequential seeds for multiple simulations--the results of which are averaged to yield final results. In python on linux, the correlations go away if I use seeds generated by the system random function (non pseudo random numbers) via random.SystemRandom(). I store SystemRandom numbers in files and read them when a seed is needed in a simulation.
To generate seeds:
import random
myrandom = random.SystemRandom
x = myrandom.random # yields a number in [0,1)
dump x out to file...
Then when seeds are needed
import random
read x from file...
newseed = int(x*(2**31)) # produce a 32 bit integer
random.seed(newseed)
nextran = random.random()
nextran = random.random()...
First: define similarity. Next: code a similarity test. Then: check for similarity.
With only a vague description of similarity it is hard to check for it.
What kind of simulation are you doing?
For simulation purposes your argument is valid (depending on the type of simulation) but if you implement it in an environment other than simulation, then it could be easily hacked if it requires that there are security concerns of the environment based on the generated random numbers.
If you are simulating the outcome of a machine whether it is harmful to society or not then the outcome of your results will not be acceptable. It requires maximum randomness in every way possible and I would never trust your reasoning.
To quote the documentation from the random module:
General notes on the underlying Mersenne Twister core generator:
The period is 2**19937-1.
It is one of the most extensively tested generators in existence.
I'd be more worried about my code being broken than my RNG not being random enough. In general, your gut feelings about randomness are going to be wrong - the Human mind is really good at finding patterns, even if they don't exist.
As long as you know your results aren't going to be 'secure' due to your lack of random seeding, you should be fine.
i need to make Python app to make a random prime number between 10^300 and 10^301 , i done it with this but its very slow. any solution?
import random , math
check_prime = 0
print "Please wait ..."
def is_prime(n):
import math
n = abs(n)
i = 2
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 1
return True
while check_prime == 0 :
randomnumber = random.randrange(math.pow(10,300),math.pow(10,301)-1)
if is_prime(randomnumber):
print randomnumber
break
First things first: Don't use math.pow(), as it is just a wrapper for the C floating-point function, and your numbers are WAY too big to be accurately represented as floats. Use Python's exponentiation operator, which is **.
Second: If you are on a platform which has a version of gmpy, use that for your primality testing.
Third: As eumiro notes, you might be dealing with too big of a problem space for there to be any truly fast solution.
If you need something quick and dirty simply use Fermat's little theorem.
def isPrime(p):
if(p==2): return True
if(not(p&1)): return False
return pow(2,p-1,p)==1
Although this works quite well for random numbers this will fail for numbers know as "pseudoprimes". (quite scarce)
But If you need something foolproof and simple enough to implement I'd suggest you
read up about Miller Rabin's Primality Test.
PS: pow(a,b,c) takes O(log(b)) time in python.
Take a look at Miller-Rabin test I am sure you will find some implementations in Python on the internet.
As explained in the comments, you can forget about building a list of all primes between 10^300 and 10^301 and picking one randomly - there's orders of magnitudes too many primes in that range for that to work.
The only practical approach, as far as I can see, is to randomly pick a number from the range, then test if it is prime. Repeat until you find a prime.
For the primality testing, that's a whole subject by itself (libraries [in the book sense] have been written about it). Basically you first do a few quick tests to discard numbers that are obviously not prime, then you bring out the big guns and use one of the common primality tests (Miller-Rabin (iterated), AKS, ...). I don't know enough to recommend one, but that's really a topic for research-level maths, so you should probably head over to https://math.stackexchange.com/ .
See e.g. this question for a starting point and some simple recipes:
How do you determine if a number is prime or composite?
About your code:
The code you post basically does just what I described, however it uses an extremely naive primality test (trial division by all numbers 1...sqrt(n)). That's why it's so slow. If you use a better primatlity test, it will be orders of magnitude faster.
Since you are handling extremely large numbers, you should really build on what clever people already did for you. Given that you need to know for sure your number is a prime (Miller Rabin is out), and you do want a random prime (not of a specific structure), I would suggest AKS. I am not sure how to optimize your numbers under test, but just choosing a random number might be okay given the speed of the primality test.
If you want to generate a large number of primes, or generate them quickly, pure python may not be the way to go - one option would be to use a python openssl wrapper, and use openssl's facility for generating RSA private keys (which are in fact a pair of primes), or some of openssl's other primality-related functions. Another way to acheive speed would be a C extension implementing one of the tests below...
If opessl isn't an option, your two choices (as many of the comments have mentioned) are the Miller-Rabin test and the AKS test. The main differences - AKS is deterministic, and is guaranteed to give no false results; whereas Miller-Rabin is probabalistic, it may occasionally give false positives - but the longer you run it, the lower that probability becomes (the odds are 1/4**k for k rounds of testing). You would think AKS would obviously be the way to go, except that it's much slower - O(log(n)**12) compared to Miller-Rabin's O(k*log(n)**3). (For comparison, the scan test you presented will take O(n**.5), so either of these will be much faster for large numbers).
If it would be useful, I can paste in a Miller-Rabin implementation I have, but it's rather long.
It is probably better to use a different algorithm, but you could optimise this code pretty easily as well.
This function will be twice as fast:
def is_prime(n):
import math
n = abs(n)
if n % 2 == 0:
return False
i = 3
while i <= math.sqrt(n):
if n % i == 0:
return False
i += 2
return True