What is pseudo random?

What is pseudo random? - python

I was reading the docs for the random module and noticed it said pseudo random and thought doesnt pseudo mean False so i was wondering what it means when it says that
For Example:
import random
print random.randint(1,2)
print random.randint(1,3)
does this still mean that the first print statement has a 50% chance of printing 1 and a 50% chance of printing 2
and that the second print statement has a 33% chance of printing one and a 33% chance of printing 2 etc.
if not then how are the pseudo random numbers generated ?

To produce true randomness requires specialized hardware that measures random events, such as radioactive decay (random) or brownian motion (also essentially random). Most computers obviously don't have these, so instead you have to use a really complex, evenly distributed, hard to predict 'pseudorandom' algorithm that starts with a number determined by, for example, the current timestamp. Such algorithms are plenty good enough for standard use cases needing 'randomness' as long as you're careful to not seed two random number generators with the same timestamp (start them at the same time on different threads, for example), which will make them do identical things. A common example of such a random number generator is Mersenne Twister: http://en.wikipedia.org/wiki/Mersenne_twister
A site that offers truly random values, explains a lot about randomness and pseudorandomness and has some yummy statistics about its randomness: http://www.random.org/ (see Learn More and Statistics) (It actually seems that it relies on measuring tiny fluctuations in a chaotic system, e.g. atmospheric noise, but the statistics show that it is so much like true randomness you can't tell it apart!)

Related

Why using a time-based pseudo-random number is not cryptographically secure?

It is well known that pseudo-random numbers are not cryptographically secure.
An extremely basic way I can think of generating a pseudo-random number could be to get the time-stamp at the time the code runs and return the lowest significant figures.
For example the outcome of import time; return time.time_ns/100 % 1000 returns a number between 0 and 1000 that should be almost impossible to predict unless you know exactly the time at which the code run (with a nanosecond precision) and all the overhead execution times of the code.
We could then use one or more numbers generated this way to run a chaotic function (as a logistic map) to generate number that should be extremely hard to predict.
One extremely naïve implementation could be:
import time
def random():
return time.time_ns()/100 % 1000 / 1000
def logistic():
r = 3.9 + random()/10
N = 1000 + int(random()*100)
x = random()
for _ in range(N):
x = r*x*(1-x)
return x
print(logistic())
However, I'm quite sure that no one would consider this to be cryptographically secure.
Why is this? How could one predict/attack such method?
EDIT:
My question is a theoretical question to understand the reasons why building a true RNG is so difficult and hard. I would never use a RNG I wrote in a real code, even less if it had to be cryptographically secure. However, it would be nice to have a bit more details on WHY it's hard to achieve such result so that hundreds/thousands of researcher spent their life working on this topic.

It is well known that pseudo-random numbers are not cryptographically secure.
Is it really? All cryptographic systems I know do use pseudo random generators. There are two major point for a cryptographically secure pseudo-random sequence:
the probability of any value should be the same (as much of possible) to keep entropy high. If on a 16 bit number the generation algorithm consistently sets 8 bits to zero, you only have a 8 bit generator...
knowledge of a number of consecutive values shall not allow to predict the next one(s) - this one is really the tricky part...
Relying on the nano-seconds part of the time blindly assumes that the internal clock of the system will not have prefered values for the low order bits... and nothing ensures that!
Common systems only rely on the hardware randomness of the time to build a seed.
And when it comes to security and cryptography to rule is do not roll your own unless you are an established specialist (and if you are, you already know that any new algorithm or implementation should be carefully reviewed by peers). Hell is always hidden in the details, and something that looks very clever at first sight could introduce a major flaw. Here relying on randomness of the system clock is not secure.
The fact is that building good algorithms and implementations is very hard. And having others to trust them takes even more time. There is nothing bad in experimenting new ideas, and studying how they are validated is even more interesting and you would learn a lot. But my advice is to not use you brand new algo for anything else than tests, and never for mission critical operations.

For cryptography, it is not only desirable that individual numbers are hard to predict but also that multiple numbers are hard to predict – that is, numbers should (appear to) be independent.
Notably, they should be independent even if an attacker knows the algorithm.
That is problematic with time based "randomness", since by design the next time is after the previous time. Worse, there is a direct relation "how much" one number is after the other – namely how much time has elapsed since fetching the previous number.
For example, drawing numbers in any predictable manner such as a loop gives a significant correlation between numbers:
>>> import time
>>>
>>> def random():
... return time.time_ns()/100 % 1000
...
>>> # some example samples
>>> [random() for _ in range(5)]
[220.0, 250.0, 250.0, 260.0, 260.0]
>>> [random() for _ in range(5)]
[370.0, 390.0, 400.0, 410.0, 410.0]
>>> # how much spread there is in samples
>>> import statistics
>>> [statistics.stdev(random() for _ in range(5)) for _ in range(5)]
[16.431676725154983, 11.40175425099138, 11.40175425099138, 8.366600265340756, 11.40175425099138]
>>> [statistics.stdev(random() for _ in range(5)) for _ in range(5)]
[16.431676725154983, 11.40175425099138, 11.40175425099138, 11.40175425099138, 11.40175425099138]
Notably, the critical part is actually not that the spread is low but that it is predictable.
The other problem is that there really is not much that can be done about it: The "randomness" contained in time is inherently limited.
On one end, time randomness is constrained by resolution. As can be seen by the output, my system clock only has sufficient resolution for multiples of 10 – so it can only draw 100 distinct numbers.
On the other end, time randomness is constrained by the program duration. A program that, say, draws during 1 ms only can draw randomness from that duration – that's only 1 000 000 distinct nanoseconds.
Notably, no algorithm can increase that randomness – one can shift it or even it out across some range, but not create more randomness from it.
Now, in practice one could say that it is still somewhat difficult to predict the values drawn this way. The point however is that other means are more difficult to predict.
Consider that some attacker knew your algorithm and tried to brute-force a result:
With a true random generator for actual 3*[0, 1000) they must try 1000*1000*1000 numbers.
With the time based random generator for seemingly 3*[0, 1000) they must try roughly 100*5 numbers (low resolution, low spread).
That means time based randomness makes the approach 2_000_000 times less robust against a brute force attack.

Best practices for seeding random and numpy.random in the same program

In order to make random simulations we run reproducible later, my colleagues and I often explicitly seed the random or numpy.random modules' random number generators using the random.seed and np.random.seed methods. Seeding with an arbitrary constant like 42 is fine if we're just using one of those modules in a program, but sometimes, we use both random and np.random in the same program. I'm unsure whether there are any best practices I should be following about how to seed the two RNGs together.
In particular, I'm worried that there's some sort of trap we could fall into where the two RNGs together behave in a "non-random" way, such as both generating the exact same sequence of random numbers, or one sequence trailing the other by a few values (e.g. the kth number from random is always the k+20th number from np.random), or the two sequences being related to each other in some other mathematical way. (I realise that pseudo-random number generators are all imperfect simulations of true randomness, but I want to avoid exacerbating this with poor seed choices.)
With this objective in mind, are there any particular ways we should or shouldn't seed the two RNGs? I've used, or seen colleagues use, a few different tactics, like:
Using the same arbitrary seed:
random.seed(42)
np.random.seed(42)
Using two different arbitrary seeds:
random.seed(271828)
np.random.seed(314159)
Using a random number from one RNG to seed the other:
random.seed(42)
np.random.seed(random.randint(0, 2**32))
... and I've never noticed any strange outcomes from any of these approaches... but maybe I've just missed them. Are there any officially blessed approaches to this? And are there any possible traps that I can spot and raise the alarm about in code review?

I will discuss some guidelines on how multiple pseudorandom number generators (PRNGs) should be seeded. I assume you're not using random-behaving numbers for information security purposes (if you are, only a cryptographic generator is appropriate and this advice doesn't apply).
To reduce the risk of correlated pseudorandom numbers, you can use PRNG algorithms, such as SFC and other so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011), that support independent "streams" of pseudorandom numbers. There are other strategies as well, and I explain more about this in "Seeding Multiple Processes".
If you can use NumPy 1.17, note that that version introduced a new PRNG system and added SFC (SFC64) to its repertoire of PRNGs. For NumPy-specific advice on parallel pseudorandom generation, see "Parallel Random Number Generation" in the NumPy documentation.
You should avoid seeding PRNGs (especially several at once) with timestamps.
You mentioned this question in a comment, when I started writing this answer. The advice there is not to seed multiple instances of the same kind of PRNG. This advice, however, doesn't apply as much if the seeds are chosen to be unrelated to each other, or if a PRNG with a very big state (such as Mersenne Twister) or a PRNG that gives each seed its own nonoverlapping pseudorandom number sequence (such as SFC) is used. The accepted answer there (at the time of this writing) demonstrates what happens when multiple instances of .NET's System.Random, with sequential seeds, are used, but not necessarily what happens with PRNGs of a different design, PRNGs of multiple designs, or PRNGs initialized with unrelated seeds. Moreover, .NET's System.Random is a poor choice for a PRNG precisely because it allows only seeds no more than 32 bits long (so the number of pseudorandom sequences it can produce is limited), and also because it has implementation bugs (if I understand correctly) that have been preserved for backward compatibility.

Method for finding best result from (almost) random data

So I'm working on some calculator for a game I play - for fun, which takes various abilities with different cooldowns, usage times, a percentage in which they may be used at etc ...
So far I am doing this by analyzing numbers in base however many abilities I have, so for example assuming i have 5 abilities used over 4 seconds:
0000: 60 damage (using ability 0, trying to use it again but failing - so returns ability damage of 0)
0001: 60 damage
Skip a few ...
0101: 200 damage
and again ...
4444: 70 damage.
Process terminates. - Hope that made sense.
Problem is, doing this in brute force works well with small times (like above) and number or abilities, however at much higher times and number of abilities it runs analyzing trillions of simulations which means brute force no longer becomes an option.
Question is, considering the data is mostly random, are there any heuristic algorithm's that (all thought may not return the optimal) will return a relatively good result.
Thanks for any responses :)

Let me rephrase to make sure I understand correctly: you want to find the best sequencing of skills, given their individual damage and cooldowns, such that only one skill is used at each time, and no skill is used more often than its cooldown allows. If so, it is a kind of a scheduling problem and one way to approach would be through linear programming.
The rough idea is to introduce n_skills * simulation_length variables x[skill][time], each constrained between 0 and 1, with the interpretation of "use skill skill at time time if x[skill][time] == 1, don't use if == 0". Now you optimize the sum of all variables weighted by the damage their skill does, sum(x[skill][:] * damage[skill] for skill in skills), under additional linear constraints (explained through numpy-like pseudocode):
for each time t, sum(x[:][t]) <= 1 (at each time you can use at most one ability)
for each ability a and time t0 sum(x[a][t0-cooldown(a):t0+cooldown(a)] <= 1 (within the period of its cooldown, you can only use your ability at most once)
Now the tricky part is that while this will give you a solution that is optimal in some sense, it will most likely not be physical, that is you'll get fractional xs. This is where the heuristic part kicks in, you have to find some way to "round" the solution to integers, losing objective value in the process, to make it physically (game-ally) meaningful. One way is to only keep x[a][t] == 1, and round all other numbers down to zero. It will give a meaningful solution, but it may not be very satisfying (ie. your character will do almost nothing). Given that my model for the problem is quite simple, I would expect there are some theoretical results on how to give a good rounding.
While I can suggest the scipy package for solving the linear program once it's formulated, the whole problem of building the constraint matrix and rounding the results (even trivially) is not a beginner-level programming task.

Is random.expovariate equivalent to a Poisson Process

I read somewhere that the python library function random.expovariate produces intervals equivalent to Poisson Process events.
Is that really the case or should I impose some other function on the results?

On a strict reading of your question, yes, that is what random.expovariate does.
expovariate gives you random floating point numbers, exponentially distributed. In a Poisson process the size of the interval between consecutive events is exponential.
However, there are two other ways I could imagine modelling poisson processes
Just generate random numbers, uniformly distributed and sort them.
Generate integers which have a Poisson distribution (i.e. they are distributed like the number of events within a fixed interval in a Poisson process). Use numpy.random.poisson to do this.
Of course all three things are quite different. The right choice depends on your application.

https://stackoverflow.com/a/10250877/1587329 gives a nice explanation of why this works (not only in python), and some code. In short
simulate the first 10 events in a poisson process with an averate rate
of 15 arrivals per second like this:
import random
for i in range(1,10):
print random.expovariate(15)

python lottery suggestion

I know python offers random module to do some simple lottery. Let say random.shuffle() is a good one.
However, I want to build my own simple one. What should I look into? Is there any specific mathematical philosophies behind lottery?
Let say, the simplest situation. 100 names and generate 20 names randomly.
I don't want to use shuffle, since I want to learn to build one myself.
I need some advise to start. Thanks.

You can generate your own pseudo-random numbers -- there's a huge amount of theory behind that, start for example here -- and of course you won't be able to compete with Python's random "Mersenne twister" (explained halfway down the large wikipedia page I pointed you to), in either quality or speed, but for purposes of understanding, it's a good endeavor. Or, you can get physically-random numbers, for example from /dev/random or /dev/urandom on Linux machines (Windows machines have their own interfaces for that, too) -- one has more pushy physical randomness, the other one has better performance.
Once you do have (or borrow from random;-) a pseudo-random (or really random) number generator, picking 20 items at random from 100 is still an interesting problem. While shuffling is a more general approach, a more immediately understandable one might be, assuming your myrand(N) function returns a random or pseudorandom int between 0 included and N excluded:
def pickfromlist(howmany, thelist):
result = []
listcopy = list(thelist)
while listcopy and len(result) < howmany:
i = myrand(len(listcopy))
result.append(listcopy.pop(i))
return result
Definitely not maximally efficient, but, I hope, maximally clear!-) In words: as long as required and feasible, pick one random item out of the remaining ones (the auxiliary list listcopy gives us the "remaining ones" at any step, and gets modified by .pop without altering the input parameter thelist, since it's a shallow copy).

See the Fisher-Yates Shuffle, described also in Knuth's The Art of Computer Programming.

I praise your desire to do this on your own.
Back in the 1950's, random numbers were unavailable to most people without a supercomputer (of the time). The RAND corporation published a book called a million random digits with 100,000 normal deviates which had, literally, just that: random numbers. It was awesome because it enabled laypeople to use high-quality random numbers for research purposes.
Now, back to your question.
I recommend you read the instructions on how to use the book (yes, it comes with instructions) and try to implement that in your Python code. This will not be efficient or elegant, but you will understand the implications of the algorithm you ultimately settle for. I love the part that instructs you to
open the book to an unselected page of
the digit table and blindly choose a
five-digit number; this number with
the first number reduced modulo 2
determines the starting line; the two
digits to the right of the initially
selected five-digit number are reduced
modulo 50 to determine the starting
column in the starting line
It was an art to read that table of numbers!
To be sure, I'm not encouraging you to reinvent the wheel for production code. I'm encouraging you to learn about the art of randomness by implementing a clever, if not very efficient, random number generator.
My work requires that I use high-quality random numbers, on limited occasions I have found the site www.random.org a very good source of both insight and material. From their website:
RANDOM.ORG offers true random numbers
to anyone on the Internet. The
randomness comes from atmospheric
noise, which for many purposes is
better than the pseudo-random number
algorithms typically used in computer
programs. People use RANDOM.ORG for
holding drawings, lotteries and
sweepstakes, to drive games and
gambling sites, for scientific
applications and for art and music.
Now, go and implement your own lottery.

You can use: random.sample
Return a k length list of unique
elements chosen from the population
sequence. Used for random sampling
without replacement.
For a more low-level approach, use `random.choice', in a loop:
Return a random element from the
non-empty sequence seq.
The pseudo-random generator (PRNG) in Python is pretty good. If you want to go even more low-level, you can implement your own. Start with reading this article. The mathematical name for lottery is "sampling without replacement". Google that for information - here's a good link.

The main shortcoming of software-based methods of generating lottery numbers is the fact that all random numbers generated by software are pseudo-random.
This may not be a problem for your simple application, but you did ask about a 'specific mathematical philosophy'. You will have noticed that all commercial lottery systems use physical methods: balls with numbers.
And behind the scenes, the numbers generated by physical lottery systems will be carefully scrutunised for indications of non-randomness and steps taken to eliminate it.
As I say, this may not be a consideration for your simple application, but the overriding requirement of a true lottery (the 'specific mathematical philosophy') should be mathematically demonstrable randomness

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.