Why using a time-based pseudo-random number is not cryptographically secure? - python

It is well known that pseudo-random numbers are not cryptographically secure.
An extremely basic way I can think of generating a pseudo-random number could be to get the time-stamp at the time the code runs and return the lowest significant figures.
For example the outcome of import time; return time.time_ns/100 % 1000 returns a number between 0 and 1000 that should be almost impossible to predict unless you know exactly the time at which the code run (with a nanosecond precision) and all the overhead execution times of the code.
We could then use one or more numbers generated this way to run a chaotic function (as a logistic map) to generate number that should be extremely hard to predict.
One extremely naïve implementation could be:
import time
def random():
return time.time_ns()/100 % 1000 / 1000
def logistic():
r = 3.9 + random()/10
N = 1000 + int(random()*100)
x = random()
for _ in range(N):
x = r*x*(1-x)
return x
print(logistic())
However, I'm quite sure that no one would consider this to be cryptographically secure.
Why is this? How could one predict/attack such method?
EDIT:
My question is a theoretical question to understand the reasons why building a true RNG is so difficult and hard. I would never use a RNG I wrote in a real code, even less if it had to be cryptographically secure. However, it would be nice to have a bit more details on WHY it's hard to achieve such result so that hundreds/thousands of researcher spent their life working on this topic.

It is well known that pseudo-random numbers are not cryptographically secure.
Is it really? All cryptographic systems I know do use pseudo random generators. There are two major point for a cryptographically secure pseudo-random sequence:
the probability of any value should be the same (as much of possible) to keep entropy high. If on a 16 bit number the generation algorithm consistently sets 8 bits to zero, you only have a 8 bit generator...
knowledge of a number of consecutive values shall not allow to predict the next one(s) - this one is really the tricky part...
Relying on the nano-seconds part of the time blindly assumes that the internal clock of the system will not have prefered values for the low order bits... and nothing ensures that!
Common systems only rely on the hardware randomness of the time to build a seed.
And when it comes to security and cryptography to rule is do not roll your own unless you are an established specialist (and if you are, you already know that any new algorithm or implementation should be carefully reviewed by peers). Hell is always hidden in the details, and something that looks very clever at first sight could introduce a major flaw. Here relying on randomness of the system clock is not secure.
The fact is that building good algorithms and implementations is very hard. And having others to trust them takes even more time. There is nothing bad in experimenting new ideas, and studying how they are validated is even more interesting and you would learn a lot. But my advice is to not use you brand new algo for anything else than tests, and never for mission critical operations.

For cryptography, it is not only desirable that individual numbers are hard to predict but also that multiple numbers are hard to predict – that is, numbers should (appear to) be independent.
Notably, they should be independent even if an attacker knows the algorithm.
That is problematic with time based "randomness", since by design the next time is after the previous time. Worse, there is a direct relation "how much" one number is after the other – namely how much time has elapsed since fetching the previous number.
For example, drawing numbers in any predictable manner such as a loop gives a significant correlation between numbers:
>>> import time
>>>
>>> def random():
... return time.time_ns()/100 % 1000
...
>>> # some example samples
>>> [random() for _ in range(5)]
[220.0, 250.0, 250.0, 260.0, 260.0]
>>> [random() for _ in range(5)]
[370.0, 390.0, 400.0, 410.0, 410.0]
>>> # how much spread there is in samples
>>> import statistics
>>> [statistics.stdev(random() for _ in range(5)) for _ in range(5)]
[16.431676725154983, 11.40175425099138, 11.40175425099138, 8.366600265340756, 11.40175425099138]
>>> [statistics.stdev(random() for _ in range(5)) for _ in range(5)]
[16.431676725154983, 11.40175425099138, 11.40175425099138, 11.40175425099138, 11.40175425099138]
Notably, the critical part is actually not that the spread is low but that it is predictable.
The other problem is that there really is not much that can be done about it: The "randomness" contained in time is inherently limited.
On one end, time randomness is constrained by resolution. As can be seen by the output, my system clock only has sufficient resolution for multiples of 10 – so it can only draw 100 distinct numbers.
On the other end, time randomness is constrained by the program duration. A program that, say, draws during 1 ms only can draw randomness from that duration – that's only 1 000 000 distinct nanoseconds.
Notably, no algorithm can increase that randomness – one can shift it or even it out across some range, but not create more randomness from it.
Now, in practice one could say that it is still somewhat difficult to predict the values drawn this way. The point however is that other means are more difficult to predict.
Consider that some attacker knew your algorithm and tried to brute-force a result:
With a true random generator for actual 3*[0, 1000) they must try 1000*1000*1000 numbers.
With the time based random generator for seemingly 3*[0, 1000) they must try roughly 100*5 numbers (low resolution, low spread).
That means time based randomness makes the approach 2_000_000 times less robust against a brute force attack.

Related

Method for finding best result from (almost) random data

So I'm working on some calculator for a game I play - for fun, which takes various abilities with different cooldowns, usage times, a percentage in which they may be used at etc ...
So far I am doing this by analyzing numbers in base however many abilities I have, so for example assuming i have 5 abilities used over 4 seconds:
0000: 60 damage (using ability 0, trying to use it again but failing - so returns ability damage of 0)
0001: 60 damage
Skip a few ...
0101: 200 damage
and again ...
4444: 70 damage.
Process terminates. - Hope that made sense.
Problem is, doing this in brute force works well with small times (like above) and number or abilities, however at much higher times and number of abilities it runs analyzing trillions of simulations which means brute force no longer becomes an option.
Question is, considering the data is mostly random, are there any heuristic algorithm's that (all thought may not return the optimal) will return a relatively good result.
Thanks for any responses :)
Let me rephrase to make sure I understand correctly: you want to find the best sequencing of skills, given their individual damage and cooldowns, such that only one skill is used at each time, and no skill is used more often than its cooldown allows. If so, it is a kind of a scheduling problem and one way to approach would be through linear programming.
The rough idea is to introduce n_skills * simulation_length variables x[skill][time], each constrained between 0 and 1, with the interpretation of "use skill skill at time time if x[skill][time] == 1, don't use if == 0". Now you optimize the sum of all variables weighted by the damage their skill does, sum(x[skill][:] * damage[skill] for skill in skills), under additional linear constraints (explained through numpy-like pseudocode):
for each time t, sum(x[:][t]) <= 1 (at each time you can use at most one ability)
for each ability a and time t0 sum(x[a][t0-cooldown(a):t0+cooldown(a)] <= 1 (within the period of its cooldown, you can only use your ability at most once)
Now the tricky part is that while this will give you a solution that is optimal in some sense, it will most likely not be physical, that is you'll get fractional xs. This is where the heuristic part kicks in, you have to find some way to "round" the solution to integers, losing objective value in the process, to make it physically (game-ally) meaningful. One way is to only keep x[a][t] == 1, and round all other numbers down to zero. It will give a meaningful solution, but it may not be very satisfying (ie. your character will do almost nothing). Given that my model for the problem is quite simple, I would expect there are some theoretical results on how to give a good rounding.
While I can suggest the scipy package for solving the linear program once it's formulated, the whole problem of building the constraint matrix and rounding the results (even trivially) is not a beginner-level programming task.

all (2^m−2)/2 possible ways to partition list

Each sample is an array of features (ints). I need to split my samples into two separate groups by figuring out what the best feature, and the best splitting value for that feature, is. By "best", I mean the split that gives me the greatest entropy difference between the pre-split set and the weighted average of the entropy values on the left and right sides. I need to try all (2^m−2)/2 possible ways to partition these items into two nonempty lists (where m is the number of distinct values (all samples with the same value for that feature are moved together as a group))
The following is extremely slow so I need a more reasonable/ faster way of doing this.
sorted_by_feature is a list of (feature_value, 0_or_1) tuples.
same_vals = {}
for ele in sorted_by_feature:
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
l = same_vals.keys()
orderings = list(itertools.permutations(l))
for ordering in orderings:
list_tups = []
for dic_key in ordering:
list_tups += same_vals[dic_key]
left_1 = 0
left_0 = 0
right_1 = num_one
right_0 = num_zero
for index, tup in enumerate(list_tups):
#0's or #1's on the left +/- 1
calculate entropy on left/ right, calculate entropy drop, etc.
Trivial details (continuing the code above):
if index == len(sorted_by_feature) -1:
break
if tup[1] == 1:
left_1 += 1
right_1 -= 1
if tup[1] == 0:
left_0 += 1
right_0 -= 1
#only calculate entropy if values to left and right of split are different
if list_tups[index][0] != list_tups[index+1][0]:
tl;dr
You're asking for a miracle. No programming language can help you out of this one. Use better approaches than what you're considering doing!
Your Solution has Exponential Time Complexity
Let's assume a perfect algorithm: one that can give you a new partition in constant O(1) time. In other words, no matter what the input, a new partition can be generated in a guaranteed constant amount of time.
Let's in fact go one step further and assume that your algorithm is only CPU-bound and is operating under ideal conditions. Under ideal circumstances, a high-end CPU can process upwards of 100 billion instructions per second. Since this algorithm takes O(1) time, we'll say, oh, that every new partition is generated in a hundred billionth of a second. So far so good?
Now you want this to perform well. You say you want this to be able to handle an input of size m. You know that that means you need about pow(2,m) iterations of your algorithm - that's the number of partitions you need to generate, and since generating each algorithm takes a finite amount of time O(1), the total time is just pow(2,m) times O(1). Let's take a quick look at the numbers here:
m = 20 means your time taken is pow(2,20)*10^-11 seconds = 0.00001 seconds. Not bad.
m = 40 means your time taken is pow(2,40)10-11 seconds = 1 trillion/100 billion = 10 seconds. Also not bad, but note how small m = 40 is. In the vast panopticon of numbers, 40 is nothing. And remember we're assuming ideal conditions.
m = 100 means 10^41 seconds! What happened?
You're a victim of algorithmic theory. Simply put, a solution that has exponential time complexity - any solution that takes 2^m time to complete - cannot be sped up by better programming. Generating or producing pow(2,m) outputs is always going to take you the same proportion of time.
Note further that 100 billion instructions/sec is an ideal for high-end desktop computers - your CPU also has to worry about processes other than this program you're running, in which case kernel interrupts and context switches eat into processing time (especially when you're running a few thousand system processes, which you no doubt are). Your CPU also has to read and write from disk, which is I/O bound and takes a lot longer than you think. Interpreted languages like Python also eat into processing time since each line is dynamically converted to bytecode, forcing additional resources to be devoted to that. You can benchmark your code right now and I can pretty much guarantee your numbers will be way higher than the simplistic calculations I provide above. Even worse: storing 2^40 permutations requires 1000 GBs of memory. Do you have that much to spare? :)
Switching to a lower-level language, using generators, etc. is all a pointless affair: they're not the main bottleneck, which is simply the large and unreasonable time complexity of your brute force approach of generating all partitions.
What You Can Do Instead
Use a better algorithm. Generating pow(2,m) partitions and investigating all of them is an unrealistic ambition. You want, instead, to consider a dynamic programming approach. Instead of walking through the entire space of possible partitions, you want to only consider walking through a reduced space of optimal solutions only. That is what dynamic programming does for you. An example of it at work in a problem similar to this one: unique integer partitioning.
Dynamic programming problems approaches work best on problems that can be formulated as linearized directed acyclic graphs (Google it if not sure what I mean!).
If a dynamic approach is out, consider investing in parallel processing with a GPU instead. Your computer already has a GPU - it's what your system uses to render graphics - and GPUs are built to be able to perform large numbers of calculations in parallel. A parallel calculation is one in which different workers can do different parts of the same calculation at the same time - the net result can then be joined back together at the end. If you can figure out a way to break this into a series of parallel calculations - and I think there is good reason to suggest you can - there are good tools for GPU interfacing in Python.
Other Tips
Be very explicit on what you mean by best. If you can provide more information on what best means, we folks on Stack Overflow might be of more assistance and write such an algorithm for you.
Using a bare-metal compiled language might help reduce the amount of real time your solution takes in ordinary situations, but the difference in this case is going to be marginal. Compiled languages are useful when you have to do operations like searching through an array efficiently, since there's no instruction-compilation overhead at each iteration. They're not all that more useful when it comes to generating new partitions, because that's not something that removing the dynamic bytecode generation barrier actually affects.
A couple of minor improvements I can see:
Use try/catch instead of if not in to avoid double lookup of keys
if ele[0] not in same_vals:
same_vals[ele[0]] = [ele]
else:
same_vals[ele[0]].append(ele)
# Should be changed to
try:
same_vals[ele[0]].append(ele) # Most of the time this will work
catch KeyError:
same_vals[ele[0]] = [ele]
Dont explicitly convert a generator to a list if you dont have to. I dont immediately see any need for your casting to a list, which would slow things down
orderings = list(itertools.permutations(l))
for ordering in orderings:
# Should be changed to
for ordering in itertools.permutations(l):
But, like I said, these are only minor improvements. If you really needed this to be faster, consider using a different language.

What is pseudo random?

I was reading the docs for the random module and noticed it said pseudo random and thought doesnt pseudo mean False so i was wondering what it means when it says that
For Example:
import random
print random.randint(1,2)
print random.randint(1,3)
does this still mean that the first print statement has a 50% chance of printing 1 and a 50% chance of printing 2
and that the second print statement has a 33% chance of printing one and a 33% chance of printing 2 etc.
if not then how are the pseudo random numbers generated ?
To produce true randomness requires specialized hardware that measures random events, such as radioactive decay (random) or brownian motion (also essentially random). Most computers obviously don't have these, so instead you have to use a really complex, evenly distributed, hard to predict 'pseudorandom' algorithm that starts with a number determined by, for example, the current timestamp. Such algorithms are plenty good enough for standard use cases needing 'randomness' as long as you're careful to not seed two random number generators with the same timestamp (start them at the same time on different threads, for example), which will make them do identical things. A common example of such a random number generator is Mersenne Twister: http://en.wikipedia.org/wiki/Mersenne_twister
A site that offers truly random values, explains a lot about randomness and pseudorandomness and has some yummy statistics about its randomness: http://www.random.org/ (see Learn More and Statistics) (It actually seems that it relies on measuring tiny fluctuations in a chaotic system, e.g. atmospheric noise, but the statistics show that it is so much like true randomness you can't tell it apart!)

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?
This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.
random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.
The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".
Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.
if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

python lottery suggestion

I know python offers random module to do some simple lottery. Let say random.shuffle() is a good one.
However, I want to build my own simple one. What should I look into? Is there any specific mathematical philosophies behind lottery?
Let say, the simplest situation. 100 names and generate 20 names randomly.
I don't want to use shuffle, since I want to learn to build one myself.
I need some advise to start. Thanks.
You can generate your own pseudo-random numbers -- there's a huge amount of theory behind that, start for example here -- and of course you won't be able to compete with Python's random "Mersenne twister" (explained halfway down the large wikipedia page I pointed you to), in either quality or speed, but for purposes of understanding, it's a good endeavor. Or, you can get physically-random numbers, for example from /dev/random or /dev/urandom on Linux machines (Windows machines have their own interfaces for that, too) -- one has more pushy physical randomness, the other one has better performance.
Once you do have (or borrow from random;-) a pseudo-random (or really random) number generator, picking 20 items at random from 100 is still an interesting problem. While shuffling is a more general approach, a more immediately understandable one might be, assuming your myrand(N) function returns a random or pseudorandom int between 0 included and N excluded:
def pickfromlist(howmany, thelist):
result = []
listcopy = list(thelist)
while listcopy and len(result) < howmany:
i = myrand(len(listcopy))
result.append(listcopy.pop(i))
return result
Definitely not maximally efficient, but, I hope, maximally clear!-) In words: as long as required and feasible, pick one random item out of the remaining ones (the auxiliary list listcopy gives us the "remaining ones" at any step, and gets modified by .pop without altering the input parameter thelist, since it's a shallow copy).
See the Fisher-Yates Shuffle, described also in Knuth's The Art of Computer Programming.
I praise your desire to do this on your own.
Back in the 1950's, random numbers were unavailable to most people without a supercomputer (of the time). The RAND corporation published a book called a million random digits with 100,000 normal deviates which had, literally, just that: random numbers. It was awesome because it enabled laypeople to use high-quality random numbers for research purposes.
Now, back to your question.
I recommend you read the instructions on how to use the book (yes, it comes with instructions) and try to implement that in your Python code. This will not be efficient or elegant, but you will understand the implications of the algorithm you ultimately settle for. I love the part that instructs you to
open the book to an unselected page of
the digit table and blindly choose a
five-digit number; this number with
the first number reduced modulo 2
determines the starting line; the two
digits to the right of the initially
selected five-digit number are reduced
modulo 50 to determine the starting
column in the starting line
It was an art to read that table of numbers!
To be sure, I'm not encouraging you to reinvent the wheel for production code. I'm encouraging you to learn about the art of randomness by implementing a clever, if not very efficient, random number generator.
My work requires that I use high-quality random numbers, on limited occasions I have found the site www.random.org a very good source of both insight and material. From their website:
RANDOM.ORG offers true random numbers
to anyone on the Internet. The
randomness comes from atmospheric
noise, which for many purposes is
better than the pseudo-random number
algorithms typically used in computer
programs. People use RANDOM.ORG for
holding drawings, lotteries and
sweepstakes, to drive games and
gambling sites, for scientific
applications and for art and music.
Now, go and implement your own lottery.
You can use: random.sample
Return a k length list of unique
elements chosen from the population
sequence. Used for random sampling
without replacement.
For a more low-level approach, use `random.choice', in a loop:
Return a random element from the
non-empty sequence seq.
The pseudo-random generator (PRNG) in Python is pretty good. If you want to go even more low-level, you can implement your own. Start with reading this article. The mathematical name for lottery is "sampling without replacement". Google that for information - here's a good link.
The main shortcoming of software-based methods of generating lottery numbers is the fact that all random numbers generated by software are pseudo-random.
This may not be a problem for your simple application, but you did ask about a 'specific mathematical philosophy'. You will have noticed that all commercial lottery systems use physical methods: balls with numbers.
And behind the scenes, the numbers generated by physical lottery systems will be carefully scrutunised for indications of non-randomness and steps taken to eliminate it.
As I say, this may not be a consideration for your simple application, but the overriding requirement of a true lottery (the 'specific mathematical philosophy') should be mathematically demonstrable randomness

Categories

Resources