Generating crypto-secure strings for OAuth tokens

Generating crypto-secure strings for OAuth tokens - python

I want to generate tokens and keys that are random strings. What is the acceptable method to generate them?
Is generating two UUIDs via standard library functions and concatenating them acceptable?

os.urandom provides access to the operating systems random number generator
EDIT: If you are using linux and are very concerned about security, you should use /dev/random/ directly. This call will block until sufficient entropy is available.

Computers (without special hardware) can only generate pseudo random data. After a while, all speudo-random number generators will start to repeat themselves. The amount of data it can generate before repeating itself is called the period.
A very popular pseudo-random number (also used in Python in the random module) generator is the Mersenne Twister. But it is deemed not suitable for cryptographic purposes, because it is fairly easy to predict the next iteration after observing only a relatively small number of iterates.
See the Wikipedia page on cryptographically secure pseudo-random number generators for a list of algorithms that seem suitable.
Operating systems like FreeBSD, OpenBSD and OS X use the Yarrow algorithm for their urandom devices. So on those systems using os.urandom might be OK, because it is well-regarded as being cryptographically secure.
Of course what you need to use depends to a large degree on how high your requirements are; how secure do you want it to be? In general I would advise you to use published and tested implementations of algorithms. Writing your own implementation is too easy to get wrong.
Edit: Computers can gather random data by watching e.g. the times at which interrupts arrive. This however does not supply a large amount of random data, and it is therefore often used to seed a PRNG.

Related

In what situations do we specifiy a static seed value when using the random module in Python? [duplicate]

From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?

The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.

The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.

But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"

If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.

A Mersenne twister, the random number generator, used by Python is seeded by the operating system served random numbers by default on those platforms where it is possible (Unixen, Windows); however on other platforms the seed defaults to the system time which means very repeatable values if the system time has a bad precision. On such systems seeding with known better random values is thus beneficial. Note that on Python 3 specifically, if version 2 is used, you can pass in any str, bytes, or bytearray to seed the generator; thus taking use of the Mersenne twister's large state better.
Another reason to use a seed value is indeed to guarantee that you get the same sequence of random numbers again and again - by reusing the known seed. Quoting the docs:
Sometimes it is useful to be able to reproduce the sequences given by
a pseudo random number generator. By re-using a seed value, the same
sequence should be reproducible from run to run as long as multiple
threads are not running.
Most of the random module’s algorithms and seeding functions are
subject to change across Python versions, but two aspects are
guaranteed not to change:
If a new seeding method is added, then a backward compatible seeder will be offered.
The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.
For this however, you mostly want to use the random.Random instances instead of using module global methods (the multiple threads issue, etc).
Finally also note that the random numbers produced by Mersenne twister are unsuitable for cryptographical use; whereas they appear very random, it is possible to fully recover the internal state of the random generator by observing only some hundreds of values from the generator. For cryptographical algorithms, you want to use the SystemRandom class.

In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.

Best practices for seeding random and numpy.random in the same program

In order to make random simulations we run reproducible later, my colleagues and I often explicitly seed the random or numpy.random modules' random number generators using the random.seed and np.random.seed methods. Seeding with an arbitrary constant like 42 is fine if we're just using one of those modules in a program, but sometimes, we use both random and np.random in the same program. I'm unsure whether there are any best practices I should be following about how to seed the two RNGs together.
In particular, I'm worried that there's some sort of trap we could fall into where the two RNGs together behave in a "non-random" way, such as both generating the exact same sequence of random numbers, or one sequence trailing the other by a few values (e.g. the kth number from random is always the k+20th number from np.random), or the two sequences being related to each other in some other mathematical way. (I realise that pseudo-random number generators are all imperfect simulations of true randomness, but I want to avoid exacerbating this with poor seed choices.)
With this objective in mind, are there any particular ways we should or shouldn't seed the two RNGs? I've used, or seen colleagues use, a few different tactics, like:
Using the same arbitrary seed:
random.seed(42)
np.random.seed(42)
Using two different arbitrary seeds:
random.seed(271828)
np.random.seed(314159)
Using a random number from one RNG to seed the other:
random.seed(42)
np.random.seed(random.randint(0, 2**32))
... and I've never noticed any strange outcomes from any of these approaches... but maybe I've just missed them. Are there any officially blessed approaches to this? And are there any possible traps that I can spot and raise the alarm about in code review?

I will discuss some guidelines on how multiple pseudorandom number generators (PRNGs) should be seeded. I assume you're not using random-behaving numbers for information security purposes (if you are, only a cryptographic generator is appropriate and this advice doesn't apply).
To reduce the risk of correlated pseudorandom numbers, you can use PRNG algorithms, such as SFC and other so-called "counter-based" PRNGs (Salmon et al., "Parallel Random Numbers: As Easy as 1, 2, 3", 2011), that support independent "streams" of pseudorandom numbers. There are other strategies as well, and I explain more about this in "Seeding Multiple Processes".
If you can use NumPy 1.17, note that that version introduced a new PRNG system and added SFC (SFC64) to its repertoire of PRNGs. For NumPy-specific advice on parallel pseudorandom generation, see "Parallel Random Number Generation" in the NumPy documentation.
You should avoid seeding PRNGs (especially several at once) with timestamps.
You mentioned this question in a comment, when I started writing this answer. The advice there is not to seed multiple instances of the same kind of PRNG. This advice, however, doesn't apply as much if the seeds are chosen to be unrelated to each other, or if a PRNG with a very big state (such as Mersenne Twister) or a PRNG that gives each seed its own nonoverlapping pseudorandom number sequence (such as SFC) is used. The accepted answer there (at the time of this writing) demonstrates what happens when multiple instances of .NET's System.Random, with sequential seeds, are used, but not necessarily what happens with PRNGs of a different design, PRNGs of multiple designs, or PRNGs initialized with unrelated seeds. Moreover, .NET's System.Random is a poor choice for a PRNG precisely because it allows only seeds no more than 32 bits long (so the number of pseudorandom sequences it can produce is limited), and also because it has implementation bugs (if I understand correctly) that have been preserved for backward compatibility.

Should I seed the random number generator?

From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?

The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.

The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.

But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"

If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.

In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?

This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.

random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.

The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".

Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.

if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

python lottery suggestion

I know python offers random module to do some simple lottery. Let say random.shuffle() is a good one.
However, I want to build my own simple one. What should I look into? Is there any specific mathematical philosophies behind lottery?
Let say, the simplest situation. 100 names and generate 20 names randomly.
I don't want to use shuffle, since I want to learn to build one myself.
I need some advise to start. Thanks.

You can generate your own pseudo-random numbers -- there's a huge amount of theory behind that, start for example here -- and of course you won't be able to compete with Python's random "Mersenne twister" (explained halfway down the large wikipedia page I pointed you to), in either quality or speed, but for purposes of understanding, it's a good endeavor. Or, you can get physically-random numbers, for example from /dev/random or /dev/urandom on Linux machines (Windows machines have their own interfaces for that, too) -- one has more pushy physical randomness, the other one has better performance.
Once you do have (or borrow from random;-) a pseudo-random (or really random) number generator, picking 20 items at random from 100 is still an interesting problem. While shuffling is a more general approach, a more immediately understandable one might be, assuming your myrand(N) function returns a random or pseudorandom int between 0 included and N excluded:
def pickfromlist(howmany, thelist):
result = []
listcopy = list(thelist)
while listcopy and len(result) < howmany:
i = myrand(len(listcopy))
result.append(listcopy.pop(i))
return result
Definitely not maximally efficient, but, I hope, maximally clear!-) In words: as long as required and feasible, pick one random item out of the remaining ones (the auxiliary list listcopy gives us the "remaining ones" at any step, and gets modified by .pop without altering the input parameter thelist, since it's a shallow copy).

See the Fisher-Yates Shuffle, described also in Knuth's The Art of Computer Programming.

I praise your desire to do this on your own.
Back in the 1950's, random numbers were unavailable to most people without a supercomputer (of the time). The RAND corporation published a book called a million random digits with 100,000 normal deviates which had, literally, just that: random numbers. It was awesome because it enabled laypeople to use high-quality random numbers for research purposes.
Now, back to your question.
I recommend you read the instructions on how to use the book (yes, it comes with instructions) and try to implement that in your Python code. This will not be efficient or elegant, but you will understand the implications of the algorithm you ultimately settle for. I love the part that instructs you to
open the book to an unselected page of
the digit table and blindly choose a
five-digit number; this number with
the first number reduced modulo 2
determines the starting line; the two
digits to the right of the initially
selected five-digit number are reduced
modulo 50 to determine the starting
column in the starting line
It was an art to read that table of numbers!
To be sure, I'm not encouraging you to reinvent the wheel for production code. I'm encouraging you to learn about the art of randomness by implementing a clever, if not very efficient, random number generator.
My work requires that I use high-quality random numbers, on limited occasions I have found the site www.random.org a very good source of both insight and material. From their website:
RANDOM.ORG offers true random numbers
to anyone on the Internet. The
randomness comes from atmospheric
noise, which for many purposes is
better than the pseudo-random number
algorithms typically used in computer
programs. People use RANDOM.ORG for
holding drawings, lotteries and
sweepstakes, to drive games and
gambling sites, for scientific
applications and for art and music.
Now, go and implement your own lottery.

You can use: random.sample
Return a k length list of unique
elements chosen from the population
sequence. Used for random sampling
without replacement.
For a more low-level approach, use `random.choice', in a loop:
Return a random element from the
non-empty sequence seq.
The pseudo-random generator (PRNG) in Python is pretty good. If you want to go even more low-level, you can implement your own. Start with reading this article. The mathematical name for lottery is "sampling without replacement". Google that for information - here's a good link.

The main shortcoming of software-based methods of generating lottery numbers is the fact that all random numbers generated by software are pseudo-random.
This may not be a problem for your simple application, but you did ask about a 'specific mathematical philosophy'. You will have noticed that all commercial lottery systems use physical methods: balls with numbers.
And behind the scenes, the numbers generated by physical lottery systems will be carefully scrutunised for indications of non-randomness and steps taken to eliminate it.
As I say, this may not be a consideration for your simple application, but the overriding requirement of a true lottery (the 'specific mathematical philosophy') should be mathematically demonstrable randomness

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.