Simple enough question:
I'm using python random module to generate random integers. I want to know what is the suggested value to use with the random.seed() function? Currently I am letting this default to the current time, but this is not ideal. It seems like a string literal constant (similar to a password) would also not be ideal/strong
Suggestions?
Thanks,
-aj
UPDATE:
The reason I am generating random integers is for generation of test data. The numbers do not need to be reproducable.
According to the documentation for random.seed:
If x is omitted or None, current system time is used; current system time is also used to initialize the generator when the module is first imported. If randomness sources are provided by the operating system, they are used instead of the system time (see the os.urandom() function for details on availability).
If you don't pass something to seed, it will try to use operating-system provided randomness sources instead of the time, which is always a better bet. This saves you a bit of work, and is about as good as it's going to get. Regarding availability, the docs for os.urandom tell us:
On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom.
Cross-platform random seeds are the big win here; you can safely omit a seed and trust that it will be random enough on almost every platform you'll use Python on. Even if Python falls back to the time, there's probably only a millisecond window (or less) to guess the seed. I don't think you'll run into any trouble using the current time anyway -- even then, it's only a fallback.
For most cases using current time is good enough. Occasionally you need to use a fixed number to generate pseudo random numbers for comparison purposes.
Setting the seed is for repeatability, not security. If anything, you make the system less secure by having a fixed seed than one that is constantly changing.
Perhaps it is not a problem in your case, but ont problem with using the system time as the seed is that someone who knows roughly when your system was started may be able to guess your seed (by trial) after seeing a few numbers from the sequence.
eg, don't use system time as the seed for your online poker game
If you are using random for generating test data I would like to suggest that reproducibility can be important.
Just think to an use case: for data set X you get some weird behaviour (eg crash). Turns out that data set X shows some feature that was not so apparent from the other data sets Y and Z and uncovers a bug which had escapend your test suites. Now knowing the seed is useful so that you can precisely reproduce the bug and you can fix it.
Related
From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?
The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.
The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.
But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"
If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.
A Mersenne twister, the random number generator, used by Python is seeded by the operating system served random numbers by default on those platforms where it is possible (Unixen, Windows); however on other platforms the seed defaults to the system time which means very repeatable values if the system time has a bad precision. On such systems seeding with known better random values is thus beneficial. Note that on Python 3 specifically, if version 2 is used, you can pass in any str, bytes, or bytearray to seed the generator; thus taking use of the Mersenne twister's large state better.
Another reason to use a seed value is indeed to guarantee that you get the same sequence of random numbers again and again - by reusing the known seed. Quoting the docs:
Sometimes it is useful to be able to reproduce the sequences given by
a pseudo random number generator. By re-using a seed value, the same
sequence should be reproducible from run to run as long as multiple
threads are not running.
Most of the random module’s algorithms and seeding functions are
subject to change across Python versions, but two aspects are
guaranteed not to change:
If a new seeding method is added, then a backward compatible seeder will be offered.
The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.
For this however, you mostly want to use the random.Random instances instead of using module global methods (the multiple threads issue, etc).
Finally also note that the random numbers produced by Mersenne twister are unsuitable for cryptographical use; whereas they appear very random, it is possible to fully recover the internal state of the random generator by observing only some hundreds of values from the generator. For cryptographical algorithms, you want to use the SystemRandom class.
In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.
I was wondering if there was any way to generate a random number, from 1 to 9, without using external libraries, even if they are included with Python.This is a dumb reason, but my editor doesn't allow any libraries, so I need a way to get randomness without libraries.
You need something to start with. Random numbers can be spawned from the last the few digits in the milliseconds value from the system's timestamp. Then you can manipulate them a little and tadah : a different random number every time.
You can implement a random number generator in plain Python, but they all need a seed. The reason Xorshift always returned the same sequence to #yuwe is that it's always getting the same seed. Same seed => same sequence of pseudo-random numbers.
Getting a suitable seed is not possible without resorting to external entropy sources, be it the current time in microseconds, the current process ID, the number of bytes sent over the network since the last reboot, mouse movements, what have you.
From the docs:
random.seed(a=None, version=2) Initialize the random number generator.
If a is omitted or None, the current system time is used. If
randomness sources are provided by the operating system, they are used
instead of the system time (see the os.urandom() function for details
on availability).
But...if it's truly random...(and I thought I read it uses Mersenne, so it's VERY random)...what's the point in seeding it? Either way the outcome is unpredictable...right?
The default is probably best if you want different random numbers with each run. If for some reason you need repeatable random numbers, in testing for instance, use a seed.
The module actually seeds the generator (with OS-provided random data from urandom if possible, otherwise with the current date and time) when you import the module, so there's no need to manually call seed().
(This is mentioned in the Python 2.7 documentation but, for some reason, not the 3.x documentation. However, I confirmed in the 3.x source that it's still done.)
If the automatic seeding weren't done, you'd get the same sequence of numbers every time you started your program, same as if you manually use the same seed every time.
But...if it's truly random
No, it's pseudo random. If it uses Mersenne Twister, that too is a PRNG.
It's basically an algorithm that generates the exact same sequence of pseudo random numbers out of a given seed. Generating truly random numbers requires special hardware support, it's not something you can do by a pure algorithm.
You might not need to seed it since it seeds itself on first use, unless you have some other or better means of providing a seed than what is time based.
If you use the random numbers for things that are not security related, a time based seed is normally fine. If you use if for security/cryptography, note what the docs say: "and is completely unsuitable for cryptographic purposes"
If you want to reproduce your results, you seed the generator with a known value so you get the same sequence every time.
A Mersenne twister, the random number generator, used by Python is seeded by the operating system served random numbers by default on those platforms where it is possible (Unixen, Windows); however on other platforms the seed defaults to the system time which means very repeatable values if the system time has a bad precision. On such systems seeding with known better random values is thus beneficial. Note that on Python 3 specifically, if version 2 is used, you can pass in any str, bytes, or bytearray to seed the generator; thus taking use of the Mersenne twister's large state better.
Another reason to use a seed value is indeed to guarantee that you get the same sequence of random numbers again and again - by reusing the known seed. Quoting the docs:
Sometimes it is useful to be able to reproduce the sequences given by
a pseudo random number generator. By re-using a seed value, the same
sequence should be reproducible from run to run as long as multiple
threads are not running.
Most of the random module’s algorithms and seeding functions are
subject to change across Python versions, but two aspects are
guaranteed not to change:
If a new seeding method is added, then a backward compatible seeder will be offered.
The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed.
For this however, you mostly want to use the random.Random instances instead of using module global methods (the multiple threads issue, etc).
Finally also note that the random numbers produced by Mersenne twister are unsuitable for cryptographical use; whereas they appear very random, it is possible to fully recover the internal state of the random generator by observing only some hundreds of values from the generator. For cryptographical algorithms, you want to use the SystemRandom class.
In most cases I would say there is no need to care about. But if someone is really willing to do something wired and (s)he could roughly figure out your system time when your code was running, they might be able to brute force replay your random numbers and see which series fits. But I would say this is quite unlikely in most cases.
import string,random,platform,os,sys
def rPass():
sent = os.urandom(random.randrange(900,7899))
print sent,"\n"
intsent=0
for i in sent:
intsent += ord(i)
print intsent
intset=0
rPass()
I need help figuring out total possible outputs for the bytecode section of this algorithm. Don't worry about the for loop and the ord stuff that's for down the line. -newbie crypto guy out.
I won't worry about the loop and the ord stuff, so let's just throw that out and look at the rest.
Also, I don't understand "I need help figuring out total possible outputs for the unicode section of this algorithm", because there is no Unicode section of the algorithm, or in fact any Unicode anything anywhere in your code. But I can help you figure out the total possible outputs of the whole thing. Which we'll do by simplifying it step by step.
First:
li=[]
for a in range(900,7899):
li.append(a)
This is exactly equivalent to:
li = range(900, 7899)
Meanwhile:
li[random.randint(0,7000)]
Because li happens to be exactly 6999 elements long, this is exactly the same as random.choice(li).
And, putting the last two together, this means it's equivalent to:
random.choice(range(900,7899))
… which is equivalent to:
random.randrange(900,7899)
But wait, what about that random.shuffle(li, random.random)? Well (ignoring the fact that random.random is already the default for the second parameter), the choice is already random-but-not-cryptographically-so, and adding another shuffle doesn't change that. If someone is trying to mathematically predict your RNG, adding one more trivial shuffle with the same RNG will not make it any harder to predict (while adding a whole lot more work based on the results may make a timing attack easier).
In fact, even if you used a subset of li instead of the whole thing, there's no way that could make your code more unpredictable. You'd have a smaller range of values to brute-force through, for no benefit.
So, your whole thing reduces to this:
sent = os.urandom(random.randrange(900, 7899))
The possible output is: Any byte string between 900 and 7899 bytes long.
The length is random, and roughly evenly distributed, but it's not random in a cryptographically-unpredictable sense. Fortunately, that's not likely to matter, because presumably the attacker can see how many bytes he's dealing with instead of having to predict it.
The content is random, both evenly distributed and cryptographically unpredictable, at least to the extent that your system's urandom is.
And that's all there is to say about it.
However, the fact that you've made it much harder to read, write, maintain, and think through gives you a major disadvantage, with no compensating disadvantage to your attacker.
So, just use the one-liner.
I think in your followup questions, you're asking how many possible values there are for 900-7898 bytes of random data.
Well, how many values are there for 900 bytes? 256**900. How many for 901? 256**901. So, the answer is:
sum(256**i for i in range(900, 7899))
… which is about 2**63184, or 10**19020.
So, 63184 bits of security sounds pretty impressive, right? Probably not. If your algorithm has no flaws in it, 100 bits is more than you could ever need. If your algorithm is flawed (and of course it is, because they all are), blindly throwing thousands more bits at it won't help.
Also, remember, the whole point of crypto is that you want cracking to be 2**N slower than legitimate decryption, for some large N. So, making legitimate decryption much slower makes your scheme much worse. This is why every real-life working crypto scheme uses a few hundred bits of key, salt, etc. (Yes, public-key encryption uses a few thousand bits for its keys, but that's because its keys aren't randomly distributed. And generally, all you do with those keys it to encrypt a randomly-generated session/document key of a few hundred bits.)
One last thing: I know you said to ignore the ord, but…
First you can write that whole part as intsent=sum(bytearray(sent)).
But, more importantly, if all you're doing with this buffer is summing it up, you're using a lot of entropy to generate a single number with a lot less entropy. (This should be obvious once you think about it. If you have two separate bytes, there are 65536 possibilities; if you add them together, there are only 512.)
Also, by generating a few thousand one-byte random numbers and adding them up, that's basically a very close approximation of a normal or gaussian distribution. (If you're a D&D player, think of how 3D6 gives 10 and 11 more often than 3 and 18… and how that's more true for 3D6 than for 2D6… and then consider 6000D6.) But then, by making the number of bytes range from 900 to 7899, you're flattening it back toward a uniform distribution from 700*127.5 to 7899*127.5. At any rate, if you can describe the distribution you're trying to get, you can probably generate that directly, without wasting all this urandom entropy and computation.
It's worth noting that there are very few cryptographic applications that can possibly make use of this much entropy. Even things like generating SSL certs use on the order of 128-1024 bits, not 64K bits.
You say:
trying to kill the password.
If you're trying to encrypt a password so it can be, say, stored on disk or sent over the network, this is almost always the wrong approach. You want to use some kind of zero-knowledge proof—store hashes of the password, or use challenge-response instead of sending data, etc. If you want to build a "keep me logged in feature", do that by actually keeping the user logged in (create and store a session auth token, rather than storing the password). See the Wikipedia article password for the basics.
Occasionally, you do need to encrypt and store passwords. For example, maybe you're building a "password locker" program for a user to store a bunch of passwords in. Or a client to a badly-designed server (or a protocol designed in the 70s). Or whatever. If you need to do this, you want one layer of encryption with a relatively small key (remember that a typical password is itself only about 256 bits long, and has less than 64 bits of actual information, so there is absolutely no benefit from using a key thousands of times as long as they). The only way to make it more secure is to use a better algorithm—but really, the encryption algorithm will almost never be the best attack surface (unless you've tried to design one yourself); put your effort into the weakest areas of the infrastructure, not the strongest.
You ask:
Also is urandom's output codependent on the assembler it's working with?
Well… there is no assembler it's working with, and I can't think of anything else you could be referring to that makes any sense.
All that urandom is dependent on is your OS's entropy pool and PRNG. As the docs say, urandom just reads /dev/urandom (Unix) or calls CryptGenRandom (Windows).
If you want to know exactly how that works on your system, man urandom or look up CryptGenRandom in MSDN. But all of the major OS's can generate enough entropy and mix it well enough that you basically don't have to worry about this at all. Under the covers, they all effectively have some pool of entropy, and some cryptographically-secure PRNG to "stretch" that pool, and some kernel device (linux, Windows) or user-space daemon (OS X) that gathers whatever entropy it can get from unpredictable things like user actions to mix it into the pool.
So, what is that dependent on? Assuming you don't have any apps wasting huge amounts of entropy, and your machine hasn't been compromised, and your OS doesn't have a major security flaw… it's basically not dependent on anything. Or, to put it another way, it's dependent on those three assumptions.
To quote the linux man page, /dev/urandom is good enough for "everything except long-lived GPG/SSL/SSH keys". (And on many systems, if someone tries to run a program that, like your code, reads thousands of bytes of urandom, or tries to kill the entropy-seeding daemon, or whatever, it'll be logged, and hopefully the user/sysadmin can deal with it.)
hmmmm python goes through an interpreter of its own so i'm not sure how that plays in
It doesn't. Obviously calling urandom(8) does a bunch of extra stuff before and after the syscall to read 8 bytes from /dev/urandom than you'd do in, say, a C problem… but the actual syscall is identical. So the urandom device can't even tell the difference between the two.
but I'm simply asking if urandom will produce different results on a different architecture.
Well, yes, obviously. For example, Linux and OS X use entirely different CSPRNGs and different ways of accumulating entropy. But the whole point is that it's supposed to be different, even on an identical machine, or at a different time on the same machine. As long as it produces "good enough" results on every platform, that's all that matters.
For instance would a processor\assembler\interpreter cause a fingerprint specific to said architecture, which is within reason stochastically predictable?
As mentioned above, the interpreter ultimately makes the same syscall as compiled code would.
As for an assembler… there probably isn't any assembler involved anywhere. The relevant parts of the Python interpreter, the random device, the entropy-gathering service or driver, etc. are most likely written in C. And even if they were hand-coded in assembly, the whole point of coding in assembly is that you pretty much directly control the machine code that gets generated, so different assemblers wouldn't make any difference.
The processor might leave a "fingerprint" in some sense. For example, I'll bet that if you knew the RNG algorithm, and controlled its state directly, you could write code that could distinguish an x86 vs. an x86_64, or maybe even one generation of i7 vs. another, based on timing. But I'm not sure what good that would do you. The algorithm will still generate the same results from the same state. And the actual attacks used against RNGs are about attacking the algorithm the entropy accumulator, and/or the entropy estimator.
At any rate, I'm willing to bet large sums of money that you're safer relying on urandom than on anything you come up with yourself. If you need something better (and you don't), implement—or, better, find a well-tested implementation of—Fortuna or BBS, or buy a hardware entropy-generating device.
I need to generate a controlled sequence of pseudo-random numbers, given an initial parameter. For that I'm using the standard python random generator, seeded by this parameter. I'd like to make sure that I will generate the same sequence across systems (Operating system, but also Python version).
In summary: Does python ensure the reproducibility / portability of it's pseudo-random number generator across implementation and versions?
No, it doesn't. There's no such promise in the random module's documentation.
What the docs do contain is this remark:
Changed in version 2.3: MersenneTwister replaced Wichmann-Hill as the default generator
So a different RNG was used prior to Python 2.3.
So far, I've been using numpy.random.RandomState for reproducible pseudo-randomness, though it too does not make the formal promise you're after.
If you want full reproducibility, you might want to include a copy of random's source in your program, or hack together a "P²RNG" (pseudo-pseudo-RNG) from hashlib.
Not necessarily.
As described in the documentation, the random module has used the Mersenne twister to generate random numbers since version 2.3, but used Wichmann-Hill before that.
(If a seed is not provided, the method of obtaining the seed also does depend on the operating system, the Python version, and factors such as the system time).
#reubano - 3.2 changed the integer functions in random, to produce more evenly distributed (which inevitably means different) output.
That change was discussed in Issue9025, where the team discuss whether they have an obligation to stick to the previous output, even when it was defective. They conclude that they do not. The docs for the module guarantee consistency for random.random() - one might assume that the functions which call it (like random.randrange()) are implicitly covered under that guarantee, but that doesn't seem to be the case.
Just as a heads up: in addition to the 2.3 change, python 3 gives numbers from python 2.x from randrange and probably other functions, even if the numbers from random.random are similar.
I just found out that there is also a difference between python3.7 and python3.8.
The following code behaves the same
from random import Random
seed = 317
rand = Random(seed)
rand.getrandbits(64)
but if you use from _random import Random instead, it behaves differently.