one time pad with pad seeded via 'passphrase' - python

Looking for a theoretical discussion here. I personally would (and will continue to) use GPG or just SCP for simply getting a file somewhere where only I can decrypt it or only I can download it. Still a discussion of where the following falls short (and by how much) would help my curiosity.
Suppose I want to encrypt a file locally, put it on the internet, and be able to grab it later. I want to make sure that only people with a certain password/phrase can decrypt the file ... and I insist on incorporating a one-time-pad.
Assuming it's only used to encrypt a message once, if one were to use a very random passphrase (e.g. Diceware) to seed the pad in a reproducible way, would this be a problem? In python, I would do something like random.seed("hurt coaster lemon swab lincoln") and then generate my pad. I would use the same seed for encryption and decryption.
There are warnings all over the place about how this Mersenne Twister RNG is not suitable for security/cryptography purposes. I see that it has a very long period, and IIUC, that random.seed allows me to choose 16 bytes worth of different seeds (Python: where is random.random() seeded?).
I've heard that the numbers in an OTP should be "truly random", but even if somebody saw, say, the 1st 100 characters of my pad, how much would that help them in determining what the seed of my RNG was (in hopes of decoding the rest)? I suppose they could brute force the seed by generating pads from every possible random seed and seeing which ones match my first 100 random letters. Still, there are quite a few random seeds to try, right?
So, how dangerous is this? And is there a reasonable way to figure out the seed of a sequence generated by common RNGs by peeking at a little bit of the sequence?

A one-time pad's key is truly-random data of the same size as the plaintext, by definition. If you're producing it some other way (e.g. by seeding a PRNG), it isn't a one-time pad, and it doesn't have the one-time pad's unbreakability property.
One-time pads are actually a special type of stream cipher. There are other stream ciphers too, and yes, they can be quite secure if used properly. But stream ciphers can also be completely insecure if used improperly, and your idea of making up your own cipher based on a non-cryptographic PRNG is improper usage from the start.
One-time pads are used when the key must be impossible to brute-force even if the attacker has unlimited computing power. Based on your description, you're just looking for something that's infeasible to brute-force by any realistic attacker, and that's what any other decent cipher will give you. And unless you're protecting nuclear launch codes or something, that's all you need.
Forget the faux-OTP and Mersenne Twister idea and just use something like AES, with something like bcrypt or scrypt to derive the key from your passphrase.
Regarding your specific question about determining the RNG's sequence: Mersenne twister's internal state can be determined by observing 2496 bytes of its output. And in a stream cipher, it's easy to determine the keystream given the plaintext and ciphertext. This means that if an attacker has your ciphertext and can determine the first 2496 bytes of your plaintext, he knows the RNG state and can use it to produce the rest of the keystream and decrypt the whole message.
2496 bytes is not feasible to brute-force, but a sophisticated attacker may be able to significantly narrow down the possibilities using intelligent guessing about the content of your plaintext, such as what you might have written about, or what file formats the data likely to be in and the known structure of those file formats. This is known as cribbing, and can provide enough of a starting point that the remaining brute-force attack becomes feasible.
Even better is if the attacker can trick you into incorporating some specific content into your plaintext. Then he doesn't even have to guess.

Related

Is there a generator version of `sample` in Python?

NetLogo argues that one of its important features is that it activates agents from an agentset in pseudo-random order. If one wanted to do something similar in Python one might do the following.
from random import sample
for agent in sample(agentset, len(agentset)):
< do something with agent >
I believe that would work fine. The problem is that sample returns a list. If agentset is large, one is essentially duplicating it. (I don't want to use shuffle or pop since these modify the original agentset.)
Ideally, I would like a version of sample that acts as a generator and yields values when requested. Is there such a function? If not, any thoughts about how to write one--without either modifying the original set or duplicating it?
Thanks.
The algorithms underlying sample require memory proportional to the size of the sample. (One algorithm is rejection sampling, and the other is a partial shuffle.) Neither can do what you're looking for.
What you're looking for requires different techniques, such as format-preserving encryption. A format-preserving cipher is essentially a keyed bijection from [0, n) to [0, n) (or equivalently, from any finite set to itself). Using format-preserving encryption, your generator would look like (pseudocode)
def shuffle_generator(sequence):
key = get_random_key()
cipher = FormatPreservingCipher(key, len(sequence))
for i in range(len(sequence)):
yield sequence[cipher.encrypt(i)]
This would be a lot slower than a traditional shuffle, but it would achieve your stated goal.
I am not aware of any good format-preserving encryption libraries for Python. (pyffx exists, but testing shows that either it's terrible, or it has severe limitations that aren't clearly documented. Either way, it doesn't seem to be usable for this.) Your best option may be to wrap a library implemented in a different language.

Algorithm to generate 12 byte hash from Web URLs

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?
shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

crypto analysis of my algo

import string,random,platform,os,sys
def rPass():
sent = os.urandom(random.randrange(900,7899))
print sent,"\n"
intsent=0
for i in sent:
intsent += ord(i)
print intsent
intset=0
rPass()
I need help figuring out total possible outputs for the bytecode section of this algorithm. Don't worry about the for loop and the ord stuff that's for down the line. -newbie crypto guy out.
I won't worry about the loop and the ord stuff, so let's just throw that out and look at the rest.
Also, I don't understand "I need help figuring out total possible outputs for the unicode section of this algorithm", because there is no Unicode section of the algorithm, or in fact any Unicode anything anywhere in your code. But I can help you figure out the total possible outputs of the whole thing. Which we'll do by simplifying it step by step.
First:
li=[]
for a in range(900,7899):
li.append(a)
This is exactly equivalent to:
li = range(900, 7899)
Meanwhile:
li[random.randint(0,7000)]
Because li happens to be exactly 6999 elements long, this is exactly the same as random.choice(li).
And, putting the last two together, this means it's equivalent to:
random.choice(range(900,7899))
… which is equivalent to:
random.randrange(900,7899)
But wait, what about that random.shuffle(li, random.random)? Well (ignoring the fact that random.random is already the default for the second parameter), the choice is already random-but-not-cryptographically-so, and adding another shuffle doesn't change that. If someone is trying to mathematically predict your RNG, adding one more trivial shuffle with the same RNG will not make it any harder to predict (while adding a whole lot more work based on the results may make a timing attack easier).
In fact, even if you used a subset of li instead of the whole thing, there's no way that could make your code more unpredictable. You'd have a smaller range of values to brute-force through, for no benefit.
So, your whole thing reduces to this:
sent = os.urandom(random.randrange(900, 7899))
The possible output is: Any byte string between 900 and 7899 bytes long.
The length is random, and roughly evenly distributed, but it's not random in a cryptographically-unpredictable sense. Fortunately, that's not likely to matter, because presumably the attacker can see how many bytes he's dealing with instead of having to predict it.
The content is random, both evenly distributed and cryptographically unpredictable, at least to the extent that your system's urandom is.
And that's all there is to say about it.
However, the fact that you've made it much harder to read, write, maintain, and think through gives you a major disadvantage, with no compensating disadvantage to your attacker.
So, just use the one-liner.
I think in your followup questions, you're asking how many possible values there are for 900-7898 bytes of random data.
Well, how many values are there for 900 bytes? 256**900. How many for 901? 256**901. So, the answer is:
sum(256**i for i in range(900, 7899))
… which is about 2**63184, or 10**19020.
So, 63184 bits of security sounds pretty impressive, right? Probably not. If your algorithm has no flaws in it, 100 bits is more than you could ever need. If your algorithm is flawed (and of course it is, because they all are), blindly throwing thousands more bits at it won't help.
Also, remember, the whole point of crypto is that you want cracking to be 2**N slower than legitimate decryption, for some large N. So, making legitimate decryption much slower makes your scheme much worse. This is why every real-life working crypto scheme uses a few hundred bits of key, salt, etc. (Yes, public-key encryption uses a few thousand bits for its keys, but that's because its keys aren't randomly distributed. And generally, all you do with those keys it to encrypt a randomly-generated session/document key of a few hundred bits.)
One last thing: I know you said to ignore the ord, but…
First you can write that whole part as intsent=sum(bytearray(sent)).
But, more importantly, if all you're doing with this buffer is summing it up, you're using a lot of entropy to generate a single number with a lot less entropy. (This should be obvious once you think about it. If you have two separate bytes, there are 65536 possibilities; if you add them together, there are only 512.)
Also, by generating a few thousand one-byte random numbers and adding them up, that's basically a very close approximation of a normal or gaussian distribution. (If you're a D&D player, think of how 3D6 gives 10 and 11 more often than 3 and 18… and how that's more true for 3D6 than for 2D6… and then consider 6000D6.) But then, by making the number of bytes range from 900 to 7899, you're flattening it back toward a uniform distribution from 700*127.5 to 7899*127.5. At any rate, if you can describe the distribution you're trying to get, you can probably generate that directly, without wasting all this urandom entropy and computation.
It's worth noting that there are very few cryptographic applications that can possibly make use of this much entropy. Even things like generating SSL certs use on the order of 128-1024 bits, not 64K bits.
You say:
trying to kill the password.
If you're trying to encrypt a password so it can be, say, stored on disk or sent over the network, this is almost always the wrong approach. You want to use some kind of zero-knowledge proof—store hashes of the password, or use challenge-response instead of sending data, etc. If you want to build a "keep me logged in feature", do that by actually keeping the user logged in (create and store a session auth token, rather than storing the password). See the Wikipedia article password for the basics.
Occasionally, you do need to encrypt and store passwords. For example, maybe you're building a "password locker" program for a user to store a bunch of passwords in. Or a client to a badly-designed server (or a protocol designed in the 70s). Or whatever. If you need to do this, you want one layer of encryption with a relatively small key (remember that a typical password is itself only about 256 bits long, and has less than 64 bits of actual information, so there is absolutely no benefit from using a key thousands of times as long as they). The only way to make it more secure is to use a better algorithm—but really, the encryption algorithm will almost never be the best attack surface (unless you've tried to design one yourself); put your effort into the weakest areas of the infrastructure, not the strongest.
You ask:
Also is urandom's output codependent on the assembler it's working with?
Well… there is no assembler it's working with, and I can't think of anything else you could be referring to that makes any sense.
All that urandom is dependent on is your OS's entropy pool and PRNG. As the docs say, urandom just reads /dev/urandom (Unix) or calls CryptGenRandom (Windows).
If you want to know exactly how that works on your system, man urandom or look up CryptGenRandom in MSDN. But all of the major OS's can generate enough entropy and mix it well enough that you basically don't have to worry about this at all. Under the covers, they all effectively have some pool of entropy, and some cryptographically-secure PRNG to "stretch" that pool, and some kernel device (linux, Windows) or user-space daemon (OS X) that gathers whatever entropy it can get from unpredictable things like user actions to mix it into the pool.
So, what is that dependent on? Assuming you don't have any apps wasting huge amounts of entropy, and your machine hasn't been compromised, and your OS doesn't have a major security flaw… it's basically not dependent on anything. Or, to put it another way, it's dependent on those three assumptions.
To quote the linux man page, /dev/urandom is good enough for "everything except long-lived GPG/SSL/SSH keys". (And on many systems, if someone tries to run a program that, like your code, reads thousands of bytes of urandom, or tries to kill the entropy-seeding daemon, or whatever, it'll be logged, and hopefully the user/sysadmin can deal with it.)
hmmmm python goes through an interpreter of its own so i'm not sure how that plays in
It doesn't. Obviously calling urandom(8) does a bunch of extra stuff before and after the syscall to read 8 bytes from /dev/urandom than you'd do in, say, a C problem… but the actual syscall is identical. So the urandom device can't even tell the difference between the two.
but I'm simply asking if urandom will produce different results on a different architecture.
Well, yes, obviously. For example, Linux and OS X use entirely different CSPRNGs and different ways of accumulating entropy. But the whole point is that it's supposed to be different, even on an identical machine, or at a different time on the same machine. As long as it produces "good enough" results on every platform, that's all that matters.
For instance would a processor\assembler\interpreter cause a fingerprint specific to said architecture, which is within reason stochastically predictable?
As mentioned above, the interpreter ultimately makes the same syscall as compiled code would.
As for an assembler… there probably isn't any assembler involved anywhere. The relevant parts of the Python interpreter, the random device, the entropy-gathering service or driver, etc. are most likely written in C. And even if they were hand-coded in assembly, the whole point of coding in assembly is that you pretty much directly control the machine code that gets generated, so different assemblers wouldn't make any difference.
The processor might leave a "fingerprint" in some sense. For example, I'll bet that if you knew the RNG algorithm, and controlled its state directly, you could write code that could distinguish an x86 vs. an x86_64, or maybe even one generation of i7 vs. another, based on timing. But I'm not sure what good that would do you. The algorithm will still generate the same results from the same state. And the actual attacks used against RNGs are about attacking the algorithm the entropy accumulator, and/or the entropy estimator.
At any rate, I'm willing to bet large sums of money that you're safer relying on urandom than on anything you come up with yourself. If you need something better (and you don't), implement—or, better, find a well-tested implementation of—Fortuna or BBS, or buy a hardware entropy-generating device.

Generating crypto-secure strings for OAuth tokens

I want to generate tokens and keys that are random strings. What is the acceptable method to generate them?
Is generating two UUIDs via standard library functions and concatenating them acceptable?
os.urandom provides access to the operating systems random number generator
EDIT: If you are using linux and are very concerned about security, you should use /dev/random/ directly. This call will block until sufficient entropy is available.
Computers (without special hardware) can only generate pseudo random data. After a while, all speudo-random number generators will start to repeat themselves. The amount of data it can generate before repeating itself is called the period.
A very popular pseudo-random number (also used in Python in the random module) generator is the Mersenne Twister. But it is deemed not suitable for cryptographic purposes, because it is fairly easy to predict the next iteration after observing only a relatively small number of iterates.
See the Wikipedia page on cryptographically secure pseudo-random number generators for a list of algorithms that seem suitable.
Operating systems like FreeBSD, OpenBSD and OS X use the Yarrow algorithm for their urandom devices. So on those systems using os.urandom might be OK, because it is well-regarded as being cryptographically secure.
Of course what you need to use depends to a large degree on how high your requirements are; how secure do you want it to be? In general I would advise you to use published and tested implementations of algorithms. Writing your own implementation is too easy to get wrong.
Edit: Computers can gather random data by watching e.g. the times at which interrupts arrive. This however does not supply a large amount of random data, and it is therefore often used to seed a PRNG.

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?
This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.
random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.
The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".
Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.
if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

Categories

Resources