Most examples on stackoverflow and the web at large put the CBC initialization vector (IV) as a "prefix" to ciphertext, e.g.:
bytesToSend = bytesOfIV + bytesOfCipherText
And then on the other side, a variety of clever/simple/flexible means are used to separate them, the slicing operator in python and assuming a 16 byte IV being particularly prevalent (and simple):
bytesRead = get bytes from source
IV = bytesRead[:16]
cipherText = bytesRead[16:]
This approach has stood the test of time and it's fine but it also is very limiting in that both sides (specifically, the listener part of both sides) must know exactly what they should do with the inbound bytes. In fact, short of slicing off the IV and using the shared secret, there's almost no other operation that can be performed.
I am examining a design wherein inbound messages have two sections: one that can be encrypted and one that is NEVER encrypted. Clearly, sensitive information is NOT put into the unencrypted section, but because it is in plaintext, a listener can actually do something intelligent with the content prior to decryption. Example: message routing, identifiers for multiple encryption keys, etc. And of course, this could include the IV. Simplified example:
{
neverEncrypted: {
cipher: "AES/CBC/PKCS5PADDING",
IV: "121866AFBBDF2906C0D942C1FA7C4DAB",
other: [ "hello", 2, 3 ],
timestamp: date
},
data: (binary cipherText WITHOUT the IV prefix)
}
This approach certainly "works" and IV is not assumed to be secret so it can live in neverEncrypted. I also like the fact that one does not have to get "clever" about the concatenation and splitting of material, esp. assumptions about block size initialization, etc. This is particularly true if a listener needs to listen for material encrypted with 2 (or more) ciphers.
The question is: Is there any special advantage to the "prefix" approach re. security or is it more just a nice method of transferring both pieces of information without requiring any additional fancy data structure decoding?
In terms of security, no, there are no direct benefits from prefixing the IV to the ciphertext (other than the ease of HMACing which I talk about below). The design you've outlined is just as secure (provided a correct implementation) as prefixing the IV.
There are definitely benefits in efficiency, however. Consider the resource cost of parsing and structuring a JSON object in memory versus the cost of splitting a stream of bytes. The latter is astronomically faster. You wouldn't notice this though unless you're processing very large amounts of data.
As long as your crypto implementation is correct, how you transfer parameters isn't really that important as long as both endpoints know what to do with the data.
Note that your suggested method might prove more difficult to HMAC your ciphertext properly as you need to apply the HMAC to the IV and ciphertext, hence why having an IV prefixed ciphertext is easier.
Related
Although it is not considered safe, I need a Python library that always generates for the same plaintext the same ciphertext using asymmetric encryption scheme.
Meaning that given a plaintext m and a public key k when encrypting m using k I will always get a constant ciphertext c.
It will be even better if there is a way to use the Python library "cryptography" to do so.
You can't find this because public-key encryption cannot possibly be deterministic. Any deterministic public-key encryption scheme is subject to a very simple attack: given a ciphertext, guess what the plaintext might be, and verify the guess by encrypting with the public key. Anyone can carry out this attack since the encryption key is public.
This is completely different from public-key signatures, which can be deterministic because being able to tell that two signatures are from the same message doesn't change anything about the signature's validity. With encryption, being able to tell that two plaintexts are the same message does break the whole purpose of encryption.
There is one scenario in which public-key encryption could be deterministic, and that's if the plaintexts are randomly generated, or derived from randomly generated data, and it's impossible to guess potential plaintexts. However, with such input restrictions, you shouldn't look for an “asymmetric encryption” scheme, but for a lower-level primitive: a trapdoor permutation. This is not a directly usable primitive, but it can be a building block of a cryptographic mechanism (such as a public-key encryption mechanism). So you can't expect libraries to offer that as their interface. Furthermore, typical protocols are not generic in the way they might use a trapdoor permutation. So your protocol definition would call for a specific primitive, not for a “deterministic asymmetric encryption” primitive.
If you think you need “deterministic asymmetric encryption”, you're designing your own cryptographic scheme, and it does not stand a chance of being correct. Don't do that. If you need help solving a problem, ask about your actual problem instead of the dead end you've reached trying to solve it.
Is there a systematic way to run Python 3.x with all strings defaulting to bytes? I am finding that when "crossing boundaries" for example talking to msgpack, Elixir, or ZeroMQ, I'm having to do all sorts of contortions constantly figuring out whether strings or bytes will be returned. It's a complete pain and adds a layer of cognitive friction over and above my problem.
For example I have
import argparse
parser.add_argument("--nodename")
args = parser.parse_args()
and then to get the nodename I need to do
str(args.nodename)
However zeroMQ wants bytes, and I'm going to use the nodename everywhere I use zeroMQ. So I make it bytes up front with
nodename.encode()
But now every time I want to use it with a string, say for concatenation, I cannot do so because I have to encode the string first. And half the libraries take perfectly good bytes data type and return them to you as strings, at which time you have to convert them back again to bytes if you want to send them outside Python. For a "glue language" this is a total disaster. I'm having to do this encode decode dance whenever I cross the boundary, and the worst is that it does not seem consistent across libraries whether they co-opt you to strings or bytes if you send them bytes.
In Python 3 is there an option to forego Unicode-by-default (since it does after all say, "by default", suggesting it can be changed), or is the answer "stick with 2.7".
In short, no. And you really don't want to try. You mention contortions but don't give specific examples, so it's hard to offer specific advice.
Neither, in this author's humble opinion, do you want to stick with Python 2.7, but if you don't need bugfixes and language updates after 2020 it won't matter.
The point is precisely that all translation between bytes and text should take place at the boundaries of your code. Decode (from whatever external representation is used) on input, encode (to whatever encoding you wish or need to use) on output. Python 3 is written to enforce this distinction, but understanding the separation should give you proper control and reduce your frustrations.
In Python 3, opening a file in text mode causes readline and friends to produce Unicode strings. You can specify the encoding when you open the file if you wish. Opening a file in binary mode causes them to produce bytestrings, to which you will have to apply your own decoding to make sense of them as text.
Whether the Python API for a particular system returns bytes or text is up to its author, and calling Python 3 functions that expect strings with bytestring arguments is likely to lead to confusion and unhappiness. All external communications (network, files, etc.) must necessarily take place in terms of bytestrings, so be clear what is text (decoding on input and encoding on output) and deal with the outside world exclusively in bytestrings.
There are always, of course, difficult corner cases. I don't envy the maintainers of the email package, who have to deal with messages containing 6-bit encoded bytestreams themselves potentially containing attachments in multiple different encodings. But then I don't usually have to work in such complex environments, and hopefully neither do you.
I am trying to learn how to defend against security attacks on websites. The link below shows a good tutorial, but I am puzzled by one statement:
In http://google-gruyere.appspot.com/part3#3__client_state_manipulation , under "Cookie manipulation", Gruyere says Pythons hash is insecure since it hashes from left-to-right.
The Gruyere application is using this to encrypt data:
# global cookie_secret; only use positive hash values
h_data = str(hash(cookie_secret + c_data) & 0x7FFFFFF)
c_data is a username; cookie_secret is a static string (which is just '' by default)
I understand that in more secure hash functions, one change generates a whole new result, but I don't understand why this insecure, because different c_data generates whole different hashes!
EDIT: How would one go about beating a hash like this?
What the comment may be trying to get at is that for most hash functions, if you are given HASH(m) then it is easy to calculate HASH(m . x), for any x (where . is concatenation).
Therefore, if you are user ro, and the server sends you HASH(secret . ro), then you can easily calculate HASH(secret . root), and login as a different user.
I think that's just a bad explanation there. Python's hash() is insecure because it's easy to find collisions, but "hashes from left to right" has nothing to do with why it's easy to find collisions. Cryptographically secure hashes also process data strictly in sequence; they're likely to operate on data 128 or 256 bits at a time rather than one byte at a time, but that's just a detail of the implementation.
(It should be said that hash() being insecure is not a bug in Python, because that's not what it's for. It's an exposed detail of the implementation of Python's dictionaries as hash tables, and you generally don't want a secure hash function for your hash table, because that would slow it down so much that it would defeat the purpose. Python does provide secure hash functions in the hashlib module.)
(The use of an insecure hash is not the only problem with the code you show, but it is by far the most important problem.)
Python's default hashing algorithm (for all types, but it has the most severe consequences for strings as those are commonly hashed for security) is geared towards running fast and playing nice with the implementation of dicts. It's not a cryptographic hashing function, you shouldn't use it for security. Use hashlib for this.
The python built-in hash function is not intended for secure, cryptographic hashing. It's intention is to facilitate storing Python objects into dictionaries efficiently.
The internal hash implementations are too predictable (too many collisions) for secure uses. For example, the following assertions are all true:
hash('a') < hash('b')
hash('b') < hash('c')
hash('c') < hash('d')
This sequential nature makes for great dictionary storage behaviour, for which it was designed.
To create a secure hash, use the hashlib library instead.
One would go about "beating" a hash like that by appending their data to the end of the string being hashed and predicting the hash function output. Let me illustrate this:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> data = 'root|admin|author'
>>> str(hash('' + data) & 0x7FFFFFF);
'116042699'
>>> data = 'root|admin|authos'
>>> str(hash('' + data) & 0x7FFFFFF);
'116042698'
>>>
Empty string ('') is the cookie secret you mentioned to be an empty string. In this particular example, though not really exploitable, one can see that the hash changed by 1 and the last byte of data changed "by one" too. Now, this example is not really an exploit (omitting the fact that creating a username of the anything_here|admin format makes that user admin) because there's some data after the username (left to right) so even if you create a username that's very close to the one being attacked then the rest of the string changes the hash in a completely undesirable manner. However, if the cookie was in the form of 105770185|user07 instead of 105770185|user07||author then you'd easily create a user "user08" or "user06" and computepredict the hash (hometask: what's the hash for "user08"? ).
I have been given this hypothetical problem:
"Osama returns from the dead and wants revenge. He now wants to communicate with his sleeper cells around the world and plan an attack. But he has to make sure know one else gets it and hence, will like to send it in an encrypted form. He's recruited you for the job. Design a system with encryption and decryption modules for the text message."
I am currently considering following scheme for encryption/decryption:
Now I want to know about the best PKC and SKC and Hash function for implementing above scheme.I did a bit of research over net on best algorithm and narrowed my algorithm choices to following:
Hash:MD5
PKC:RSA or Diffie-Hellman
SKC:DSA
Can you please suggest if there is something that I am missing or any better/new algorithem available.
I am planning to implement this in python.
EDIT:
After reading replies I think I should go with followings:
Hash:SHA-2
PKC:ECC
SKC:AES
Any advice on python library that will provide these algorithm.
The short answer is: it cannot be secure if you do it yourself.
Cryptography provides some basic tools, such as symmetric encryption or digital signatures. Assembling those tools into a communication protocol is devilishly difficult, and when I invoke the name of the Devil I mean it: it looks easy, but there are many details, and it is known that the Devil hides in the details.
Your problem here is akin to "secure emailing", and there are two main protocols for that: OpenPGP and CMS (as used in S/MIME). Look them up: you will see that the solution to your problem is not easy. For the implementation, just use an existing library, e.g. M2Crypto.
MD5 has been known to have weaknesses since 1996, and is considered to be utterly broken since 2004. You should try to get some more up-to-date sources.
If you really want "better" algorithms:
Hash: any of SHA-2 family (sha-224/256/384/512)
PKC: ECC (Elliptic Curve Cryptography)
Other related info: Elliptic curve Diffie–Hellman, Elliptic Curve DSA
Also read Applied Cryptography and other books by Bruce Schneier
For your SKC, AES is a more modern standard than DSA, though DSA isn't broken in the same way that MD5 is.
Strictly speaking, the question does not require sender authentication or nonrepudiation, so the digital signature is not necessary either.
Alot of asymmetric cryptography is vulnerable to quantum cryptography. If you're Osama your're going up against the NSA and it's a safe bet they've got a quantum computer sitting around somewhere.
The only system that is able to provide perfect secrecy is the One Time Pad and it is this reason that satalites use use it.
It is essentially a chunk of random data that you XOR a message with to produce a ciphertext with and XOR with the key key again to produce the plaintext.
**Pros:**Eliptic
Perfect secrecy
Cons:
Key is the length of the message.
Unable to re-use a key without revealing the plaintext of both messages
Use a fallback of AES in feedback mode to encrypt your messageselliptic curve crypto for document signing and HMAC to ensure message integrity.
Pycrypto is a python library with everything you would need to implement this yourself.
I'm a bit confused that the argument to crypto functions is a string. Should I simply wrap non-string arguments with str() e.g.
hashlib.sha256(str(user_id)+str(expiry_time))
hmac.new(str(random.randbits(256)))
(ignore for the moment that random.randbits() might not be cryptographically good).
edit: I realise that the hmac example is silly because I'm not storing the key anywhere!
Well, usually hash-functions (and cryptographic functions generally) work on bytes. The Python strings are basically byte-strings. If you want to compute the hash of some object you have to convert it to a string representation. Just make sure to apply the same operation later if you want to check if the hash is correct. And make sure that your string representation doesn't contain any changing data that you don't want to be checked.
Edit: Due to popular request a short reminder that Python unicode strings don't contain bytes but unicode code points. Each unicode code point contains multiple bytes (2 or 4, depending on how the Python interpreter was compiled). Python strings only contain bytes. So Python strings (type str) are the type most similar to an array of bytes.
You can.
However, for the HMAC, you actually want to store the key somewhere. Without the key, there is no way for you to verify the hash value later. :-)
Oh and Sha256 isn't really an industrial strength cryptographic function (although unfortunately it's used quite commonly on many sites). It's not a real way to protect passwords or other critical data, but more than good enough for generating temporal tokens
Edit: As mentioned Sha256 needs at least some salt. Without salt, Sha256 has a low barrier to being cracked with a dictionary attack (time-wise) and there are plenty of Rainbow tables to use as well. Personally I'd not use anything less than Blowfish for passwords (but that's because I'm paranoid)