I'm a bit confused that the argument to crypto functions is a string. Should I simply wrap non-string arguments with str() e.g.
hashlib.sha256(str(user_id)+str(expiry_time))
hmac.new(str(random.randbits(256)))
(ignore for the moment that random.randbits() might not be cryptographically good).
edit: I realise that the hmac example is silly because I'm not storing the key anywhere!
Well, usually hash-functions (and cryptographic functions generally) work on bytes. The Python strings are basically byte-strings. If you want to compute the hash of some object you have to convert it to a string representation. Just make sure to apply the same operation later if you want to check if the hash is correct. And make sure that your string representation doesn't contain any changing data that you don't want to be checked.
Edit: Due to popular request a short reminder that Python unicode strings don't contain bytes but unicode code points. Each unicode code point contains multiple bytes (2 or 4, depending on how the Python interpreter was compiled). Python strings only contain bytes. So Python strings (type str) are the type most similar to an array of bytes.
You can.
However, for the HMAC, you actually want to store the key somewhere. Without the key, there is no way for you to verify the hash value later. :-)
Oh and Sha256 isn't really an industrial strength cryptographic function (although unfortunately it's used quite commonly on many sites). It's not a real way to protect passwords or other critical data, but more than good enough for generating temporal tokens
Edit: As mentioned Sha256 needs at least some salt. Without salt, Sha256 has a low barrier to being cracked with a dictionary attack (time-wise) and there are plenty of Rainbow tables to use as well. Personally I'd not use anything less than Blowfish for passwords (but that's because I'm paranoid)
Related
I am publishing same data (Topic, Key & Value) from python confluent_kafka library based Producer v/s Java apache library based producer but when messages checked on Kafka then they are published to different
Kafka Partition.
I was expecting by default both these library will use same hash method (murmur2) on Key and will determine same partition when publishing message to Kafka, but looks like that is is not happening.
Is there flag or option that needs to be set on Python library so that it will use same algorithm and generate same (as Java library) Kafka partition OR is there any other python library that should be used to achieve this?
I found a way to force confluent_kafka Producer to use murmur2 algorithm to determine partition. You can set below parameter with value:
'partitioner': 'murmur2_random'
I'v had the same problem. You can do two things:
Change the algorithm on the Python side to "murmur2_random" However this might be a bit fragile
Forward your messed up topic to another topic written with a single language
Example solution
In the latest version of kafka-python (v2.0.2), the default partitioning algorithm for the producer is the same as the default java partitioning algorithm (murmur2)
https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html
"The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages"
By default, both python and java should convert the keys to strings and then encode them to bytes using utf8. Finally the bytes are used to compute the murmur2 hash.
In theory the same string should result in the same utf8 encoding on pretty much any machine/environment. So once the key strings are the same we should always get the same murmur2 hash and therefore the same partition. Regardless of whether the partition is calculated in java or python. That's my understanding anyways
A problem that I have seen is in the string conversion. In my case, the partition key is a dictionary object and converting the dictionary to a string resulted in different white spaces in the string across java and python
For example:
import json
partition_key = json.dumps({"id":123456})
print(partition_key)
>>>>>>>>>>>>>>>>>>>>>>>>>
'{"id": 123456}'
# This result is different to the string we get from java
# Using the widely used Jackson Serializer class in java we get
'{"id":123456}'
# Notice the extra space in the string generated by python
The way hashing works is that even a one character difference in a string will result in a different hash. If you have a small number of partitions, you could get lucky and get the same partition for similar strings. But its not guaranteed
To fix the issue with extra spaces in the python dict -> string conversion we can use the separators parameter in json.dumps. For example
partition_key = json.dumps({"id":123456}, separators=(',', ':'))
print(partition_key)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
'{"id":123456}' # notice there is now no space between the key: and value
Once I did this, I got the same partitioning across both java and python.
Fixing the extra space, fixed the problem for me, but there maybe other differences for different keys. In general, I think the approach of ensuring that the java and python implementations convert the keys to the EXACT same string and then use the same encoding and partitioning algorithms should be a pretty general solution.
Of course having simpler keys. E.g not dictionaries, is probably a helpful design principle. But this is not always in your control
Is there a systematic way to run Python 3.x with all strings defaulting to bytes? I am finding that when "crossing boundaries" for example talking to msgpack, Elixir, or ZeroMQ, I'm having to do all sorts of contortions constantly figuring out whether strings or bytes will be returned. It's a complete pain and adds a layer of cognitive friction over and above my problem.
For example I have
import argparse
parser.add_argument("--nodename")
args = parser.parse_args()
and then to get the nodename I need to do
str(args.nodename)
However zeroMQ wants bytes, and I'm going to use the nodename everywhere I use zeroMQ. So I make it bytes up front with
nodename.encode()
But now every time I want to use it with a string, say for concatenation, I cannot do so because I have to encode the string first. And half the libraries take perfectly good bytes data type and return them to you as strings, at which time you have to convert them back again to bytes if you want to send them outside Python. For a "glue language" this is a total disaster. I'm having to do this encode decode dance whenever I cross the boundary, and the worst is that it does not seem consistent across libraries whether they co-opt you to strings or bytes if you send them bytes.
In Python 3 is there an option to forego Unicode-by-default (since it does after all say, "by default", suggesting it can be changed), or is the answer "stick with 2.7".
In short, no. And you really don't want to try. You mention contortions but don't give specific examples, so it's hard to offer specific advice.
Neither, in this author's humble opinion, do you want to stick with Python 2.7, but if you don't need bugfixes and language updates after 2020 it won't matter.
The point is precisely that all translation between bytes and text should take place at the boundaries of your code. Decode (from whatever external representation is used) on input, encode (to whatever encoding you wish or need to use) on output. Python 3 is written to enforce this distinction, but understanding the separation should give you proper control and reduce your frustrations.
In Python 3, opening a file in text mode causes readline and friends to produce Unicode strings. You can specify the encoding when you open the file if you wish. Opening a file in binary mode causes them to produce bytestrings, to which you will have to apply your own decoding to make sense of them as text.
Whether the Python API for a particular system returns bytes or text is up to its author, and calling Python 3 functions that expect strings with bytestring arguments is likely to lead to confusion and unhappiness. All external communications (network, files, etc.) must necessarily take place in terms of bytestrings, so be clear what is text (decoding on input and encoding on output) and deal with the outside world exclusively in bytestrings.
There are always, of course, difficult corner cases. I don't envy the maintainers of the email package, who have to deal with messages containing 6-bit encoded bytestreams themselves potentially containing attachments in multiple different encodings. But then I don't usually have to work in such complex environments, and hopefully neither do you.
I am trying to learn how to defend against security attacks on websites. The link below shows a good tutorial, but I am puzzled by one statement:
In http://google-gruyere.appspot.com/part3#3__client_state_manipulation , under "Cookie manipulation", Gruyere says Pythons hash is insecure since it hashes from left-to-right.
The Gruyere application is using this to encrypt data:
# global cookie_secret; only use positive hash values
h_data = str(hash(cookie_secret + c_data) & 0x7FFFFFF)
c_data is a username; cookie_secret is a static string (which is just '' by default)
I understand that in more secure hash functions, one change generates a whole new result, but I don't understand why this insecure, because different c_data generates whole different hashes!
EDIT: How would one go about beating a hash like this?
What the comment may be trying to get at is that for most hash functions, if you are given HASH(m) then it is easy to calculate HASH(m . x), for any x (where . is concatenation).
Therefore, if you are user ro, and the server sends you HASH(secret . ro), then you can easily calculate HASH(secret . root), and login as a different user.
I think that's just a bad explanation there. Python's hash() is insecure because it's easy to find collisions, but "hashes from left to right" has nothing to do with why it's easy to find collisions. Cryptographically secure hashes also process data strictly in sequence; they're likely to operate on data 128 or 256 bits at a time rather than one byte at a time, but that's just a detail of the implementation.
(It should be said that hash() being insecure is not a bug in Python, because that's not what it's for. It's an exposed detail of the implementation of Python's dictionaries as hash tables, and you generally don't want a secure hash function for your hash table, because that would slow it down so much that it would defeat the purpose. Python does provide secure hash functions in the hashlib module.)
(The use of an insecure hash is not the only problem with the code you show, but it is by far the most important problem.)
Python's default hashing algorithm (for all types, but it has the most severe consequences for strings as those are commonly hashed for security) is geared towards running fast and playing nice with the implementation of dicts. It's not a cryptographic hashing function, you shouldn't use it for security. Use hashlib for this.
The python built-in hash function is not intended for secure, cryptographic hashing. It's intention is to facilitate storing Python objects into dictionaries efficiently.
The internal hash implementations are too predictable (too many collisions) for secure uses. For example, the following assertions are all true:
hash('a') < hash('b')
hash('b') < hash('c')
hash('c') < hash('d')
This sequential nature makes for great dictionary storage behaviour, for which it was designed.
To create a secure hash, use the hashlib library instead.
One would go about "beating" a hash like that by appending their data to the end of the string being hashed and predicting the hash function output. Let me illustrate this:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> data = 'root|admin|author'
>>> str(hash('' + data) & 0x7FFFFFF);
'116042699'
>>> data = 'root|admin|authos'
>>> str(hash('' + data) & 0x7FFFFFF);
'116042698'
>>>
Empty string ('') is the cookie secret you mentioned to be an empty string. In this particular example, though not really exploitable, one can see that the hash changed by 1 and the last byte of data changed "by one" too. Now, this example is not really an exploit (omitting the fact that creating a username of the anything_here|admin format makes that user admin) because there's some data after the username (left to right) so even if you create a username that's very close to the one being attacked then the rest of the string changes the hash in a completely undesirable manner. However, if the cookie was in the form of 105770185|user07 instead of 105770185|user07||author then you'd easily create a user "user08" or "user06" and computepredict the hash (hometask: what's the hash for "user08"? ).
This question already has answers here:
Is it possible to decrypt MD5 hashes?
(24 answers)
Closed 2 years ago.
Possible Duplicate:
Is it possible to decrypt md5 hashes?
i used md5.new(); md5.update("aaa"), md5.digest()
to form a md5 hash of the data "aaa" . How to get back the data using python?
You cannot decode an md5 hash, as hashing is a process that is best thought of as one-way encoding (that is to say what is hashed cannot be de-hashed; one can only determine what was hashed, either by examining a list of known hashes, or by hashing a set of inputs and matching the resulting hashes with the hash you are trying to "decode").
Quoting Wikipedia, the key features of such a hashing algorithm are:
it is infeasible to find a message
that has a given hash,
it is
infeasible to modify a message without
changing its hash,
it is infeasible to
find two different messages with the
same hash.
The most common uses of such algorithms today are:
Storing passwords
Verifying the contents of files.
If you want to two-way encrypt the data, you need to look at other cryptographic libraries for Python (As usual, Stackoverflow has a recommendation).
You can't. That's the point - a hash is one-way, it's not the same as an encryption.
I don't know about Python - but hash function are irreversible.
First of all, note that hash functions provide a constant length output - meaning that information will be thrown away (you can hash a file of 3 MB and still only get a result of less than 1 kB).
Additionally, hash functions are made for the fact that they aren't reversible, if you need encryption, don't use hashing but encryption - a major application of hashing is when the database info has leaked (which contained hashes) that the passwords have not been compromised (there are more examples, but this is the most obvious one)
If you want to break a hash, such as a password hash. Then you need a very large lookup table. John the Ripper is commonly used to break passwords using a dictionary, this is a very good method espeically if its a salted password hash.
Another approch is using a Rainbow Table, however these take long time to generate. There are free rainbow tables accessible online.
Here is a python script to perform an md5() brute force attack.
To add to everyone else's point, MD5 is a one-way hash. The common usage is to hash two input values and if the hashed values match, then the input should be the same. Going from an MD5 hashed value to the hash input is nonsensical. What you are probably after is a symmetric encryption algorithm - see two-way keyed encryption/hash algorithm for a good discussion on the subject.
In general, the answers from BlueRaja and Sean are correct. MD5 (and other hash functions) are one-way, you can't reverse the process.
However, if you have a small size of data, you can try to search for a hash collision (another, or the same, piece of data) having the same hash.
Hashes map a bunch of data to a finite (albeit large) set of numeric values/strings.
It is a many-to-one mapping, so that decoding a hash is not only "difficult" in the cryptographic sense, but also conceptually impossible in that even if you could, you would get an infinite set of possible input strings.
Are there any security exploits that could occur in this scenario:
eval(repr(unsanitized_user_input), {"__builtins__": None}, {"True":True, "False":False})
where unsanitized_user_input is a str object. The string is user-generated and could be nasty. Assuming our web framework hasn't failed us, it's a real honest-to-god str instance from the Python builtins.
If this is dangerous, can we do anything to the input to make it safe?
We definitely don't want to execute anything contained in the string.
See also:
Funny blog post about eval safety
Previous Question
Blog: Fast deserialization in Python
The larger context which is (I believe) not essential to the question is that we have thousands of these:
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
unsanitized_user_input_3,
unsanitized_user_input_4,
...])
in some cases nested:
repr([[unsanitized_user_input_1,
unsanitized_user_input_2],
[unsanitized_user_input_3,
unsanitized_user_input_4],
...])
which are themselves converted to strings with repr(), put in persistent storage, and eventually read back into memory with eval.
Eval deserialized the strings from persistent storage much faster than pickle and simplejson. The interpreter is Python 2.5 so json and ast aren't available. No C modules are allowed and cPickle is not allowed.
It is indeed dangerous and the safest alternative is ast.literal_eval (see the ast module in the standard library). You can of course build and alter an ast to provide e.g. evaluation of variables and the like before you eval the resulting AST (when it's down to literals).
The possible exploit of eval starts with any object it can get its hands on (say True here) and going via .__class_ to its type object, etc. up to object, then gets its subclasses... basically it can get to ANY object type and wreck havoc. I can be more specific but I'd rather not do it in a public forum (the exploit is well known, but considering how many people still ignore it, revealing it to wannabe script kiddies could make things worse... just avoid eval on unsanitized user input and live happily ever after!-).
If you can prove beyond doubt that unsanitized_user_input is a str instance from the Python built-ins with nothing tampered, then this is always safe. In fact, it'll be safe even without all those extra arguments since eval(repr(astr)) = astr for all such string objects. You put in a string, you get back out a string. All you did was escape and unescape it.
This all leads me to think that eval(repr(x)) isn't what you want--no code will ever be executed unless someone gives you an unsanitized_user_input object that looks like a string but isn't, but that's a different question--unless you're trying to copy a string instance in the slowest way possible :D.
With everything as you describe, it is technically safe to eval repred strings, however, I'd avoid doing it anyway as it's asking for trouble:
There could be some weird corner-case where your assumption that only repred strings are stored (eg. a bug / different pathway into the storage that doesn't repr instantly becmes a code injection exploit where it might otherwise be unexploitable)
Even if everything is OK now, assumptions might change at some point, and unsanitised data may get stored in that field by someone unaware of the eval code.
Your code may get reused (or worse, copy+pasted) into a situation you didn't consider.
As Alex Martelli pointed out, in python2.6 and higher, there is ast.literal_eval which will safely handle both strings and other simple datatypes like tuples. This is probably the safest and most complete solution.
Another possibility however is to use the string-escape codec. This is much faster than eval (about 10 times according to timeit), available in earlier versions than literal_eval, and should do what you want:
>>> s = 'he\nllo\' wo"rld\0\x03\r\n\tabc'
>>> repr(s)[1:-1].decode('string-escape') == s
True
(The [1:-1] is to strip the outer quotes repr adds.)
Generally, you should never allow anyone to post code.
So called "paid professional programmers" have a hard-enough time writing code that actually works.
Accepting code from the anonymous public -- without benefit of formal QA -- is the worst of all possible scenarios.
Professional programmers -- without good, solid formal QA -- will make a hash of almost any web site. Indeed, I'm reverse engineering some unbelievably bad code from paid professionals.
The idea of allowing a non-professional -- unencumbered by QA -- to post code is truly terrifying.
repr([unsanitized_user_input_1,
unsanitized_user_input_2,
...
... unsanitized_user_input is a str object
You shouldn't have to serialise strings to store them in a database..
If these are all strings, as you mentioned - why can't you just store the strings in a db.StringListProperty?
The nested entries might be a bit more complicated, but why is this the case? When you have to resort to eval to get data from the database, you're probably doing something wrong..
Couldn't you store each unsanitized_user_input_x as it's own db.StringProperty row, and have group them by an reference field?
Either of those may not be applicable, since I've no idea what you're trying to achieve, but my point is - can you not structure the data in a way you where don't have to rely on eval (and also rely on it not being a security issue)?