what's the difference between the HMAC signature and hashing directly? - python

Just out of curiosity, really... for example, in python,
hashlib.sha1("key" + "data").hexdigest() != hmac.new("key", "data", hashlib.sha1)
is there some logical distinction I'm missing between the two actions?

hashlib.sha1 gives you simply sha1 hash of content "keydata" that you give as a parameter (note that you are simply concatenating the two strings). The hmac call gives you keyed hash of the string "data" using string "key" as the key and sha1 as the hash function. The fundamental difference between the two calls are that the HMAC can only be reproduced if you know the key so you would also know something about who has generated the hmac. SHA1 can only be used to detect that content has not changed.

I found the answer in the manual.
https://en.wikipedia.org/wiki/Hmac#Design_principles

Related

Use of .digest() in hashing?

What is the use of .digest() in this statement? Why do we use it ? I searched on google ( and documentation also) but still I am not able to figure it out.
train_hashes = [hashlib.sha1(x).digest() for x in train_dataset]
What I found is that it convert to string. Am I right or wrong?
The .digest() method returns the actual digest the hash is designed to produce.
It is a separate method because the hashing API is designed to accept data in multiple pieces:
hash = hashlib.sha1()
for chunk in large_amount_of_data:
hash.update(chunk)
final_digest = hash.digest()
The above code creates a hashing object without passing any initial data in, then uses the hash.update() method to put chunks of data in in a loop. This helps avoid having to all of the data into memory all at once, so you can hash anything between 1 byte and the entire Google index, if you ever had access to something that large.
If hashlib.sha1(x) produced the digest directly you could never add additional data to hash first. Moreover, there is also an alternative method of accessing the digest, as a hexadecimal string using the hash.hexdigest() method (equivalent to hash.digest().hex(), but more convenient).
The code you found uses the fact that the constructor of the hash object also accepts data; since that's the all of the data that you wanted to hash you can call .digest() immediately.
The module documentation covers it this way:
There is one constructor method named for each type of hash. All return a hash object with the same simple interface. For example: use sha256() to create a SHA-256 hash object. You can now feed this object with bytes-like objects (normally bytes) using the update() method. At any point you can ask it for the digest of the concatenation of the data fed to it so far using the digest() or hexdigest() methods.
(bold emphasis mine).

Python str id hash

I'm trying to convert user access log into a pure binary format, which would require me to convert string into int using some hash method, and then the mapping relationship of "id -> string value" would be stored somewhere for further backward retrieve.
Since I'm using Python, in order to save some process time, instead of introducing hashlib to calculate hash, can I simply use
string_hash = id(intern(some_string))
as the hash method? Any basic difference to be aware of comparing to MD5 / SHA1? Is the probability of conflict obviously higher than MD5 / SHA1?
Doesn't work. id is not guaranteed to be consistent across interpreter executions; in CPython, it's the memory location of the object. Even if it were consistent, it doesn't have enough bytes for collision resistance. Why not just keep using the strings? ASCII or Unicode, strings can be serialized easily.

How To Create a Unique Key For A Dictionary In Python

What is the best way to generate a unique key for the contents of a dictionary. My intention is to store each dictionary in a document store along with a unique id or hash so that I don't have to load the whole dictionary from the store to check if it exists already or not. Dictionaries with the same keys and values should generate the same id or hash.
I have the following code:
import hashlib
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
print str(a)
print hashlib.sha1(str(a)).hexdigest()
print hashlib.sha1(str(b)).hexdigest()
The last two print statements generate the same string. Is this is a good implementation? or are there any pitfalls with this approach? Is there a better way to do this?
Update
Combining suggestions from the answers below, the following might be a good implementation
import hashlib
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
def get_id_for_dict(dict):
unique_str = ''.join(["'%s':'%s';"%(key, val) for (key, val) in sorted(dict.items())])
return hashlib.sha1(unique_str).hexdigest()
print get_id_for_dict(a)
print get_id_for_dict(b)
I prefer serializing the dict as JSON and hashing that:
import hashlib
import json
a={'name':'Danish', 'age':107}
b={'age':107, 'name':'Danish'}
# Python 2
print hashlib.sha1(json.dumps(a, sort_keys=True)).hexdigest()
print hashlib.sha1(json.dumps(b, sort_keys=True)).hexdigest()
# Python 3
print(hashlib.sha1(json.dumps(a, sort_keys=True).encode()).hexdigest())
print(hashlib.sha1(json.dumps(b, sort_keys=True).encode()).hexdigest())
Returns:
71083588011445f0e65e11c80524640668d3797d
71083588011445f0e65e11c80524640668d3797d
No - you can't rely on particular order of elements when converting dictionary to a string.
You can, however, convert it to sorted list of (key,value) tuples, convert it to a string and compute a hash like this:
a_sorted_list = [(key, a[key]) for key in sorted(a.keys())]
print hashlib.sha1( str(a_sorted_list) ).hexdigest()
It's not fool-proof, as a formating of a list converted to a string or formatting of a tuple can change in some future major python version, sort order depends on locale etc. but I think it can be good enough.
A possible option would be using a serialized representation of the list that preserves order. I am not sure whether the default list to string mechanism imposes any kind of order, but it wouldn't surprise me if it were interpreter-dependent. So, I'd basically build something akin to urlencode that sorts the keys beforehand.
Not that I believe that you method would fail, but I'd rather play with predictable things and avoid undocumented and/or unpredictable behavior. It's true that despite "unordered", dictionaries end up having an order that may even be consistent, but the point is that you shouldn't take that for granted.

Hash function that protects against collisions, not attacks. (Produces a random UUID-size result space)

Using SHA1 to hash down larger size strings so that they can be used as a keys in a database.
Trying to produce a UUID-size string from the original string that is random enough and big enough to protect against collisions, but much smaller than the original string.
Not using this for anything security related.
Example:
# Take a very long string, hash it down to a smaller string behind the scenes and use
# the hashed key as the data base primary key instead
def _get_database_key(very_long_key):
return hashlib.sha1(very_long_key).digest()
Is SHA1 a good algorithm to be using for this purpose? Or is there something else that is more appropriate?
Python has a uuid library, based on RFC 4122.
The version that uses SHA1 is UUIDv5, so the code would be something like this:
import uuid
uuid.uuid5(uuid.NAMESPACE_OID, 'your string here')

Convert SHA256 hash string to SHA256 hash object in Python

I have a string which is a SHA256 hash, and I want to pass it to a Python script which will convert it to a SHA256 object. If I do this:
my_hashed_string = // my hashed string here
m = hashlib.SHA256()
m.update( my_hashed_string )
it will just hash my hash. I don't want to hash twice...it's already been hashed. I just want python to parse my original hashed string as a hash object. How do I do this?
Unfortunately the hash alone isn't enough info to reconstruct the hash object. The hash algorithm itself is temporal, depending on both internal structures and further input in order to generate hashes for subsequent input; the hash itself is only a small piece of the cross section of the algorithm's data, and cannot alone be used to generate hashes of additional data.

Categories

Resources