Deterministic, recursive hashing in python - python

Python 3's default hashing function(s) isn't deterministic (hash(None) varies from run to run), and doesn't even make a best effort to generate unique id's with high probability (hash(-1)==hash(-2) is True).
Is there some other hash function that works well as a checksum (i.e. negligible probability of two data structures hashing to the same value, and returns the same result each run of python), and supports all of python's built-in datatypes, including None?
Ideally it would be in the standard library. I can pickle the object or get a string representation, but that seems unnecessarily hacky, and string representations of floats are probably very bad checksums.
I found the cryptographic hashes (md5,sha256) in the standard library, but they only operate on bytestrings.
Haskell seems to get this ~almost right in their standard library... but "Nothing::Maybe Int" and 0 both hash to 0, so it's not perfect there either.

You can use any hash from hashlib on a pickled object.
pickle.dumps not suitable for hashing.
You can use sorted-keys json with hashlib.
hashlib.md5(json.dumps(data, sort_keys=True)).hexdigest()
Taken from: https://stackoverflow.com/a/10288255/3858507, according to AndrewWagner's comment.
By the way and only for reference as this causes security vulnerabitilies, the PYTHONHASHSEED environment variable can be used to disable randomization of hashes throughout your application.

Related

Are there any variable length hash functions available for Python?

I am looking for a hash function that can generate a digest of a specified bit-size for a cryptographic signature scheme. A related question (https://crypto.stackexchange.com/questions/3558/are-there-hash-algorithms-with-variable-length-output) on the Cryptography SE specifies that algorithms exist for this particular purpose.
Are there any Python libraries that I can use for this?
Currently, my scheme just pads a SHA-256 output to the desired size. I also have tried the Python SHA3 library - pysha3 1.0.2, however, it has a few predefined digest sizes that can be used.
I want a hashing function which can take in the desired digest size as a parameter and accordingly hashes a message (if possible)
As a cursory answer: You might be interested in the inbuilt Blake2 function in hashlib in python 3.6+.
It only outputs up to 64 bytes, but is "faster than MD5, SHA-1, SHA-2, and SHA-3, yet is at least as secure as the latest standard SHA-3".
Hopefully this is long enough and you don't need external libraries!
Any Extendable output function (XOF) can be used to obtain a digest of a given size. From Wikipedia:
Extendable-output functions (XOFs) are cryptographic hashes which can output an arbitrarily large number of random-looking bits.
One of the function provided under SHA-3 includes the SHAKE128 and SHAKE256 functions. They follow from the general properties of the sponge construction. A sponge function can generate an arbitrary length of output. The 128 and 256 in their names indicates its maximum security level (in bits), as described in Sections A.1 and A.2 of FIPS 202.
In python, first install the PyCryptodome library:
pip install pycryptodome
A hash of say 20 bytes can be generated as follows:
from Crypto.Hash import SHAKE256
from binascii import hexlify
shake = SHAKE256.new()
shake.update(b'Some data')
print hexlify(shake.read(20))
Further references on SHAKE256 and SHA3:
Link 1
Link 2

Algorithm to generate 12 byte hash from Web URLs

I am crawling some websites for special items and storing them in MongoDB server. To avoid duplicate items, I am using the hash value of the item link. Here is my code to generate the hash from the link:
import hashlib
from bson.objectid import ObjectId
def gen_objectid(link):
"""Generates objectid from given link"""
return ObjectId(hashlib.shake_128(str(link).encode('utf-8')).digest(12))
# end def
I have no idea how the shake_128 algorithm works. That is where my question comes in.
Is it okay to use this method? Can I safely assume that the probability of a collision is negligible?
What is the better way to do this?
shake_128 is one of the SHA-3 hash algorithms, chosen as the result of a contest to be the next generation of secure hash algorithms. They are not widely used, since SHA-2 is still considered good enough in most cases. Since these algorithms are designed for cryptographically secure hashing, this should be overkill for what you are doing. Also shake_128, as the name implies, should give you a 128-bit value, which is 16 bytes, not 12. This gives you 2^128 = 3.4e38 different hashes. I think you will be just fine. If anything, I would say you could use a faster hashing algorithm since you don't need cryptographic security in this case.

what is difference between crypto hash and hashtable hashes in python?

What is a crypto hash and what are some algorithms? How it is different from a normal hash in python? How can I determine which to use?
EX:
Cryptographic hash function
hello--aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
helld--44d634fa6b81353bc3ed424879ffd013501ade53
hash function
hash("hello") -1267296259
hash("helld") -1267296266
Please help me
Cryptographic Hash functions are different from Hashtable Hash functions. One main difference is that cryptographic hash functions are designed not to have hash collision weaknesses. They are designed to be more secure and irreversible in most cases. But Hashtable hash functions like hash are faster and are designed to use to quickly access items in memory or comparing items or etc.
Suppose two differenct Scenarios. If you want to store passwords in a database you must use something like pbkdf2 so it is more secure and so slower to generate in order to prevent brute forces. But in another case you just want to have a set of items and check if an item exists in that set. You can simply store a 32-bit or 64-bit hash of items(e.g. classes) and compare hashes quickly instead of classes.
For example for string "hello", it is much faster to compute and store 1267296259 as it is a 32-bit integer and more secure and slower to compute and store aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d.
P.S. A good example is here.

using python hash() values as a refrence number

I'm writing a scrapy spider that collects news articles from various online newspapers. The sites in question update at least once a day, and I'm going to run the spider just as often, I need some way to filter out duplicates (i.e articles I've already scraped).
In other cases it'd be as simple as comparing reference numbers, but newspaper articles don't have any reference numbers. I was wondering if it'd be possible to hash the title using pythons hash() function and use the resulting value as a stand-in for an actual reference number, just for comparison purposes?
On the surface it seems possible, but what do you guys think?
Yes, you can do that, but I'd not use hash() for this, as hash() is optimised for a different task and can lead too easily to collisions on larger texts (different inputs resulting in the same hash value).
Use a cryptographic hashing scheme instead; the hashlib module gives you access to MD5 and other algorithms and produce output that is far less likely to produce collisions.
For your purposes, MD5 will do just fine:
article_hash = hashlib.md5(scraped_info).hexdigest()
This has the added advantage that the MD5 hash is always going to be calculated the same regardless of OS or system architecture; hash() can offer no such guarantee.

Whats more random, hashlib or urandom?

I'm working on a project with a friend where we need to generate a random hash. Before we had time to discuss, we both came up with different approaches and because they are using different modules, I wanted to ask you all what would be better--if there is such a thing.
hashlib.sha1(str(random.random())).hexdigest()
or
os.urandom(16).encode('hex')
Typing this question out has got me thinking that the second method is better. Simple is better than complex. If you agree, how reliable is this for 'randomly' generating hashes? How would I test this?
This solution:
os.urandom(16).encode('hex')
is the best since it uses the OS to generate randomness which should be usable for cryptographic purposes (depends on the OS implementation).
random.random() generates pseudo-random values.
Hashing a random value does not add any new randomness.
random.random() is a pseudo-radmom generator, that means the numbers are generated from a sequence. if you call random.seed(some_number), then after that the generated sequence will always be the same.
os.urandom() get's the random numbers from the os' rng, which uses an entropy pool to collect real random numbers, usually by random events from hardware devices, there exist even random special entropy generators for systems where a lot of random numbers are generated.
on unix system there are traditionally two random number generators: /dev/random and /dev/urandom. calls to the first block if there is not enough entropy available, whereas when you read /dev/urandom and there is not enough entropy data available, it uses a pseudo-rng and doesn't block.
so the use depends usually on what you need: if you need a few, equally distributed random numbers, then the built in prng should be sufficient. for cryptographic use it's always better to use real random numbers.
The second solution clearly has more entropy than the first. Assuming the quality of the source of the random bits would be the same for os.urandom and random.random:
In the second solution you are fetching 16 bytes = 128 bits worth of randomness
In the first solution you are fetching a floating point value which has roughly 52 bits of randomness (IEEE 754 double, ignoring subnormal numbers, etc...). Then you hash it around, which, of course, doesn't add any randomness.
More importantly, the quality of the randomness coming from os.urandom is expected and documented to be much better than the randomness coming from random.random. os.urandom's docstring says "suitable for cryptographic use".
Testing randomness is notoriously difficult - however, I would chose the second method, but ONLY (or, only as far as comes to mind) for this case, where the hash is seeded by a random number.
The whole point of hashes is to create a number that is vastly different based on slight differences in input. For your use case, the randomness of the input should do. If, however, you wanted to hash a file and detect one eensy byte's difference, that's when a hash algorithm shines.
I'm just curious, though: why use a hash algorithm at all? It seems that you're looking for a purely random number, and there are lots of libraries that generate uuid's, which have far stronger guarantees of uniqueness than random number generators.
if you want a unique identifier (uuid), then you should use
import uuid
uuid.uuid4().hex
https://docs.python.org/3/library/uuid.html

Categories

Resources