Cheap mapping of string to small fixed-length string - python

Just for debugging purposes I would like to map a big string (a session_id, which is difficult to visualize) to a, let's say, 6 character "hash". This hash does not need to be secure in any way, just cheap to compute, and of fixed and reduced length (md5 is too long). The input string can have any length.
How would you implement this "cheap_hash" in python so that it is not expensive to compute? It should generate something like this:
def compute_cheap_hash(txt, length=6):
# do some computation
return cheap_hash
print compute_cheap_hash("SDFSGSADSADFSasdfgsadfSDASAFSAGAsaDSFSA2345435adfdasgsaed")
aBxr5u

I can't recall if MD5 is uniformly distributed, but it is designed to change a lot even for the smallest difference in the input.
Don't trust my math, but I guess the collision chance is 1 in 16^6 for the first 6 digits from the MD5 hexdigest, which is about 1 in 17 millions.
So you can just cheap_hash = lambda input: hashlib.md5(input).hexdigest()[:6].
After that you can use hash = cheap_hash(any_input) anywhere.
PS: Any algorithm can be used; MD5 is slightly cheaper to compute but hashlib.sha256 is also a popular choice.

def cheaphash(string,length=6):
if length<len(hashlib.sha256(string).hexdigest()):
return hashlib.sha256(string).hexdigest()[:length]
else:
raise Exception("Length too long. Length of {y} when hash length is {x}.".format(x=str(len(hashlib.sha256(string).hexdigest())),y=length))
This should do what you need it to do, it simply uses the hashlib module, so make sure to import it before using this function.

I found this similar question: https://stackoverflow.com/a/6048639/647991
So here is the function:
import hashlib
def compute_cheap_hash(txt, length=6):
# This is just a hash for debugging purposes.
# It does not need to be unique, just fast and short.
hash = hashlib.sha1()
hash.update(txt)
return hash.hexdigest()[:length]

Related

Hashing Algorithm to use for short unique content ID

I was wondering what would be the best hashing algorithm to use to create short + unique ids for list of content items. Each content item is ascii file in the order of 100-500kb.
The requirements I have are:
Must be as short as possible, I have very limited space to store the ids and would like to keep them to say < 10 characters each (when represented as ascii)
Must be unique, i.e. no collisions or at least a negligible chance of collisions
I don't need to it be cryptographically secure
I don't need it to be overly fast (each content item is pretty small)
I am trying to implement this in python so preferably a algorithm that has a python implementation.
In leu of any other recommendation I've currently decided to use the following approach. I am taking blake2 hashing algorithm to create a cryptographically secure hex hash based on the file contents as to minimise the chance of collisions. I then use base64 encoding to map it to an ascii character set of which I just take the first 8 digits of.
Under the assumption that these digits will be perfectly randomised that gives 64^8 combinations that the hash can take. I predict the upper limit to the number of content items I'll ever have is 50k which gives me a probability of at least 1 collision of 0.00044% which I think is acceptably low enough for my use-case (can always up to 9 or 10 digits if needed in the future).
import hashlib
import base64
def get_hash(byte_content, size=8):
hash_bytes = hashlib.blake2b(byte_content,digest_size=size * 3).digest()
hash64 = base64.b64encode(hash_bytes).decode("utf-8")[:size]
return hash64
# Example of use
get_hash(b"some random binary object")

Shard/mod function

Let's say I have n machines and I need to allocate data across those machines as uniformly as possible. Let's use 5 for this example. And the data we have will look like:
id state name date
1 'DE' 'Aaron' 2014-01-01
To shard on the id, I could do a function like:
machine_num = id % n
To shard on a string, I suppose the most basic way would be something like string-to-binary-to-number:
name_as_num = int(''.join(format(ord(i), 'b') for i in name), 2)
machine_num = name_as_num % n
Or even simpler:
machine_num = ord(name[0]) % n
What would be an example of how a date or timestamp could be sharded? And what might be a better function to shard a string (or even numeric) field than the ones I'm using above?
Since hash functions are meant to produce numbers that are evenly distributed, you can use the hash function for your purpose:
machine_num = hash(name) % n
Works for datetime objects too:
machine_num = hash(datetime(2019, 10, 2, 12, 0, 0)) % n
But as #jasonharper pointed out in the comment, the hash value of a specific object is only guaranteed to be consistent within the same run of a program, so if you require the distribution to be consistent across multiple runs, you would have to write your own hashing function like what you have done in your question.
Without further knowledge about the structure and distribution about the keys used for shard operations, a hash function is a good approach. Python standard library provides in zlib module the simple functions adler32 and crc32 which take bytes (actually anything with buffer interface) and return an unsigned 32 bit integer on which modulo can then be applied to get the machine number.
CRC and Adler are fast algorithms but documentation says that "Since the algorithm is designed for use as a checksum algorithm, it is not suitable for use as a general hash algorithm." So the distribution may not be optimal (uniform).
Cryptographic hashes (slower but with better distribution) are available through hashlib module. They return their digest as byte-sequence which can be converted to integer with int.from_bytes.

Using hashlib.sha256 to create a unique id; is this guaranteed to be unique?

I am trying to create a unique record id using the following function:
import hashlib
from base64 import b64encode
def make_uid(salt, pepper, key):
s = b64encode(salt)
p = b64encode(pepper)
k = b64encode(key)
return hashlib.sha256(s + p + k).hexdigest()
Where pepper is set like this:
uuid_pepper = uuid.uuid4()
pepper = str(uuid_pepper).encode('ascii')
And salt and key are the same values for every request.
My question is, because of the unique nature of the pepper, will make_uid in this intance always return a unique value, or is there a chance that it can create a duplicate?
The suggested answer is different because I'm not asking about the uniqueness of various uuid types, I'm wondering whether it's at all possible for a sha256 hash to create a collision between two distinct inputs.
I think what you want to know is whether SHA256 is guaranteed to generate a unique hash result. The answer is yes and no. I got the following result from my research, not 100% accurate but close.
In theory, SHA256 will collide. It has 2^256 results. So if we hash 2^256 + 1 times, there must be a collision. Even worse, according to statistics, the possibility of collision within 2^130 times of hashing is 99%.
But you probably won't generate one during your lifetime. Assume we have a computer that can calculate 10,000 hashes per second. It costs this computer 4 * 10^27 years to finish 2^130 hashes. You might not have any idea about how large this number is. The number of years of doing hashing is 2 * 10^22 times of that of human exist on earth. That means that even if you started doing hashing since the first day we were on earth till now, the possibility of collision is still very very small.
Hope that answers your question.

Hash function that protects against collisions, not attacks. (Produces a random UUID-size result space)

Using SHA1 to hash down larger size strings so that they can be used as a keys in a database.
Trying to produce a UUID-size string from the original string that is random enough and big enough to protect against collisions, but much smaller than the original string.
Not using this for anything security related.
Example:
# Take a very long string, hash it down to a smaller string behind the scenes and use
# the hashed key as the data base primary key instead
def _get_database_key(very_long_key):
return hashlib.sha1(very_long_key).digest()
Is SHA1 a good algorithm to be using for this purpose? Or is there something else that is more appropriate?
Python has a uuid library, based on RFC 4122.
The version that uses SHA1 is UUIDv5, so the code would be something like this:
import uuid
uuid.uuid5(uuid.NAMESPACE_OID, 'your string here')

using strings as python dictionaries (memory management)

I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

Categories

Resources