I was wondering what would be the best hashing algorithm to use to create short + unique ids for list of content items. Each content item is ascii file in the order of 100-500kb.
The requirements I have are:
Must be as short as possible, I have very limited space to store the ids and would like to keep them to say < 10 characters each (when represented as ascii)
Must be unique, i.e. no collisions or at least a negligible chance of collisions
I don't need to it be cryptographically secure
I don't need it to be overly fast (each content item is pretty small)
I am trying to implement this in python so preferably a algorithm that has a python implementation.
In leu of any other recommendation I've currently decided to use the following approach. I am taking blake2 hashing algorithm to create a cryptographically secure hex hash based on the file contents as to minimise the chance of collisions. I then use base64 encoding to map it to an ascii character set of which I just take the first 8 digits of.
Under the assumption that these digits will be perfectly randomised that gives 64^8 combinations that the hash can take. I predict the upper limit to the number of content items I'll ever have is 50k which gives me a probability of at least 1 collision of 0.00044% which I think is acceptably low enough for my use-case (can always up to 9 or 10 digits if needed in the future).
import hashlib
import base64
def get_hash(byte_content, size=8):
hash_bytes = hashlib.blake2b(byte_content,digest_size=size * 3).digest()
hash64 = base64.b64encode(hash_bytes).decode("utf-8")[:size]
return hash64
# Example of use
get_hash(b"some random binary object")
Related
I am trying to create a unique record id using the following function:
import hashlib
from base64 import b64encode
def make_uid(salt, pepper, key):
s = b64encode(salt)
p = b64encode(pepper)
k = b64encode(key)
return hashlib.sha256(s + p + k).hexdigest()
Where pepper is set like this:
uuid_pepper = uuid.uuid4()
pepper = str(uuid_pepper).encode('ascii')
And salt and key are the same values for every request.
My question is, because of the unique nature of the pepper, will make_uid in this intance always return a unique value, or is there a chance that it can create a duplicate?
The suggested answer is different because I'm not asking about the uniqueness of various uuid types, I'm wondering whether it's at all possible for a sha256 hash to create a collision between two distinct inputs.
I think what you want to know is whether SHA256 is guaranteed to generate a unique hash result. The answer is yes and no. I got the following result from my research, not 100% accurate but close.
In theory, SHA256 will collide. It has 2^256 results. So if we hash 2^256 + 1 times, there must be a collision. Even worse, according to statistics, the possibility of collision within 2^130 times of hashing is 99%.
But you probably won't generate one during your lifetime. Assume we have a computer that can calculate 10,000 hashes per second. It costs this computer 4 * 10^27 years to finish 2^130 hashes. You might not have any idea about how large this number is. The number of years of doing hashing is 2 * 10^22 times of that of human exist on earth. That means that even if you started doing hashing since the first day we were on earth till now, the possibility of collision is still very very small.
Hope that answers your question.
I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias
The checksum should be unique. I know its theoretical but still I want the probablity of collisions to be very very small.
Can compare two files to be equal if there checksums are equal or not.
Speed(not very important, but still)
Please feel free to as elaborative as possible.
It depends on your use case.
If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed… although it should be in hashlib_algorithms_available on most stock Mac, Windows, and Linux builds.)
On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.
The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:
To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.
Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes object, then did this:
In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop
As you can see, md4 is significantly faster than the others.
Tests using hashlib.md5() instead of hashlib.new('md5'), and using bytes with less entropy (runs of 1-8 string.ascii_letters separated by spaces) didn't show any significant differences.
And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.
for x in hashlib.algorithms_available:
h = hashlib.new(x)
print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.
Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:
def hash_file(path, algorithm='md5', bufsize=8192):
h = hashlib.new(algorithm)
with open(path, 'rb') as f:
block = f.read(bufsize)
if not block:
break
h.update(block)
return h.digest()
If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open and os.read), which may sometimes be faster on some platforms.
The collision possibilities with hash size of sufficient bits are , theoretically, quite small:
Assuming random hash values with a uniform distribution, a collection
of n different data blocks and a hash function that generates b bits,
the probability p that there will be one or more collisions is bounded
by the number of pairs of blocks multiplied by the probability that a
given pair will collide, i.e
And, so far, SHA-1 collisions with 160 bits have been unobserved. Assuming one exabyte (10^18) of data, in 8KB blocks, the theoretical chance of a collision is 10^-20 -- a very very small chance.
A useful shortcut is to eliminate files known to be different from each other through short-circuiting.
For example, in outline:
Read the first X blocks of all files of interest;
Sort the one that have the same hash for the first X blocks as potentially the same file data;
For each file with the first X blocks that are unique, you can assume the entire file is unique vs all other tested files -- you do not need to read the rest of that file;
With the remaining files, read more blocks until you prove the signatures are the same or different.
With X blocks of sufficient size, 95%+ of the files will be correctly discriminated into unique files in the first pass. This is much faster than blindly reading the entire file and calculating the full hash for each and every file.
md5 tends to work great for checksums ... same with SHA-1 ... both have very small probability of collisions although I think SHA-1 has slightly smaller collision probability since it uses more bits
if you are really worried about it, you could use both checksums (one md5 and one sha1) the chance that both match and the files differ is infinitesimally small (still not 100% impossible but very very very unlikely) ... (this seems like bad form and by far the slowest solution)
typically (read: in every instance I have ever encountered) an MD5 OR an SHA1 match is sufficient to assume uniqueness
there is no way to 100% guarantee uniqueness short of byte by byte comparisson
i created a small duplicate file remover script few days back, which reads the content of the file and create a hash for it and then compare with the next file, in which even if the name is different the checksum is going to be the same..
import hashlib
import os
hash_table = {}
dups = []
path = "C:\\images"
for img in os.path.listdir(path):
img_path = os.path.join(path, img)
_file = open(img_path, "rb")
content = _file.read()
_file.close()
md5 = hashlib.md5(content)
_hash = md5.hexdigest()
if _hash in hash_table.keys():
dups.append(img)
else:
hash_table[_hash] = img
Just for debugging purposes I would like to map a big string (a session_id, which is difficult to visualize) to a, let's say, 6 character "hash". This hash does not need to be secure in any way, just cheap to compute, and of fixed and reduced length (md5 is too long). The input string can have any length.
How would you implement this "cheap_hash" in python so that it is not expensive to compute? It should generate something like this:
def compute_cheap_hash(txt, length=6):
# do some computation
return cheap_hash
print compute_cheap_hash("SDFSGSADSADFSasdfgsadfSDASAFSAGAsaDSFSA2345435adfdasgsaed")
aBxr5u
I can't recall if MD5 is uniformly distributed, but it is designed to change a lot even for the smallest difference in the input.
Don't trust my math, but I guess the collision chance is 1 in 16^6 for the first 6 digits from the MD5 hexdigest, which is about 1 in 17 millions.
So you can just cheap_hash = lambda input: hashlib.md5(input).hexdigest()[:6].
After that you can use hash = cheap_hash(any_input) anywhere.
PS: Any algorithm can be used; MD5 is slightly cheaper to compute but hashlib.sha256 is also a popular choice.
def cheaphash(string,length=6):
if length<len(hashlib.sha256(string).hexdigest()):
return hashlib.sha256(string).hexdigest()[:length]
else:
raise Exception("Length too long. Length of {y} when hash length is {x}.".format(x=str(len(hashlib.sha256(string).hexdigest())),y=length))
This should do what you need it to do, it simply uses the hashlib module, so make sure to import it before using this function.
I found this similar question: https://stackoverflow.com/a/6048639/647991
So here is the function:
import hashlib
def compute_cheap_hash(txt, length=6):
# This is just a hash for debugging purposes.
# It does not need to be unique, just fast and short.
hash = hashlib.sha1()
hash.update(txt)
return hash.hexdigest()[:length]
I am using the following method to create a random code for users as part of a booking process:
User.objects.make_random_password()
When the users turn up at the venue, they will present the password.
Is it safe to assume that two people won't end up with the same code?
Thanks
No, it's not safe to assume that two people can't have the same code. Random doesn't mean unique. It may be unlikely and rare, depending on the length you specify and number of users you are dealing with. But you can't rely on its uniqueness.
It depends on now many users you have, and the password length you choose, and how you use User.objects.make_random_password() For the defaults, the chance is essentially zero, IMO;
This method is implemented using get_random_string(). From the django github repo:
def get_random_string(length=12,
allowed_chars='abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'):
"""
Returns a securely generated random string.
The default length of 12 with the a-z, A-Z, 0-9 character set returns
a 71-bit value. log_2((26+26+10)^12) =~ 71 bits
"""
if not using_sysrandom:
# This is ugly, and a hack, but it makes things better than
# the alternative of predictability. This re-seeds the PRNG
# using a value that is hard for an attacker to predict, every
# time a random string is required. This may change the
# properties of the chosen random sequence slightly, but this
# is better than absolute predictability.
random.seed(
hashlib.sha256(
"%s%s%s" % (
random.getstate(),
time.time(),
settings.SECRET_KEY)
).digest())
return ''.join([random.choice(allowed_chars) for i in range(length)])
According to github, the current code uses a 12 character password from a string of 62 characters (lower- and uppercase letters and numbers) by default. This makes for 62**12 or 3226266762397899821056 (3.22e21) possible different passwords. This is much larger than the current world population (around 7e9).
The letters are picked from this list of characters by the random.choice() function. The question now becomes how likely it is that the repeated calling of random.choice() returns the same sequence twice?
As you can see from the implementation of get_random_string(), the code tries hard to avoid predictability. When not using the OS's pseudo-random value generator (which on Linux and *BSD gathers real randomness from e.g. the times at which ethernet packets or keypresses arrive), it re-seeds the random module's Mersenne Twister predictable PRNG at each call with a combination of the current random state, the current time and (presumably constant) secret key.
So for two identical passwords to be generated, both the state of the random generator (which is about 8 kiB in python) and the time at which they are generated (measured in seconds since the epoch, as per time.time()) have to be identical. If the system's time is well-maintained and you are running one instance of the password generation program, the chance of that happening is essentially zero. If you start two or more instances of this password generating program at exactly the same time and with exactly the same seed for the PRNG, and combine their output, one would expect some passwords to appear more than once.
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.