I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias
The checksum should be unique. I know its theoretical but still I want the probablity of collisions to be very very small.
Can compare two files to be equal if there checksums are equal or not.
Speed(not very important, but still)
Please feel free to as elaborative as possible.
It depends on your use case.
If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed… although it should be in hashlib_algorithms_available on most stock Mac, Windows, and Linux builds.)
On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.
The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:
To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.
Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes object, then did this:
In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop
As you can see, md4 is significantly faster than the others.
Tests using hashlib.md5() instead of hashlib.new('md5'), and using bytes with less entropy (runs of 1-8 string.ascii_letters separated by spaces) didn't show any significant differences.
And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.
for x in hashlib.algorithms_available:
h = hashlib.new(x)
print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.
Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:
def hash_file(path, algorithm='md5', bufsize=8192):
h = hashlib.new(algorithm)
with open(path, 'rb') as f:
block = f.read(bufsize)
if not block:
break
h.update(block)
return h.digest()
If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open and os.read), which may sometimes be faster on some platforms.
The collision possibilities with hash size of sufficient bits are , theoretically, quite small:
Assuming random hash values with a uniform distribution, a collection
of n different data blocks and a hash function that generates b bits,
the probability p that there will be one or more collisions is bounded
by the number of pairs of blocks multiplied by the probability that a
given pair will collide, i.e
And, so far, SHA-1 collisions with 160 bits have been unobserved. Assuming one exabyte (10^18) of data, in 8KB blocks, the theoretical chance of a collision is 10^-20 -- a very very small chance.
A useful shortcut is to eliminate files known to be different from each other through short-circuiting.
For example, in outline:
Read the first X blocks of all files of interest;
Sort the one that have the same hash for the first X blocks as potentially the same file data;
For each file with the first X blocks that are unique, you can assume the entire file is unique vs all other tested files -- you do not need to read the rest of that file;
With the remaining files, read more blocks until you prove the signatures are the same or different.
With X blocks of sufficient size, 95%+ of the files will be correctly discriminated into unique files in the first pass. This is much faster than blindly reading the entire file and calculating the full hash for each and every file.
md5 tends to work great for checksums ... same with SHA-1 ... both have very small probability of collisions although I think SHA-1 has slightly smaller collision probability since it uses more bits
if you are really worried about it, you could use both checksums (one md5 and one sha1) the chance that both match and the files differ is infinitesimally small (still not 100% impossible but very very very unlikely) ... (this seems like bad form and by far the slowest solution)
typically (read: in every instance I have ever encountered) an MD5 OR an SHA1 match is sufficient to assume uniqueness
there is no way to 100% guarantee uniqueness short of byte by byte comparisson
i created a small duplicate file remover script few days back, which reads the content of the file and create a hash for it and then compare with the next file, in which even if the name is different the checksum is going to be the same..
import hashlib
import os
hash_table = {}
dups = []
path = "C:\\images"
for img in os.path.listdir(path):
img_path = os.path.join(path, img)
_file = open(img_path, "rb")
content = _file.read()
_file.close()
md5 = hashlib.md5(content)
_hash = md5.hexdigest()
if _hash in hash_table.keys():
dups.append(img)
else:
hash_table[_hash] = img
Related
I was wondering what would be the best hashing algorithm to use to create short + unique ids for list of content items. Each content item is ascii file in the order of 100-500kb.
The requirements I have are:
Must be as short as possible, I have very limited space to store the ids and would like to keep them to say < 10 characters each (when represented as ascii)
Must be unique, i.e. no collisions or at least a negligible chance of collisions
I don't need to it be cryptographically secure
I don't need it to be overly fast (each content item is pretty small)
I am trying to implement this in python so preferably a algorithm that has a python implementation.
In leu of any other recommendation I've currently decided to use the following approach. I am taking blake2 hashing algorithm to create a cryptographically secure hex hash based on the file contents as to minimise the chance of collisions. I then use base64 encoding to map it to an ascii character set of which I just take the first 8 digits of.
Under the assumption that these digits will be perfectly randomised that gives 64^8 combinations that the hash can take. I predict the upper limit to the number of content items I'll ever have is 50k which gives me a probability of at least 1 collision of 0.00044% which I think is acceptably low enough for my use-case (can always up to 9 or 10 digits if needed in the future).
import hashlib
import base64
def get_hash(byte_content, size=8):
hash_bytes = hashlib.blake2b(byte_content,digest_size=size * 3).digest()
hash64 = base64.b64encode(hash_bytes).decode("utf-8")[:size]
return hash64
# Example of use
get_hash(b"some random binary object")
EDIT: Python 2.7.8
I have two files. p_m has a few hundred records that contain acceptable values in column 2. p_t has tens of millions of records in which I want to make sure that column 14 is from the set of acceptable values already mentioned. So in the first while loop I'm reading in all the acceptable values, making a set (for de-duping), and then turning that set into a list (I didn't benchmark to see if a set would have been faster than a list, actually...). I got it down to about as few lines as possible in the second loop, but I don't know if they are the FASTEST few lines (I'm using the [14] index twice because exceptions are so very rare that I didn't want to bother with an assignment to a variable). Currently it takes about 40 minutes to do a scan. Any ideas on how to improve that?
def contentScan(p_m,p_t):
""" """
vcont=sets.Set()
i=0
h = open(p_m,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
vcont.add(line.split("|")[2])
h.close()
vcont = list(vcont)
vcont.sort()
i=0
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
i += 1
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
h.close()
print "PASS All variable_content_id values exist in the matrix." %rem
return 0
Checking for membership in a set of a few hundred items is much faster than checking for membership in the equivalent list. However, given your staggering 40-minutes running time, the difference may not be that meaningful. E.g:
ozone:~ alex$ python -mtimeit -s'a=list(range(300))' '150 in a'
100000 loops, best of 3: 3.56 usec per loop
ozone:~ alex$ python -mtimeit -s'a=set(range(300))' '150 in a'
10000000 loops, best of 3: 0.0789 usec per loop
so if you're checking "tens of millions of times" using the set should save you tens of seconds -- better than nothing, but barely measurable.
The same consideration applies for other very advisable improvements, such as turning the loop structure:
h = open(p_t,"rb")
while(True):
line = h.readline()
if not line:
break
...
h.close()
into a much-sleeker:
with open(p_t, 'rb') as h:
for line in h:
...
again, this won't save you as much as a microsecond per iteration -- so, over, say, 50 million lines, that's less than one of those 40 minutes. Ditto for the removal of the completely unused i += 1 -- it makes no sense for it to be there, but taking it way will make little difference.
One answer focused on the cost of the split operation. That depends on how many fields per record you have, but, for example:
ozone:~ alex$ python -mtimeit -s'a="xyz|"*20' 'a.split("|")[14]'
1000000 loops, best of 3: 1.81 usec per loop
so, again, whatever optimization here could save you maybe at most a microsecond per iteration -- again, another minute shaved off, if that.
Really, the key issue here is why reading and checking e.g 50 million records should take as much as 40 minutes -- 2400 seconds -- 48 microseconds per line; and no doubt still more than 40 microseconds per line even with all the optimizations mentioned here and in other answers and comments.
So once you have applied all the optimizations (and confirmed the code is still just too slow), try profiling the program -- per e.g http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html -- to find out exactly where all of the time is going.
Also, just to make sure it's not just the I/O to some peculiarly slow disk, do a run with the meaty part of the big loop "commented out" - just reading the big file and doing no processing or checking at all on it; this will tell you what's the "irreducible" I/O overhead (if I/O is responsible for the bulk of your elapsed time, then you can't do much to improve things, though changing the open to open(thefile, 'rb', HUGE_BUFFER_SIZE) might help a bit) and may want to consider improving the hardware set-up instead -- defragment a disk, use a local rather than remote filesystem, whatever...
The list lookup was the issue (as you correctly noticed). Searching the list has O(n) time complexity, where n is the number of items stored in the list. on the other hand, finding a value in a hashtable (this is what the python dictionary actually is) has O(1) complexity. As you have hundreds of items in the list, the list lookup is about two orders of magnitude more expensive than the dictionary lookup. This is in line with the 34x improvement you saw when replacing the list with the dictionary.
To further reduce execution time by 5-10x you can use a Python JIT. I personally like Pypy http://pypy.org/features.html . You do not need to modify your script, just install pypy and run:
pypy [your_script.py]
EDIT: Made more pythony.
EDIT 2: Using set builtin rather than dict.
Based on the comments, I decided to try using a dict instead of a list to store the acceptable values against which I'd be checking the big file (I did keep a watchful eye on .split but did not change it). Based on just changing the list to a dict, I saw an immediate and HUGE improvement in execution time.
Using timeit and running 5 iterations over a million-line file, I get 884.2 seconds for the list-based checker, and 25.4 seconds for the dict-based checker! So like a 34x improvement for changing 2 or 3 lines.
Thanks all for the inspiration! Here's the solution I landed on:
def contentScan(p_m,p_t):
""" """
vcont=set()
with open(p_m,'rb') as h:
for line in h:
vcont.add(line.split("|")[2])
with open(p_t,"rb") as h:
for line in h:
if line.split("|")[14] not in vcont:
print "%s is not defined in the matrix." %line.split("|")[14]
return 1
print "PASS All variable_content_id values exist in the matrix."
return 0
Yes, it's not optimal at all. split is EXPENSIVE like hell (creates new list, creates N strings, append them to list). scan for 13s "|", scan for 14s "|" (from 13s pos) and line[pos13 + 1:pos14 - 1].
Pretty sure you can make in run 2-10x faster with this small change. To add more - you can not extract string, but loop through valid strings and for each start at pos13+1 char by char while chars match. If you ended at "|" for one of the strings - it's good one. Also it'll help a bit to sort valid strings list by frequency in data file. But not creating list with dozens of strings on each step is way more important.
Here are your tests:
generator (ts - you can adjust it to make us some real data).
no-op (just reading)
original solution.
no split solution
Wasn't able to make it format the code, so here is gist.
https://gist.github.com/iced/16fe907d843b71dd7b80
Test conditions: VBox with 2 cores and 1GB RAM running ubuntu 14.10 with latest updates. Each variation was executed 10 times, with rebooting VBox before each run and throwing off lowest and highest run time.
Results:
no_op: 25.025
original: 26.964
no_split: 25.102
original - no_op: 1.939
no_split - no_op: 0.077
Though in this case this particular optimization is useless as majority of time was spent in IO. I was unable to find test setup to make IO less than 70%. In any case - split IS expensive and should be avoided when it's not needed.
PS. Yes, I understand that in case of even 1K of good items it's way better to use hash (actually, it's better at the point when hash computation is faster than loookup - probably at 100 of elements), my point is that split is expensive in this case.
I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences.
The naive way is something like this:
ht = defaultdict(int)
for s in sentences:
ht[s]+=1
I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?
If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.
Can someone approve the former paragraph?
One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert:
ht[hash(s)]+=1
This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.
Will that work? Should I expect collisions? any other Pythonic solutions?
Thanks!
Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.
Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:
for x in xrange(5000000): # it's 5 millions
d[x] = random.getrandbits(BITS)
For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:
for x in xrange(5000000): # it's 5 millions
d[x] = (random.getrandbits(64), random.getrandbits(64))
It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:
for x in xrange(5000000): # it's still 5 millions
d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)
It'll reduce memory usage by two.
It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:
hashes = {}
for s in sentence:
ptr_value = pointer(s) # make it integer
hash_value = hash(s) # make it integer
if hash_value in hashes:
collisions.setdefault(hashes[hash_value], []).append(ptr_value)
else:
hashes[hash_value] = ptr_value
So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).
perhaps passing keys to md5 http://docs.python.org/library/md5.html
Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.
I'm going to have 1 small dictionary (between 5 and 20 keys) that will be referenced up to a hundred times or so for one page load in python 2.5.
I'm starting to name the keys which it will be looking up and I was wondering if there is a key naming convention I could follow to help dict lookup times.
I had to test ;-)
using
f1, integer key 1
f2 short string, "one"
f3 long string "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
as one of the keys into a dictionary of length 4. Iterating 10,000,000 times and measuring the times. I get this result:
<function f1 at 0xb779187c>
f1 3.64
<function f2 at 0xb7791bfc>
f2 3.48
<function f3 at 0xb7791bc4>
f3 3.65
I.e no difference...
My code
There may be sensible names for them that just so happen to produce names whose hashes aren't clashing. However, CPython dicts are already one of the most optimized data structures in the known universe, producing few collisions for most inputs, working well with the hash schemes of other builtin types, resolving clashes very fast, etc. It's extremely unlikely you'll see any benefit at all even if you found something, especially since a hundred lookups aren't really that many.
Take, for example, this timeit benchmark run on my 4 years old desktop machine (sporting a laughablely low-budget dual core CPU with 3.1 GHz):
...>python -mtimeit --setup="d = {chr(i)*100: i for i in range(15)};\
k = chr(7)*100" "d[k]"
1000000 loops, best of 3: 0.222 usec per loop
And those strings are a dozen times larger than everything that's remotely sensible to type out manually as a variable name. Cutting down the length from 100 to 10 leads to 0.0778 microseconds per lookup. Now measure your page's load speed and compare those (alternatively, just ponder how long the work you're actually doing when building the page will take); and take into account caching, framework overhead, and all these things.
Nothing you do in this regard can make a difference performance-wise, period, full stop.
Because the Python string hash function iterates over the chars (at least if this is still applicable), I'd opt for short strings.
To add another aspect:
for very small dictionaries and heavy timing constraints, the time to compute hashes might be a substancial fraction of the overall time. Therefore, for (say) 5 elements, it might be faster to use an array and a sequential search (of course, wrapped up into some MiniDictionary object), maybe even augmented by a binary search. This might find the element with 2-3 comparisons, which may or may not be faster than hash-computation plus one compare.
The break-even depends on the hash speed, the average number of elements and the number of hash collisions to expect, so some measurements is required, and there is no "one-size-fits-all" answer.
Python dictionaries have a fast path for string keys, so use these (rather than, say, tuples). The hash value of a string is cached in that string, so it's more important that the strings remain the same ones than their actual value; string constants (i.e., strings that appear verbatim in the program and are not the result of a calculation) always remain exactly the same, so as long as you use those, there's no need to worry.
This problem might be relatively simple, but I'm given two text files. One text file contains all encrypted passwords encrypted via crypt.crypt in python. The other list contains over 400k+ normal dictionary words.
The assignment is that given 3 different functions which transform strings from their normal case to all different permutations of capitalizations, transforms a letter to a number (if it looks alike, e.g. G -> 6, B -> 8), and reverses a string. The thing is that given the 10 - 20 encrypted passwords in the password file, what is the most efficient way to get the fastest running solution in python to run those functions on dictionary word in the words file? It is given that all those words, when transformed in whatever way, will encrypt to a password in the password file.
Here is the function which checks if a given string, when encrypted, is the same as the encrypted password passed in:
def check_pass(plaintext,encrypted):
crypted_pass = crypt.crypt(plaintext,encrypted)
if crypted_pass == encrypted:
return True
else:
return False
Thanks in advance.
Without knowing details about the underlying hash algorithm and possible weaknesses of the algorithm all you can do is to run a brute-force attack trying all possible transformations of the words in your password list.
The only way to speed up such a brute-force attack is to get more powerful hardware and to split the task and run the cracker in parallel.
On my slow laptop, crypt.crypt takes about 20 microseconds:
$ python -mtimeit -s'import crypt' 'crypt.crypt("foobar", "zappa")'
10000 loops, best of 3: 21.8 usec per loop
so, the brute force approach (really the only sensible one) is "kinda" feasible. By applying your transformation functions you'll get (ballpark estimate) about 100 transformed words per dictionary word (mostly from the capitalization changes), so, about 40 million transformed words out of your whole dictionary. At 20 microseconds each, that will take about 800 seconds, call it 15 minutes, for the effort of trying to crack one of the passwords that doesn't actually correspond to any of the variations; expected time about half that, to crack a password that does correspond.
So, if you have 10 passwords to crack, and they all do correspond to a transformed dictionary word, you should be done in an hour or two. Is that OK? Because there isn't much else you can do except distribute this embarassingly parallel problem over as many nodes and cores as you can grasp (oh, and, use a faster machine in the first place -- that might buy you perhaps a factor of two or thereabouts).
There is no deep optimization trick that you can add, so the general logic will be that of a triple-nested loop: one level loops over the encrypted passwords, one over the words in the dictionary, one over the variants of each dictionary word. There isn't much difference regarding how you nest things (except the loop on the variants must come within the loop on the words, for simplicity). I recommend encapsulating "give me all variants of this word" as a generator (for simplicity, not for speed) and otherwise minimizing the number of function calls (e.g. there is no reason to use that check_pass function since the inline code is just as clear, and will be microscopically faster).