Determine difference of 2 large binary files?

Determine difference of 2 large binary files? - python

I need to compare and get the difference of 2 large binary files (up to 100 MB).
For ASCII format I can use this:
import difflib
file1 = open('large1.txt', 'r')
file2 = open('large2.txt', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
difference = ''.join(x[2:] for x in diff if x.startswith('- '))
print(difference)
How would one make it work for binary files? Tried different encoding, binary read mode but nothing worked yet.
EDIT: I use .vcl binary files.

difflib will be extremely slow for large file, 100MB would be categorized as very large...
Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.
If you can tolerant the slowness, try difflib.SequenceMatcher, it works for almost all type of data.
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
Pythone Doc - class difflib.SequenceMatcher

Related

Faster similarity clustering in python

I have a collection of a few thousand strings (DNA sequences). I want to trim this down to a couple of hundred (exact number not critical) by excluding sequences that are very similar.
I can do this via matching using the "Levenshtein" module. It works, but it's pretty slow, and I'm pretty sure there must be a much faster way. The code here is the same approach but applied to words, to make it more testable; for me with this cutoff it takes about 10 seconds and collects ~1000 words.
import Levenshtein as lev
import random
f = open("/usr/share/dict/words", 'r')
txt = f.read().splitlines()
f.close()
cutoff = .5
collected = []
while len(txt) > 0:
a = random.choice(txt)
collected.append(a)
txt = filter( lambda b: lev.ratio(a,b) < cutoff, txt)
I've tried a few different variations and some other matching modules (Jellyfish) without getting significantly faster.

Maybe you could use locality-sensitve hashing. You could hash all strings into buckets so that all strings in one bucket are very similar to each other. Than you can pick just one of each bucket.
I've never used this personally (don't know how good it works in your case), it's just an idea.
related: Python digest/hash for string similarity

I'm not quite sure, what your application aims at, but when working with sequences you might want to consider using Biopython. An alternative way to calculate the distance between two DNA snipplets is using an alignment score (which is related to Levenshtein by non-constant alignement weights). Using biopython you could do a multiple sequence alignment and create a phylogenetic tree. If you'd like to have a faster solution, use BLAST which is a heuristical approach. This would be consistent solutions while yours depends on the random choice of the input sequences.
Concerning your initial question, there's no 'simple' solution to speed up your code.

Efficient way of computing statistics for large/imprecise amount of data

I have over 65 million numeric values stored in a text file. I need to compute the maximum, minimum, average, standard deviation, as well as the 25, 50, and 75 percentiles.
Normally I would use the attached code, but I need a more efficient way to compute these metrics because i cannot store all value p in a list. How can I more effectively calculate these values in Python?
import numpy as np
np.average(obj)
np.min(mylist)
np.max(mylist)
np.std(mylist)
np.percentile(obj, 25)
np.percentile(obj, 50)
np.percentile(obj, 75)
maxx = float('-inf')
minx = float('+inf')
sumz = 0
for index, p in enumerate(open("foo.txt", "r")):
maxx = max(maxx, float(p))
minx = min(minx, float(p))
sumz += float(p)
index += 1
my_max = maxx
my_min = minx
my_avg = sumz/index

Use binary file. Then you can use numpy.memmap to map it to memory and can perform all sorts of algorithms, even if the dataset was larger than RAM.
You can even use the numpy.memmap to create a memory mapped array, and read your data in from the text file... you can work on it and when you are done, you also have the data in binary format.

I think you are on the right track, by iterating over the file and keeping track of max and min values. To calculate the std, you should keep a sum of squares inside the loop: sum_of_squares += z**2. You then can calculate std = sqrt(sum_of_squares / n - (sumz / n)**2) after the loop, see formula here (but this formula might suffer from numerical problems). For performance, you might want to iterate over the file in some decent size chunks of data.
To calculate the median and percentiles in a 'continuous' way, you could build up a histogram inside your loop. After the loop, you can get approximate percentiles and median by converting the histogram to the CDF, the error will depend on the number of bins.

As Antti Haapala says, the easiest and most efficient way to do this will be to stick with numpy, and just use a memmapped binary file instead of a text file. Yes, converting from one format to the other will take a bit of time—but it'll almost certainly save more time than it costs (because you can use numpy vectorized operations instead of loops), and it will also make your code a lot simpler.
If you can't do that, Python 3.4 will come with a statistics module. A backport to 2.6+ will hopefully be available at some point after the PEP is finalized; at present I believe you can only get stats, the earlier module it's based on, which requires 3.1+. Unfortunately, while stats does do single-pass algorithms on iterators, it doesn't have any convenient way to run multiple algorithms in parallel on the same iterator, so you have be clever with itertools.tee and zip to force it to interleave the work instead of pulling the whole thing into memory.
And of course there are plenty of other modules out there if you search PyPI for "stats" and/or "statistics" and/or "statistical".
Either way, using a pre-built module will mean someone's already debugged all the problems you're going to run into, and they may have also optimized the code (maybe even ported it to C) to boot.

To get the percentiles, sort the text file using a command line program. Use the line count (index in your program) to find the line numbers of the percentiles (index // 4, etc.) Then retrieve those lines from the file.

Most of these operations can be expressed easily in terms of simple arithmetic. In that case, it can actually (surprisingly) be quite efficient to process simple statistics directly from the Linux command line using awk and sed, e.g. as in this post: < http://www.unixcl.com/2008/09/sum-of-and-group-by-using-awk.html >.
If you need to generalize to more advanced operations, like weighted percentiles, then I'd recommend using Python Pandas (notably the HDFStore capabilities for later retrieval). I've used Pandas with a DataFrame of over 25 million records before (10 columns by 25 million distinct rows). If you're more memory constrained, you could read the data in in chunks, calculate partial contributions from each chunk, and store out intermediate results, then finish off the calculation by just loading the intermediate results, in a serialized sort of map-reduce kind of framework.

File Checksums in Python

I am creating an application related to files. And I was looking for ways to compute checksums for files. I want to know what's the best hashing method to calculate checksums of files md5 or SHA-1 or something else based on this criterias
The checksum should be unique. I know its theoretical but still I want the probablity of collisions to be very very small.
Can compare two files to be equal if there checksums are equal or not.
Speed(not very important, but still)
Please feel free to as elaborative as possible.

It depends on your use case.
If you're only worried about accidental collisions, both MD5 and SHA-1 are fine, and MD5 is generally faster. In fact, MD4 is also sufficient for most use cases, and usually even faster… but it isn't as widely-implemented. (In particular, it isn't in hashlib.algorithms_guaranteed… although it should be in hashlib_algorithms_available on most stock Mac, Windows, and Linux builds.)
On the other hand, if you're worried about intentional attacks—i.e., someone intentionally crafting a bogus file that matches your hash—you have to consider the value of what you're protecting. MD4 is almost definitely not sufficient, MD5 is probably not sufficient, but SHA-1 is borderline. At present, Keccak (which will soon by SHA-3) is believed to be the best bet, but you'll want to stay on top of this, because things change every year.
The Wikipedia page on Cryptographic hash function has a table that's usually updated pretty frequently. To understand the table:
To generate a collision against an MD4 requires only 3 rounds, while MD5 requires about 2 million, and SHA-1 requires 15 trillion. That's enough that it would cost a few million dollars (at today's prices) to generate a collision. That may or may not be good enough for you, but it's not good enough for NIST.
Also, remember that "generally faster" isn't nearly as important as "tested faster on my data and platform". With that in mind, in 64-bit Python 3.3.0 on my Mac, I created a 1MB random bytes object, then did this:
In [173]: md4 = hashlib.new('md4')
In [174]: md5 = hashlib.new('md5')
In [175]: sha1 = hashlib.new('sha1')
In [180]: %timeit md4.update(data)
1000 loops, best of 3: 1.54 ms per loop
In [181]: %timeit md5.update(data)
100 loops, best of 3: 2.52 ms per loop
In [182]: %timeit sha1.update(data)
100 loops, best of 3: 2.94 ms per loop
As you can see, md4 is significantly faster than the others.
Tests using hashlib.md5() instead of hashlib.new('md5'), and using bytes with less entropy (runs of 1-8 string.ascii_letters separated by spaces) didn't show any significant differences.
And, for the hash algorithms that came with my installation, as tested below, nothing beat md4.
for x in hashlib.algorithms_available:
h = hashlib.new(x)
print(x, timeit.timeit(lambda: h.update(data), number=100))
If speed is really important, there's a nice trick you can use to improve on this: Use a bad, but very fast, hash function, like zlib.adler32, and only apply it to the first 256KB of each file. (For some file types, the last 256KB, or the 256KB nearest the middle without going over, etc. might be better than the first.) Then, if you find a collision, generate MD4/SHA-1/Keccak/whatever hashes on the whole file for each file.
Finally, since someone asked in a comment how to hash a file without reading the whole thing into memory:
def hash_file(path, algorithm='md5', bufsize=8192):
h = hashlib.new(algorithm)
with open(path, 'rb') as f:
block = f.read(bufsize)
if not block:
break
h.update(block)
return h.digest()
If squeezing out every bit of performance is important, you'll want to experiment with different values for bufsize on your platform (powers of two from 4KB to 8MB). You also might want to experiment with using raw file handles (os.open and os.read), which may sometimes be faster on some platforms.

The collision possibilities with hash size of sufficient bits are , theoretically, quite small:
Assuming random hash values with a uniform distribution, a collection
of n different data blocks and a hash function that generates b bits,
the probability p that there will be one or more collisions is bounded
by the number of pairs of blocks multiplied by the probability that a
given pair will collide, i.e
And, so far, SHA-1 collisions with 160 bits have been unobserved. Assuming one exabyte (10^18) of data, in 8KB blocks, the theoretical chance of a collision is 10^-20 -- a very very small chance.
A useful shortcut is to eliminate files known to be different from each other through short-circuiting.
For example, in outline:
Read the first X blocks of all files of interest;
Sort the one that have the same hash for the first X blocks as potentially the same file data;
For each file with the first X blocks that are unique, you can assume the entire file is unique vs all other tested files -- you do not need to read the rest of that file;
With the remaining files, read more blocks until you prove the signatures are the same or different.
With X blocks of sufficient size, 95%+ of the files will be correctly discriminated into unique files in the first pass. This is much faster than blindly reading the entire file and calculating the full hash for each and every file.

md5 tends to work great for checksums ... same with SHA-1 ... both have very small probability of collisions although I think SHA-1 has slightly smaller collision probability since it uses more bits
if you are really worried about it, you could use both checksums (one md5 and one sha1) the chance that both match and the files differ is infinitesimally small (still not 100% impossible but very very very unlikely) ... (this seems like bad form and by far the slowest solution)
typically (read: in every instance I have ever encountered) an MD5 OR an SHA1 match is sufficient to assume uniqueness
there is no way to 100% guarantee uniqueness short of byte by byte comparisson

i created a small duplicate file remover script few days back, which reads the content of the file and create a hash for it and then compare with the next file, in which even if the name is different the checksum is going to be the same..
import hashlib
import os
hash_table = {}
dups = []
path = "C:\\images"
for img in os.path.listdir(path):
img_path = os.path.join(path, img)
_file = open(img_path, "rb")
content = _file.read()
_file.close()
md5 = hashlib.md5(content)
_hash = md5.hexdigest()
if _hash in hash_table.keys():
dups.append(img)
else:
hash_table[_hash] = img

How to save double to file in python?

Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?

You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.

I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.

You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]

If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.

see if two files have the same content in python [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?
What is the easiest way to see if two files are the same content-wise in Python.
One thing I can do is md5 each file and compare. Is there a better way?

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.
Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.
See http://docs.python.org/library/filecmp.html
e.g.
>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False
Speed consideration:
Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte
Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.
import random
import string
import hashlib
import time
def getRandText(N):
return "".join([random.choice(string.printable) for i in xrange(N)])
N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)
def cmpHash(text1, text2):
hash1 = hashlib.md5()
hash1.update(text1)
hash1 = hash1.hexdigest()
hash2 = hashlib.md5()
hash2.update(text2)
hash2 = hash2.hexdigest()
return hash1 == hash2
def cmpByteByByte(text1, text2):
return text1 == text2
for cmpFunc in (cmpHash, cmpByteByByte):
st = time.time()
for i in range(10):
cmpFunc(randText1, randText2)
print cmpFunc.func_name,time.time()-st
and the output is
cmpHash 0.234999895096
cmpByteByByte 0.0

I'm not sure if you want to find duplicate files or just compare two single files. If the latter, the above approach (filecmp) is better, if the former, the following approach is better.
There are lots of duplicate files detection questions here. Assuming they are not very small and that performance is important, you can
Compare file sizes first, discarding all which doesn't match
If file sizes match, compare using the biggest hash you can handle, hashing chunks of files to avoid reading the whole big file
Here's is an answer with Python implementations (I prefer the one by nosklo, BTW)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Determine difference of 2 large binary files? - python

Related

Faster similarity clustering in python

Efficient way of computing statistics for large/imprecise amount of data

File Checksums in Python

How to save double to file in python?

see if two files have the same content in python [duplicate]

Categories

Resources