see if two files have the same content in python [duplicate] - python

This question already has answers here:
Closed 13 years ago.
Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?
What is the easiest way to see if two files are the same content-wise in Python.
One thing I can do is md5 each file and compare. Is there a better way?

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.
Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.
See http://docs.python.org/library/filecmp.html
e.g.
>>> import filecmp
>>> filecmp.cmp('file1.txt', 'file1.txt')
True
>>> filecmp.cmp('file1.txt', 'file2.txt')
False
Speed consideration:
Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte
Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.
import random
import string
import hashlib
import time
def getRandText(N):
return "".join([random.choice(string.printable) for i in xrange(N)])
N=1000000
randText1 = getRandText(N)
randText2 = getRandText(N)
def cmpHash(text1, text2):
hash1 = hashlib.md5()
hash1.update(text1)
hash1 = hash1.hexdigest()
hash2 = hashlib.md5()
hash2.update(text2)
hash2 = hash2.hexdigest()
return hash1 == hash2
def cmpByteByByte(text1, text2):
return text1 == text2
for cmpFunc in (cmpHash, cmpByteByByte):
st = time.time()
for i in range(10):
cmpFunc(randText1, randText2)
print cmpFunc.func_name,time.time()-st
and the output is
cmpHash 0.234999895096
cmpByteByByte 0.0

I'm not sure if you want to find duplicate files or just compare two single files. If the latter, the above approach (filecmp) is better, if the former, the following approach is better.
There are lots of duplicate files detection questions here. Assuming they are not very small and that performance is important, you can
Compare file sizes first, discarding all which doesn't match
If file sizes match, compare using the biggest hash you can handle, hashing chunks of files to avoid reading the whole big file
Here's is an answer with Python implementations (I prefer the one by nosklo, BTW)

Related

Determine difference of 2 large binary files?

I need to compare and get the difference of 2 large binary files (up to 100 MB).
For ASCII format I can use this:
import difflib
file1 = open('large1.txt', 'r')
file2 = open('large2.txt', 'r')
diff = difflib.ndiff(file1.readlines(), file2.readlines())
difference = ''.join(x[2:] for x in diff if x.startswith('- '))
print(difference)
How would one make it work for binary files? Tried different encoding, binary read mode but nothing worked yet.
EDIT: I use .vcl binary files.
difflib will be extremely slow for large file, 100MB would be categorized as very large...
Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.
If you can tolerant the slowness, try difflib.SequenceMatcher, it works for almost all type of data.
This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
Pythone Doc - class difflib.SequenceMatcher

Faster similarity clustering in python

I have a collection of a few thousand strings (DNA sequences). I want to trim this down to a couple of hundred (exact number not critical) by excluding sequences that are very similar.
I can do this via matching using the "Levenshtein" module. It works, but it's pretty slow, and I'm pretty sure there must be a much faster way. The code here is the same approach but applied to words, to make it more testable; for me with this cutoff it takes about 10 seconds and collects ~1000 words.
import Levenshtein as lev
import random
f = open("/usr/share/dict/words", 'r')
txt = f.read().splitlines()
f.close()
cutoff = .5
collected = []
while len(txt) > 0:
a = random.choice(txt)
collected.append(a)
txt = filter( lambda b: lev.ratio(a,b) < cutoff, txt)
I've tried a few different variations and some other matching modules (Jellyfish) without getting significantly faster.
Maybe you could use locality-sensitve hashing. You could hash all strings into buckets so that all strings in one bucket are very similar to each other. Than you can pick just one of each bucket.
I've never used this personally (don't know how good it works in your case), it's just an idea.
related: Python digest/hash for string similarity
I'm not quite sure, what your application aims at, but when working with sequences you might want to consider using Biopython. An alternative way to calculate the distance between two DNA snipplets is using an alignment score (which is related to Levenshtein by non-constant alignement weights). Using biopython you could do a multiple sequence alignment and create a phylogenetic tree. If you'd like to have a faster solution, use BLAST which is a heuristical approach. This would be consistent solutions while yours depends on the random choice of the input sequences.
Concerning your initial question, there's no 'simple' solution to speed up your code.

Is there any simple sequence alignment module or package for python?

Dealing with plain text sequence files (fasta sequences for most case) is not very efficient. I really want to deal with python objects (str or so) instead of fasta file. What I need is simply:
>>> s1 = Seq('atgctttccg....act')
>>> s2 = Seq( 'tactttccg....tat')
>>> result = align(s1, s2, scoring_matrix)
>>> result.identity, result.score, result.expect
(79.37, 1086, 9e-105)
>>> result.alignment
('atgctttccg....act--','-tactttccg....tat')
Thus, I can also avoid parsing output files repetitively, which is boring, time-consuming and error-prone. I don't expect high performance. I'm planing to write a python extension implementing Smith-Waterman's algorithm but wondering:
1. Is there an existing module for my need?
2. Any recommended reading for common optimization of a Smith-Waterman alignment implementation?
Any suggestions appreciated.
I use Biopython to handle/parse fasta files. It is great. I won't be hard to implement Smith-Waterman algorithm in python, but slow. Good luck.

Fastest way to read a list of numbers from a file

I have found a few similar questions here in Stack Overflow, but I believe I could benefit from advice specific for my case.
I must store around 80 thousand lists of real valued numbers in a file and read them back later.
First, I tried cPickle, but the reading time wasn't appealing:
>>> stmt = """
with open('pickled-data.dat') as f:
data = cPickle.load(f)
"""
>>> timeit.timeit(stmt, 'import cPickle', number=1)
3.8195440769195557
Then I found out that storing the numbers as plain text allows faster reading (makes sense, since cPickle must worry about a lot of things):
>>> stmt = """
data = []
with open('text-data.dat') as f:
for line in f:
data.append([float(x) for x in line.split()])
"""
>>> timeit.timeit(stmt, number=1)
1.712096929550171
This is a good improvement, but I think I could still optimize it somehow, since programs written in other languages can read similar data from files considerably faster.
Any ideas?
If numpy arrays are workable, numpy.fromfile will likely be the fastest option to read the files (here's a somewhat related question I asked just a couple days ago)
Alternatively, it seems like you could do a little better with struct, though I haven't tested it:
import struct
def write_data(f,data):
f.write(struct.pack('i',len()))
for lst in data:
f.write(struct.pack('i%df'%len(lst),len(lst),*lst))
def read_data(f):
def read_record(f):
nelem = struct.unpack('i',f.read(4))[0]
return list(struct.unpack('%df'%nelem,f.read(nelem*4))) #if tuples are Ok, remove the `list`.
nrec = struct.unpack('i',f.read(4))[0]
return [ read_record(f) for i in range(nrec) ]
This assumes that storing the data as 4-byte floats is good enough. If you want a real double precision number, change the format statements from f to d and change nelem*4 to nelem*8. There might be some minor portability issues here (endianness and sizeof datatypes for example).

Python: Set with only existence check?

I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as I can tell, the set() actually stored the string which is eating up a lot of my memory.
Does such a data structure exist?
done = hash_only_set()
while len(queue) > 0 :
item = queue.pop()
if item not in done :
process(item)
done.add(item)
(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).
It's certainly possible to keep a set of only hashes:
done = set()
while len(queue) > 0 :
item = queue.pop()
h = hash(item)
if h not in done :
process(item)
done.add(h)
Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.
If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?
Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.
You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.
EDIT: Important notes:
False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
False negatives (getting a negative result for a string that was stored) are not possible.
That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.
If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).
The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).
You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.
If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.
You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.
The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.
If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.
Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.
Edit2: This is my 3-minute bloom filter:
class BloomFilter (object):
"""
Let's make a bloom filter
http://en.wikipedia.org/wiki/Bloom_filter
__contains__ has false positives, but never false negatives
"""
def __init__(self, hashes=(hash, )):
self.hashes = hashes
self.data = set()
def __contains__(self, obj):
return all((h(obj) in self.data) for h in self.hashes)
def add(self, obj):
self.data.update(h(obj) for h in self.hashes)
As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.
Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here

Categories

Resources