Detecting duplicate files with Python

Detecting duplicate files with Python - python

I'm trying to write a script in Python for sorting through files (photos, videos), checking metadata of each, finding and moving all duplicates to a separate directory. Got stuck with the metadata checking part. Tried os.stat - doesn't return True for duplicate files. Ideally, I should be able to do something like :
if os.stat("original.jpg")== os.stat("duplicate.jpg"):
shutil.copy("duplicate.jpg","C:\\Duplicate Folder")
Pointers anyone?

There's a few things you can do. You can compare the contents or hash of each file or you can check a few select properties from the os.stat result, ex
def is_duplicate(file1, file2):
stat1, stat2 = os.stat(file1), os.stat(file2)
return stat1.st_size==stat2.st_size and stat1.st_mtime==stat2.st_mtime

A basic loop using a set to keep track of already encountered files:
import glob
import hashlib
uniq = set()
for fname in glob.glob('*.txt'):
with open(fname,"rb") as f:
sig = hashlib.sha256(f.read()).digest()
if sig not in uniq:
uniq.add(sig)
print fname
else:
print fname, " (duplicate)"
Please note as with any hash function there is a slight chance of collision. That is two different files having the same digest. Depending your needs, this is acceptable of not.
According to Thomas Pornin in an other answer :
"For instance, with SHA-256 (n=256) and one billion messages (p=109) then the probability [of collision] is about 4.3*10-60."
Given your need, if you have to check for additional properties in order to identify "true" duplicates, change the sig = ....line to whatever suits you. For example, if you need to check for "same content" and "same owner" (st_uidas returned by os.stat()), write:
sig = ( hashlib.sha256(f.read()).digest(),
os.stat(fname).st_uid )

If two files have the same md5 they are exact duplicates.
from hashlib import md5
with open(file1, "r") as original:
original_md5 = md5(original.read()).hexdigest()
with open(file2, "r") as duplicate:
duplicate_md5 = md5(duplicate.read()).hexdigest()
if original_md5 == duplicate_md5:
do_stuff()
In your example you're using jpg file in that case you want to call the method open with its second argument equals to rb. For that see the documentation for open

os.stat offers information about some file's metadata and features, including the creation time. That is not a good approach in order to find out if two files are the same.
For instance: Two files can be the same and have different time creation. Hence, comparing stats will fail here. Sylvain Leroux approach is the best one when combining performance and accuracy, since it is very rare two different files has the same hash.
So, unless you have an incredibly large amount of data and a repeated file will cause a system fatality, this is the way to go.
If that your case (it not seems to be), well ... the only way you can be 100% sure two file are the same is iterating and perform a comparison byte per byte.

Related

Is there a faster way to judge whether two pictures are different?

I am using it to tell whether now screenshot is different from last screenshot.
Now I use
with open('last_screenshot.bmp','rb+') as f:
org = f.read()
with open('now_screenshot.bmp','rb+') as f:
new = f.read()
if(org==new):
print("The pictures are same")
Is there a faster way to do this?

You won't get nowhere comparing pixels. Your options:
extract features using descriptor (HOG, SIFT, SURF, ORB), match them, see how many were matched; example
calculate hashes, compute hamming metric here's example
take pretrained embedder; when it comes to images it's pretty easy, just take activation of penultimate layer; it can be inception, vgg etc

Well you could iterate the files chunk by chunk instead of reading the entire thing in memory.
Alternatively, use filecmp or shell out to cmp(1).

You can use filecmp.cmp(..., shallow=False) which ships with the standard library. This won't read the whole files into memory but instead read them in chunks and short-circuit when a differing chunk is encountered. Example usage:
import filecmp
if filecmp('last_screenshot.bmp', 'now_screenshot.bmp', shallow=False):
print('Files compare equal')

Merge two large text files by common row to one mapping file

I have two text files that have similar formatting. The first (732KB):
>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
The second (5.26GB):
>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--
Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.
Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.
For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.
Thank you!

In my opinion, the easiest way would be to use BLAST+...
Set up the larger file as a BLAST database and use the smaller file as the query...
Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.
BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...

BioPython should be able to read in large FASTA files.
from Bio import SeqIO
from collections import defaultdict
mapping = defaultdict(list)
for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
stool_seq = str(stool_record.seq)
for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
lib_seq = str(lib_record.seq)
if stool_seq.startswith(lib_seq):
mapping[lib_record.id.split(';')[0]].append(stool_record.id)

Python - Import txt in a sequential pattern

In the directory I have say, 30 txt files each containing two columns of numbers with roughly 6000 numbers in each column. What i want to do is to import the first 3 txt files, process the data which gives me the desired output, then i want to move onto the next 3 txt files.
The directory looks like:
file0a
file0b
file0c
file1a
file1b
file1c ... and so on.
I don't want to import all of the txt files simultaneously, I want to import the first 3, process the data, then the next 3 and so forth. I was thinking of making a dictionary - though i have a feeling this might involve writing each file name in the dictionary, which would take far too long.
EDIT:
For those that are interested, I think i have come up with a work around. Any feedback would greatly be appreciated, since i'm not sure if this is the quickest way to do things or the most pythonic.
import glob
def chunks(l,n):
for i in xrange(0,len(l),n):
yield l[i:i+n]
Data = []
txt_files = glob.iglob("./*.txt")
for data in txt_files:
d = np.loadtxt(data, dtype = np.float64)
Data.append(d)
Data_raw_all = list(chunks(Data,3))
Here the list 'Data' is all of the text files from the directory, and 'Data_raw_all' uses the function 'chunks' to group the elements in 'Data' into sets of 3. This way you can selecting one element in Data_raw_all selects the corresponding 3 text files in the directory.

First of all, I have nothing original to include here and I definitely do not want to claim credit for it at all because it all comes from the Python Cookbook 3rd Ed and from this wonderful presentation on generators by David Beazley (one of the co-authors of the aforementioned Cookbook). However, I think you might really benefit from the examples given in the slideshow on generators.
What Beazley does is chain a bunch of generators together in order to do the following:
yields filenames matching a given filename pattern.
yields open file objects from a sequence of filenames.
concatenates a sequence of generators into a single sequence
greps a series of lines for those that match a regex pattern
All of these code examples are located here. The beauty of this method is that the chained generators simply chew up the next pieces of information: they don't load all files into memory in order to process all the data. It's really a nice solution.
Anyway, if you read through the slideshow, I believe it will give you a blueprint for exactly what you want to do: you just have to change it for the information you are seeking.
In short, check out the slideshow linked above and follow along and it should provide a blueprint for solving your problem.

I'm presuming you want to hardcode as few of the file names as possible. Therefore most of this code is for generating the filenames. The files are then opened with a with statement.
Example code:
from itertools import cycle, count
root = "UVF2CNa"
for n in count(1):
for char in cycle("abc"):
first_part = "{}{}{}".format(root, n, char)
try:
with open(first_part + "i") as i,\
open(first_part + "j") as j,\
open(first_part + "k") as k:
# do stuff with files i, j and k here
pass
except FileNotFoundError:
# deal with this however
pass

Python Memory error solutions if permanent access is required

first, I am aware of the amount of Python memory error questions on SO, but so far, none has matched my use case.
I am currently trying to parse a bunch of textfiles (~6k files with ~30 GB) and store each unique word. Yes, I am building a wordlist, no I am not planning on doing evil things with it, it is for the university.
I implemented the list of found words as a set (created with words = set([]), used with words.add(word)) and I am just adding every found word to it, considering that the set mechanics should remove all duplicates.
This means that I need permanent access to the whole set for this to work (Or at least I see no alternative, since the whole list has to be checked for duplicates on every insert).
Right now, I am running into MemoryError about 25% through, when it uses about 3.4 GB of my RAM. I am on a Linux 32bit, so I know where that limitation comes from, and my PC only has 4 Gigs of RAM, so even 64 bit would not help here.
I know that the complexity is probably terrible (Probably O(n) on each insert, although I don't know how Python sets are implemented (trees?)), but it is still (probably) faster and (definitly) more memory efficient than adding each word to a primitive list and removing duplicates afterwards.
Is there any way to get this to run? I expect about 6-10 GB of unique words, so using my current RAM is out of the question, and upgrading my RAM is currently not possible (and does not scale too well once I start letting this script loose on larger amounts of files).
My only Idea at the moment is caching on Disk (Which will slow the process down even more), or writing temporary sets to disk and merging them afterwards, which will take even more time and the complexity would be horrible indeed. Is there even a solution that will not result in horrible runtimes?
For the record, this is my full source. As it was written for personal use only, it is pretty horrible, but you get the idea.
import os
import sys
words=set([])
lastperc = 0
current = 1
argl = 0
print "Searching for .txt-Files..."
for _,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
argl=argl+1
print "Found " + str(argl) + " Files. Beginning parsing process..."
print "0% 50% 100%"
for r,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
fobj = open(os.path.join(r,file),"r")
for line in fobj:
line = line.strip()
word, sep, remains = line.partition(" ")
if word != "":
words.add(word)
word, sep, remains = remains.partition(" ")
while sep != "":
words.add(word)
word, sep, remains2 = remains.partition(" ")
remains = remains2
if remains != "":
words.add(remains)
newperc = int(float(current)/argl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Set contains " + str(len(words)) + " different words. Sorting..."
sorteddic = sorted(words, key=str.lower)
print "Sorted. Writing to File"
print "0% 50% 100%"
lastperc = 0
current = 1
sdicl = len(sorteddic)-1
fobj = open(sys.argv[1],"w")
for element in sorteddic:
fobj.write(element+"\n")
newperc = int(float(current)/sdicl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Enjoy your wordlist."
Thanks for your help and Ideas.

You're probably going to need to store the keys on disk. A key-value store like Redis might fit the bill.

Do you really mean 6-10GB of unique words? Is this English text? Surely even counting proper nouns and names there shouldn't be more than a few million unique words.
Anyway, what I would do is process one file at a time, or even one section (say, 100k) of a file at a time, a build a unique wordlist just for that portion. Then just union all the sets as a post-processing step.

My inclination is towards a database table, but if you want to stay within a single framework, checkout PyTables: http://www.pytables.org/moin

The first thing I'd try would be to restrict words to lower-case characters – as Tyler Eaves pointed out, this will probably reduce the set size enough to fit into memory. Here's some very basic code to do this:
import os
import fnmatch
import re
def find_files(path, pattern):
for root, files, directories in os.walk(path):
for f in fnmatch.filter(files, pattern):
yield os.path.join(root, f)
words = set()
for file_name in find_files(".", "*.txt"):
with open(file_name) as f:
data = f.read()
words.update(re.findall("\w+", data.lower()))
A few more comments:
I would usually expect the dictionary to grow rapidly at the beginning; very few new words should be found late in the process, so your extrapolation might severely overestimate the final size of the word list.
Sets are very efficient for this purpose. They are implemented as hash tables, and adding a new word has an amortised complexity of O(1).

Hash your keys into a codespace that is smaller and more managable. Key the hash to a file containing the keys with that hash. The table of hashes is much smaller and the individual key files are much smaller.

python zipfile module doesn't seem to be compressing my files

I made a little helper function:
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w")
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
The problem is that all my files are NOT being COMPRESSED! The files are the same size and, effectively, just the extension is being change to ".zip" (from ".xls" in this case).
I'm running python 2.5 on winXP sp2.

This is because ZipFile requires you to specify the compression method. If you don't specify it, it assumes the compression method to be zipfile.ZIP_STORED, which only stores the files without compressing them. You need to specify the method to be zipfile.ZIP_DEFLATED. You will need to have the zlib module installed for this (it is usually installed by default).
import zipfile
def main(archive_list=[],zfilename='default.zip'):
print zfilename
zout = zipfile.ZipFile(zfilename, "w", zipfile.ZIP_DEFLATED) # <--- this is the change you need to make
for fname in archive_list:
print "writing: ", fname
zout.write(fname)
zout.close()
if __name__ == '__main__':
main()
Update: As per the documentation (python 3.7), value for 'compression' argument should be specified to override the default, which is ZIP_STORED. The available options are ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA and the corresponding libraries zlib, bz2 or lzma should be available.

There is a really easy way to compress zip format,
Use in shutil.make_archive library.
For example:
import shutil
shutil.make_archive(file_name, 'zip', file location after compression)
Can see more extensive documentation at: Here

Hope this is going to be useful to someone.
I tested all zip modes and benchmarked them on two data sets. First one small (~30 MB) and other large (~ 1,5 GB). They consisted of various types of files so it would be as close to real life scenario as possible. I did two methods of tests on each dataset: the “proportional” one and the “complete” one. Both tests where repeated 3 times one after another to get an average. Those result may differ depending on your machines, but I think it’s still a good place to start.
I did the test in two methods because I’m trying to make my own specialized backup solution.
The proportional method creates more zip files but it allows me to transfer smaller packages of data if necessary eg. replacing only things that changed. It's more complicated than that, but it is not important right now.
The complete method is just straight up compressing whole folder.
Compression ratio calculation:
size_difference = source_size - compressed_size
compression_ratio = (size_difference * 100.0) / source_size
Basically the higher that number the better.
Each zip archive was initialized like this:
# Mode tests
with zipfile.ZipFile(target_zip, 'w', compression_method) as ziph:
# Level tests
with zipfile.ZipFile(target_zip, 'w', compression_method, compresslevel=level) as ziph:
Here are the results:
It seems that no matter the method, the most optimal compression mode is ZIP_DEFLATED.
The only smaller archive size gave me ZIP_LZMA mode, but it was only fraction of % and it took about 8x longer for large data sets.
Furthermore I tried different levels of compression with the same data set and methods. Except this time there was only one run per level.
It looks like ZIP_DEFLATED and ZIP_BIP2 have similar compression capabilities, but the second one is much slower. For large data sets the compression level of 1 or 2 should suffice. Increasing it more gives no significant effect on final file size. If the workload demands a lot of “small” zip files it is better to use level 9. It gives high compression ratio but takes about the same amount of time as at level 1.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.