Python - Duplicate File Finder using defaultdict - python

I'm experimenting with different ways to identify duplicate files, based on file content, by looping through the top level directory where folders A-Z exist. Within folders A-Z there is one additional layer of folders named after the current date. Finally, within the dated folders there are between several thousand to several million (<3 million) files in various formats.
Using the script below I was able to process roughly 800,000 files in about 4 hours. However, running it over a larger data set of roughly 13,000,000 files total it consistently breaks on letter "I" that contains roughly 1.5 million files.
Given the size of data I'm dealing with I'm considering outputting the content directly to a text file and then importing it into MySQL or something similar for further processing. Please let me know if I'm going down the right track or if you feel a modified version of the script below should be able to handle 13+ million files.
Question - How can I modify the script below to handle 13+ million files?
Error traceback:
Traceback (most recent call last):
File "C:/Users/"user"/PycharmProjects/untitled/dups.py", line 28, in <module>
for subdir, dirs, files in os.walk(path):
File "C:\Python34\lib\os.py", line 379, in walk
yield from walk(new_path, topdown, onerror, followlinks)
File "C:\Python34\lib\os.py", line 372, in walk
nondirs.append(name)
MemoryError
my code:
import hashlib
import os
import datetime
from collections import defaultdict
def hash(filepath):
hash = hashlib.md5()
blockSize = 65536
with open(filepath, 'rb') as fpath:
block = fpath.read(blockSize)
while len(block) > 0:
hash.update(block)
block = fpath.read(blockSize)
return hash.hexdigest()
directory = "\\\\path\\to\\files\\"
directories = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]
outFile = open("\\path\\output.txt", "w", encoding='utf8')
for folder in directories:
sizeList = defaultdict(list)
path = directory + folder
print("Start time: " + str(datetime.datetime.now()))
print("Working on folder: " + folder)
# Walk through one level of directories
for subdir, dirs, files in os.walk(path):
for file in files:
filePath = os.path.join(subdir, file)
sizeList[os.stat(filePath).st_size].append(filePath)
print("Hashing " + str(len(sizeList)) + " Files")
## Hash remaining files
fileList = defaultdict(list)
for fileSize in sizeList.values():
if len(fileSize) > 1:
for dupSize in fileSize:
fileList[hash(dupSize)].append(dupSize)
## Write remaining hashed files to file
print("Writing Output")
for fileHash in fileList.values():
if len(fileHash) > 1:
for hashOut in fileHash:
outFile.write(hashOut + " ~ " + str(os.stat(hashOut).st_size) + '\n')
outFile.write('\n')
outFile.close()
print("End time: " + str(datetime.datetime.now()))

Disclaimer: I don't know if this is a solution.
I looked at your code, and I realized the error is provoked by .walk. Now it's true that this might be because of too much info being processed (so maybe an external DB would help matters, though the added operations might hinder your speed). But other than that, .listdir (which is called by .walk) is really terrible when you handle a huge amount of files. Hopefully, this is resolved in Python 3.5 because it implements the way better scandir, so if you're willing* to try the latest (and I do mean latest, it was release, what, 8 days ago?), that might help.
Other than that you can try and trace bottlenecks, and garbage collection to maybe figure it out.
*you can also just install it with pip using your current python, but where's the fun in that?

Related

Concatenating fasta files in folder into single file in python

I have multiple fasta sequence files stored in a folder within my current working directory (called "Sequences") and am trying to combine all the sequences into a single file to run a MUSLCE multiple sequence alignment on.
This is what I have so far and it is functional up until the output_fas.close(), where i get the error message FileNotFoundError: [Errno 2] No such file or directory: './Sequences'
Here is the code:
import os
os.getcwd() #current directory
DIR = input("\nInput folder path containing FASTA files to combine into one FASTA file: ")
os.chdir(DIR)
FILE_NAME = input("\nWhat would you like to name your output file (e.g. combo.fas)? Note: "
"Please add the .fas extension: ")
output_fas = open(FILE_NAME, 'w')
file_count = 0
for f in os.listdir(DIR):
if f.endswith(( ".fasta")):
file_count += 1
fh = open(os.path.join(DIR, f))
for line in fh:
output_fas.write(line)
fh.close()
output_fas.close()
print(str(file_count) + " FASTA files were merged into one file, which can be found here: " + DIR)
When i input the directory i input it as './Sequences' which successfully changes the directory.
Not quite sure what to do. I adjusted the code before and it successfully created the new files with all the sequences concatenated together, however it ran continuously and would not end and had multiple repeats of each sequence.
Appreciate the help!
The error should occur before the output_fas.close(), and should be seen at the os.listdir(DIR) call. The problem is that DIR becomes meaningless as soon as you execute the os.chdir(DIR) command. DIR was provided as a relative path, and os.chdir(DIR) changes to the new directory, making the old relative path no longer correct relative to the new directory.
If you're going to use os.chdir(DIR), then never use DIR again, and just change your loop to:
# Use with statement for guaranteed deterministic close at end of block & to avoid need
# for explicit close
with open(FILE_NAME, 'w') as output_fas:
file_count = 0
for f in os.listdir(): # Remove DIR to list current directory
if f.endswith(".fasta"):
file_count += 1
# Use a with for same reason as above
with open(f) as fh: # Don't join to DIR because f is already correctly in current directory
output_fas.writelines(fh) # writelines does the loop of write calls for you

Moving Files: Matching Partial File/Directory Criteria (lastName, firstName) - Glob, Shutil

EDIT: ANSWER Below is the answer to the question. I will leave all subsequent text there just to show you how difficult I made such an easy task..
from pathlib import Path
import shutil
base = "C:/Users/Kenny/Documents/Clients"
for file in Path("C:/Users/Kenny/Documents/Scans").iterdir():
name = file.stem.split('-')[0].rstrip()
subdir = Path(base, name)
if subdir.exists():
dest = Path(subdir, file.name)
shutil.move(file, dest)
Preface:
I'm trying to write code that will move hundreds of PDF files from a :/Scans folder into another directory based on the matching client's name. This question is linked below - a very kind person, Elis Byberi, helped assist me in correcting my original code. I'm encountering another problem though..
To see our discussion and a similar question discussed:
-Python- Move All PDF Files in Folder to NewDirectory Based on Matching Names, Using Glob or Shutil
Python move files from directories that match given criteria to new directory
Question: How can you move all of the named files in :/Scans to their appropriately matched folder in :/Clients.
Background: Here is a breakdown of my file folders to give you a better idea of what I'm trying to do.
Within :/Scans folder I have thousands of PDF files, manually renamed (I tried writing a program to auto-rename.. didn't work) based on client and content, such that the folder encloses PDFs labeled as follows:
lastName, firstName - [contentVariable]
(repeat the above 100,000x)
Within the :/C drive of my computer I have a folder named 'Clients' with sub-folders for each and every client, named similar to the pattern above, as 'lastName, firstName'
EDIT: The code below will move the entire Scans folder to the Clients folder, which is close, but not exactly what I need to be doing. I only need to move the files within Scans to the corresponding Client fold names.
import glob
import shutil
import os
source = "C:/Users/Kenny/Documents/Scans"
dest = "C:/Users/Kenny/Documents/Clients"
os.chdir("C:/Users/Kenny/Documents/Clients")
pattern = '*,*'
for x in glob.glob(pattern):
fileName = os.path.join(source, x)
print(fileName)
shutil.move(source, dest)
EDIT 2 - CLOSE!: The code below will move all the files in Scans to the Clients folder, which is close, but not exactly what I need to be doing. I need to get each file into the correct corresponding file folder within the Clients folder.
This is a step forward from moving the entire Scans folder I would think.
source = "C:/Users/Kenny/Documents/Scans"
dest = "C:/Users/Kenny/Documents/Clients"
for (dirpath, dirnames, filenames) in walk(source):
for file in filenames:
shutil.move(path.join(dirpath,file), dest)
I have the following code below as well, and I am aware it does not do what I want it to do, so I am definitely missing something..
import glob
import shutil
import os
path = "C:/Users/Kenny/Documents/Scans"
dirs = os.listdir(path)
for file in dirs:
print(file)
dest_dir = "C:/Users/Kenny/Documents/Clients/{^w, $w}?"
for file in glob.glob(r'C:Users/Kenny/Documents/Clients/{^w, $w}?'):
print(file)
shutil.move(file, dest_dir)
1) Should I use os.scandir instead of os.listdir ?
2) Am I moving in the correct direction if I modify the code as such:
import glob
import shutil
import os
path = "C:/Users/Kenny/Documents/Scans"
dirs = os.scandir(path)
for file in dirs:
print(file)
dest_dir = "C:/Users/Kenny/Documents/Clients/*"
for file in glob.glob(r'C:Users/Kenny/Documents/Clients, *'):
dest_dir = os.path.join(file, glob.glob)
shutil.move(file, dest_dir)
Note within the 'for file in glob.glob(r'C:Users/Kenny/Documents/Clients/{^w, $w}?' I have tried replacing 'Clients/{^w, $w}?' with just 'Clients/*'
For the above, I only need the file in :/Scans, written as, "lastName, firstName - [content]" to be matched and moved to /Clients/[lastName, firstName] --- the [content] does not matter. But there are both greedy and nongreedy expressions... which is why I'm unsure about using * or {^w, $w}? -- because we have clients with the same last names, but different first names.
The following error is generated when running the first command:
Error 1
Error 2
The following error (though, there is no error?) is generated when running the second command:
Error 3
EDIT/POSSIBLE ANSWER
Have not yet tested this but, fnmatch(filename, pattern), or, fnmatch.translate(pattern) can be used to test whether the filename string matches the pattern string, returning True or False.
From here perhaps you could write a conditional statement..
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
shutil.move(source, destination)
or
for file in os.listdir('.'):
if fnmatch.fnmatch(file, '*.txt'):
shutil.move(file.join(eachFile, source), destination)
I have not tested the two aforesaid codes. I have no idea if they work, but editing allows others to see how my train of thought is progressing.

IOError: [Errno 2] No such file or directory: when the name was made by looping over existing files

I'm trying to have the bottom part of the code iterate over some files. These files should be corresponding and are differentiated by a number, so the counter is to change the number part of the file.
The file names are generated by looking through the given files and selecting files containing certain things in the title, then having them ordered using the count.
This code works independently, in it's own (lonely) folder, and prints the correct files in the correct order. However when i use this in my main code, where file_1 and file_2 are referenced (the decoder and encoder parts of the code) I get the error in the title. There is no way there is any typo or that the files don't exist because python made these things itself based on existing file names.
import os
count = 201
while 205 > count:
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-decoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_1 = f
print(file_1)
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-encoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_2 = f
print(file_2)
decoder1.load_state_dict(
torch.load(open(file_1, 'rb')))
encoder1.load_state_dict(
torch.load(open(file_2, 'rb')))
print(getBlueScore(encoder1, decoder1, pairs, src, tgt))
print_every=10
print(file_1 + file_2)
count = count + 1
i then need to use these files two by two.
It's very possible that you are running into issues with variable scoping, but without being able to see your entire code it's hard to know for sure.
If you know what the model files should be called, might I suggest this code:
for i in range(201, 205):
e = 'absolute_path/models/test_encoder_%d.model' % i
d = 'absolute_path/models/test_decoder_%d.model' % i
if os.path.exists(e) and os.path.exists(d):
decoder1.load_state_dict(torch.load(open(e, 'rb')))
encoder1.load_state_dict(torch.load(open(d, 'rb')))
Instead of relying on the existence of strings in a path name which could lead to errors this would force only those files you want to open to be opened. Also it gets rid of any possible scoping issues.
We could clean it up a bit more but you get the idea.

Python split a large file to multiple file

I’m new to python and have a small bit of code that reads a list of files into a Windows batch file to execute. The trouble is, the code is (correctly) writing up 700 lines per batch file which would take an age to process on one PC. I’d like to break each batch file into, say 100 lines per batch script, and then run each batch file across 7 PC's, for example, but my lack of python knowledge is hampering me a little. I have;
def find_files(directory, pattern):
#directory= (raw_input("Enter a directory to search for Userlists: ")
directory=("c:\\test\\")
os.chdir(directory)
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
for files in find_files('a-zA-Z0-9','*.btm'):
#print files
lines="call "+files +"\n"
print lines
with open ("c://Extract.bat","a+") as infile:
infile.write(lines)
infile.close
I've tried using (islice(f,n)) but it failed to write anything to 'extract.bat'. Any help, would, as always be really appreciated.
Many thanks
A quite simple way would be to put this in your loop :
i += 1 # don't forget to initialize it before the loop, though
with open ("c://Extract" + str(i/700) + ".bat","a+") as infile:
gives you a new file names Extract0.bat, Extract1.bat, ... every 700 line.

Reading all files in all directories [duplicate]

This question already has answers here:
How do I list all files of a directory?
(21 answers)
Closed 9 years ago.
I have the code working to read in the values of a single text file but am having difficulties reading all files from all directories and putting all of the contents together.
Here is what I have:
filename = '*'
filesuffix = '*'
location = os.path.join('Test', filename + "." + filesuffix)
Document = filename
thedictionary = {}
with open(location) as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
for position, item in enumerate(file_contents):
if item in thedictionary:
thedictionary[item].append(position)
else:
thedictionary[item] = [position]
wordlist = (thedictionary, Document)
#print wordlist
#print thedictionary
note that I am trying to stick the wildcard * in for the filename as well as the wildcard for the filesuffix. I get the following error:
"IOError: [Errno 2] No such file or directory: 'Test/.'"
I am not sure if this is even the right way to do it but it seems that if I somehow get the wildcards working - it should work.
I have gotten this example to work: Python - reading files from directory file not found in subdirectory (which is there)
Which is a little different - but don't know how to update it to read all files. I am thinking that in this initial set of code:
previous_dir = os.getcwd()
os.chdir('testfilefolder')
#add something here?
for filename in os.listdir('.'):
That I would need to add something where I have an outer for loop but don't quite know what to put in it..
Any thoughts?
Python doesn't support wildcards directly in filenames to the open() call. You'll need to use the glob module instead to load files from a single level of subdirectories, or use os.walk() to walk an arbitrary directory structure.
Opening all text files in all subdirectories, one level deep:
import glob
for filename in glob.iglob(os.path.join('Test', '*', '*.txt')):
with open(filename) as f:
# one file open, handle it, next loop will present you with a new file.
Opening all text files in an arbitrary nesting of directories:
import os
import fnmatch
for dirpath, dirs, files in os.walk('Test'):
for filename in fnmatch.filter(files, '*.txt'):
with open(os.path.join(dirpath, filename)):
# one file open, handle it, next loop will present you with a new file.

Categories

Resources