I am currently writing a python code to access each directory and delete the specific type of file extension specified. However, I only want the code to delete the files if there are two files with same name but different file extensions are present.
ie. I only want mytext.txt to delete if mytext.txt and mytext.docx are present both in the same folder, if only mytext.txt is present then I want the code to skip that specific folder.
I have added the following lines to remove the files with extension no matter the condition:
for item in test:
if item.endswith('.txt'):
os.remove(os.path.join(pathforRemove, item))
If 'f1.txt', 'f2.png', 'f2.txt', 'f3.png', 'f4.txt'
are your files:
from collections import defaultdict
test = ['f1.txt', 'f2.png', 'f2.txt', 'f3.png', 'f4.txt']
# construct a filename to extensions map
fname_to_ext = defaultdict(set)
pairs = list(map(lambda s: (s[:s.rfind('.')], s[s.rfind('.'):]), test))
for fname, ext in pairs:
fname_to_ext[fname].add(ext)
for fname, exts in fname_to_ext.items():
if len(exts) > 1 and '.txt' in exts:
print('deleting: ', fname + '.txt')
# os.remove(os.path.join(pathforRemove, item))
This prints:
deleting: f2.txt
You can slightly modify your answer to check if this condition exists by storing it in a dict with each value as a list. Then we gather all the values from the dictionary who have a length greater than two and have files that end in '.txt'. Once we have all those values we delete them.
from collections import defaultdict
recs = defaultdict(list)
for item in test:
name = os.path.basename(item)
no_ext = name.split('.')[0]
recs[no_ext].append(name)
to_delete = [val for v in recs.values() for val in v if len(v) >= 2 and val.endswith('.txt')]
for item in to_delete:
os.remove(os.path.join(pathforRemove, item))
you may try this code snippet to see if fulfill your ask:
import os
rootDir = '/test-dir-traverse'
extensionToBeRetained = 'docx'
extensionToBeRemoved = 'txt'
for dirName, subdirList, fileList in os.walk(rootDir):
print('Found directory: %s' % dirName)
for fnameToBeRemoved in fileList:
print('\t%s' % fnameToBeRemoved)
for fname in fileList:
if fnameToBeRemoved.endswith(extensionToBeRemoved) and fname.endswith(extensionToBeRetained) and fnameToBeRemoved[0:-len(extensionToBeRemoved)] == fname[0:-len(extensionToBeRetained)]:
print('Deleting file : {}').format(fnameToBeRemoved)
You can adjust file extensions and change further.
Related
I am struggling with grouping files in a directory and returning the files with only max id.
There are following files in the directory:
FileA_212456.txt
FileA_234567.txt
FileB_88912.txt
FileB_891234.txt
FileC_829103.txt
FileC_821234.txt
...
The expected results is:
FileA_234567.txt
FileB_891234.txt
FileC_821234.txt ...
I tried the the code below, splitting the file by "_" and using [1] as an id to sort out and return by max(id), but not sure how to group them in a dictionary. Is there a better way to accomplish this?
import os
directory = '/directory'
dictionary = {}
for file in os.listdir(directory):
id = file.split('_')[1].split('.')[0]
file_name = file.split('_')[0]
dictionary[id ] = file_name
print([max(k) for k in dictionary.items()])
The dictionary should be organized the other way round:
key should be filename (without id)
ids should be created (if filename key doesn't exist) or updated when a greater value is found
like this (with hardcoded list so it's self-contained)
files ="""FileA_212456.txt
FileA_234567.txt
FileB_88912.txt
FileB_891234.txt
FileC_829103.txt
FileC_821234.txt""".splitlines()
dictionary = {}
for file in files:
ident = int(file.split('_')[1].split('.')[0])
file_name = file.split('_')[0]
if file_name not in dictionary:
dictionary[file_name] = ident # first time
else:
dictionary[file_name] = max(dictionary[file_name],ident)
for k,v in dictionary.items():
print("{}_{}.txt".format(k,v))
the result is:
FileA_234567.txt
FileB_891234.txt
FileC_829103.txt
I would say go with an if and elif statement to check if the current loop has a bigger number. Also a few changes,
"id" is a builtin for python so I would name it something else
Make sure to covert the "id" to an int to be able to compare correctly or else your just comparing strings
This one is a extra but I imported collections to be able to easily sort the dictionary by "file_name"
Here is the code:
import os
import collections
directory = './directory'
dictionary = {}
for file in os.listdir(directory):
fileID = int(file.split('_')[1].split('.')[0])
file_name = file.split('_')[0]
if file_name not in dictionary:
dictionary[file_name] = fileID
elif dictionary[file_name] < fileID:
dictionary[file_name] = fileID
dictionary = collections.OrderedDict(sorted(dictionary.items()))
print(dictionary)
for x in dictionary.keys():
print(f"{x}_{dictionary[x]}.txt")
I have the tree of above. I need to search in a recursive way the directories and files from the tree and return them as a dictionary in the following form ->
key: directories/name of file and value: first line of file
eg: key:1/2/5/test5 value:first line of test 5
So far, i created the next code:
def search(root):
items = os.listdir(root)
for element in items:
if os.path.isfile(element):
with open (element) as file:
one_line=file.readline()
print(one_line)
elif os.path.isdir(element):
search(os.path.join(root,element))
The problem is that my code only searches the directories. Please make me understand where i'm wrong and how to solve it. Massive appreciation for any help, thank you!
Your code is almost correct. It has to be adjusted a little, though.
More specifically,
element is a file or directory name (not path). If it is a subdirectory or file in a subdirectory the value of if os.path.isfile(element) and elif os.path.isdir(element) will be always False. Hence, replace them with if os.path.isfile(os.path.join(root, element)) and elif os.path.isdir(os.path.join(root, element)) respectively.
Similarly, with open(element) should be replaced by with open(os.path.join(root,element)).
When reading the file's first line, you have to store the path and that line in a dictionary.
That dictionary has to be updated when calling the recursive function in elif os.path.isdir(element).
See below for the complete snippet:
import os
def search(root):
my_dict = {} # this is the final dictionary to be populated
for element in os.listdir(root):
if os.path.isfile(os.path.join(root, element)):
try:
with open(os.path.join(root, element)) as file:
my_dict[os.path.join(root, element)] = file.readline() # populate the dictionary
except UnicodeDecodeError:
# This exception handling has been put here to ignore decode errors (some files cannot be read)
pass
elif os.path.isdir(os.path.join(root, element)):
my_dict.update(search(os.path.join(root,element))) # update the current dictionary with the one resulting from the recursive call
return my_dict
print(search('.'))
It prints a dictionary like below:
{
"path/file.csv": "name,surname,grade",
"path/to/file1.txt": "this is the first line of file 1",
"path/to/file2.py": "import os"
}
For the sake of readability, os.path.join(root, element) can be stored in a variable, then:
import os
def search(root):
my_dict = {} # this is the final dictionary to be populated
for element in os.listdir(root):
path = os.path.join(root, element)
if os.path.isfile(path):
with open(path) as file:
my_dict[path] = file.readline()
elif os.path.isdir(path):
my_dict.update(search(path))
return my_dict
print(search('.'))
You can use os.walk
The following function will not include empty folders.
def get_tree(startpath):
tree = {}
for root, dirs, files in os.walk(startpath):
for file in files:
path = root+"/"+file
with open(path,'r') as f:
first_line = f.readline()
tree[path] = first_line
return tree
The output will be like this:
{
file_path : first_line_of_the_file,
file_path2 : first_line_of_the_file2,
...
}
So I'm dealing with a script that needs to zip all files into a single folder that share the same name. So, for example, the folder structure looks like this...
001.flt
001.hdr
001.prj
002.flt
002.hdr
002.prj
. .
.
700.flt
700.hdr
700.prj
In order to get a file to zip, I have a script that can handle a single file but does not recognize ["*.flt", "*.hdr", "*.prj"]
Is there a workaround for getting the script to recognize the file names based on their names and group them by name as well? I would like each individual zip file to contain file contents but zip it as
001.zip, 002.zip....
meaning the zip file contains the different file extensions
001.zip(
001.hdr,
001.prj,
001.flt
)
'''
import zipfile, sys, os, glob
inDir = r"\test\DEM"
outDir = r"\test\DEM_out"
filetype = "*.flt"
def zipfiletypeInDir(inDir, outDir):
# Check that input directory exists
if not os.path.exists(inDir):
print ("Input directory %s does not exist!" % inDir)
return False
print ("Zipping filetype(s) in folder %s to output folder %s" % (inDir, outDir))
# Loop through "filetype" in input directory, glob will match pathnames from
for inShp in glob.glob(os.path.join(inDir, filetype)):
# Build the filename of the output zip file
outZip = os.path.join(outDir, os.path.splitext(os.path.basename(inShp))[0] + ".zip")
# Zip the "filetype"
zipfiletype(inShp, outZip)
return True
def zipfiletype(infiletype, newZipFN):
print ('Starting to Zip '+(infiletype)+' to '+(newZipFN))
# Delete output zipfile if it already exists
if (os.path.exists(newZipFN)):
print ('Deleting'+ newZipFN)
os.remove(newZipFN)
# Output zipfile still exists, exit
if (os.path.exists(newZipFN)):
print ('Unable to Delete'+newZipFN)
return False
# Open zip file object
zipobj = zipfile.ZipFile(newZipFN,'w')
# Loop through "filetype" components
for infile in glob.glob( infiletype.lower().replace(filetype,"*.flt")):
# Skip .zip file extension
if os.path.splitext(infile)[1].lower() != ".zip":
print ("Zipping %s" % (infile))
# Zip the "filetype" component
zipobj.write(infile,os.path.basename(infile),zipfile.ZIP_DEFLATED)
zipobj.close()
return True
if __name__=="__main__":
zipfiletypeInDir(inDir, outDir)
print ("done!")
If the possible duplicate I provided doesn't answer your question....
One way would be to iterate over all the file names and make a dictionary grouping all the files with the same name.
In [54]: import collections, os, zipfile
In [55]: zips = collections.defaultdict(list)
In [55]:
In [56]: for f in os.listdir():
...: name, ext = os.path.splitext(f)
...: zips[name].append(f)
Then iterate over the dictionary; creating a new zip file for each key and adding each key's files to it.
In [57]: outdir = r'zips'
In [58]: for k,v in zips.items():
...: zname = k+'.zip'
...: fpath = os.path.join(outdir,zname)
...: #print(fpath)
...: with zipfile.ZipFile(fpath, 'w') as z:
...: for name in v:
...: z.write(name)
I Found what I was looking for, This script identifies the names of the files and groups them based on that with the Iterator.
#group files into separate zipfolders from single directory based from
#individual file names
import fnmatch, os, glob, zipfile
#edit data folders for in and out variables
path = r"D:\Users\in_path"
out_path = D"C:\Users\out_path"
#create variables used in iterator
obj = os.listdir(path)
my_iterator = obj.__iter__()
##
#iterate each file name as '%s.*'
for obj in my_iterator:
#define name of file for rest of iterator to preform
name = os.path.splitext(obj)[0]
print (name)
#create a zip folder to store data that is being compressed
zip_path = os.path.join(out_path, name + '.zip')
#create variable 'zip' that directs the data into the compressed folder
zip = zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED)
os.chdir(path)
#files are written to the folder with glob.glob
for files in glob.glob('%s.*' %name):
zip.write(os.path.join(path,files), files)
#print each iteration of files being written
print ('All files written to %s' %zip_path)
zip.close()
import os, unicodecsv as csv
# open and store the csv file
IDs = {}
with open('labels.csv','rb') as csvfile:
timeReader = csv.reader(csvfile, delimiter = ',')
# build dictionary with associated IDs
for row in timeReader:
IDs[row[0]] = row[1]
# move files
path = 'train/'
tmpPath = 'train2/'
for oldname in os.listdir(path):
# ignore files in path which aren't in the csv file
if oldname in IDs:
try:
os.rename(os.path.join(path, oldname), os.path.join(tmpPath, IDs[oldname]))
except:
print 'File ' + oldname + ' could not be renamed to ' + IDs[oldname] + '!'
I am trying to sort my files according to this csv file. But the file contains many ids with same name. Is there a way to move files with same name to 1 folder or adding a number in front of a file if the file with same name already exist in directory?
Example-
id name
001232131hja1.jpg golden_retreiver
0121221122ld.jpg black_hound
0232113222kl.jpg golden_retreiver
0213113jjdsh.jpg alsetian
05hkhdsk1233a.jpg black_hound
I actually want to move all the files having id corresponding to golden_retreiver to one folder and so on.
Based on what you describe, here is my approach:
import csv
import os
SOURCE_ROOT = 'train'
DEST_ROOT = 'train2'
with open('labels.csv') as infile:
next(infile) # Skip the header row
reader = csv.reader(infile)
seen = set()
for dogid, breed in reader:
# Create a new directory if needed
if breed not in seen:
os.mkdir(os.path.join(DEST_ROOT, breed))
seen.add(breed)
src = os.path.join(SOURCE_ROOT, dogid + '.jpg')
dest = os.path.join(DEST_ROOT, breed, dogid + '.jpg')
try:
os.rename(src, dest)
except WindowsError as e:
print e
Notes
For every line in the data file, I create the breed directory at the destination. I use the set seen to make sure that I only create each directory once.
After that, it is a trivia matter of moving files into place
One possible move error: file does not exist in the source dir. In which case, the code just prints out the error and ignore it.
Background:
My target is to find dublicate files in two differen folders (without subfolders). To do that, I use the following Python script:
###Check ob alle Archive noch vorhanden oder ob Daten bei Check gelöscht wurden
def listfiles(path):
files = []
for dirName, subdirList, fileList in os.walk(path):
dir = dirName.replace(path, '')
for fname in fileList:
if fname.endswith("_GIS.7z"):
files.append(os.path.join(dir, fname))
return files
x = listfiles(root)
y = listfiles(backupfolderGIS)
#q = [filename for filename in x if filename not in y]
files_only_in_x = set(x) - set(y)
files_only_in_y = set(y) - set(x)
files_only_in_either = set(x) ^ set(y)
files_in_both = set(x) & set(y)
all_files = set(x) | set(y)
print "Alle Datein:"
print all_files
print " "
print "Nur im Zwischenspeicher:"
print files_only_in_x
print " "
print "Nur im Backupordner:"
print files_only_in_y
print " "
print "Nur einem von beiden Ordnern:"
print files_only_in_either
print " "
print "In beiden Ordnern:"
print files_in_both
print " "
The relevant output variable/ list is files_in_both (folders); it shows me the dublicates; if I use print, it looks like set(['NameoftheProject_GIS.7z', 'NameofanotherProject_GIS.7z']).
Question:
How can I use this output/ information (of dublicate files in directories) to delete/ move them? Here for example the files NameoftheProject_GIS.7z and NameofanotherProject_GIS.7z in folder backupfolderGIS / list files_in_both.
os.walk recursively checks all folders and subfolders starting from the root dir you pass, you want to check two different folders (without subfolders) so just search each folder with glob, if you want to move you can use shutil.move:
from glob import iglob
from os import path
from shutil import move
pt1, pt2 = "/path_1", "path_2"
dupe = set(map(path.basename, iglob("/path_1./*_GIS.7z"))).intersection(map(path.basename, iglob("/path_2./*_GIS.7z")))
for fle in dupe:
# move(src, dest)
move(path.join(pt1, fle), "wherever")
Or to delete use os.remove:
for fle in dupe:
os.remove(path.join(pt1, fle))
If you want to move/delete the file from pt2 then pass that to path.join in place of pt1.
You could also use str.endwith with os.listdir:
dupe = set(fname for fname in os.listdir(pt1) if fname.endswith("_GIS.7z")).intersection(fname for fname in os.listdir(pt2) if fname.endswith("_GIS.7z"))
To avoid repeating you can put it in a function:
from shutil import move
from os import path, listdir
def listfiles(path, end):
return set(fname for fname in listdir(path) if fname.endswith(end))
for fle in listfiles(pt1,"_GIS.7z").intersection(listfiles(pt2, "_GIS.7z")):
move(path.join(t1, fle), "wherever")
Now if you did want to check all folders for files with the same basename and so something for dupe name, you would need to keep a record of the full paths, you can group all common files by basename using a defaultdict:
from os import path, walk
from collections import defaultdict
def listfiles(pth, end):
files = defaultdict(list)
for dirName, subdirList, fileList in walk(pth):
for fname in fileList:
if fname.endswith(end):
files[fname].append(path.join(dirName, fname))
return files
You will get a dict where the keys are the basenames and the values are lists of files with the full path to each, any list with more than one vaulue means you have at least two files with the same name but you should remember have the same basename does not mean the files are actually the same.