My script logs information about all the unique file types in a directory and subdirectory. In the process of creating a unique list of file extensions the current code considers that jpg, Jpg and JPG are the same so it only includes one of them in the List. How can I include all three or more variances?
for root, dirs, files in os.walk(SourceDIR, topdown=False):
for fl in files:
currentFile=os.path.join(root, fl)
ext=fl[fl.rfind('.')+1:]
if ext!='':
if DirLimiter in currentFile:
List.append(currentFile)
directory1=os.path.basename(os.path.normpath(currentFile[:currentFile.rfind(DirLimiter)]))
directory2=(currentFile[len(SourceDIR):currentFile.rfind('\\'+directory1+DirLimiter)])
directory=directory2+'\\'+directory1
if directory not in dirList:
dirCount+=1
dirList.append(directory)
if ext not in extList:
extList.append(ext)
The full script is in this question on stackexchange: Recurse through directories and log files by file type in python
Thanks to JennaK on further investigation I found the input in the jpg report actually had JPG and jpg in the file as below.
> 44;X:\scratch\project\Input\Foreshore and Jetties Package
> 3\487679 - Jetty\IMG_1630.JPG;3755267
> 45;X:\scratch\project\Input\Foreshore and Jetties Package
> 3\487679 - Jetty\IMG_1633.JPG;2447135
> 1;X:\scratch\project\Input\649701 - Hill
> Close\2263.jpg;405328 2;X:\scratch\project\Input\649701 - Hill Close\2264.jpg;372770
so it first got details of all the JPG files then the jpg files and put them in a single report which is actually more convenient than having 2 reports. I guess I programmed better than I thought :-)
No, for list, the in operator checks for equality, and strings are only equal to one another when they use the same case.
You could use a set here, and store all directory.lower() values in it. Sets are (a lot) faster for membership testing as lists as well:
directories = set()
extensions = set()
for root, dirs, files in os.walk(SourceDIR, topdown=False):
# ...
# no need to use `directory.lower() in directories`, just update the set:
directories.add(directory.lower())
# ...
extensions.add(ext.lower())
The dirCount variable is easily derived later on:
dirCount = len(directories)
You also want to look into the functions provided by os.path some more, in particular the os.path.splitext(), os.path.relpath() and os.path.join() functions.
Your file handling in the loop can be simplified a lot; a:
for fl in files:
filename = os.path.join(root, fl)
base, ext = os.path.splitext(filename)
if ext:
List.append(filename)
directory = os.path.relpath(filename, SourceDir)
directories.add(directory.lower())
extensions.add(ext)
Note that I use just os.path.relpath() here; your os.path.basename() and os.path.normpath() dance plus delimiters, etc. was needlessly complicated.
Now, reading between the lines, it seems that you only want to consider extensions to be equal whatever the case of just that part.
In that case, build yourself a new filename from the result of os.path.splitext():
base, ext = os.path.splitext(filename)
normalized_filename = base + ext.lower()
Now normalized_filename is the filename with the extension lowered, so you can use that value in the sets as needed.
Related
I have a folder structure as shown below
There are several subfolders with duplicate name,all I wanted is when any duplicate subfolder name is encountered, it should be prefixed with parent folder name.
e.g.
DIR2>SUBDIR1 should be renamed as DIR2>DIR2_SUDIR1 , When the folder is renamed to DIR2_SUDIR1 , the file inside this folder should also have the same prefix as its parent folder.
eg. DIR2>SUBDIR1>subdirtst2.txt should now become DIR2>DIR2_SUDIR1>DIR2_subdirtst2.txt
What I have done till now ?
I simply have added all the folder name in a list , after this I am not able to figure out any elegant way to do this task.
import os
list_dir=[]
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
if file.endswith(".txt"):
path_file = os.path.join(root)
print(path_file)
list_dir.append(path_file)
The following snippet should be able to achieve what you desire. I've written it in a way that clearly shows what is being done, so I'm sure there might be tweaks to make it more efficient or elegant.
import os
cwd = os.getcwd()
to_be_renamed = set()
for rootdir in next(os.walk(cwd))[1]:
if to_be_renamed == set():
to_be_renamed = set(next(os.walk(os.path.join(cwd, rootdir)))[1])
else:
to_be_renamed &= set(next(os.walk(os.path.join(cwd, rootdir)))[1])
for rootdir in next(os.walk(cwd))[1]:
subdirs = next(os.walk(os.path.join(cwd, rootdir)))[1]
for s in subdirs:
if s in to_be_renamed:
srcpath = os.path.join(cwd, rootdir, s)
dstpath = os.path.join(cwd, rootdir, rootdir+'_'+s)
# First rename files
for f in next(os.walk(srcpath))[2]:
os.rename(os.path.join(srcpath, f), os.path.join(srcpath, rootdir+'_'+f))
# Now rename dir
os.rename(srcpath, dstpath)
print('Renamed', s, 'and files')
Here, cwd stores the path to the dir that contains DIR1, DIR2 and DIR3. The first loop checks all immediate subdirectories of these 'root directories' and creates a set of duplicated subdirectory names by repeatedly taking their intersection (&).
Then it runs another loop, checks if the subdirectory is to be renamed and finally uses the os.rename function to rename it and all the files it contains.
os.walk() returns a 3-tuple with path to the directory, the directories in it, and the files in it, at each step. It 'walks' the tree in either a top-down or bottom-up manner, and doesn't stop at one iteration.
So, the built-in next() method is used to generate the first result (that of the current dir), after which either [1] or [2] is used to get directories and files respectively.
If you want to rename not just files, but all items in the subdirectories being renamed, then replace next(os.walk(srcpath))[2] with os.listdir(srcpath). This list contains both files and directories.
NOTE: The reason I'm computing the list of duplicated names first in a separate loop is so that the first occurrence is not left unchanged. Renaming in the same loop will miss that first one.
I use os.listdir and it works fine, but I get sub-directories in the list also, which is not what I want: I need only files.
What function do I need to use for that?
I looked also at os.walk and it seems to be what I want, but I'm not sure of how it works.
You need to filter out directories; os.listdir() lists all names in a given path. You can use os.path.isdir() for this:
basepath = '/path/to/directory'
for fname in os.listdir(basepath):
path = os.path.join(basepath, fname)
if os.path.isdir(path):
# skip directories
continue
Note that this only filters out directories after following symlinks. fname is not necessarily a regular file, it could also be a symlink to a file. If you need to filter out symlinks as well, you'd need to use not os.path.islink() first.
On a modern Python version (3.5 or newer), an even better option is to use the os.scandir() function; this produces DirEntry() instances. In the common case, this is faster as the direntry loaded already has cached enough information to determine if an entry is a directory or not:
basepath = '/path/to/directory'
for entry in os.scandir(basepath):
if entry.is_dir():
# skip directories
continue
# use entry.path to get the full path of this entry, or use
# entry.name for the base filename
You can use entry.is_file(follow_symlinks=False) if only regular files (and not symlinks) are needed.
os.walk() does the same work under the hood; unless you need to recurse down subdirectories, you don't need to use os.walk() here.
Here is a nice little one-liner in the form of a list comprehension:
[f for f in os.listdir(your_directory) if os.path.isfile(os.path.join(your_directory, f))]
This will return a list of filenames within the specified your_directory.
import os
directoryOfChoice = "C:\\" # Replace with a directory of choice!!!
filter(os.path.isfile, os.listdir(directoryOfChoice))
P.S: os.getcwd() returns the current directory.
for fname in os.listdir('.'):
if os.path.isdir(fname):
pass # do your stuff here for directory
else:
pass # do your stuff here for regular file
The solution with os.walk() would be:
for r, d, f in os.walk('path/to/dir'):
for files in f:
# This will list all files given in a particular directory
Even though this is an older post, let me please add the pathlib library introduced in 3.4 which provides an OOP style of handling directories and files for sakes of completeness. To get all files in a directory, you can use
def get_list_of_files_in_dir(directory: str, file_types: str ='*') -> list:
return [f for f in Path(directory).glob(file_types) if f.is_file()]
Following your example, you could use it like this:
mypath = '/path/to/directory'
files = get_list_of_files_in_dir(mypath)
If you only want a subset of files depending on the file extension (e.g. "only csv files"), you can use:
files = get_list_of_files_in_dir(mypath, '*.csv')
Note PEP 471 DirEntry object attributes is: is_dir(*, follow_symlinks=True)
so...
from os import scandir
folder = '/home/myfolder/'
for entry in scandir(folder):
if entry.is_dir():
# do code or skip
continue
myfile = folder + entry.name
#do something with myfile
This is part of a program I'm writing. The goal is to extract all the GPX files, say at G:\ (specified with -e G:\ at the command line). It would create an 'Exports' folder and dump all files with matching extensions there, recursively that is. Works great, a friend helped me write it!! Problem: empty directories and subdirectories for dirs that did not contain GPX files.
import argparse, shutil, os
def ignore_list(path, files): # This ignore list is specified in the function below.
ret = []
for fname in files:
fullFileName = os.path.normpath(path) + os.sep + fname
if not os.path.isdir(fullFileName) \
and not fname.endswith('gpx'):
ret.append(fname)
elif os.path.isdir(fullFileName) \ # This isn't doing what it's supposed to.
and len(os.listdir(fullFileName)) == 0:
ret.append(fname)
return ret
def gpxextract(src,dest):
shutil.copytree(src,dest,ignore=ignore_list)
Later in the program we have the call for extractpath():
if args.extractpath:
path = args.extractpath
gpxextract(extractpath, 'Exports')
So the above extraction does work. But the len function call above is designed to prevent the creation of empty dirs and does not. I know the best way is to os.rmdir somehow after the export, and while there's no error, the folders remain.
So how can I successfully prune this Exports folder so that only dirs with GPXs will be in there? :)
If I understand you correctly, you want to delete empty folders? If that is the case, you can do a bottom up delete folder operation -- which will fail for any any folders that are not empty. Something like:
for root, dirs, files in os.walk('G:/', topdown=true):
for dn in dirs:
pth = os.path.join(root, dn)
try:
os.rmdir(pth)
except OSError:
pass
I want to copy multiple files in one directory and copy and rename the file in increments of 500. For example the first 500 files in C:\Pics (with random original names) will be renamed 500-1000 and the new directory they are placed in is called 500…….files 1000-1500 would go into directory 1000 and so on.
The current code does not rename the files put instead puts it in a new directory with the correct number. This was just a start. I believe the code below Is a good start can anyone help me modify to get the results desired?
import os, glob
target = 'C:\Pics'
prefix = 'p0'
os.chdir(target)
allfiles = os.listdir(target)
count = 500
for filename in allfiles:
if not glob.glob('*.jpg'): continue
dirname = prefix + str(count)
target = os.path.join(dirname, filename)
os.renames(filename, target)
count +=1
os.listdir and glob.glob are similar functions. They both return lists of files/dirs, so they don't belong in the same loop (at least not the way you're trying to use them). The main difference is that os.listdir just takes a directory and returns basically *.* from it (minus . and ..), where as glob.glob expects a "globbing pattern" which can contain * ? [] in a restricted regex format. The function you might be thinking of here (instead of glob.glob) is fnmatch.fnmatch, which applies a globbing pattern to a single file name.
os.listdir(path)
Return a list containing the names of the entries in the directory
given by path. The list is in arbitrary order. It does not include the
special entries '.' and '..' even if they are present in the
directory.
Availability: Unix, Windows.
Changed in version 2.3: On Windows NT/2k/XP and Unix, if path is a Unicode object, the result > will be a list of Unicode objects. Undecodable filenames will still be returned as string
objects.
glob.glob(pathname)
Return a possibly-empty list of path names that
match pathname, which must be a string containing a path
specification. pathname can be either absolute (like
/usr/src/Python-1.5/Makefile) or relative (like ../../Tools//.gif),
and can contain shell-style wildcards. Broken symlinks are included in
the results (as in the shell).
Sorry, too lazy to actually mock up files and test this, but then I'd be doing all the work for you. But this should work (or be a darn close to what I think you're aiming at). ;)
import os
import fnmatch
import os.path
target = 'C:\Pics'
os.chdir(target)
allfiles = os.listdir(target)
count = 500
for filename in allfiles:
if not fnmatch.fnmatch(filename, '*.jpg'):
continue
if count % 500 == 0:
dirname = 'p%04d' % count
if not os.path.exists(dirname):
os.mkdir(dirname)
target = os.path.join(dirname, '%d.jpg' % count)
os.rename(filename, target)
count += 1
One can use os.listdir('somedir') to get all the files under somedir. However, if what I want is just regular files (excluding directories) like the result of find . -type f under shell.
I know one can use [path for path in os.listdir('somedir') if not os.path.isdir('somedir/'+path)] to achieve similar result as in this related question: How to list only top level directories in Python?. Just wondering if there are more succinct ways to do so.
You could use os.walk, which returns a tuple of path, folders and files:
files = next(os.walk('somedir'))[2]
I have a couple of ways that i do such tasks. I cannot comment on the succinct nature of the solution. FWIW here they are:
1.the code below will take all files that end with .txt. you may want to remove the ".endswith" part
import os
for root, dirs, files in os.walk('./'): #current directory in terminal
for file in files:
if file.endswith('.txt'):
#here you can do whatever you want to with the file.
2.This code here will assume that the path is provided to the function and will append all .txt files to a list and if there are subdirectories in the path, it will append those files in the subdirectories to subfiles
def readFilesNameList(self, path):
basePath = path
allfiles = []
subfiles = []
for root, dirs, files in os.walk(basePath):
for f in files:
if f.endswith('.txt'):
allfiles.append(os.path.join(root,f))
if root!=basePath:
subfiles.append(os.path.join(root, f))
I know the code is just skeletal in nature but i think you can get the general picture.
post if you find the succinct way! :)
The earlier os.walk answer is perfect if you only want the files in the top-level directory. If you want subdirectories' files too, though (a la find), you need to process each directory, e.g.:
def find_files(path):
for prefix, _, files in os.walk(path):
for name in files:
yield os.path.join(prefix, name)
Now list(find_files('.')) is a list of the same thing find . -type f -print would have given you (the list is because find_files is a generator, in case that's not obvious).