Apply procedure to files in many subdirectories - python

I am trying to apply a procedure to thousands of files, but in many subdirectories.
I was thinking using os.listdir() first to list all subdirectories, than go look in each subdirectory and apply my procedure. My arborescence is as follow:
subdir1 -> file, file, file, .....
subdir2 -> file, file, file, .....
Directory -> subdir3 -> file, file, file, .....
subdir4 -> file, file, file, .....
subdir5 -> file, file, file, .....
I can access the list of subdir with os.listdir() but not the files in the subdirectories, do you have an idea how to proceed ?
Thanks
EDIT:
When using MikeH method, in my case:
import os
from astropy.io import fits
ROOT_DIR='./'
for dirName, subdirList, fileList in os.walk(ROOT_DIR):
for fname in fileList:
hdul = fits.open(fname)
I get the error:
FileNotFoundError: [Errno 2] No such file or directory: 'lte08600-2.00+0.5.Alpha=+0.50.PHOENIX-ACES-AGSS-COND-2011-HiRes.fits'
And indeed if I try to check the path on the file, with print(os.path.abspath(fname) I can see that the path is wrong, it misses the subdirectories like /root/dir/fnam instead of root/dir/subdir/fname
What is wrong in this ?
EDIT2:
That's it I found out what was wrong, I have to join the path of the file, writing os.path.join(dirName,fname) instead of just fname each time.
Thanks !

Something like this should work for you:
import os
ROOT_DIR='./'
for dirName, subdirList, fileList in os.walk(ROOT_DIR):
for fname in fileList:
# fully qualified file name is ROOT_DIR/dirname/fname
performFunction(dirName, fname)

Related

Python: Finding files in directory but ignoring folders and their contents

So my program search_file.py is trying to look for .log files in the directory it is currently placed in. I used the following code to do so:
import os
# This is to get the directory that the program is currently running in
dir_path = os.path.dirname(os.path.realpath(__file__))
# for loop is meant to scan through the current directory the program is in
for root, dirs, files in os.walk(dir_path):
for file in files:
# Check if file ends with .log, if so print file name
if file.endswith('.log')
print(file)
My current directory is as follows:
search_file.py
sample_1.log
sample_2.log
extra_file (this is a folder)
And within the extra_file folder we have:
extra_sample_1.log
extra_sample_2.log
Now, when the program runs and prints the files out it also takes into account the .log files in the extra_file folder. But I do not want this. I only want it to print out sample_1.log and sample_2.log. How would I approach this?
Try this:
import os
files = os.listdir()
for file in files:
if file.endswith('.log'):
print(file)
The problem in your code is os.walk traverses the whole directory tree and not just your current directory. os.listdir returns a list of all filenames in a directory with the default being your current directory which is what you are looking for.
os.walk documentation
os.listdir documentation
By default, os.walk does a root-first traversal of the tree, so you know the first emitted data is the good stuff. So, just ask for the first one. And since you don't really care about root or dirs, use _ as the "don't care" variable name
# get root files list.
_, _, files = next(os.walk(dir_path))
for file in files:
# Check if file ends with .log, if so print file name
if file.endswith('.log')
print(file)
Its also common to use glob:
from glob import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
for file in glob(os.path.join(dir_path, "*.log")):
print(file)
This runs the risk that there is a directory that ends in ".log", so you could also add a testing using os.path.isfile(file).

Python loop through directories

I am trying to use python library os to loop through all my subdirectories in the root directory, and target specific file name and rename them.
Just to make it clear this is my tree structure
My python file is located at the root level.
What I am trying to do, is to target the directory 942ba loop through all the sub directories and locate the file 000000 and rename it to 000000.csv
the current code I have is as follow:
import os
root = '<path-to-dir>/942ba956-8967-4bec-9540-fbd97441d17f/'
for dirs, subdirs, files in os.walk(root):
for f in files:
print(dirs)
if f == '000000':
dirs = dirs.strip(root)
f_new = f + '.csv'
os.rename(os.path.join(r'{}'.format(dirs), f), os.path.join(r'{}'.format(dirs), f_new))
But this is not working, because when I run my code, for some reasons the code strips the date from the subduers
can anyone help me to understand how to solve this issue?
A more efficient way to iterate through the folders and only select the files you are looking for is below:
source_folder = '<path-to-dir>/942ba956-8967-4bec-9540-fbd97441d17f/'
files = [os.path.normpath(os.path.join(root,f)) for root,dirs,files in os.walk(source_folder) for f in files if '000000' in f and not f.endswith('.gz')]
for file in files:
os.rename(f, f"{f}.csv")
The list comprehension stores the full path to the files you are looking for. You can change the condition inside the comprehension to anything you need. I use this code snippet a lot to find just images of certain type, or remove unwanted files from the selected files.
In the for loop, files are renamed adding the .csv extension.
I would use glob to find the files.
import os, glob
zdir = '942ba956-8967-4bec-9540-fbd97441d17f'
files = glob.glob('*{}/000000'.format(zdir))
for fly in files:
os.rename(fly, '{}.csv'.format(fly))

os.listdir not listing files in directory

For some reason os.listdir isn't working for me. I have 6 .xlsx files inside the input_dir but it creating a list with nothing in it instead of showing a list of 6 files. If I move the .xlsx files into where the script is one directory back, and update the input_dir path it then finds all 6 files but I need the 6 files to be one directory up in their own folder. And when I move them one directory up into their own folder, and I update the input_dir path it doesn't find them at all.
import openpyxl as xl
import os
import pandas as pd
import xlsxwriter
input_dir='C:\\Users\\work\\comparison'
files = [file for file in os.listdir(input_dir)
if os.path.isfile(file) and file.endswith(".xlsx")]
for file in files:
input_file = os.path.join(input_dir, file)
wb1=xl.load_workbook(input_file)
ws1=wb1.worksheets[0]
When you move the files into input_dir, the following line creates an empty list:
files = [file for file in os.listdir(input_dir)
if os.path.isfile(file) and file.endswith(".xlsx")]
This is because you are checking for os.path.isfile(file) instead of os.path.isfile(os.path.join(input_dir, file))
When files are present in the same directory as the script, it's able to find the file and creates the list correctly.
Alternatively, you could try using glob.glob which accepts a file path pattern and returns full path to the file in the iterator.
The problem comes from os.path.isfile(file) : os.listdir(input_dir) returns a list of filenames inside the input_dir directory, but without their path. Hence your error, as os.path.isfile(file) will look into your current directory, which obviously doesn't have any of those filenames.
You can easily correct this by simply changing os.path.isfile(input_dir + '\\' + file), but a prettier solution would rather be to simply delete this part of the code (as if os.listdir returns this filename, then it is necessary into your directory and there's no need to check if so) :
files = [file for file in os.listdir(input_dir) if file.endswith(".xlsx")]

How to copy folder structure under another directory?

I have some questions related to copying a folder structure. In fact, I need to do a conversion of pdf files to text files. Hence I have such a folder structure for the place where I import the pdf:
D:/f/subfolder1/subfolder2/a.pdf
And I would like to create the exact folder structure under "D:/g/subfolder1/subfolder2/" but without the pdf file since I need to put at this place the converted text file. So after the conversion function it gives me
D:/g/subfolder1/subfolder2/a.txt
And also I would like to add if function to make sure that under "D:/g/" the same folder structure does not exist before creating.
Here is my current code. So how can I create the same folder structure without the file?
Thank you!
import converter as c
import os
inputpath = 'D:/f/'
outputpath = 'D:/g/'
for root, dirs, files in os.walk(yourpath, topdown=False):
for name in files:
with open("D:/g/"+ ,mode="w") as newfile:
newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))
For me the following works fine:
Iterate over existing folders
Build the structure for the new folders based on existing ones
Check, if the new folder structure does not exist
If so, create new folder without files
Code:
import os
inputpath = 'D:/f/'
outputpath = 'D:/g/'
for dirpath, dirnames, filenames in os.walk(inputpath):
structure = os.path.join(outputpath, dirpath[len(inputpath):])
if not os.path.isdir(structure):
os.mkdir(structure)
else:
print("Folder does already exits!")
Documentation:
os.walk
os.mkdir
os.path.isdir
How about using shutil.copytree()?
import shutil
def ig_f(dir, files):
return [f for f in files if os.path.isfile(os.path.join(dir, f))]
shutil.copytree(inputpath, outputpath, ignore=ig_f)
The directory you want to create should not exist before calling this function. You can add a check for that.
Taken from shutil.copytree without files
A minor tweak to your code for skipping pdf files:
for root, dirs, files in os.walk('.', topdown=False):
for name in files:
if name.find(".pdf") >=0: continue
with open("D:/g/"+ ,mode="w") as newfile:
newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))

Removing randomly generated file extensions from .jpg files using python

I recently recovered a folder that i had accidentally deleted. It has .jpg and .tar.gz files. However, all of the files now have some sort of hash extension appended to them and it is different for every file. There are more than 600 files in the folders. So example names would be:
IMG001.jpg.3454637876876978068
IMG002.jpg.2345447786787689769
IMG003.jpg.3454356457657757876
and
folder1.tar.gz.45645756765876
folder2.tar.gz.53464575678588
folder3.tar.gz.42345435647567
I would like to have a script that could go in turn (maybe i can specify extension or it can have two iterations, one through the .jpg files and the other through the .tar.gz) and clean up the last part of the file name starting from the . right before the number. So the final file names would end in .jpg and .tar.gz
What I have so far in python:
import os
def scandirs(path):
for root, dirs, files in os.walk(path):
for currentFile in files:
os.path.splitext(currentFile)
scandirs('C:\Users\ad\pics')
Obviously it doesn't work. I would appreciate any help. I would also consider using a bash script, but I do not know how to do that.
shutil.move(currentFile,os.path.splitext(currentFile)[0])
at least I think ...
Here is how I would do it, using regular expressions:
import os
import re
pattern = re.compile(r'^(.*)\.\d+$')
def scandirs(path):
for root, dirs, files in os.walk(path):
for currentFile in files:
match = pattern.match(currentFile)
if match:
os.rename(
os.path.join(root, currentFile),
os.path.join(root, match.groups(1)[0])
)
scandirs('C:/Users/ad/pics')
Since you tagged with bash I will give you an answer that will remove the last extension for all files/directories in a directory:
for f in *; do
mv "$f" "${f%.*}"
done

Categories

Resources