os.listdir getting slower over different runs

os.listdir getting slower over different runs - python

dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]
The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.
EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program

It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.
But instead that use either below method or use walk() method.
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# Create an empty list to store the file paths
files = []
for subdir in subdirs:
# Use os.scandir() to iterate over the files and directories in the subdirectory
with os.scandir(subdir) as entries:
for entry in entries:
# Check if the entry is a regular file
if entry.is_file():
# Add the file path to the list
files.append(entry.path)

Related

Simple Python program that checks in each subfolder how many files there are and which extensions the file contains

I am writing a simple python script that looks in the subfolders of the selected subfolder for files and summarizes which extensions are used and how many.
I am not really familiar with os.walk and I am really stuck with the "for file in files" section
`
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
`
I thought a for loop was the best option for the loop that summarizes the files and extensions but it only takes the last subfolder and gives a output of the files that are in the last map.
Does anybody maybe have a fix or a different aproach for it?
FULL CODE:
`
import os
# Set file path using / {End with /}
root_path="C:/Users/me/Documents/"
# Initialize variables to keep track of file counts
total_file_count=0
file_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
# Initialize a count for each file type
file_counts[subfolder] = {}
# Iterate through all files in the subfolder
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
# Print total file count
print(f"There are a total of {total_file_count} files.")
# Print the file counts for each subfolder
for subfolder, counts in file_counts.items():
print(f"In the {subfolder} subfolder:")
for extension, count in counts.items():
print(f"There are {count} .{extension} files")
`
Thank you in advance :)

If I understand correctly, you want to count the extensions in ALL subfolders of the given folder, but are only getting one folder. If that is indeed the problem, then the issue is this loop
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
You are iterating through os.walk, but you keep overwriting the subfolder variable. So while it will print out every subfolder, it will only remember the LAST subfolder it encounters - leading to the code returning only on subfolder.
Solution 1: Fix the loop
If you want to stick with os.walk, you just need to fix the loop. First things first - define files as a real variable. Don't rely on using the temporary variable from the loop. You actually already have this: file_counts!
Then, you need someway to save the files. I see that you want to split this up by subfolder, so what we can do is use file_counts, and use it to map each subfolder to a list of files (you are trying to do this, but are fundamentally misunderstanding some python code; see my note below about this).
So now, we have a dictionary mapping each subfolder to a list of files! We would just need to iterate through this and count the extensions. The final code looks something like this:
file_counts = {}
extension_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
subfolder = root.split("/")[-1]
file_counts[subfolder] = files
extensions_counts[subfolder]={}
# Iterate through all subfolders, and then through all files
for subfolder in file_counts:
for file in file_counts[subfolder]:
total_file_count += 1
extension = file.split(".")[-1]
if extension not in extension_counts[subfolder]:
extension_counts[subfolder][extension] = 1
else:
extension_counts[subfolder][extension] += 1
Solution 2: Use glob
Instead of os.walk, you can use the glob module, which will return a list of all files and directories wherever you search. It is a powerful tool that uses wildcard matching, and you can read about it here
Note
In your code, you write
# Initialize a count for each file type
file_counts[subfolder] = {}
Which feels like a MATLAB coding scheme. First, subfolder is a variable, and not a vector, so this would only initialize a count for a single file type (and even if it was a list, you get an unhashable type error). Second, this seems to stem from the idea that continuously assigning a variable in a loop builds a list instead of overwriting, which is not true. If you want to do that, you need to initialize an empty list, and use .append().
Note 2: Electric Boogaloo
There are two big ways to make this code good, and here are hints
Look into default dictionaries. They will make your code less redundant
Do you REALLY need to save the numbers and THEN count? What if you counted directly?

Rather than using os.walk you could use the rglob and glob methods of Path object. E.g.,
from pathlib import Path
root_path="C:/Users/me/Documents/"
# get a list of all the directories within root (and recursively within those subdirectories
dirs = [d for d in Path().rglob(root_path + "*") if d.is_dir()]
dirs.append(Path(root_path)) # append root directory
# loop through all directories
for curdir in dirs:
# get suffixes (i.e., extensions) of all files in the directory
suffixes = set([s.suffix for s in curdir.glob("*") if s.is_file()])
print(f"In the {curdir}:")
# loop through the suffixes
for suffix in suffixes:
# get all the files in the currect directory with that extension
suffiles = curdir.glob(f"*{suffix}")
print(f"There are {len(list(suffiles))} {suffix} files")

Moving half of the files from one directory into another

I am new to Python, and I am trying to use shutil to move files from one directory to another. I understand how to do this for one file or for the entire directory, but how can I do this if I want to only move some of the files. For example, if I have a directory of 50 files and I only want to move half of those 25. Is there a way to specify them instead of doing
shutil.move(source, destination)
25 times?

shutil.move() takes a single file or directory for an argument, so you can't move more than one at a time. However, this is what loops are for!
Basically, first generate a list of files in the directory using os.listdir(), then loop through half the list, moving each file, like so:
import os, shutil
srcPath = './oldPath/'
destPath = './newPath/'
files = os.listdir(srcPath)
for file in files[:len(files)//2]:
shutil.move(srcPath + file, destPath + file)
You didn't mention what to do if there was an odd number of files which didn't divide evenly, so I rounded down. You can round up by adding 1 after the integer division.
One caveat with that code, it will move half the items in the directory, including subdirectories. If you have only files, there will be no effect, but if there is, and you don't want to move subdirectories, you'll want to remove the subdirectories from the "files" list first.

Specify the files you want to move into a collection such as a list, and then if after Python 3.4, you can also use pathlib's class Path to move file.
from pathlib import Path
SRC_DIR = "/src-dir"
DST_DIR = "/dst-dir"
FILES_TO_MOVE = ["file1", "file2", "file3", ..]
for file in FILES_TO_MOVE:
Path(f"{SRC_DIR}/{file}").rename(f"{DST_DIR}/{file}")
https://docs.python.org/3.4/library/pathlib.html#pathlib.Path.rename

Is there a way to change your cwd in Python using a file as an input?

I have a Python program where I am calculating the number of files within different directories, but I wanted to know if it was possible to use a text file containing a list of different directory locations to change the cwd within my program?
Input: Would be a text file that has different folder locations that contains various files.
I have my program set up to return the total amount of files in a given folder location and return the amount to a count text file that will be located in each folder the program is called on.

You can use os module in Python.
import os
# dirs will store the list of directories, can be populated from your text file
dirs = []
text_file = open(your_text_file, "r")
for dir in text_file.readlines():
dirs.append(dir)
#Now simply loop over dirs list
for directory in dirs:
# Change directory
os.chdir(directory)
# Print cwd
print(os.getcwd())
# Print number of files in cwd
print(len([name for name in os.listdir(directory)
if os.path.isfile(os.path.join(directory, name))]))

Yes.
start_dir = os.getcwd()
indexfile = open(dir_index_file, "r")
for targetdir in indexfile.readlines():
os.chdir(targetdir)
# Do your stuff here
os.chdir(start_dir)
Do bear in mind that if your program dies half way through it'll leave you in a different working directory to the one you started in, which is confusing for users and can occasionally be dangerous (especially if they don't notice it's happened and start trying to delete files that they expect to be there - they might get the wrong file). You might want to consider if there's a way to achieve what you want without changing the working directory.
EDIT:
And to suggest the latter, rather than changing directory use os.listdir() to get the files in the directory of interest:
import os
start_dir = os.getcwd()
indexfile = open(dir_index_file, "r")
for targetdir in indexfile.readlines():
contents = os.listdir(targetdir)
numfiles = len(contents)
countfile = open(os.path.join(targetdir, "count.txt"), "w")
countfile.write(str(numfiles))
countfile.close()
Note that this will count files and directories, not just files. If you only want files then you'll have to go through the list returned by os.listdir checking whether each item is a file using os.path.isfile()

Pythonic way to smart-rename files for record-keeping sake?

Using IronPython 2.6 (I'm new), I'm trying to write a program that opens a file, saves it at a series of locations, and then opens/manipulates/re-saves those. It will be run by an upper-level program on a loop, and this entire procedure is designed to catch/preserve corrupted saves so my company can figure out why this glitch of corruption occasionally happens.
I've currently worked out the Open/Save to locations parts of the script and now I need to build a function that opens, checks for corruption, and (if corrupted) moves the file into a subfolder (with an iterative renaming applied, for copies) or (if okay), modifies the file and saves a duplicate, where the process is repeated on the duplicate, sans duplication.
I tell this all for context to the root problem. In my situation, what is the most pythonic, consistent, and windows/unix friendly way to move a file (corrupted) into a subfolder while also renaming it based on the number of pre-existing copies of the file that exist within said subfolder?
In other words:
In a folder structure built as:
C:\Folder\test.txt
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
How to I move test.txt such that:
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
C:\Folder\Subfolder\test04.txt
In an automated way, so that I can loop my program overnight and have it stack up the corrupted text files I need to save? Note: They're not text files in practice, just example.

assuming you are going to use the convention of incrementally suffinxing numbers to the files:
import os.path
import shutil
def store_copy( file_to_copy, destination):
filename, extension = os.path.splitext( os.path.basename(file_to_copy)
existing_files = [i for in in os.listdir(destination) if i.startswith(filename)]
new_file_name = "%s%02d%s" % (filename, len(existing_files), extension)
shutil.copy2(file_to_copy, os.path.join(destination, new_file_name)
There's a fail case if you have subdirectories or files in destination whose names overlap with the source file, ie, if your file is named 'example.txt' and the destination containst 'example_A.txt' as well as 'example.txt' and 'example01.txt' If that's a possibility you'd have to change the test in the existing files = line to something more sophisticated

Python - Search for files & ZIP, across multiple directories

This is my first time hacking together bits and pieces of code to form a utility that I need (I'm a designer by trade) and, though I feel I'm close, I'm having trouble getting the following to work.
I routinely need to zip up files with a .COD extension that are inside of a directory structure I've created. As an example, the structure may look like this:
(single root folder) -> (multiple folders) -> (two folders) -> (one folder) -> COD files
I need to ZIP up all the COD files into COD.zip and place that zip file one directory above where the files currently are. Folder structure would look like this when done for example:
EXPORT folder -> 9800 folder -> 6 folder -> OTA folder (+ new COD.zip) -> COD files
My issues -
first, the COD.zip that it creates seems to be appropriate for the COD files within it but when I unzip it, there is only 1 .cod inside but the file size of that ZIP is the size of all the CODs zipped together.
second, I need the COD files to be zipped w/o any folder structure - just directly within COD.zip. Currently, my script creates an entire directory structure (starting with "users/mysuername/etc etc").
Any help would be greatly appreciated - and explanations even better as I'm trying to learn :)
Thanks.
import os, glob, fnmatch, zipfile
def scandirs(path):
for currentFile in glob.glob( os.path.join(path, '*') ):
if os.path.isdir(currentFile):
scandirs(currentFile)
if fnmatch.fnmatch(currentFile, '*.cod'):
cod = zipfile.ZipFile("COD.zip","a")
cod.write(currentFile)
scandirs(os.getcwd())

For problem #1, I think your problem is probably this section:
cod = zipfile.ZipFile("COD.zip","a")
cod.write(currentFile)
You're creating a new zip (and possibly overwriting the existing one) every time you go to write a new file. Instead you want to create the zip once per directory and then repeatedly append to it (see example below).
For problem #2, your issue is that you probably need to flatten the filename when you write it to the archive. One approach would be to use os.chdir to CD into each directory in scandirs as you look at it. An easier approach is to use the os.path module to split up the file path and grab the basename (the filename without the path) and then you can use the 2nd parameter to cod.write to change the filename that gets put into the actual zip (see example below).
import os, os.path, glob, fnmatch, zipfile
def scandirs(path):
#zip file goes at current path, then up one dir, then COD.zip
zip_file_path = os.path.join(path,os.path.pardir,"COD.zip")
cod = zipfile.ZipFile(zip_file_path,"a") #NOTE: will result in some empty zips at the moment for dirs that contain no .cod files
for currentFile in glob.glob( os.path.join(path, '*') ):
if os.path.isdir(currentFile):
scandirs(currentFile)
if fnmatch.fnmatch(currentFile, '*.cod'):
cod.write(currentFile,os.path.basename(currentFile))
cod.close()
if not cod.namelist(): #zip is empty
os.remove(zip_file_path)
scandirs(os.getcwd())
So create the zip file once, repeatedly append to it while flattening the filenames, then close it. You also need to make sure you call close or you may not get all your files written.
I don't have a good way to test this locally at the moment, so feel free to try it and report back. I'm sure I probably broke something. ;-)

The following code has the same effect but is more reusable and does not create multiple zip files.
import os,glob,zipfile
def scandirs(path, pattern):
result = []
for file in glob.glob( os.path.join( path, pattern)):
if os.path.isdir(file):
result.extend(scandirs(file, pattern))
else:
result.append(file)
return result
zfile = zipfile.ZipFile('yourfile.zip','w')
for file in scandirs(yourbasepath,'*.COD'):
print 'Processing file: ' + file
zfile.write(file) # folder structure
zfile.write(file, os.path.split(file)[1]) # no folder structure
zfile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

os.listdir getting slower over different runs - python

Related

Simple Python program that checks in each subfolder how many files there are and which extensions the file contains

Moving half of the files from one directory into another

Is there a way to change your cwd in Python using a file as an input?

Pythonic way to smart-rename files for record-keeping sake?

Python - Search for files & ZIP, across multiple directories

Categories

Resources