Copy random files by 5 each to different folder - python

I have a big list of files, that I was able to make in a random order (with a custom column "random number"). (I even made the list of these files in a txt list file for some reason).
But now I need to put them into.. lets see....740 files divide by 5...
into 148 new folders. Ok, I can make new 148 folders with an extDir utiity, but how can I copy each 5 files into a one of a 148 folders separately
so the 1-5 files go to the dir1
the 6-10 files go to dir2
11-15 to dir3
etc
Yes, I tried to do it manually.. but got lost..also I need to repeat the operation with different files about ten times.... I tried to use Python for this, but I am a beginning programmer.
All I have is the text file of all files in the folder, and now I need to separate it into "modules" by 5 files and copy each one into different ascending folders.

Assuming the files are stored in the directory files, the following code iterates over all files in the directory and moves them to dirX, incrementing X every five files.
import os
import shutil
filecounter = 0
dircounter = 0
directory = "files"
for file in os.listdir(directory):
absoluteFilename = os.path.join(directory, file)
if filecounter % 5 == 0: # increment dir counter every five files processed
dircounter += 1
os.mkdir(os.path.join(directory, "dir"+str(dircounter)))
targetfile = os.path.join(directory, "dir"+str(dircounter), file) # builds absolute target filename
shutil.move(absoluteFilename, targetfile)
filecounter += 1
This uses the module operator to increment the dircounter every five files.
Note that the order of the files is arbitrary (see os.listdir). You might have to sort the list beforehand.

Related

os.listdir getting slower over different runs

dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# one of subdirs contain huge number of files
files = [os.path.join(file, f) for file in subdirs for f in os.listdir(file)]
The code ran smoothly first few times under 30 seconds but over different runs of the same code, the time increased to 11 minutes and now not even running in 11 minutes. The problem is in the 3rd line and I suspect os.listdir for this.
EDIT: Just want to read the files so that it can be sent as argument to a multiprocessing function. RAM is also not an issue as RAM is ample and not even 1/10th of RAM is used by the program
It might leads that os.listdir(dir_) reads the entire directory tree and returns a list of all the files and subdirectories in dir_. This process can take a long time if the directory tree is very large or if the system is under heavy load.
But instead that use either below method or use walk() method.
dir_ = "/path/to/folder/with/huge/number/of/files"
subdirs = [os.path.join(dir_, file) for file in os.listdir(dir_)]
# Create an empty list to store the file paths
files = []
for subdir in subdirs:
# Use os.scandir() to iterate over the files and directories in the subdirectory
with os.scandir(subdir) as entries:
for entry in entries:
# Check if the entry is a regular file
if entry.is_file():
# Add the file path to the list
files.append(entry.path)

Simple Python program that checks in each subfolder how many files there are and which extensions the file contains

I am writing a simple python script that looks in the subfolders of the selected subfolder for files and summarizes which extensions are used and how many.
I am not really familiar with os.walk and I am really stuck with the "for file in files" section
`
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
`
I thought a for loop was the best option for the loop that summarizes the files and extensions but it only takes the last subfolder and gives a output of the files that are in the last map.
Does anybody maybe have a fix or a different aproach for it?
FULL CODE:
`
import os
# Set file path using / {End with /}
root_path="C:/Users/me/Documents/"
# Initialize variables to keep track of file counts
total_file_count=0
file_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
# Initialize a count for each file type
file_counts[subfolder] = {}
# Iterate through all files in the subfolder
for file in files:
total_file_count += 1
# Get the file extension
extension = file.split(".")[-1]
# If the extension is not in the dictionary, add it
if extension not in file_counts[subfolder]:
file_counts[subfolder][extension] = 1
# If the extension is already in the dictionary, increase the count by 1
else:
file_counts[subfolder][extension] += 1
# Print total file count
print(f"There are a total of {total_file_count} files.")
# Print the file counts for each subfolder
for subfolder, counts in file_counts.items():
print(f"In the {subfolder} subfolder:")
for extension, count in counts.items():
print(f"There are {count} .{extension} files")
`
Thank you in advance :)
If I understand correctly, you want to count the extensions in ALL subfolders of the given folder, but are only getting one folder. If that is indeed the problem, then the issue is this loop
for root, dirs, files in os.walk(root_path):
# Get currenty subfolder name
subfolder = root.split("/")[-1]
print(subfolder)
You are iterating through os.walk, but you keep overwriting the subfolder variable. So while it will print out every subfolder, it will only remember the LAST subfolder it encounters - leading to the code returning only on subfolder.
Solution 1: Fix the loop
If you want to stick with os.walk, you just need to fix the loop. First things first - define files as a real variable. Don't rely on using the temporary variable from the loop. You actually already have this: file_counts!
Then, you need someway to save the files. I see that you want to split this up by subfolder, so what we can do is use file_counts, and use it to map each subfolder to a list of files (you are trying to do this, but are fundamentally misunderstanding some python code; see my note below about this).
So now, we have a dictionary mapping each subfolder to a list of files! We would just need to iterate through this and count the extensions. The final code looks something like this:
file_counts = {}
extension_counts = {}
# Iterate through all subfolders and files using os.walk
for root, dirs, files in os.walk(root_path):
subfolder = root.split("/")[-1]
file_counts[subfolder] = files
extensions_counts[subfolder]={}
# Iterate through all subfolders, and then through all files
for subfolder in file_counts:
for file in file_counts[subfolder]:
total_file_count += 1
extension = file.split(".")[-1]
if extension not in extension_counts[subfolder]:
extension_counts[subfolder][extension] = 1
else:
extension_counts[subfolder][extension] += 1
Solution 2: Use glob
Instead of os.walk, you can use the glob module, which will return a list of all files and directories wherever you search. It is a powerful tool that uses wildcard matching, and you can read about it here
Note
In your code, you write
# Initialize a count for each file type
file_counts[subfolder] = {}
Which feels like a MATLAB coding scheme. First, subfolder is a variable, and not a vector, so this would only initialize a count for a single file type (and even if it was a list, you get an unhashable type error). Second, this seems to stem from the idea that continuously assigning a variable in a loop builds a list instead of overwriting, which is not true. If you want to do that, you need to initialize an empty list, and use .append().
Note 2: Electric Boogaloo
There are two big ways to make this code good, and here are hints
Look into default dictionaries. They will make your code less redundant
Do you REALLY need to save the numbers and THEN count? What if you counted directly?
Rather than using os.walk you could use the rglob and glob methods of Path object. E.g.,
from pathlib import Path
root_path="C:/Users/me/Documents/"
# get a list of all the directories within root (and recursively within those subdirectories
dirs = [d for d in Path().rglob(root_path + "*") if d.is_dir()]
dirs.append(Path(root_path)) # append root directory
# loop through all directories
for curdir in dirs:
# get suffixes (i.e., extensions) of all files in the directory
suffixes = set([s.suffix for s in curdir.glob("*") if s.is_file()])
print(f"In the {curdir}:")
# loop through the suffixes
for suffix in suffixes:
# get all the files in the currect directory with that extension
suffiles = curdir.glob(f"*{suffix}")
print(f"There are {len(list(suffiles))} {suffix} files")

Moving half of the files from one directory into another

I am new to Python, and I am trying to use shutil to move files from one directory to another. I understand how to do this for one file or for the entire directory, but how can I do this if I want to only move some of the files. For example, if I have a directory of 50 files and I only want to move half of those 25. Is there a way to specify them instead of doing
shutil.move(source, destination)
25 times?
shutil.move() takes a single file or directory for an argument, so you can't move more than one at a time. However, this is what loops are for!
Basically, first generate a list of files in the directory using os.listdir(), then loop through half the list, moving each file, like so:
import os, shutil
srcPath = './oldPath/'
destPath = './newPath/'
files = os.listdir(srcPath)
for file in files[:len(files)//2]:
shutil.move(srcPath + file, destPath + file)
You didn't mention what to do if there was an odd number of files which didn't divide evenly, so I rounded down. You can round up by adding 1 after the integer division.
One caveat with that code, it will move half the items in the directory, including subdirectories. If you have only files, there will be no effect, but if there is, and you don't want to move subdirectories, you'll want to remove the subdirectories from the "files" list first.
Specify the files you want to move into a collection such as a list, and then if after Python 3.4, you can also use pathlib's class Path to move file.
from pathlib import Path
SRC_DIR = "/src-dir"
DST_DIR = "/dst-dir"
FILES_TO_MOVE = ["file1", "file2", "file3", ..]
for file in FILES_TO_MOVE:
Path(f"{SRC_DIR}/{file}").rename(f"{DST_DIR}/{file}")
https://docs.python.org/3.4/library/pathlib.html#pathlib.Path.rename

Is there a way to automatically generate an empty array for each iteration of a for loop?

I am trying to create a separate array for each pass of the for loop in order to store the values of 'signal' which are generated by the wavefile.read function.
Some background as to how the code works / how Id like it to work:
I have the following file path:
Root directory
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Now for each Labeled Folder, Id like to create an array that holds the values of all the .wav files contained in its respective sub directories.
This is what I attempted:
for label in df.index:
for path, directories, files in os.walk('voxceleb1/wav_dev_files/' + label):
for file in files:
if file.endswith('.wav'):
count = count + 1
rate,signal = wavfile.read(os.path.join(path, file))
print(count)
Above is a snapshot of dataframe df
Ultimately, the reason for these arrays is that I would like to calculate the mean average length of time of the wav files contained in each labeled subdirectory and add this as a column vector to the dataframe.
Note that the index of the dataframe corresponds to the directory names. I appreciate any and all help!
The code snippet you've posted can be simplified and modernized a bit. Here's what I came up with:
I've got the following directory structure:
I'm using text files instead of wav files in my example, because I don't have any wav files on hand.
In my root, I have A and B (these are supposed to be your "labeled directories"). A has two text files. B has one immediate text file and one subfolder with another text file inside (this is meant to simulate your "irrelevant multiple directories").
The code:
def main():
from pathlib import Path
root_path = Path("./root/")
labeled_directories = [path for path in root_path.iterdir() if path.is_dir()]
txt_path_lists = []
# Generate lists of txt paths
for labeled_directory in labeled_directories:
txt_path_list = list(labeled_directory.glob("**/*.txt"))
txt_path_lists.append(txt_path_list)
# Print the lists of txt paths
for txt_path_list in txt_path_lists:
print(txt_path_list)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The output:
[WindowsPath('root/A/a_one.txt'), WindowsPath('root/A/a_two.txt')]
[WindowsPath('root/B/b_one.txt'), WindowsPath('root/B/asdasdasd/b_two.txt')]
As you can see, we generated two lists of text file paths, one for each labeled directory. The glob pattern I used (**/*.txt) handles multiple nested directories, and recursively finds all text files. All you have to do is change the extension in the glob pattern to have it find .wav files instead.

Python Iterate over Folders and combine csv files inside

Windows OS - I've got several hundred subdirectories and each subdirectory contains 1 or more .csv files. All the files are identical in structure. I'm trying to loop through each folder and concat all the files in each subdirectory into a new file combining all the .csv files in that subdirectory.
example:
folder1 -> file1.csv, file2.csv, file3.csv -->> file1.csv, file2.csv, file3.csv, combined.csv
folder2 -> file1.csv, file2.csv -->> file1.csv, file2.csv, combined.csv
Very new to coding and getting lost in this. Tried using os.walk but completely failed.
The generator produced by os.walk yields three items each iteration: the path of the current directory in the walk, a list of paths representing sub directories that will be traversed next, and a list of filenames contained in the current directory.
If for whatever reason you don't want to walk certain file paths, you should remove entries from what I called sub below (the list of sub directories contained in root). This will prevent os.walk from traversing any paths you removed.
My code does not prune the walk. Be sure to update this if you don't want to traverse an entire file subtree.
The following outline should work for this although I haven't been able to test this on Windows. I have no reason to think it'll behave differently.
import os
import sys
def write_files(sources, combined):
# Want the first header
with open(sources[0], 'r') as first:
combined.write(first.read())
for i in range(1, len(sources)):
with open(sources[i], 'r') as s:
# Ignore the rest of the headers
next(s, None)
for line in s:
combined.write(line)
def concatenate_csvs(root_path):
for root, sub, files in os.walk(root_path):
filenames = [os.path.join(root, filename) for filename in files
if filename.endswith('.csv')]
combined_path = os.path.join(root, 'combined.csv')
with open(combined_path, 'w+') as combined:
write_files(filenames, combined)
if __name__ == '__main__':
path = sys.argv[1]
concatenate_csvs(path)

Categories

Resources