It's late night for me and I'm banging my head against the wall as to why I can't figure this out.
Trying to split a directory with 100,000 folders (directories) into 4 subfolders with 25,000 folders/directories in each sub_directory.
Here is the code I have:
import os
import shutil
import alive_progress
from alive_progress import alive_bar
import time
# Set the directory you want to separate
src_dir = r'C:\Users\Administrator\Desktop\base'
# Set the number of directories you want in each subdirectory
num_dirs_per_subdir = 25000
# Set the base name for the subdirectories
subdir_base_name = '25k-Split'
# Calculate the number of subdirectories needed
num_subdirs = len(os.listdir(src_dir)) // num_dirs_per_subdir
# Iterate over the subdirectories
for i in range(num_subdirs):
# Create the subdirectory path
subdir_path = os.path.join(src_dir, f'{subdir_base_name}_{i}')
# Create the subdirectory
os.mkdir(subdir_path)
# Get the directories to move
dirs_to_move = os.listdir(src_dir)[i*num_dirs_per_subdir:(i+1)*num_dirs_per_subdir]
# Iterate over the directories to move
with alive_bar(1000, force_tty=True) as bar:
for directory in dirs_to_move:
# Construct the source and destination paths
src_path = os.path.join(src_dir, directory)
dst_path = os.path.join(subdir_path, directory)
bar()
# Move the directory
shutil.move(src_path, dst_path)
bar()
I of course receive the following error:
Cannot move a directory 'C:\Users\Administrator\Desktop\base\25k-Split_0' into itself 'C:\Users\Administrator\Desktop\base\25k-Split_0\25k-Split_0'
Any help greatly appreciated.
You have 4 bugs:
You don't calculate the number of directories needed correctly.
Change
num_subdirs = len(os.listdir(src_dir)) // num_dirs_per_subdir
to
num_subdirs = len(os.listdir(src_dir)) // num_dirs_per_subdir + 1
If you have 1 directory, and want 25,000 directories per subdirectory. How many subdirectories do you need? 1. Not 0.
You need to check if the subdirectory already exists:
# Create the subdirectory path
subdir_path = os.path.join(src_dir, f'{subdir_base_name}_{i}')
if os.path.exists(subdir_path):
raise RuntimeError(f"{subdir_path} already exists")
# Create the subdirectory
os.mkdir(subdir_path)
You should give the target directory to shutil.move:
shutil.move(src_path, subdir_path)
You recalculate the directory list every time, which includes the subdirectories:
# outside loop
directories = os.listdir(src_dir)
# ...
dirs_to_move = directories[i*num_dirs_per_subdir:(i+1)*num_dirs_per_subdir]
I believe issues #2 & 4 are the main problem.
The problem is this line:
dirs_to_move = os.listdir(src_dir)[...
You keep fetching the directory list each time you go through the outer loop range(num_subdirs). After you handle the first subdir, the second iteration of the loop also gets the subdir you just created..
Delete the line above from inside the first loop and calculate directories to move outside the loops only once. Then index into it to get the list of dirs to move without refetching the directory list again, like this:
all_dirs = os.listdir(src_dir)
# Iterate over the subdirectories
for i in range(num_subdirs):
dir_index = i * num_dirs_per_subdir
dirs_to_move = all_dirs[dir_index : dir_index+num_dirs_per_subdir]
...
Your logic doesn't work if number of directories doesn't divide into num_dirs_per_subdir exactly. Here is how you can fix that:
start_index = i*num_dirs_per_subdir
end_index = start_index + num_dirs_per_subdir
if end_index > len(all_dirs):
dirs_to_move = all_dirs[start_index:]
else:
dirs_to_move = all_dirs[start_index : end_index]
...
Related
I have to write a matlab script in python as apparently what I want to achieve is done much more efficiently in Python.
So the first task is to read all images into python using opencv while maintaining folder structure. For example if the parent folder has 50 sub folders and each sub folder has 10 images then this is how the images variable should look like in python, very much like a cell in matlab. I read that python lists can perform this cell like behaviour without importing anything, so thats good I guess.
For example, below is how I coded it in Matlab:
path = '/home/university/Matlab/att_faces';
subjects = dir(path);
subjects = subjects(~strncmpi('.', {subjects.name}, 1)); %remove the '.' and '..' subfolders
img = cell(numel(subjects),1); %initialize the cell equal to number of subjects
for i = 1: numel(subjects)
path_now = fullfile(path, subjects(i).name);
contents = dir([path_now, '/*.pgm']);
for j = 1: numel(contents)
img{i}{j} = imread(fullfile(path_now,contents(j).name));
disp([i,j]);
end
end
The above img will have 50 cells and each cell will have stored 10 images. img{1} will be all images belonging to subject 1 and so on.
Im trying to replicate this in python but am failing, this is what I have I got so far:
import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
for n in sub_f:
path_now = os.path.join(path, sub_f[n], '*.pgm')
images[n] = [cv2.imread(file) for file in glob.glob(path_now)]
Its not exactly what I am looking for, some help would be appreciated. Please ignore silly mistakes as it is my first day writing in python.
Thanks
edit: directory structure:
The first problem is that n isn't a number or index, it is a string containing the path name. To get the index, you can use enumerate, which gives index, value pairs.
Second, unlike in MATLAB you can't assign to indexes that don't exist. You need to pre-allocate your image array or, better yet, append to it.
Third, it is better not to use the variable file since in python 2 it is a built-in data type so it can confuse people.
So with preallocating, this should work:
images = [None]*len(sub_f)
for n, cursub in enumerate(sub_f):
path_now = os.path.join(path, cursub, '*.pgm')
images[n] = [cv2.imread(fname) for fname in glob.glob(path_now)]
Using append, this should work:
for cursub in sub_f
path_now = os.path.join(path, cursub, '*.pgm')
images.append([cv2.imread(fname) for fname in glob.glob(path_now)])
That being said, there is an easier way to do this. You can use the pathlib module to simplify this.
So something like this should work:
from pathlib import Path
mypath = Path('/home/university/Matlab/att_faces')
images = []
for subdir in mypath.iterdir():
images.append([cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')])
This loops over the subdirectories, then globs each one.
This can even be done in a nested list comprehension:
images = [[cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')]
for subdir in mypath.iterdir()]
It should be the following:
import os
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
print(sub_f) #--- this will print all the files present in this directory ---
#--- this a list to which you will append all the images ---
images = []
#--- iterate through every file in the directory and read those files that end with .pgm format ---
#--- after reading it append it to the list ---
for n in sub_f:
if n.endswith('.pgm'):
path_now = os.path.join(path, n)
print(path_now)
images.append(cv2.imread(path_now, 1))
import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
#read the images
for folder in sub_f:
path_now = os.path.join(path, folder, '*.pgm')
images.append([cv2.imread(file) for file in glob.glob(path_now)])
#display the images
for folder in images:
for image in folder:
cv2.imshow('image',image)
cv2.waitKey(0)
cv2.destroyAllWindows()
i have a code that finds every file in the directory with a certain extension:
and i want this script to be applied to every shp:
import geopandas as gpd
pst = gpd.read_file(r'C:\Users\user\Desktop\New folder1\PST')#this is not needed in the final because it takes the path by it self
dbound = gpd.read_file(r'C:\Users\user\Desktop\New folder1\DBOUND')#same here
dbound.reset_index(inplace=True)
dbound = dbound.rename(columns={'index': 'fid'})
wdp = gpd.sjoin(pst, dbound, how="inner", op='within')#each dbound and pst from every subfolder
wdp['DEC_ID']=wdp['fid']
this is the list that contains the paths to the shapefiles:
grouped_shapefiles that has these shapefiles:
[['C:\\Users\\user\\Desktop\\eff\\20194\\DBOUND\\DBOUND.shp',
'C:\\Users\\user\\Desktop\\eff\\20194\\PST\\PST.shp'],
['C:\\Users\\user\\Desktop\\eff\\20042\\DBOUND\\DBOUND.shp',
'C:\\Users\\user\\Desktop\\eff\\20042\\PST\\PST.shp'],
['C:\\Users\\user\\Desktop\\eff\\20161\\DBOUND\\DBOUND.shp',
'C:\\Users\\user\\Desktop\\eff\\20161\\PST\\PST.shp'],
['C:\\Users\\user\\Desktop\\eff\\20029\\DBOUND\\DBOUND.shp',
'C:\\Users\\user\\Desktop\\eff\\20029\\PST\\PST.shp'],
['C:\\Users\\user\\Desktop\\eff\\20008\\DBOUND\\DBOUND.shp',
'C:\\Users\\user\\Desktop\\eff\\20008\\PST\\PST.shp']]
and i want something like this:
results = []
for group in grouped_shapefiles:
#here applies the script where i need help to connect in the loop
#and then the export process- the line that follows
#o=a path
out = o +'\result.shp'#here it would be nice to add to the name in the output the name of its folder so it would be unique
data2.to_file(out)
How can i do that?
I needed a way to pull 10% of the files in a folder, at random, for sampling after every "run." Luckily, my current files are numbered numerically, and sequentially. So my current method is to list file names, parse the numerical portion, pull max and min values, count the number of files and multiply by .1, then use random.sample to get a "random [10%] sample." I also write these names to a .txt then use shutil.copy to move the actual files.
Obviously, this does not work if I have an outlier, i.e. if I have a file 345.txt among other files from 513.txt - 678.txt. I was wondering if there was a direct way to simply pull a number of files from a folder, randomly? I have looked it up and cannot find a better method.
Thanks.
Using numpy.random.choice(array, N) you can select N items at random from an array.
import numpy as np
import os
# list all files in dir
files = [f for f in os.listdir('.') if os.path.isfile(f)]
# select 0.1 of the files randomly
random_files = np.random.choice(files, int(len(files)*.1))
I was unable to get the other methods to work easily with my code, but I came up with this.
output_folder = 'C:/path/to/folder'
for x in range(int(len(files) *.1)):
to_copy = choice(files)
shutil.copy(os.path.join(subdir, to_copy), output_folder)
This will give you the list of names in the folder with mypath being the path to the folder.
from os import listdir
from os.path import isfile, join
from random import shuffle
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
shuffled = shuffle(onlyfiles)
small_list = shuffled[:len(shuffled)/10]
This should work
You can use following strategy:
Use list = os.listdir(path) to get all your files in the directory as list of paths.
Next, count your files with range = len(list) function.
Using rangenumber you can get random item number like that random_position = random.randrange(1, range)
Repeat step 3 and save values in a list until you get enough positions (range/10 in your case)
After that you can get required files names like that list[random_position]
Use cycle for for iterating.
Hope this helps!
Based on Karl's solution (which did not work for me under Win 10, Python 3.x), I came up with this:
import numpy as np
import os
# List all files in dir
files = os.listdir("C:/Users/.../Myfiles")
# Select 0.5 of the files randomly
random_files = np.random.choice(files, int(len(files)*.5))
# Get the remaining files
other_files = [x for x in files if x not in random_files]
# Do something with the files
for x in random_files:
print(x)
Following is the code
import os
def get_size(path):
total_size = 0
for root, dirs, files in os.walk(path):
for f in files:
fp = os.path.join(root, f)
total_size += os.path.getsize(fp)
return total_size
for root,dirs,files in os.walk('F:\House'):
print(get_size(dirs))
OUTPUT :
F:\House 21791204366
F:\House\house md 1832264906
F:\House\house md\house M D 1 1101710538
F:\House\Season 2 3035002265
F:\House\Season 3 3024588888
F:\House\Season 4 2028970391
F:\House\Season 5 3063415301
F:\House\Season 6 2664657424
F:\House\Season 7 3322229429
F:\House\Season 8 2820075762
I need only sub directories after main directory with their sizes. My code is going till the last directory and writing its size.
As an example:
F:\House 21791204366
F:\House\house md 1832264906
F:\House\house md\house M D 1 1101710538
It has printed the size for house md as well as house M D 1 (which is a subdirectory in house md). But I only want it till house md sub directory level.
DESIRED OUTPUT:
I need the size of each sub dir after the main dir level (specified by the user) and not sub sub dir (but their size should be included in parent dirs.)
How do I go about it ?
To print the size of each immediate subdirectory and the total size for the parent directory similar to du -bcs */ command:
#!/usr/bin/env python3.6
"""Usage: du-bcs <parent-dir>"""
import os
import sys
if len(sys.argv) != 2:
sys.exit(__doc__) # print usage
parent_dir = sys.argv[1]
total = 0
for entry in os.scandir(parent_dir):
if entry.is_dir(follow_symlinks=False): # directory
size = get_tree_size_scandir(entry)
# print the size of each immediate subdirectory
print(size, entry.name, sep='\t')
elif entry.is_file(follow_symlinks=False): # regular file
size = entry.stat(follow_symlinks=False).st_size
else:
continue
total += size
print(total, parent_dir, sep='\t') # print the total size for the parent dir
where get_tree_size_scandir()[text in Russian, code in Python, C, C++, bash].
The size of a directory here is the apparent size of all regular files in it and its subdirectories recursively. It doesn't count the size for the directory entries themselves or the actual disk usage for the files. Related: why is the output of du often so different from du -b.
Instead of using os.walk in your getpath function, you can use listdir in conjunction with isdir:
for file in os.listdir(path):
if not os.path.isdir(file):
# Do your stuff
total_size += os.path.getsize(fp)
...
os.walk will visit the entire directory tree, whereas listdir will only visit the files in the current directory.
However, be aware that this will not add the size of the subdirectories to the directory size. So if "Season 1" has 5 files of 100MB each, and 5 directories of 100 MB each, then the size reported by your function will be 500MB only.
Hint: Use recursion if you want the size of subdirectories to get added as well.
This question already has answers here:
Travel directory tree with limited recursion depth
(2 answers)
Closed 5 years ago.
I would like to search and print directories under c:// for example, but only list 1st and 2nd levels down, that do contain SP30070156-1.
what is the most efficient way to get this using python 2 without the script running though the entire sub-directories (so many in my case it would take a very long time)
typical directory names are as follow:
Rooty Hill SP30068539-1 3RD Split Unit AC Project
Oxford Falls SP30064418-1 Upgrade SES MSB
Queanbeyan SP30066062-1 AC
You can try to create a function based on os.walk(). Something like this should get you started:
import os
def walker(base_dir, level=1, string=None):
results = []
for root, dirs, files in os.walk(base_dir):
_root = root.replace(base_dir + '\\', '') #you may need to remove the "+ '\\'"
if _root.count('\\') < level:
if string is None:
results.append(dirs)
else:
if string in dirs:
results.append(dirs)
return results
Then you can just call it with string='SP30070156-1' and level 1 then level 2.
Not sure if it's going to be faster than 40s, though.
here is the code i used, the method is quick to list, if filtered for keyword then it is even quicker
import os
MAX_DEPTH = 1
#folders = ['U:\I-Project Works\PPM 20003171\PPM 11-12 NSW', 'U:\I-Project Works\PPM 20003171\PPM 11-12 QLD']
folders = ['U:\I-Project Works\PPM 20003171\PPM 11-12 NSW']
try:
for stuff in folders:
for root, dirs, files in os.walk(stuff, topdown=True):
for dir in dirs:
if "SP30070156-1" in dir:
sp_path = root + "\\"+ dir
print(sp_path)
raise Found
if root.count(os.sep) - stuff.count(os.sep) == MAX_DEPTH - 1:
del dirs[:]
except:
print "found"