Pulling random files out of a folder for sampling

Pulling random files out of a folder for sampling - python

I needed a way to pull 10% of the files in a folder, at random, for sampling after every "run." Luckily, my current files are numbered numerically, and sequentially. So my current method is to list file names, parse the numerical portion, pull max and min values, count the number of files and multiply by .1, then use random.sample to get a "random [10%] sample." I also write these names to a .txt then use shutil.copy to move the actual files.
Obviously, this does not work if I have an outlier, i.e. if I have a file 345.txt among other files from 513.txt - 678.txt. I was wondering if there was a direct way to simply pull a number of files from a folder, randomly? I have looked it up and cannot find a better method.
Thanks.

Using numpy.random.choice(array, N) you can select N items at random from an array.
import numpy as np
import os
# list all files in dir
files = [f for f in os.listdir('.') if os.path.isfile(f)]
# select 0.1 of the files randomly
random_files = np.random.choice(files, int(len(files)*.1))

I was unable to get the other methods to work easily with my code, but I came up with this.
output_folder = 'C:/path/to/folder'
for x in range(int(len(files) *.1)):
to_copy = choice(files)
shutil.copy(os.path.join(subdir, to_copy), output_folder)

This will give you the list of names in the folder with mypath being the path to the folder.
from os import listdir
from os.path import isfile, join
from random import shuffle
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
shuffled = shuffle(onlyfiles)
small_list = shuffled[:len(shuffled)/10]
This should work

You can use following strategy:
Use list = os.listdir(path) to get all your files in the directory as list of paths.
Next, count your files with range = len(list) function.
Using rangenumber you can get random item number like that random_position = random.randrange(1, range)
Repeat step 3 and save values in a list until you get enough positions (range/10 in your case)
After that you can get required files names like that list[random_position]
Use cycle for for iterating.
Hope this helps!

Based on Karl's solution (which did not work for me under Win 10, Python 3.x), I came up with this:
import numpy as np
import os
# List all files in dir
files = os.listdir("C:/Users/.../Myfiles")
# Select 0.5 of the files randomly
random_files = np.random.choice(files, int(len(files)*.5))
# Get the remaining files
other_files = [x for x in files if x not in random_files]
# Do something with the files
for x in random_files:
print(x)

Related

I want to make place to collect all path of folder

This is my code.
folder_out = []
for a in range(1,80):
folder_letter = "/content/drive/MyDrive/project/Dataset/data/"
folder_out[a] = os.path.join(folder_letter, str(a))
folder_out.append(folder_out[a])
and this is an error
and this what I want

You are using the os method wrong, you want to use os.listdir(Your directory here) to get a list of all directories
import os
dir = os.listdir("/content/drive/MyDrive/project/Dataset/data/")
for f in dir:
print(f)
If you just want a list of all directories, just use os.listdir("/content/drive/MyDrive/project/Dataset/data/")

It's simply pointless to create a variable. They are unnecessary: You can store everything in lists, dictionaries and so on. Creating a new variables inside loop is very very bad practice.
Code correction: save in list instead and access them using loops or slicing.
import os
folder_out = []
for a in range(1,80):
folder_letter = "/content/drive/MyDrive/project/Dataset/data/"
folder= os.path.join(folder_letter, str(a))
folder_out.append(folder)
print(folder_out)
Gives list of folder names.
['/content/drive/MyDrive/project/Dataset/data/1', '/content/drive/MyDrive/project/Dataset/data/2', '/content/drive/MyDrive/project/Dataset/data/3', '/content/drive/MyDrive/project/Dataset/data/4', '/content/drive/MyDrive/project/Dataset/data/5', '/content/drive/MyDrive/project/Dataset/data/6', '/content/drive/MyDrive/project/Dataset/data/7', '/content/drive/MyDrive/project/Dataset/data/8', '/content/drive/MyDrive/project/Dataset/data/9',.....]
If you want to iterate over them.
for elment in folder_out:
print(elment)
Which gives #
element 1
elem2nt 2...
Like
for x in folder_out:
print(f"folder_out{c}: {x}")
c= c+1
Gives what you want
folder_out0: /content/drive/MyDrive/project/Dataset/data/1
folder_out1: /content/drive/MyDrive/project/Dataset/data/2
folder_out2: /content/drive/MyDrive/project/Dataset/data/3
folder_out3: /content/drive/MyDrive/project/Dataset/data/4
folder_out4: /content/drive/MyDrive/project/Dataset/data/5
folder_out5: /content/drive/MyDrive/project/Dataset/data/6
folder_out6: /content/drive/MyDrive/project/Dataset/data/7
folder_out7: /content/drive/MyDrive/project/Dataset/data/8
folder_out8: /content/drive/MyDrive/project/Dataset/data/9
folder_out9: /content/drive/MyDrive/project/Dataset/data/10
folder_out10: /content/drive/MyDrive/project/Dataset/data/11
folder_out11: /content/drive/MyDrive/project/Dataset/data/12
folder_out12: /content/drive/MyDrive/project/Dataset/data/13
folder_out13: /content/drive/MyDrive/project/Dataset/data/14
folder_out14: /content/drive/MyDrive/project/Dataset/data/15
folder_out15: /content/drive/MyDrive/project/Dataset/data/16
If you want to create a folder for each path:
import os
for x in folder_out:
os.mkdir(x)
which will create 79 empty folders

batch copy files, perform operation and copy more files

I want to copy files from a directory in batch mode, perform an operation on the copied files and then copy more files. To do this I have managed this code
import os
import sys
from shutil import copy2
_, _, filenames = next(os.walk("src/"))
print(filenames)
number_of_files = len(filenames)
batch_number = 2
i = 0
while i < number_of_files:
i += 1
j = i + batch_number
print(filenames[i:j])
and its output is
['file_02', 'file_03']
['file_03', 'file_04']
['file_04', 'file_010']
['file_010', 'file_01']
['file_01', 'file_06']
['file_06', 'file_08']
['file_08', 'file_09']
['file_09', 'file_07']
['file_07']
[]
What I want is:
['file_01', 'file_02']
['file_03', 'file_04']
['file_05', 'file_06']
['file_07', 'file_08']
['file_09', 'file_10']
What would be the best way to go about doing this?

Be careful, os.walk doesn't provide sorting in numerical way.
You can use sort() method. And pass each time to certain list, where you sort the content numerically
you_file_list.sort(key=int)
Since you case contain file name, you can put file_ to XX number in filenames list.

Opening files from directory in specific order

I have a folder that contains around 500 images that I am rotating at a random angle from 0 to 360. The files are named 00i.jpeg where i = 0 then i = 1. For example I have an image named 009.jpeg and one named 0052.jpeg and another one 00333.jpeg. My code below works as is does rotate the image, but how the files are being read through is not stepping correctly.
I would think I would need some sort of stepping code chunk that starts at 0 and adds one each time, but I'm not sure where I would put that. os.listdir doesn't allow me to do that because (from my understanding) it just lists the files out. I tried using os.walk but I cannot use cv2.imread. I receive a SystemError: <built-in function imread> returned NULL without setting an error error.
Any suggestions?
import cv2
import imutils
from random import randrange
import os
os.chdir("C:\\Users\\name\\Desktop\\training\\JPEG")
j = 0
for infile in os.listdir("C:\\Users\\name\\Desktop\\training\\JPEG"):
filename = 'testing' + str(j) + '.jpeg'
i = randrange(360)
image = cv2.imread(infile)
rotation_output = imutils.rotate_bound(image, angle=i)
os.chdir("C:\\Users\\name\\Desktop\\rotate_test")
cv2.imwrite("C:\\Users\\name\\Desktop\\rotate_test\\" + filename, rotation_output)
os.chdir("C:\\Users\\name\\Desktop\\training\\JPEG")
j = j + 1
print(infile)
000.jpeg
001.jpeg
0010.jpeg
00100.jpeg
...
Needs to be:
print(infile)
000.jpeg
001.jpeg
002.jpeg
003.jpeg
...

Get a list of files first, then use sort with key where the key is an integer version of the file name without extension.
files = os.listdir("C:\\Users\\name\\Desktop\\training\\JPEG")
files.sort(key=lambda x:int(x.split('.')[0]))
for infile in files:
...
Practical example:
files = ['003.jpeg','000.jpeg','001.jpeg','0010.jpeg','00100.jpeg','002.jpeg']
files.sort(key=lambda x:int(x.split('.')[0]))
print(files)
Output
['000.jpeg', '001.jpeg', '002.jpeg', '003.jpeg', '0010.jpeg', '00100.jpeg']

How to randomly sample files from a filesystem in Python

Is there a performant way to sample files from a file system until you hit a target sample size in Python?
For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.
Currently, for small-ish flat directories of ~100k or so, I can do something like this:
import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
print(direntry.path)
However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.
import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
if random.randint(0, 10) < 5:
continue
print(direntry.path)
count += 1
if count >= sample_size:
print("reached sample_size")
break
Any ideas on how to randomly sample a large selection of files from a large directory structure?

Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.
Code
from pathlib import Path
import random
pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10
rc = []
for k, path in enumerate(pathlist):
if k < nof_samples:
rc.append(str(path)) # because path is object not string
else:
i = random.randint(0, k)
if i < nof_samples:
rc[i] = str(path)
print(len(rc))
print(rc)

Reading images while maintaining folder structure

I have to write a matlab script in python as apparently what I want to achieve is done much more efficiently in Python.
So the first task is to read all images into python using opencv while maintaining folder structure. For example if the parent folder has 50 sub folders and each sub folder has 10 images then this is how the images variable should look like in python, very much like a cell in matlab. I read that python lists can perform this cell like behaviour without importing anything, so thats good I guess.
For example, below is how I coded it in Matlab:
path = '/home/university/Matlab/att_faces';
subjects = dir(path);
subjects = subjects(~strncmpi('.', {subjects.name}, 1)); %remove the '.' and '..' subfolders
img = cell(numel(subjects),1); %initialize the cell equal to number of subjects
for i = 1: numel(subjects)
path_now = fullfile(path, subjects(i).name);
contents = dir([path_now, '/*.pgm']);
for j = 1: numel(contents)
img{i}{j} = imread(fullfile(path_now,contents(j).name));
disp([i,j]);
end
end
The above img will have 50 cells and each cell will have stored 10 images. img{1} will be all images belonging to subject 1 and so on.
Im trying to replicate this in python but am failing, this is what I have I got so far:
import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
for n in sub_f:
path_now = os.path.join(path, sub_f[n], '*.pgm')
images[n] = [cv2.imread(file) for file in glob.glob(path_now)]
Its not exactly what I am looking for, some help would be appreciated. Please ignore silly mistakes as it is my first day writing in python.
Thanks
edit: directory structure:

The first problem is that n isn't a number or index, it is a string containing the path name. To get the index, you can use enumerate, which gives index, value pairs.
Second, unlike in MATLAB you can't assign to indexes that don't exist. You need to pre-allocate your image array or, better yet, append to it.
Third, it is better not to use the variable file since in python 2 it is a built-in data type so it can confuse people.
So with preallocating, this should work:
images = [None]*len(sub_f)
for n, cursub in enumerate(sub_f):
path_now = os.path.join(path, cursub, '*.pgm')
images[n] = [cv2.imread(fname) for fname in glob.glob(path_now)]
Using append, this should work:
for cursub in sub_f
path_now = os.path.join(path, cursub, '*.pgm')
images.append([cv2.imread(fname) for fname in glob.glob(path_now)])
That being said, there is an easier way to do this. You can use the pathlib module to simplify this.
So something like this should work:
from pathlib import Path
mypath = Path('/home/university/Matlab/att_faces')
images = []
for subdir in mypath.iterdir():
images.append([cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')])
This loops over the subdirectories, then globs each one.
This can even be done in a nested list comprehension:
images = [[cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')]
for subdir in mypath.iterdir()]

It should be the following:
import os
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
print(sub_f) #--- this will print all the files present in this directory ---
#--- this a list to which you will append all the images ---
images = []
#--- iterate through every file in the directory and read those files that end with .pgm format ---
#--- after reading it append it to the list ---
for n in sub_f:
if n.endswith('.pgm'):
path_now = os.path.join(path, n)
print(path_now)
images.append(cv2.imread(path_now, 1))

import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
#read the images
for folder in sub_f:
path_now = os.path.join(path, folder, '*.pgm')
images.append([cv2.imread(file) for file in glob.glob(path_now)])
#display the images
for folder in images:
for image in folder:
cv2.imshow('image',image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pulling random files out of a folder for sampling - python

Using numpy.random.choice(array, N) you can select N items at random from an array. import numpy as np import os # list all files in dir files = [f for f in os.listdir('.') if os.path.isfile(f)] # select 0.1 of the files randomly random_files = np.random.choice(files, int(len(files)*.1))

I was unable to get the other methods to work easily with my code, but I came up with this. output_folder = 'C:/path/to/folder' for x in range(int(len(files) *.1)): to_copy = choice(files) shutil.copy(os.path.join(subdir, to_copy), output_folder)

Related

I want to make place to collect all path of folder

batch copy files, perform operation and copy more files

Opening files from directory in specific order

How to randomly sample files from a filesystem in Python

Reading images while maintaining folder structure

Categories

Resources