How to randomly sample files from a filesystem in Python - python

Is there a performant way to sample files from a file system until you hit a target sample size in Python?
For example, let's say I have 10 million files in an arbitrarily nested folder structure and I want a sample of 20,000 files.
Currently, for small-ish flat directories of ~100k or so, I can do something like this:
import os
import random
sample_size = 20_000
sample = random.sample(list(os.scandir(path)), sample_size)
for direntry in sample:
print(direntry.path)
However, this doesn't scale up well. So, I thought maybe put the random check in the loop. This sort of works, but has the problem of if the number of files in the directory is close the sample_size, it may not pick up the full target sample_size and I would need to keep track of which files were included in the sample and then keep looping until I fill up the sample bucket.
import os
import random
sample_size = 20_000
count = 0
for direntry in os.scandir(path):
if random.randint(0, 10) < 5:
continue
print(direntry.path)
count += 1
if count >= sample_size:
print("reached sample_size")
break
Any ideas on how to randomly sample a large selection of files from a large directory structure?

Use iterators/generators so you won't keep all files in memory. And use Reservoir sampling to pick selected samples from the basically a stream of file names.
Code
from pathlib import Path
import random
pathlist = Path("C:/Users/XXX/Documents").glob('**/*.py')
nof_samples = 10
rc = []
for k, path in enumerate(pathlist):
if k < nof_samples:
rc.append(str(path)) # because path is object not string
else:
i = random.randint(0, k)
if i < nof_samples:
rc[i] = str(path)
print(len(rc))
print(rc)

Related

How to get average size of a directory with given skip step

I have a directory with lots of files, I want to check every nth or fixed amount of files for its size, then extrapolate it to the total file count in that directory.
I tried something, but my precision and syntax is bad. By no means I ask to fix my code, its just an example of what doesn't work and look well.
I'm on Python 2.7
def get_size2(path):
files = os.listdir(path)
filesCount = len(files)
samples = 5.0
step = math.ceil(filesCount / samples)
files = files[0::step]
reminderCount = filesCount - len(files)
reminderStep = float(reminderCount / len(files)) + 1
total_size = 0
for f in files:
fp = os.path.join(path, f)
if not os.path.islink(fp):
total_size += os.path.getsize(fp) * reminderStep
return int(total_size)
It's hard to fully understanding what you are trying to do given the code, but I think you want to gather an estimated directory size given the average found in a sub sample.
You can iterate though files given a certain increment size by passing a third parameter to a for loop:
for count in range(0, len(files), samples):
print(f"On count: {count}")
Also, I'm a bit lost by the reminderCount and reminderStep variables.
Essentially you want to evaluate what the average size of file you view is (total size you have viewed, divided by the total number of files you have looked at) You can multiply the average file size by the number of files in the directory to extrapolate what an expected directory size would be given the sample. Turning the above logic into a function would simplify the problem to the following:
import os
import math
def get_size2(path):
files = os.listdir(path)
filesCount = len(files)
samples = 1
files_counted = 0
total_size = 0
for count in range(0, len(files), samples):
files_counted += 1
f = files[count]
fp = os.path.join(path, f)
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return int(total_size / files_counted) * filesCount
def main():
print(f'{get_size2("./test/path")}')
if __name__ == "__main__":
main()
This attempts to keep as many of the variables and as much as the structure as you posted, while adjusting the logic of the example. There are changes that I would recommend to the code such as passing the sample size as a parameter.

batch copy files, perform operation and copy more files

I want to copy files from a directory in batch mode, perform an operation on the copied files and then copy more files. To do this I have managed this code
import os
import sys
from shutil import copy2
_, _, filenames = next(os.walk("src/"))
print(filenames)
number_of_files = len(filenames)
batch_number = 2
i = 0
while i < number_of_files:
i += 1
j = i + batch_number
print(filenames[i:j])
and its output is
['file_02', 'file_03']
['file_03', 'file_04']
['file_04', 'file_010']
['file_010', 'file_01']
['file_01', 'file_06']
['file_06', 'file_08']
['file_08', 'file_09']
['file_09', 'file_07']
['file_07']
[]
What I want is:
['file_01', 'file_02']
['file_03', 'file_04']
['file_05', 'file_06']
['file_07', 'file_08']
['file_09', 'file_10']
What would be the best way to go about doing this?
Be careful, os.walk doesn't provide sorting in numerical way.
You can use sort() method. And pass each time to certain list, where you sort the content numerically
you_file_list.sort(key=int)
Since you case contain file name, you can put file_ to XX number in filenames list.

Find duplicate images in fastest way

I have 2 image folder containing 10k and 35k images. Each image is approximately the size of (2k,2k).
I want to remove the images which are exact duplicates.
The variation in different images are just a change in some pixels.
I have tried DHashing, PHashing, AHashing but as they are lossy image hashing technique so they are giving the same hash for non-duplicate images too.
I also tried writing a code in python, which will just subtract images and the combination in which the resultant array is not zero everywhere gives those image pair to be duplicate of each other.
Buth the time for a single combination is 0.29 seconds and for total 350 million combinations is really huge.
Is there a way to do it in a faster way without flagging non-duplicate images also.
I am open to doing it in any language(C,C++), any approach(distributed computing, multithreading) which can solve my problem accurately.
Apologies if I added some of the irrelevant approaches as I am not from computer science background.
Below is the code I used for python approach -
start = timeit.default_timer()
dict = {}
for i in path1:
img1 = io.imread(i)
base1 = os.path.basename(i)
for j in path2:
img2 = io.imread(j)
base2 = os.path.basename(j)
if np.array_equal(img1, img2):
err = img1.astype('float') - img2.astype('float')
is_all_zero = np.all((err == 0))
if is_all_zero:
dict[base1] = base2
else:
continue
stop = timeit.default_timer()
print('Time: ', stop - start)
Use lossy hashing as a prefiltering step, before a complete comparison. You can also generate thumbnail images (say 12 x 8 pixels), and compare for similarity.
The idea is to perform quick rejection of very different images.
You should find the answer on how to delete duplicate files (not only images). Then you can use, for example, fdupes or find some alternative SW: https://alternativeto.net/software/fdupes/
This code checks if there are any duplicates in a folder (it's a bit slow though):
import image_similarity_measures
from image_similarity_measures.quality_metrics import rmse, psnr
from sewar.full_ref import rmse, psnr
import cv2
import os
import time
def check(path_orginal,path_new):#give r strings
original = cv2.imread(path_orginal)
new = cv2.imread(path_new)
return rmse(original, new)
def folder_check(folder_path):
i=0
file_list = os.listdir(folder_path)
print(file_list)
duplicate_dict={}
for file in file_list:
# print(file)
file_path=os.path.join(folder_path,file)
for file_compare in file_list:
print(i)
i+=1
file_compare_path=os.path.join(folder_path,file_compare)
if file_compare!=file:
similarity_score=check(file_path,file_compare_path)
# print(str(similarity_score))
if similarity_score==0.0:
print(file,file_compare)
duplicate_dict[file]=file_compare
file_list.remove(str(file))
return duplicate_dict
start_time=time.time()
print(folder_check(r"C:\Users\Admin\Linear-Regression-1\image-similarity-measures\input1"))
end_time=time.time()
stamp=end_time-start_time
print(stamp)

Reading images while maintaining folder structure

I have to write a matlab script in python as apparently what I want to achieve is done much more efficiently in Python.
So the first task is to read all images into python using opencv while maintaining folder structure. For example if the parent folder has 50 sub folders and each sub folder has 10 images then this is how the images variable should look like in python, very much like a cell in matlab. I read that python lists can perform this cell like behaviour without importing anything, so thats good I guess.
For example, below is how I coded it in Matlab:
path = '/home/university/Matlab/att_faces';
subjects = dir(path);
subjects = subjects(~strncmpi('.', {subjects.name}, 1)); %remove the '.' and '..' subfolders
img = cell(numel(subjects),1); %initialize the cell equal to number of subjects
for i = 1: numel(subjects)
path_now = fullfile(path, subjects(i).name);
contents = dir([path_now, '/*.pgm']);
for j = 1: numel(contents)
img{i}{j} = imread(fullfile(path_now,contents(j).name));
disp([i,j]);
end
end
The above img will have 50 cells and each cell will have stored 10 images. img{1} will be all images belonging to subject 1 and so on.
Im trying to replicate this in python but am failing, this is what I have I got so far:
import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
for n in sub_f:
path_now = os.path.join(path, sub_f[n], '*.pgm')
images[n] = [cv2.imread(file) for file in glob.glob(path_now)]
Its not exactly what I am looking for, some help would be appreciated. Please ignore silly mistakes as it is my first day writing in python.
Thanks
edit: directory structure:
The first problem is that n isn't a number or index, it is a string containing the path name. To get the index, you can use enumerate, which gives index, value pairs.
Second, unlike in MATLAB you can't assign to indexes that don't exist. You need to pre-allocate your image array or, better yet, append to it.
Third, it is better not to use the variable file since in python 2 it is a built-in data type so it can confuse people.
So with preallocating, this should work:
images = [None]*len(sub_f)
for n, cursub in enumerate(sub_f):
path_now = os.path.join(path, cursub, '*.pgm')
images[n] = [cv2.imread(fname) for fname in glob.glob(path_now)]
Using append, this should work:
for cursub in sub_f
path_now = os.path.join(path, cursub, '*.pgm')
images.append([cv2.imread(fname) for fname in glob.glob(path_now)])
That being said, there is an easier way to do this. You can use the pathlib module to simplify this.
So something like this should work:
from pathlib import Path
mypath = Path('/home/university/Matlab/att_faces')
images = []
for subdir in mypath.iterdir():
images.append([cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')])
This loops over the subdirectories, then globs each one.
This can even be done in a nested list comprehension:
images = [[cv2.imread(str(curfile)) for curfile in subdir.glob('*.pgm')]
for subdir in mypath.iterdir()]
It should be the following:
import os
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
print(sub_f) #--- this will print all the files present in this directory ---
#--- this a list to which you will append all the images ---
images = []
#--- iterate through every file in the directory and read those files that end with .pgm format ---
#--- after reading it append it to the list ---
for n in sub_f:
if n.endswith('.pgm'):
path_now = os.path.join(path, n)
print(path_now)
images.append(cv2.imread(path_now, 1))
import cv2
import os
import glob
path = '/home/university/Matlab/att_faces'
sub_f = os.listdir(path)
images = []
#read the images
for folder in sub_f:
path_now = os.path.join(path, folder, '*.pgm')
images.append([cv2.imread(file) for file in glob.glob(path_now)])
#display the images
for folder in images:
for image in folder:
cv2.imshow('image',image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Pulling random files out of a folder for sampling

I needed a way to pull 10% of the files in a folder, at random, for sampling after every "run." Luckily, my current files are numbered numerically, and sequentially. So my current method is to list file names, parse the numerical portion, pull max and min values, count the number of files and multiply by .1, then use random.sample to get a "random [10%] sample." I also write these names to a .txt then use shutil.copy to move the actual files.
Obviously, this does not work if I have an outlier, i.e. if I have a file 345.txt among other files from 513.txt - 678.txt. I was wondering if there was a direct way to simply pull a number of files from a folder, randomly? I have looked it up and cannot find a better method.
Thanks.
Using numpy.random.choice(array, N) you can select N items at random from an array.
import numpy as np
import os
# list all files in dir
files = [f for f in os.listdir('.') if os.path.isfile(f)]
# select 0.1 of the files randomly
random_files = np.random.choice(files, int(len(files)*.1))
I was unable to get the other methods to work easily with my code, but I came up with this.
output_folder = 'C:/path/to/folder'
for x in range(int(len(files) *.1)):
to_copy = choice(files)
shutil.copy(os.path.join(subdir, to_copy), output_folder)
This will give you the list of names in the folder with mypath being the path to the folder.
from os import listdir
from os.path import isfile, join
from random import shuffle
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
shuffled = shuffle(onlyfiles)
small_list = shuffled[:len(shuffled)/10]
This should work
You can use following strategy:
Use list = os.listdir(path) to get all your files in the directory as list of paths.
Next, count your files with range = len(list) function.
Using rangenumber you can get random item number like that random_position = random.randrange(1, range)
Repeat step 3 and save values in a list until you get enough positions (range/10 in your case)
After that you can get required files names like that list[random_position]
Use cycle for for iterating.
Hope this helps!
Based on Karl's solution (which did not work for me under Win 10, Python 3.x), I came up with this:
import numpy as np
import os
# List all files in dir
files = os.listdir("C:/Users/.../Myfiles")
# Select 0.5 of the files randomly
random_files = np.random.choice(files, int(len(files)*.5))
# Get the remaining files
other_files = [x for x in files if x not in random_files]
# Do something with the files
for x in random_files:
print(x)

Categories

Resources