Speed up the creation of pathname - python

I've 2 folders. The first (called A) contains same images named in the form: subject_incrementalNumber.jpg (where incrementalNumber goes from 0 to X).
Then I process each image contained in folder A and extract some pieces from it, then save each piece in folder B with the name: subject(the same of the original image contained in folder A)_incrementalNumber(the same of folder A)_anotherIncrementalNumber(that distinguish one piece from another).
Finally, I delete the processed image from folder A.
A
subjectA_0.jpg
subjectA_1.jpg
subjectA_2.jpg
...
subjectB_0.jpg
B
subjectA_0_0.jpg
subjectA_0_1.jpg
subjectA_1_0.jpg
subjectA_2_0.jpg
...
Everytime I download a new image of one subject and save it in folder A, I have to calculate a new pathname for this image (I have to found the min incrementalNumber available for the specific subject). The problem is that when I process an image I delete it from folder A and I store only the pieces in folder B, so I have to find the min number available in both folders.
Now I use the following function to create the pathname
output_name = chooseName( subject, folderA, folderB )
# Create incremental file
# If the name already exist, try with incremental number (0, 1, etc.)
def chooseName( owner, dest_files, faces_files ):
# found the min number available in both folders
v1 = seekVersion_downloaded( owner, dest_files )
v2 = seekVersion_faces( owner, faces_files )
# select the max from those 2
version = max( v1, v2 )
# create name
base = dest_files + os.sep + owner + "_"
fname = base + str(version) + ".jpg"
return fname
# Seek the min number available in folderA
def seekVersion_folderA( owner, dest_files ):
def f(x):
if fnmatch.fnmatch(x, owner + '_*.jpg'): return x
res = filter( f, dest_files )
def g(x): return int(x[x.find("_")+1:-len(".jpg")])
numbers = map( g, res )
if len( numbers ) == 0: return 0
else: return int(max(numbers))+1
# Seek the min number available in folderB
def seekVersion_folderB( owner, faces_files ):
def f(x):
if fnmatch.fnmatch(x, owner + '_*_*.jpg'): return x
res = filter( f, faces_files )
def g(x): return int(x[x.find("_")+1:x.rfind("_")])
numbers = map( g, res )
if len( numbers ) == 0: return 0
else: return int(max(numbers))+1
It works, but this process take about 10seconds for each image, and since I have a lot of images this is too inefficient.
There is any workaround to make it faster?

As specified, this is indeed a hard problem with no magic shortcuts. In order to find the minimum available number you need to use trial and error, exactly as you are doing. Whilst the implementation could be speeded up, there is a fundamental limitation in the algorithm.
I think I would relax the constraints to the problem a little. I would be prepared to choose numbers that weren't the minimum available. I would store a hidden file in the directory which contained the last number used when creating a file. Every time you come to create another one, read this number from the file, increment it by 1, and see if that name is available. If so you are good to go, if not, start counting up from there. Remember to update the file when you do settle on a name.
If no humans are reading these names, then you may be better off using randomly generated names.

I've found another solution: use the hash of the file as unique file name

Related

Load images into list in a specific order

I have a folder with images that follow this name scheme:
FEATURED_0.png
FEATURED_1.png
FEATURED_2.png
When I want to load them into a list usable by Pillow I do this:
for filename in glob.glob("FEATURED*.png"):
image = Image.open(filename)
featured_list.append(image)
This works pretty ok, but has a flaw. It loads the images, regardless of their number.
I already tried loading FEATURED_0 regardless of what, moving into a list and then checking if the next Image has a greater number +1, but that failed miserably.
So back to my question. Is there a way to load Images into a list in python, but in a specific order?
glob doesn't have option to sort filenames. os.list() and os.walk() also can't sort it.
You have to use list = sorted(list) for this
for filename in sorted(glob.glob("FEATURED*.png")):
But sorted() without parameters will put _10 before _2 because it checks char after char and first it compares _ with _ and next 1 with 2 and skip rest.
You may need to use sorted(..., key=your_function) with your_function which extracts number from filename and converts to integer - and sorted will use only this integer to compare names.
def number(filename):
return int(filename[9:-4])
data = ['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_2.png', 'FEATURED_10.png']
print(list(sorted(data))) # `_10` before `_2`
print(list(sorted(data, key=number))) # `_2` before `_10`
Result
# `_10` before `_2`
['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_10.png', 'FEATURED_2.png']
# `_2` before `_10`
['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_2.png', 'FEATURED_10.png']
def number(filename):
return int(filename[9:-4])
for filename in sorted(glob.glob("FEATURED*.png"), key=number):
# ... code ...
You may also write it with lambda
sorted( data, key=(lambda x:int(x[9:-4])) )
You can create function which returns something more complex - ie. tuple (extension, number) - and then first you get all .jpg (sorted by number) and next all .png (sorted by number)
EDIT:
If you have filenames with different words but all has structure word_number.extension then you can use split("_") to get word and split(".") to get number and extension. And it may need to use tuple (word, int(number)) to sort firt by word and next by number
def number(filename):
word, rest = filename.split('_')
number, ext = rest.split('.')
return (word, int(number))
data = [
'FEATURED_0.png', 'TEST_10.png',
'FEATURED_1.png', 'TEST_1.png',
'ANOTHER_10.png', 'FEATURED_2.png',
'ANOTHER_1.png', 'FEATURED_10.png',
]
print(list(sorted(data))) # `_10` before `_2`
print(list(sorted(data, key=number))) # `_2` before `_10`
If you want to start with 0, load each subsequent number, and stop when you hit a number that's not there, use a while loop:
num = 0
base = "FEATURED_%d.png"
while os.path.isfile(base % num):
featured_list.append(Image.open(base % num))
num += 1

Memory issues with a list of lists [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am having some memory issues and I am wondering if there is any way I can free up some memory in the code below. I have tried using a generator expression rather than list comprehension but that does not produce unique combinations, as the memory is freed up.
The list of lists (combinations) causes me to run out of memory and the program does not finish.
The end result would be 729 lists in this list, with each list containing 6 WindowsPath elements that point to images. I have tried storing the lists as strings in a text file but I can not get that to work, I tried using a pandas dataframe but I can not get that to work.
I need to figure out a different solution. The output right now is exactly what I need but the memory is the only issue.
from pathlib import Path
from random import choice
from itertools import product
from PIL import Image
import sys
def combine(arr):
return list(product(*arr))
def generate(x):
#set new value for name
name = int(x)
#Turn name into string for file name
img_name = str(name)
#Pick 1 random from each directory, add to list.
a_paths = [choice(k) for k in layers]
#if the length of the list of unique combinations is equal to the number of total combinations, this function stops
if len(combinations) == len(combine(layers)):
print("Done")
sys.exit()
else:
#If combination exists, generate new list
if any(j == a_paths for j in combinations) == True:
print("Redo")
generate(name)
#Else, initialize new image, paste layers + save image, add combination to list, and generate new list
else:
#initialize image
img = Image.new("RGBA", (648, 648))
png_info = img.info
#For each path in the list, paste on top of previous, sets image to be saved
for path in a_paths:
layer = Image.open(str(path), "r")
img.paste(layer, (0, 0), layer)
print(str(name) + ' - Unique')
img.save(img_name + '.png', **png_info)
combinations.append(a_paths)
name = name - 1
generate(name)
'''
Main method
'''
global layers
layers = [list(Path(directory).glob("*.png")) for directory in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")]
#name will dictate the name of the file output(.png image) it is equal to the number of combinations of the image layers
global name
name = len(combine(layers))
#combinations is the list of lists that will store all unique combinations of images
global combinations
combinations = []
#calling recursive function
generate(name)
Let's start with a MRE version of your code (i.e. something that I can run without needing a bunch of PNGs -- all we're concerned with here is how to go through the images without hitting recursion limits):
from random import choice
from itertools import product
def combine(arr):
return list(product(*arr))
def generate(x):
# set new value for name
name = int(x)
# Turn name into string for file name
img_name = str(name)
# Pick 1 random from each directory, add to list.
a_paths = [choice(k) for k in layers]
# if the length of the list of unique combinations is equal to the number of total combinations, this function stops
if len(combinations) == len(combine(layers)):
print("Done")
return
else:
# If combination exists, generate new list
if any(j == a_paths for j in combinations) == True:
print("Redo")
generate(name)
# Else, initialize new image, paste layers + save image, add combination to list, and generate new list
else:
# initialize image
img = []
# For each path in the list, paste on top of previous, sets image to be saved
for path in a_paths:
img.append(path)
print(str(name) + ' - Unique')
print(img_name + '.png', img)
combinations.append(a_paths)
name = name - 1
generate(name)
'''
Main method
'''
global layers
layers = [
[f"{d}{f}.png" for f in ("foo", "bar", "baz", "ola", "qux")]
for d in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")
]
# name will dictate the name of the file output(.png image) it is equal to the number of combinations of the image layers
global name
name = len(combine(layers))
# combinations is the list of lists that will store all unique combinations of images
global combinations
combinations = []
# calling recursive function
generate(name)
When I run this I get some output that starts with:
15625 - Unique
15625.png ['dir1/qux.png', 'dir2/bar.png', 'dir3/bar.png', 'dir4/foo.png', 'dir5/baz.png', 'dir6/foo.png']
15624 - Unique
15624.png ['dir1/baz.png', 'dir2/qux.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/foo.png']
15623 - Unique
15623.png ['dir1/ola.png', 'dir2/qux.png', 'dir3/bar.png', 'dir4/ola.png', 'dir5/ola.png', 'dir6/bar.png']
...
and ends with a RecursionError. I assume this is what you mean when you say you "ran out of memory" -- in reality it doesn't seem like I'm anywhere close to running out of memory (maybe this would behave differently if I had actual images?), but Python's stack depth is finite and this function seems to be recursing into itself arbitrarily deep for no particularly good reason.
Since you're trying to eventually generate all the possible combinations, you already have a perfectly good solution, which you're even already using -- itertools.product. All you have to do is iterate through the combinations that it gives you. You don't need recursion and you don't need global variables.
from itertools import product
from typing import List
def generate(layers: List[List[str]]) -> None:
for name, a_paths in enumerate(product(*layers), 1):
# initialize image
img = []
# For each path in the list, paste on top of previous,
# sets image to be saved
for path in a_paths:
img.append(path)
print(f"{name} - Unique")
print(f"{name}.png", img)
print("Done")
'''
Main method
'''
layers = [
[f"{d}{f}.png" for f in ("foo", "bar", "baz", "ola", "qux")]
for d in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")
]
# calling iterative function
generate(layers)
Now we get all of the combinations -- the naming starts at 1 and goes all the way to 15625:
1 - Unique
1.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/foo.png']
2 - Unique
2.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/bar.png']
3 - Unique
3.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/baz.png']
...
15623 - Unique
15623.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/baz.png']
15624 - Unique
15624.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/ola.png']
15625 - Unique
15625.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/qux.png']
Done
Replacing the actual image-generating code back into my mocked-out version is left as an exercise for the reader.
If you wanted to randomize the order of the combinations, it'd be pretty reasonable to do:
from random import shuffle
...
combinations = list(product(*layers))
shuffle(combinations)
for name, a_paths in enumerate(combinations, 1):
...
This uses more memory (since now you're building a list of the product instead of iterating through a generator), but the number of images you're working with isn't actually that large, so this is fine as long as you aren't adding a level of recursion for each image.

How to use Multiprocessing for 3 argument function in Python

I have 3 types of files each of the same size ( around 500 files of each type). I have to give these files to a function. How can I use multiprocessing for the same?
The files are rgb_image: 15.png,16.png,17.png .... depth_img: 15.png, 16.png, 17.png and mat :15.mat, 16.mat, 17.mat ... I have to use 3 files 15.png, 15.png and 15.mat as argument to the function. Starting names of files can vary but it is of this format.
The code is as follows:
def depth_rgb_registration(rgb, depth, mat):
required operation is performed here and
gait_list ( a list is the output of this function)
def display_fun(mat, selected_depth, selected_color, excel):
for idx, color_img in enumerate(color_lists):
for i in range(len(depth_lists)):
if color_img.split('.')[0] == depth_lists[i].split('.')[0]:
rgb = os.path.join(selected_color, color_img)
depth = os.path.join(selected_depth, sorted(depth_lists)[i])
m = sorted(mat_lists)[idx]
mat2 = os.path.join(mat, m)
abc = color_img.split('.')[0]
gait_list1 = []
fnum = int("".join([str(i) for i in re.findall("(\d+)", abc)]))
gait_list1.append(fnum)
depth_rgb_registration(rgb, depth,mat2)
gait_list2.append(gait_list1) #Output gait_list1 from above function
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
In the above code, we have display_fun which is the main function, which is called from the other code.
In this function, we have color_img, depth_imp, and mat which are three different types of files from the folders. These three files are given as arguments to depth_rgb_registration function. In this function, some required values are stored in gait_list1 which is then stored in an excel file for every set of files.
This loop above is working but it takes around 20-30 minutes to run depending on the number of files.
So I wanted to use Multiprocessing and reduce the overall time.
I tried multiprocessing by seeing some example but I am not able to understand how can I give these 3 files as an argument. I know using a dictionary here is not correct which I have used below, but what can be an alternative?
Even if it is asynchronous multiprocessing, it is fine. I even thought of using GPU to run the function, but as I read, extra time will go in the loading of the data to GPU. Any suggestions?
def display_fun2(mat, selected_depth, selected_color, results, excel):
path3 = selected_depth
path4 = selected_color
path5 = mat
rgb_depth_pairs = defaultdict(list)
for rgb in path4.iterdir():
rgb_depth_pairs[rgb.stem].append(rgb)
included_extensions = ['png']
images = [fn for ext in included_extensions for fn in path3.glob(f'*.{ext}')]
for image in images:
rgb_depth_pairs[image.stem].append(image)
for mat in path5.iterdir():
rgb_depth_pairs[mat.stem].append(mat)
rgb_depth_pairs = [item for item in rgb_depth_pairs.items() if len(item) == 3]
with Pool() as p:
p.starmap_async(process_pairs, rgb_depth_pairs)
gait_list2.append(gait_list1)
data1 = pd.DataFrame(gait_list2)
data1.to_excel(writer, index=False)
wb.save(excel)
def depth_rgb_registration(rgb, depth, mat):
required operation for one set of files
I did not look at the code in detail (it was too long), but provided that the combinations of arguments that will be sent to your function with 3 arguments can be evaluated independently (outside of the function itself), you can simply use Pool.starmap:
For example:
from multiprocessing import Pool
def myfunc(a, b, c):
return 100*a + 10*b + c
myargs = [(2,3,1), (1,2,4), (5,3,2), (4,6,1), (1,3,8), (3,4,1)]
p = Pool(2)
print(p.starmap(myfunc, myargs))
returns:
[231, 124, 532, 461, 138, 341]
Alternatively, if your function can be recast as a function which accepts a single argument (the tuple) and expands from this into the separate variables that it needs, then you can use Pool.map:
def myfunc(t):
a, b, c = t # unpack the tuple and carry on
return 100*a + 10*b + c
...
print(p.map(myfunc, myargs))

What data structure is good for maintaining file paths?

I'm working on the "Longest Absolute filepath" problem on LeetCode. This is a simple problem that asks "What is the length of the longest absolute file path in a given directory". And my working solution is as follows. The file directory is given as a string.
def lengthLongestPath(self, input):
"""
:type input: str, the file directory
:rtype: int
"""
current_folder_path = [""] * 40
longest_file_path_size = 0
for item in input.split("\n"):
num_tabs = item.count("\t")
print num_tabs
if "." not in item:
current_folder_path[num_tabs] = item.lstrip("\t")
else:
absolute_file_path = "/".join(current_folder_path[:num_tabs] + [item.lstrip("\t")])
print item
print num_tabs, absolute_file_path, current_folder_path
longest_file_path_size = max(len(absolute_file_path), longest_file_path_size)
return longest_file_path_size
This works. However, note that on line current_folder_path = [""] * 40 is very unelegant. This was a line to remember the current file path. I wonder if there is a way to remove this.
The problem statement does not address some fine points. It is very unclear what path may correspond to the string
a\n\t\tb
Is it a//b or plain illegal? If the former, do we need to normalize it?
I guess it is safe to assume that such paths are illegal. In other words, the path depth only grows by 1, and the current_folder_path in fact functions like a stack. You don't need to preinitialize it, but just push the name when num_tabs exceeds its size, and pop as necessary.
As a side note, since join is linear to the current accumulated length, the entire algorithm seems quadratic, which violates the time complexity requirement.

Data stored in directories based on date. How to get all data from date1 until date2?

I have a python app that will be storing a large amount of data as custom python objects through jsonpicke.
Currently my project has a data file with this structure:
....
data/
year/
...
month/
...
day/
...
A/
data_file_1
...
data_file_n
B/
data_file_1
...
data_file_n
here I was just representing multiple potential dirs or files as '...'
I would like my user to be able to specify a start date and an end date from which I will parse all the data.
Currently my data set is quite small, and so moving it around is no problem. Furthermore the data doesn't need to be human readable, so whatever format doesn't really matter at this stage.
Is there an easier way to store this data so that I can get the data whenever I need, update it when necessary. Another library, package or just better directory layout?
If not, than my question is a bit more specific.
My solution so far has been:
import os
...
def get_data(path, dates, data):
"""
#param path: str representing
the current path being searched.
#param dates: list of tuple
representing (min, max) dates
to be considered
#param data: empty list, used
for collecting the data of the
files.
"""
if len(dates) <= 0:
#This case occurs when the subdirectories
# under the current path don't have any constraints.
# as such I can grab all the files without worry.
for dirpath, dirnames, filenames in os.walk(path):
for file in filenames:
data.append(dirpath + '/' + file)
else:
min, max = dates[0]
dirpath, dirs, files = os.walk(path).next()
for dir in dirs:
value = int(dir)
if value > min and value < max:
#unconstrained case
get_data(path + '/' + dir, [], data)
elif value == min:
#TODO recurse with boundary case minimum
elif value == max:
#TODO recurse with boundary case maximum
However these boundary cases have stumped me. If I am given some abritrarily determined dates. Lets say:
# from 8/21/2011 -> until 12/7/2014
dates = [(2011, 2014), (8, 12), (1, 7)]
The problem is then how should I set up the date to be passed into the recursive method in the boundary cases?
Am I missing a simple solution to this problem?
Use datetime objects to represent the dates, increment with day deltas.
Pseudocode:
import os.path
import datetime
def f(t0, t1):
t = t0
while t != t1:
day_path = os.path.join(t.year, t.month, t.day)
if os.path.exists(day_path):
# Do what you need inside the day dir.
t += datetime.timedelta(days=1)

Categories

Resources