Memory issues with a list of lists [closed]

Memory issues with a list of lists [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I am having some memory issues and I am wondering if there is any way I can free up some memory in the code below. I have tried using a generator expression rather than list comprehension but that does not produce unique combinations, as the memory is freed up.
The list of lists (combinations) causes me to run out of memory and the program does not finish.
The end result would be 729 lists in this list, with each list containing 6 WindowsPath elements that point to images. I have tried storing the lists as strings in a text file but I can not get that to work, I tried using a pandas dataframe but I can not get that to work.
I need to figure out a different solution. The output right now is exactly what I need but the memory is the only issue.
from pathlib import Path
from random import choice
from itertools import product
from PIL import Image
import sys
def combine(arr):
return list(product(*arr))
def generate(x):
#set new value for name
name = int(x)
#Turn name into string for file name
img_name = str(name)
#Pick 1 random from each directory, add to list.
a_paths = [choice(k) for k in layers]
#if the length of the list of unique combinations is equal to the number of total combinations, this function stops
if len(combinations) == len(combine(layers)):
print("Done")
sys.exit()
else:
#If combination exists, generate new list
if any(j == a_paths for j in combinations) == True:
print("Redo")
generate(name)
#Else, initialize new image, paste layers + save image, add combination to list, and generate new list
else:
#initialize image
img = Image.new("RGBA", (648, 648))
png_info = img.info
#For each path in the list, paste on top of previous, sets image to be saved
for path in a_paths:
layer = Image.open(str(path), "r")
img.paste(layer, (0, 0), layer)
print(str(name) + ' - Unique')
img.save(img_name + '.png', **png_info)
combinations.append(a_paths)
name = name - 1
generate(name)
'''
Main method
'''
global layers
layers = [list(Path(directory).glob("*.png")) for directory in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")]
#name will dictate the name of the file output(.png image) it is equal to the number of combinations of the image layers
global name
name = len(combine(layers))
#combinations is the list of lists that will store all unique combinations of images
global combinations
combinations = []
#calling recursive function
generate(name)

Let's start with a MRE version of your code (i.e. something that I can run without needing a bunch of PNGs -- all we're concerned with here is how to go through the images without hitting recursion limits):
from random import choice
from itertools import product
def combine(arr):
return list(product(*arr))
def generate(x):
# set new value for name
name = int(x)
# Turn name into string for file name
img_name = str(name)
# Pick 1 random from each directory, add to list.
a_paths = [choice(k) for k in layers]
# if the length of the list of unique combinations is equal to the number of total combinations, this function stops
if len(combinations) == len(combine(layers)):
print("Done")
return
else:
# If combination exists, generate new list
if any(j == a_paths for j in combinations) == True:
print("Redo")
generate(name)
# Else, initialize new image, paste layers + save image, add combination to list, and generate new list
else:
# initialize image
img = []
# For each path in the list, paste on top of previous, sets image to be saved
for path in a_paths:
img.append(path)
print(str(name) + ' - Unique')
print(img_name + '.png', img)
combinations.append(a_paths)
name = name - 1
generate(name)
'''
Main method
'''
global layers
layers = [
[f"{d}{f}.png" for f in ("foo", "bar", "baz", "ola", "qux")]
for d in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")
]
# name will dictate the name of the file output(.png image) it is equal to the number of combinations of the image layers
global name
name = len(combine(layers))
# combinations is the list of lists that will store all unique combinations of images
global combinations
combinations = []
# calling recursive function
generate(name)
When I run this I get some output that starts with:
15625 - Unique
15625.png ['dir1/qux.png', 'dir2/bar.png', 'dir3/bar.png', 'dir4/foo.png', 'dir5/baz.png', 'dir6/foo.png']
15624 - Unique
15624.png ['dir1/baz.png', 'dir2/qux.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/foo.png']
15623 - Unique
15623.png ['dir1/ola.png', 'dir2/qux.png', 'dir3/bar.png', 'dir4/ola.png', 'dir5/ola.png', 'dir6/bar.png']
...
and ends with a RecursionError. I assume this is what you mean when you say you "ran out of memory" -- in reality it doesn't seem like I'm anywhere close to running out of memory (maybe this would behave differently if I had actual images?), but Python's stack depth is finite and this function seems to be recursing into itself arbitrarily deep for no particularly good reason.
Since you're trying to eventually generate all the possible combinations, you already have a perfectly good solution, which you're even already using -- itertools.product. All you have to do is iterate through the combinations that it gives you. You don't need recursion and you don't need global variables.
from itertools import product
from typing import List
def generate(layers: List[List[str]]) -> None:
for name, a_paths in enumerate(product(*layers), 1):
# initialize image
img = []
# For each path in the list, paste on top of previous,
# sets image to be saved
for path in a_paths:
img.append(path)
print(f"{name} - Unique")
print(f"{name}.png", img)
print("Done")
'''
Main method
'''
layers = [
[f"{d}{f}.png" for f in ("foo", "bar", "baz", "ola", "qux")]
for d in ("dir1/", "dir2/", "dir3/", "dir4/", "dir5/", "dir6/")
]
# calling iterative function
generate(layers)
Now we get all of the combinations -- the naming starts at 1 and goes all the way to 15625:
1 - Unique
1.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/foo.png']
2 - Unique
2.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/bar.png']
3 - Unique
3.png ['dir1/foo.png', 'dir2/foo.png', 'dir3/foo.png', 'dir4/foo.png', 'dir5/foo.png', 'dir6/baz.png']
...
15623 - Unique
15623.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/baz.png']
15624 - Unique
15624.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/ola.png']
15625 - Unique
15625.png ['dir1/qux.png', 'dir2/qux.png', 'dir3/qux.png', 'dir4/qux.png', 'dir5/qux.png', 'dir6/qux.png']
Done
Replacing the actual image-generating code back into my mocked-out version is left as an exercise for the reader.
If you wanted to randomize the order of the combinations, it'd be pretty reasonable to do:
from random import shuffle
...
combinations = list(product(*layers))
shuffle(combinations)
for name, a_paths in enumerate(combinations, 1):
...
This uses more memory (since now you're building a list of the product instead of iterating through a generator), but the number of images you're working with isn't actually that large, so this is fine as long as you aren't adding a level of recursion for each image.

Related

Load images into list in a specific order

I have a folder with images that follow this name scheme:
FEATURED_0.png
FEATURED_1.png
FEATURED_2.png
When I want to load them into a list usable by Pillow I do this:
for filename in glob.glob("FEATURED*.png"):
image = Image.open(filename)
featured_list.append(image)
This works pretty ok, but has a flaw. It loads the images, regardless of their number.
I already tried loading FEATURED_0 regardless of what, moving into a list and then checking if the next Image has a greater number +1, but that failed miserably.
So back to my question. Is there a way to load Images into a list in python, but in a specific order?

glob doesn't have option to sort filenames. os.list() and os.walk() also can't sort it.
You have to use list = sorted(list) for this
for filename in sorted(glob.glob("FEATURED*.png")):
But sorted() without parameters will put _10 before _2 because it checks char after char and first it compares _ with _ and next 1 with 2 and skip rest.
You may need to use sorted(..., key=your_function) with your_function which extracts number from filename and converts to integer - and sorted will use only this integer to compare names.
def number(filename):
return int(filename[9:-4])
data = ['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_2.png', 'FEATURED_10.png']
print(list(sorted(data))) # `_10` before `_2`
print(list(sorted(data, key=number))) # `_2` before `_10`
Result
# `_10` before `_2`
['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_10.png', 'FEATURED_2.png']
# `_2` before `_10`
['FEATURED_0.png', 'FEATURED_1.png', 'FEATURED_2.png', 'FEATURED_10.png']
def number(filename):
return int(filename[9:-4])
for filename in sorted(glob.glob("FEATURED*.png"), key=number):
# ... code ...
You may also write it with lambda
sorted( data, key=(lambda x:int(x[9:-4])) )
You can create function which returns something more complex - ie. tuple (extension, number) - and then first you get all .jpg (sorted by number) and next all .png (sorted by number)
EDIT:
If you have filenames with different words but all has structure word_number.extension then you can use split("_") to get word and split(".") to get number and extension. And it may need to use tuple (word, int(number)) to sort firt by word and next by number
def number(filename):
word, rest = filename.split('_')
number, ext = rest.split('.')
return (word, int(number))
data = [
'FEATURED_0.png', 'TEST_10.png',
'FEATURED_1.png', 'TEST_1.png',
'ANOTHER_10.png', 'FEATURED_2.png',
'ANOTHER_1.png', 'FEATURED_10.png',
]
print(list(sorted(data))) # `_10` before `_2`
print(list(sorted(data, key=number))) # `_2` before `_10`

If you want to start with 0, load each subsequent number, and stop when you hit a number that's not there, use a while loop:
num = 0
base = "FEATURED_%d.png"
while os.path.isfile(base % num):
featured_list.append(Image.open(base % num))
num += 1

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1

I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

How convert multidimensional array to two dimensional array

Here, my code feats value form text file; and create matrices as multidimensional array, but the problem is the code create more then two dimensional array, that I can't manipulate, I need two dimensional array, how I do that?
Explain algorithm of my code:
Moto of code:
My code fetch value from a specific folder, each folder contain 7 'txt' file, that generate from one user, in this way multiple folder contain multiple data of multiple user.
step1: Start a 1st for loop, and control it using how many folder have in specific folder,and in variable 'path' store the first path of first folder.
step2: Open the path and fetch data of 7 txt file using 2nd for loop.after feats, it close 2nd for loop and execute the rest code.
step3: Concat the data of 7 txt file in one 1d array.
step4(Here the problem arise): Store the 1d arry of each folder as 2d array.end first for loop.
Code:
import numpy as np
from array import *
import os
f_path='Result'
array_control_var=0
#for feacth directory path
for (path,dirs,file) in os.walk(f_path):
if(path==f_path):
continue
f_path_1= path +'\page_1.txt'
#Get data from page1 indivisualy beacuse there string type data exiest
pgno_1 = np.array(np.loadtxt(f_path_1, dtype='U', delimiter=','))
#only for page_2.txt
f_path_2= path +'\page_2.txt'
with open(f_path_2) as f:
str_arr = ','.join([l.strip() for l in f])
pgno_2 = np.asarray(str_arr.split(','), dtype=int)
#using loop feach data from those text file.datda type = int
for j in range(3,8):
#store file path using variable
txt_file_path=path+'\page_'+str(j)+'.txt'
if os.path.exists(txt_file_path)==True:
#genarate a variable name that auto incriment with for loop
foo='pgno_'+str(j)
else:
break
#pass the variable name as string and store value
exec(foo + " = np.array(np.loadtxt(txt_file_path, dtype='i', delimiter=','))")
#z=np.array([pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7])
#marge all array from page 2 to rest in single array in one dimensation
f_array=np.concatenate((pgno_2,pgno_3,pgno_4,pgno_5,pgno_6,pgno_7), axis=0)
#for first time of the loop assing this value
if array_control_var==0:
main_f_array=f_array
else:
#here the problem arise
main_f_array=np.array([main_f_array,f_array])
array_control_var+=1
print(main_f_array)
current my code generate array like this(for 3 folder)
[
array([[0,0,0],[0,0,0]]),
array([0,0,0])
]
Note: I don't know how many dimension it have
But I want
[
array(
[0,0,0]
[0,0,0]
[0,0,0])
]

I tried to write a recursive code that recursively flattens the list of lists into one list. It gives the desired output for your case, but I did not try it for many other inputs(And it is buggy for certain cases such as :list =[0,[[0,0,0],[0,0,0]],[0,0,0]])...
flat = []
def main():
list =[[[0,0,0],[0,0,0]],[0,0,0]]
recFlat(list)
print(flat)
def recFlat(Lists):
if len(Lists) == 0:
return Lists
head, tail = Lists[0], Lists[1:]
if isinstance(head, (list,)):
recFlat(head)
return recFlat(tail)
else:
return flat.append(Lists)
if __name__ == '__main__':
main()
My idea behind the code was to traverse the head of each list, and check whether it is an instance of a list or an element. If the head is an element, this means I have a flat list and I can return the list. Else, I should recursively traverse more.

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2):
hash = imagehash.average_hash(Image.open(path + file_name1))
otherhash = imagehash.average_hash(Image.open(path + file_name2))
return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks

There is indeed a much faster way of doing this:
import collections
import glob
import os
def dupDetector(dirpath, ext):
hashes = collections.defaultdict(list)
for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
h = imagehash.average_hash(Image.open(fpath))
hashes[h].append(fpath)
for h,fpaths in hashes.items():
if len(fpaths) == 1:
print(fpaths[0], "is one of a kind")
continue
print("The following files are duplicates of each other (with the hash {}): \n\t{}".format(h, '\n\t'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don't need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Why not compute hash only once?
hashes = [imagehash.average_hash(Image.open(path + fn)) for fn in file_names]
def compare_hashes(hash1, hash2):
return hash1 == hash2

One solution is to keep using the hash but stocking it in a list of tuple (or a dic, i don't know wich is more efficient here) where the first element is the name of the image and the second is the hash. It should take aproximatively the same 5 mins.
If you have 5000 images,
You compare the value of the first element of the list to the 4999 others
Then the second to the 4998 others (as you already checked the first one)
Then the third ...
This "just" make you do n²/2 comparisons (where n is the number of images)

Just use map structure to calculate hashes for each image,then store hashes as a key and name of the image as a value.
As a result you would have unique images names array.
def get_hash(filename):
return imagehash.average_hash(Image.open(path + filename))
def get_unique_images(filenames):
hashes = {}
for filename in filenames:
image_hash = get_hash(filename)
hashes[image_hash] = filename
return hashes.values()

Speed up the creation of pathname

I've 2 folders. The first (called A) contains same images named in the form: subject_incrementalNumber.jpg (where incrementalNumber goes from 0 to X).
Then I process each image contained in folder A and extract some pieces from it, then save each piece in folder B with the name: subject(the same of the original image contained in folder A)_incrementalNumber(the same of folder A)_anotherIncrementalNumber(that distinguish one piece from another).
Finally, I delete the processed image from folder A.
A
subjectA_0.jpg
subjectA_1.jpg
subjectA_2.jpg
...
subjectB_0.jpg
B
subjectA_0_0.jpg
subjectA_0_1.jpg
subjectA_1_0.jpg
subjectA_2_0.jpg
...
Everytime I download a new image of one subject and save it in folder A, I have to calculate a new pathname for this image (I have to found the min incrementalNumber available for the specific subject). The problem is that when I process an image I delete it from folder A and I store only the pieces in folder B, so I have to find the min number available in both folders.
Now I use the following function to create the pathname
output_name = chooseName( subject, folderA, folderB )
# Create incremental file
# If the name already exist, try with incremental number (0, 1, etc.)
def chooseName( owner, dest_files, faces_files ):
# found the min number available in both folders
v1 = seekVersion_downloaded( owner, dest_files )
v2 = seekVersion_faces( owner, faces_files )
# select the max from those 2
version = max( v1, v2 )
# create name
base = dest_files + os.sep + owner + "_"
fname = base + str(version) + ".jpg"
return fname
# Seek the min number available in folderA
def seekVersion_folderA( owner, dest_files ):
def f(x):
if fnmatch.fnmatch(x, owner + '_*.jpg'): return x
res = filter( f, dest_files )
def g(x): return int(x[x.find("_")+1:-len(".jpg")])
numbers = map( g, res )
if len( numbers ) == 0: return 0
else: return int(max(numbers))+1
# Seek the min number available in folderB
def seekVersion_folderB( owner, faces_files ):
def f(x):
if fnmatch.fnmatch(x, owner + '_*_*.jpg'): return x
res = filter( f, faces_files )
def g(x): return int(x[x.find("_")+1:x.rfind("_")])
numbers = map( g, res )
if len( numbers ) == 0: return 0
else: return int(max(numbers))+1
It works, but this process take about 10seconds for each image, and since I have a lot of images this is too inefficient.
There is any workaround to make it faster?

As specified, this is indeed a hard problem with no magic shortcuts. In order to find the minimum available number you need to use trial and error, exactly as you are doing. Whilst the implementation could be speeded up, there is a fundamental limitation in the algorithm.
I think I would relax the constraints to the problem a little. I would be prepared to choose numbers that weren't the minimum available. I would store a hidden file in the directory which contained the last number used when creating a file. Every time you come to create another one, read this number from the file, increment it by 1, and see if that name is available. If so you are good to go, if not, start counting up from there. Remember to update the file when you do settle on a name.
If no humans are reading these names, then you may be better off using randomly generated names.

I've found another solution: use the hash of the file as unique file name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory issues with a list of lists [closed] - python

Related

Load images into list in a specific order

Parse list of strings for speed

How convert multidimensional array to two dimensional array

Check if files in dir are the same

Speed up the creation of pathname

Categories

Resources