Inneficient loading of lot of small files

Inneficient loading of lot of small files - python

I'm having trouble loading a lot of small image files (aprox. 90k png images) into a single 3D np.array.
The current solution takes couple of hours which is unacceptable.
The images are size 64x128.
I have a pd.DataFrame called labels with the names of the images and want to import whose images in the same order as in the labels variable.
My current solution is:
dataset = np.empty([1, 64, 128], dtype=np.int32)
for file_name in labels['file_name']:
array = cv.imread(f'{IMAGES_PATH}/{file_name}.png', cv.COLOR_BGR2GRAY)
dataset = np.append(dataset, [array[:]], axis=0)
From what I have timed, the most time consuming operation is the dataset = np.append(dataset, [array[:]], axis=0), which takes around 0.4s per image.
Is there any better way to import such files and store them in a np.array?
I was thinking about multiprocessing, but I want the labels and dataset to be in the same order.

Game developers typically concatenate bunches of small images into a single big file and then use sizes and offsets to slice out the currently needed piece. Here's example of how this can be done with imagemagick:
montage -mode concatenate -tile 1x *.png out.png
But then again it will not get around the reading of 90k of small files. And magick has it's own peculiarities which may or may not surface in your case.
Also, I haven't originally noticed that you are having problem with np.append(dataset, [array[:]], axis=0).
That is very bad line. Appending in a loop is never a performant code.
Either preallocate the array and write to it. Or use numpy's functions for concatenating many arrays at once:
dataset = np.empty([int(90e3), 64, 128], dtype=np.int32)
for i,file_name in enumerate(labels['file_name']):
array = cv.imread(f'{IMAGES_PATH}/{file_name}.png', cv.COLOR_BGR2GRAY)
dataset[i,:,:] = array

Related

Loading image files as a ndarray in efficient way

I have a bunch of images in a directory and want to make a single ndarray of those images like CIFAR-10. I have written a brute-force way to do so. However it will get really slow if the number of images are large.
def dataLoader(data_path):
data = np.empty([1,1200,900,3]) # size of an image is 1200*900 with RGB
i = 0
for filename in os.listdir(data_path):
if filename.endswith(".jpg") or filename.endswith(".png"):
target = cv2.imread(os.path.join(data_path, filename))
data = np.concatenate([data, target.reshape([1,1200,900,3])], axis=0)
print(i, end='\r')
i += 1
return data
I am checking the progression by printing loop count. As I see that, it is fairly quick at first 50 iterations but it gets slower and slower as it iterates. I suppose it is due to the numpy concatenation. Is there better way to do so?
I also really appreciate if there is a method to save the ndarray so that I don't need to build the ndarray every time. Currently, I'm using numpy.save.

Import Multipage TIFF as separate images in Python

I'm currently working on some image processing in python 2. I'm saving the images as a multipage .tif files, which contain 4 images of resolution 1920x1080. I've imported the tiff file using skimage.io.imread. Once I've done this, I'm left with an ndarray with shape (1080,1920,4). How should I break this into four separate 1920x1080 Numpy arrays that I could then use for image processing?

You can use numpy.dsplit():
arrays = numpy.dsplit(a, a.shape[2])
This will get you a list with the arrays you want.
Optionally you could just use slicing:
arrays = [a[:,:,n] for n in range(a.shape[2])]

How can I write to a png/tiff file patch-by-patch?

I want to create a png or tiff image file from a very large h5py dataset that cannot be loaded into memory all at once. So, I was wondering if there is a way in python to write to a png or tiff file in patches? (I can load the h5py dataset in slices to a numpy.ndarray).
I've tried using the pillow library and doing PIL.Image.paste giving the box coordinates, but for large images it goes out of memory.
Basically, I'm wondering if there's a way to do something like:
for y in range(0, height, patch_size):
for x in range(0, width, patch_size):
y2 = min(y + patch_size, height)
x2 = min(x + patch_size, width)
# image_arr is an h5py dataset that cannot be loaded completely
# in memory, so load it in slices
image_file.write(image_arr[y:y2, x:x2], box=(y, x, y2, x2))
I'm looking for a way to do this, without having the whole image loaded into memory. I've tried the pillow library, but it loads/keeps all the data in memory.
Edit: This question is not about h5py, but rather how extremely large images (that cannot be loaded into memory) can we written out to a file in patches - similar to how large text files can be constructed by writing to it line by line.

Try tifffile.memmap:
from tifffile import memmap
image_file = memmap('temp.tif', shape=(height, width), dtype=image_arr.dtype,
bigtiff=True)
for y in range(0, height, patch_size):
for x in range(0, width, patch_size):
y2 = min(y + patch_size, height)
x2 = min(x + patch_size, width)
image_file[y:y2, x:x2] = image_arr[y:y2, x:x2]
image_file.flush()
This creates a uncompressed BigTIFF file with one strip. Memory-mapped tiles are not implemented yet. Not sure how many libraries can handle that kind of file, but you can always directly read from the strip using the meta data in the TIFF tags.

Short answer to "if there is a way in Python to write to a png or tiff file in patches?". Well, yes - everything is possible in Python, given enough time and skill to implement it. On the other hand, NO, there is no ready-made solution for this - because it doesn't appear to be very useful.
I don't know about TIFF and a comment here says it is limited to 4GB, so this format is likely not a good candidate. PNG has no practical limit and can be written in chunks, so it is doable in theory - on the condition that at least one scan line of your resulting image does fit into memory.
If you really want to go ahead with this, here is the info that you need:
A PNG file consists of a few metadata chunks and a series of image data chunks. The latter are independent of each other and you can therefore construct a big image out of several smaller images (each of which contains a whole number of rows, a minimum of one row) by simply concatenating their image data chunks (IDAT) together and adding the needed metadata chunks (you can pick those from the first small image, except for the IHDR chunk - that one will need to be constructed to contain the final image size).
So, here is how I'd do it, if I had to (NOTE you will need some understanding of Python's bytes type and the methods of converting byte sequences to and from Python data types to pull this off):
find how many rows I can fit into memory and make that the height of my "small image chunk". The width is the width of the entire final image. let's call those width and small_height
go through my giant data set in h5py one chunk at a time (width * small_height), convert it to PNG and save it to disk in a temporary file, or if your image conversion library allows it - directly to a bytes string in memory. Then process the byte data as follows and delete it at the end:
-- on the first iteration: walk through the PNG data one record at a time (see the PNG spec: http://www.libpng.org/pub/png/spec/1.2/png-1.2-pdg.html, it is in length-tag-value form and very easy to write code that efficiently walks over the file record by record), save ALL the records into my target file, except: modify IHDR to have the final image size and skip the IEND record.
-- on all subsequent iterations: scan through the PNG data and pick only the IDAT records, write those out to the output file.
append an IEND record to the target file.
All done - you should now have a valid humongous PNG. I wonder who or what could read that, though.

OpenCV with Python - reading images in a loop

I'm using OpenCV's imread function to read my images into Python, for further processing as a NumPy array later in the pipeline. I know that OpenCV uses BGR instead of RGB, and have accounted for it, where required. But one thing that stumps me is why I get these differing outputs for the following scenarios?
Reading an image directly into a single works fine. The plotted image (using matplotlib.pyplot) reproduces my .tiff/.png input correctly.
img_train = cv2.imread('image.png')
plt.imshow(img_train)
plt.show()
When I use cv2.imread in a loop (for reading from a directory of such images - which is my ultimate goal here), I create an array as follows:
files = [f for f in listdir(mypath) if isfile(join(mypath, f))]
img_train = np.empty([len(files), height, width, channel])
for n in range(0, len(files)):
img_train[n] = cv2.imread(join(mypath, files[n]))
plt.imshow(img_train[n])
plt.show()
When I try to cross check and plot the image obtained thus, I get a very different output. Why so? How do I rectify this so that it looks more like my input, like in the first case? Am I reading the arrays correctly in the second case, or is it flawed?
Otherwise, is it something that stems from Matplotlib's plotting function? I do not know how to cross check for this case, though.
Any advice appreciated.

Extremely trivial solution.
np.empty creates an array of dtype float by default.
Changing this to uint8 as in the first case with OpenCV alone worked fine.

Best dtype for creating large arrays with numpy

I am looking to store pixel values from satellite imagery into an array. I've been using
np.empty((image_width, image_length)
and it worked for smaller subsets of an image, but when using it on the entire image (3858 x 3743) the code terminates very quickly and all I get is an array of zeros.
I load the image values into the array using a loop and opening the image with gdal
img = gdal.Open(os.path.join(fn + "\{0}".format(fname))).ReadAsArray()
but when I include print img_array I end up with just zeros.
I have tried almost every single dtype that I could find in the numpy documentation but keep getting the same result.
Is numpy unable to load this many values or is there a way to optimize the array?
I am working with 8-bit tiff images that contain NDVI (decimal) values.
Thanks

Not certain what type of images you are trying to read, but in the case of radarsat-2 images you can the following:
dataset = gdal.Open("RADARSAT_2_CALIB:SIGMA0:" + inpath + "product.xml")
S_HH = dataset.GetRasterBand(1).ReadAsArray()
S_VV = dataset.GetRasterBand(2).ReadAsArray()
# gets the intensity (Intensity = re**2+imag**2), and amplitude = sqrt(Intensity)
self.image_HH_I = numpy.real(S_HH)**2+numpy.imag(S_HH)**2
self.image_VV_I = numpy.real(S_VV)**2+numpy.imag(S_VV)**2
But that is specifically for that type of images (in this case each image contains several bands, so i need to read in each band separately with GetRasterBand(i), and than do ReadAsArray() If there is a specific GDAL driver for the type of images you want to read in, life gets very easy
If you give some more info on the type of images you want to read in, i can maybe help more specifically
Edit: did you try something like this ? (not sure if that will work on tiff, or how many bits the header is, hence the something:)
A=open(filename,"r")
B=numpy.fromfile(A,dtype='uint8')[something:].reshape(3858,3743)
C=B*1.0
A.close()
Edit: The problem is solved when using 64bit python instead of 32bit, due to memory errors at 2Gb when using the 32bit python version.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.