how to load a very large mat file in chunks?

how to load a very large mat file in chunks? - python

okay so the code is like this
X1 is the loaded hyperspectral image with dimensions (512x512x91)
what i am trying to do is basically crop 64x64x91 sized matrices with a changing stride of 2. this gives me a total of 49952 images each of 64x64x91 size however when i run the for loop i get the memory error.
my system has 8 GB ram.
data_images_0=np.zeros((49952,256,256,91))
k=0
for i in range(0,512-64,2):
r=64
print(k)
for j in range (0,512-64,2):
#print(k)
data_images_0[k,:,:,:]=X1[i:i+r,j:j+r,:]
k=k+1
I have a hyperspectral image loaded as a Mat file and the dimensions are (512x512x91). I want to use chunks of this image as the input to my CNN for example using crops of 64x64x91. The problem is once i create the crops out of the original image i have trouble loading the data as loading all the crops at once gives me memory error.
Is there something i can do to load my cropped data in batches so that i dont receive such a memory error.
Should i convert my data into some other format or proceed the problem in some other way?

You are looking for the matfile function. It allows you to access the array on your harddisk and then only load parts of it.
Say your picture is named pic, then you can do something like
data = matfile("filename.mat");
part = data.pic(1:64,1:64,:);
%Do something
then only the (1:64,1:64,:) part of the variable pic will be loaded into part.
As always it should be noted, that working on the harddisk is not exactly fast and should be avoided. On the other hand if your variable is too large to fit in the memory, there is no other way around it (apart from buying more memory).

I think you might want to use the matfile function, which basically opens a .mat file without pulling its entire content into the RAM. You basically read a header from your .mat file that contains information about the stored elements, like size, data type and so on. Imagine your .mat file hyperspectralimg.mat containing the matrix myImage. You'd have to proceed like this:
filename = 'hyperspectralimg.mat';
img = matfile(filename);
A = doStuff2MyImg(img.myImage(1:64,1:64,:)); % Do stuff to your imageparts
img.myImage(1:64,1:64,:) = A; %Return changes to your file
This is a brief example how you can use it in case you haven't used matfile before. If you have already used it and it doesn't work, let us know and as a general recommendation upload code snippets and data samples regarding your issues, it helps.
A quick comment regarding the tags: If your concern regards matlab, then don't tag python and similar things.

You can use numpy memory map. This is equivalent to matfile of MatLAB.
https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

Related

Fastest way of loading a large number (600) of large images (42M px) in Python

I am processing images with OpenCV, and saving these in a Numpy array with numpy.save. I find myself with this kind of file: a 600(number of images) x 5248(height) x 7936(width) x 3(channels) .npy file, that weights around 70GB.
I then need in another program to load this file and show the images at a fast pace. If I'm not mistaken, loading a 70GB file regardless of the RAM size is a not viable.
Hence my question, should I save the images in multiple smaller arrays ? And if so, how can I define the right amount of images per array ? As for the loading, should I use multiprocessing or multithreading ?
Also, is there perhaps a better type of file in which to store the images ?

NumPy memmap slow loading small chunk from large file on first read only

I am using NumPy memmap to load a small amount of data from various locations throughout a large binary file (memmap'd, reshaped, flipped around, and then around 2000x1000 points loaded from around a 2 GB binary file). There are five 2 GB files each with its own memory map object.
The memory maps are created all very quickly. And the slice of data from the first several files pulls out very quickly. But, then, it suddenly stops on the fourth and fifth file. Memory usage remains low, so, it does not appear to be reading the whole file into memory, but, I/O access from the process is high. It could easily take ten or fifteen minutes for this to clear, and then everything proceeds as expected. Subsequent access through all of the memory maps is extremely rapid, including loading data that was not previously touched. Memory usage remains low throughout. Closing Python and re-running, the problem does not reoccur until reboot (caching maybe?).
I'm on Windows 10 with Python 2.7. Any thoughts for troubleshooting?
EDIT: There was a request in the comments for file format type and example code. Unfortunately, I cannot provide exact details; however, I can say this much. The file format contains just int16 binary values for a 3D array which can be reshaped by the format [n1, n2, n3] where n* are the length for each dimension. However, the files are split at 2GB. So, they are loaded in like this:
memmaps = []
for filename in filelist:
memmaps.append(np.memmap(filename, dtype=np.int16, mode='r'))
memmaps[-1] = memmaps[-1].reshape([len(memmaps[-1])/n2/n3, n2, n3])
memmaps[-1] = np.transpose(memmaps[-1], [2,1,0])
This certainly isn't the cleanest code in the world, but it generally works, except for this seemingly random slow down. The user has a slider which allows them to plot a slice from this array as
image = np.zeros([n2, n1], dtype=np.int16)
#####
c = 0
for d in memmaps:
image[:,c:(c+d.shape[2])] = d[slice,:,:]
c = c + d.shape[2]
I'm leaving out a lot of detail, but I think this captures the most relevant information.
EDIT 2: Also, I am open to alternative approaches to handling this problem. My end goal is real time interactive plotting of an arbitrary and relatively small chunk of 2D data as an image from a large 3D dataset that may be split across multiple binary files. I'm presently using pyqtgraph with fairly reasonable results, except for this random problem.

Reducing RAM overloading when handling big matrices in python

I am currently in a lab which uses iPython Notebook with python 2.7 for data processing. We work on pictures taken by a 285*384 pixels camera, with different parameters changing according to what we search to observe.Therefore, we need to deal with big matrices and as the data processing progress, the accumulation of matrices allocations makes the RAM / swap to be fullfilled and so we cannot go any further.
The typical initial data matrice is of size 100*285*384*16. Then we have to allocate numerous other matrices to calculate the temporal average corresponding to this matrice (of size 285*384*16, 100 being the temporal dimension), then we need to fit linearly the data so we have 2 100*285*384*16 matrices (2 estimated parameters needed for the linear fit), calculate the average and the standart deviation of those fits... and so on. So we allocate of lot of big matrices which leads to RAM / swap fullfilment. Also, we display some pictures associated with some of these matrices.
Of course we could deallocate matrices as we go further in the data processing but we need to be able to change the code and see the results of old calculations without having to rebuilt all the code (calculations are sometimes pretty long). All results depend on the previous ones indeed, so we need to keep the data in the memory.
I would know wether there is some way to extend the swap memory (on the "physical" memory of a disk for example) or to by-pass our RAM limitations in any way with a smarter way of coding. Otherwise I would use a server of my laboratory institute that has 32 Go of RAM but it would be a loss of time and ergonomy for us to be unable to do it with our own computers. The crash occurs both in Macintosh and Windows and due to the limitations of RAM for windows in python I will probably try it with linux, but the 4Go of RAM of our computers will still be overfilled at some point.
I would really appreciate any help on this problem, I didn't find any answers on the net at this point. Thanks you in advance for your help.

You can drastically reduce you RAM requirements by storing the images to disk in HDF5 format using compression with pytables. Depending on your specific data, you can gain significant performances compared to an all-in-RAM approach.
The trick is to use the blazing fast blosc compression included in pytables.
As an example, this code creates an file containing multiple numpy arrays using blosc compression:
import tables
import numpy as np
img1 = np.arange(200*300*100)
img2 = np.arange(200*300*100)*10
h5file = tables.open_file("image_store.h5", mode = "w", title = "Example images",
filters=tables.Filters(complevel=5, complib='blosc'))
h5file.create_carray('/', 'image1', obj=img1, title = 'The image number 1')
h5file.create_carray('/', 'image2', obj=img2, title = 'The image number 2')
h5file.flush() # This makes sure everything is flushed to disk
h5file.close() # Closes the file, previous flush is redundant here.
and the following code snippet loads the two arrays back in RAM:
h5file = tables.open_file("image_store.h5") # By default it is a read-only open
img1 = h5file.root.image1[:] # Load in RAM image1 by using "slicing"
img2 = h5file.root.image2.read() # Load in RAM image1
Finally, if a single array is too big to fit in RAM, you can save and read it chunk-by-chunk using the conventional slicing notation. You create an (chunked) pytables array on disk with a preset size and type and then fill in chunks in this way:
h5file.create_carray('/', 'image_big', title = 'Big image',
atom=tables.Atom.from_dtype(np.dtype('uint16')),
shape=(200, 300, 400))
h5file.root.image_big[:100] = 1
h5file.root.image_big[100:200] = 2
h5file.flush()
Note that this time you don't provide a numpy array to pytables (obj keyword) but you create an empty array, and therefore you need to specify shape and type (atom).
For more info you can check out the official pytables documentation:
PyTables Documentation

Constructing high resolution images in Python

Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.

You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.

Direct access to a single pixel using Python

Is there any way with Python to directly get (only get, no modify) a single pixel (to get its RGB color) from an image (compressed format if possible) without having to load it in RAM nor processing it (to spare the CPU)?
More details:
My application is meant to have a huge database of images, and only of images.
So what I chose is to directly store images on harddrive, this will avoid the additional workload of a DBMS.
However I would like to optimize some more, and I'm wondering if there's a way to directly access a single pixel from an image (the only action on images that my application does), without having to load it in memory.
Does PIL pixel access allow that? Or is there another way?
The encoding of images is my own choice, so I can change whenever I want. Currently I'm using PNG or JPG. I can also store in raw, but I would prefer to keep images a bit compressed if possible. But I think harddrives are cheaper than CPU and RAM, so even if images must stay RAW in order to do that, I think it's still a better bet.
Thank you.
UPDATE
So, as I feared, it seems that it's impossible to do with variable compression formats such as PNG.
I'd like to refine my question:
Is there a constant compression format (not necessarily specific to an image format, I'll access it programmatically), which would allow to access any part by just reading the headers?
Technically, how to efficiently (read: fast and non blocking) access a byte from a file with Python?
SOLUTION
Thank's to all, I have successfully implemented the functionality I described by using run-length encoding on every row, and padding every row to the same length of the maximum row.
This way, by prepeding a header that describes the fixed number of columns for each row, I could easily access the row using first a file.readline() to get the headers data, then file.seek(headersize + fixedsize*y, 0) where y is the row currently selected.
Files are compressed, and in memory I only fetch a single row, and my application doesn't even need to uncompress it because I can compute where the pixel is exactly by just iterating over every RLE values. So it is also very easy on CPU cycles.

If you want to keep a compressed file format, you can break each image up into smaller rectangles and store them separately. Using a fixed size for the rectangles will make it easier to calculate which one you need. When you need the pixel value, calculate which rectangle it's in, open that image file, and offset the coordinates to get the proper pixel.
This doesn't completely optimize access to a single pixel, but it can be much more efficient than opening an entire large image.

In order to evalutate a file you have to load into memory. However, you might be able to figure out how to read only parts of a file, depending on the file format. For example the PNG file specifies a header of size of 8 bytes. However, because of compression the chunks are variable. But if you would store all the pixels in a raw format, you can directly access each pixel, because you can calculate the adress of the file and the appropriate offset. What PNG, JPEG is going to do with the raw data is impossible to predict.
Depending on the structure of the files you might be able to compute efficient hashes. I suppose there is loads of research, if you want to really get into this, for example: Link
"This paper introduces a novel image indexing technique that may be called an image hash function. The algorithm uses randomized signal processing strategies for a non-reversible compression of images into random binary strings, and is shown to be robust against image changes due to compression, geometric distortions, and other attacks"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.