Reducing RAM overloading when handling big matrices in python

Reducing RAM overloading when handling big matrices in python - python

I am currently in a lab which uses iPython Notebook with python 2.7 for data processing. We work on pictures taken by a 285*384 pixels camera, with different parameters changing according to what we search to observe.Therefore, we need to deal with big matrices and as the data processing progress, the accumulation of matrices allocations makes the RAM / swap to be fullfilled and so we cannot go any further.
The typical initial data matrice is of size 100*285*384*16. Then we have to allocate numerous other matrices to calculate the temporal average corresponding to this matrice (of size 285*384*16, 100 being the temporal dimension), then we need to fit linearly the data so we have 2 100*285*384*16 matrices (2 estimated parameters needed for the linear fit), calculate the average and the standart deviation of those fits... and so on. So we allocate of lot of big matrices which leads to RAM / swap fullfilment. Also, we display some pictures associated with some of these matrices.
Of course we could deallocate matrices as we go further in the data processing but we need to be able to change the code and see the results of old calculations without having to rebuilt all the code (calculations are sometimes pretty long). All results depend on the previous ones indeed, so we need to keep the data in the memory.
I would know wether there is some way to extend the swap memory (on the "physical" memory of a disk for example) or to by-pass our RAM limitations in any way with a smarter way of coding. Otherwise I would use a server of my laboratory institute that has 32 Go of RAM but it would be a loss of time and ergonomy for us to be unable to do it with our own computers. The crash occurs both in Macintosh and Windows and due to the limitations of RAM for windows in python I will probably try it with linux, but the 4Go of RAM of our computers will still be overfilled at some point.
I would really appreciate any help on this problem, I didn't find any answers on the net at this point. Thanks you in advance for your help.

You can drastically reduce you RAM requirements by storing the images to disk in HDF5 format using compression with pytables. Depending on your specific data, you can gain significant performances compared to an all-in-RAM approach.
The trick is to use the blazing fast blosc compression included in pytables.
As an example, this code creates an file containing multiple numpy arrays using blosc compression:
import tables
import numpy as np
img1 = np.arange(200*300*100)
img2 = np.arange(200*300*100)*10
h5file = tables.open_file("image_store.h5", mode = "w", title = "Example images",
filters=tables.Filters(complevel=5, complib='blosc'))
h5file.create_carray('/', 'image1', obj=img1, title = 'The image number 1')
h5file.create_carray('/', 'image2', obj=img2, title = 'The image number 2')
h5file.flush() # This makes sure everything is flushed to disk
h5file.close() # Closes the file, previous flush is redundant here.
and the following code snippet loads the two arrays back in RAM:
h5file = tables.open_file("image_store.h5") # By default it is a read-only open
img1 = h5file.root.image1[:] # Load in RAM image1 by using "slicing"
img2 = h5file.root.image2.read() # Load in RAM image1
Finally, if a single array is too big to fit in RAM, you can save and read it chunk-by-chunk using the conventional slicing notation. You create an (chunked) pytables array on disk with a preset size and type and then fill in chunks in this way:
h5file.create_carray('/', 'image_big', title = 'Big image',
atom=tables.Atom.from_dtype(np.dtype('uint16')),
shape=(200, 300, 400))
h5file.root.image_big[:100] = 1
h5file.root.image_big[100:200] = 2
h5file.flush()
Note that this time you don't provide a numpy array to pytables (obj keyword) but you create an empty array, and therefore you need to specify shape and type (atom).
For more info you can check out the official pytables documentation:
PyTables Documentation

Related

Unexpected behaviour when chunking with multiple netcdf files in xarray/dask

I'm working with a set of 468 netcdf files summing up to 12GB in total. Each file has only one global snapshot of a geophysical variable, i.e. for each file the data shape is (1, 1801, 3600) corresponding to dimensions ('time', 'latitude', 'longitude').
My RAM is 8GB so I need chunking. I'm creating a xarray dataset using xarray.open_mfdataset and I have found that using the parameter chunk when calling xarray.open_mfdataset or doing a rechunking after with method .chunk has totally different outcomes. A similar issues was reported here without getting any response.
From the xarray documentation, chunking when calling xarray.open_dataset or when rechunking with .chunk should be exactly equivalent...
http://xarray.pydata.org/en/stable/dask.html
...but it doesn't seem so. I share here my examples.
1) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.
import xarray as xr
data1 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time', combine='nested',
chunks = {'longitude':400, 'latitude':200}) \
.chunk({'time':-1})
data1.t2m.data
with ProgressBar():
data1.std('time').compute()
[########################################] | 100% Completed | 5min 44.1s
In this case everything works fine.
2) CHUNKING WITH METHOD .chunk ALONG THE SPATIAL DIMENSIONS (longitude, latitude) HAVING THE TIME DIMENSION UNCHUNKED.
data2=xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time',combine='nested') \
.chunk({'time': -1, 'longitude':400, 'latitude':200})
data2.t2m.data
As this image shows, apparently the chunking is now exactly the same than in 1). However...
with ProgressBar():
data2.std('time').compute()
[##################################### ] | 93% Completed | 1min 50.8s
...the computation of the std could not finish, the jupyter notebook kernel died without message due to exceeding the memory limit as I could checked monitoring with htop... This likely implies that the chunking was indeed not taking place in reality and all the dataset without chunks is being loaded in to memory.
3) CHUNKING WHEN CALLING xarray.open_mfdataset ALONG THE SPATIAL DIMENSIONS (longitude, latitude) AND LEAVING THE TIME DIMENSION CHUNKED BY DEFAULT (ONE CHUNK PER FILE).
In theory this case should be much slower that 1) since the computation of std is done along the time dimension and thus much more chunks are generated unnecessarily (421420 chunks now vs 90 chunks in (1)).
data3 = xr.open_mfdataset('/data/cds_downloads/2m_temperature/*.nc',
concat_dim='time', combine='nested',
chunks = {'longitude':400, 'latitude':200})
data3.t2m.data
with ProgressBar():
data3.std('time').compute()
[########################################] | 100% Completed | 5min 51.2s
However there is no memory problems and the amount of time required for the computation is almost the same than in case 1). This again suggests that method .chunk seems to be not working properly.
Anyone knows if this makes sense or how to solve this issue? I would need to be able to change the chunking depending on the specific computation I need to do.
Thanks
PD: I'm using xarray version 0.15.1

I would need to be able to change the chunking depending on the specific computation I need to do.
Yes, computations will be highly sensitive to chunk structure.
Chunking as early as possible in a computation (ideally when you're reading in data) is ideal because that makes the overall computation simpler.
In general I recommend larger chunk sizes. See https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs

NumPy memmap slow loading small chunk from large file on first read only

I am using NumPy memmap to load a small amount of data from various locations throughout a large binary file (memmap'd, reshaped, flipped around, and then around 2000x1000 points loaded from around a 2 GB binary file). There are five 2 GB files each with its own memory map object.
The memory maps are created all very quickly. And the slice of data from the first several files pulls out very quickly. But, then, it suddenly stops on the fourth and fifth file. Memory usage remains low, so, it does not appear to be reading the whole file into memory, but, I/O access from the process is high. It could easily take ten or fifteen minutes for this to clear, and then everything proceeds as expected. Subsequent access through all of the memory maps is extremely rapid, including loading data that was not previously touched. Memory usage remains low throughout. Closing Python and re-running, the problem does not reoccur until reboot (caching maybe?).
I'm on Windows 10 with Python 2.7. Any thoughts for troubleshooting?
EDIT: There was a request in the comments for file format type and example code. Unfortunately, I cannot provide exact details; however, I can say this much. The file format contains just int16 binary values for a 3D array which can be reshaped by the format [n1, n2, n3] where n* are the length for each dimension. However, the files are split at 2GB. So, they are loaded in like this:
memmaps = []
for filename in filelist:
memmaps.append(np.memmap(filename, dtype=np.int16, mode='r'))
memmaps[-1] = memmaps[-1].reshape([len(memmaps[-1])/n2/n3, n2, n3])
memmaps[-1] = np.transpose(memmaps[-1], [2,1,0])
This certainly isn't the cleanest code in the world, but it generally works, except for this seemingly random slow down. The user has a slider which allows them to plot a slice from this array as
image = np.zeros([n2, n1], dtype=np.int16)
#####
c = 0
for d in memmaps:
image[:,c:(c+d.shape[2])] = d[slice,:,:]
c = c + d.shape[2]
I'm leaving out a lot of detail, but I think this captures the most relevant information.
EDIT 2: Also, I am open to alternative approaches to handling this problem. My end goal is real time interactive plotting of an arbitrary and relatively small chunk of 2D data as an image from a large 3D dataset that may be split across multiple binary files. I'm presently using pyqtgraph with fairly reasonable results, except for this random problem.

Fast slicing .h5 files using h5py

I am working with .h5 files with little experience.
In a script I wrote I load in data from an .h5 file. The shape of the resulting array is: [3584, 3584, 75]. Here the values 3584 denotes the number of pixels, and 75 denotes the number of time frames. Loading the data and printing the shape takes 180 ms. I obtain this time using os.times().
If I now want to look at the data at a specific time frame I use the following piece of code:
data_1 = data[:, :, 1]
The slicing takes up a lot of time (1.76 s). I understand that my 2D array is huge but at some point I would like to loop over time which will take very long as I'm performing this slice within the for loop.
Is there a more effective/less time consuming way of slicing the time frames or handling this type of data?
Thank you!

Note: I'm making assumptions here since I'm unfamiliar with .H5 files and the Python code the accesses them.
I think that what is happening is that when you "load" the array, you're not actually loading an array. Instead, I think that an object is constructed on top of the file. It probably reads in dimensions and information related to how the file is organized, but it doesn't read the whole file.
That object mimicks an array so good that when you later on perform the slice operation, the normal Python slice operation can be executed, but at this point the actual data is being read. That's why the slice takes so long time compared to "loading" all the data.
I arrive at this conclusion because of the following.
If you're reading 75 frames of 3584x3584 pixels, I'm assuming they're uncompressed (H5 seems to be just raw dumps of data), and in that case, 75 * 3.584 * 3.584 = 963.379.200, this is around 918MB of data. Couple that with you "reading" this in 180ms, we get this calculation:
918MB / 180ms = 5.1GB/second reading speed
Note, this number is for 1-byte pixels, which is also unlikely.
This speed thus seems highly unlikely, as even the best SSDs today reach way below 1GB/sec.
It seems much more plausible that an object is just constructed on top of the file and the slice operation incurs the cost of reading at least 1 frame worth of data.
If we divide the speed by 75 to get per-frame speed, we get 68MB/sec speed for 1-byte pixels, and with 24 or 32-bit pixels we get up to 270MB/sec reading speeds. Much more plausible.

Constructing high resolution images in Python

Say I have some huge amount of data stored in an HDF5 data file (size: 20k x 20k, if not more) and I want to create an image from all of this data using Python. Obviously, this much data cannot be opened and stored in the memory without an error. Therefore, is there some other library or method that would not require all of the data to be dumped into the memory and then processed into an image (like how the libraries: Image, matplotlib, numpy, etc. handle it)?
Thanks.
This question comes from a similar question I asked: Generating pcolormesh images from very large data sets saved in H5 files with Python But I think that the question I posed here covers a broader range of applications.
EDIT (7.6.2013)
Allow me to clarify my question further: In the first question (the link), I was using the easiest method I could think of to generate an image from a large collection of data stored in multiple files. This method was to import the data, generate a pcolormesh plot using matplotlib, and then save a high resolution image from this plot. But there are obvious memory limitations to this approach. I can only import about 10 data sets from the files before I reach a memory error.
In that question, I was asking if there is a better method to patch together the data sets (that are saved in HDF5 files) into a single image without importing all of the data into the memory of the computer. (I will likely require 100s of these data sets to be patched together into a single image.) Also, I need to do everything in Python to make it automated (as this script will need to be run very often for different data sets).
The real question I discovered while trying to get this to work using various libraries is: How can I work with high resolution images in Python? For example, if I have a very high resolution PNG image, how can I manipulate it with Python (crop, split, run through an fft, etc.)? In my experience, I have always run into memory issues when trying to import high resolution images (think ridiculously high resolution pictures from a microscope or telescope (my application is a microscope)). Are there any libraries designed to handle such images?
Or, conversely, how can I generate a high resolution image from a massive amount of data saved in a file with Python? Again the data file could be arbitrarily large (5-6 Gigabytes if not larger).
But in my actual application, my question is: Is there a library or some kind of technique that would allow me to take all of the data sets that I receive from my device (which are saved in HDF5) and patch them together to generate an image from all of them? Or I could save all of the data sets in a single (very large) HDF5 file. Then how could I import this one file and then create an image from its data?
I do not care about displaying the data in some interactive plot. The resolution of the plot is not important. I can easily use a lower resolution for it, but I must be able to generate and save a high resolution image from the data.
Hope this clarifies my question. Feel free to ask any other questions about my question.

You say it "obviously can't be stored in memory", but the following calculations say otherwise.
20,000 * 20,000 pixels * 4 channels = 1.6GB
Most reasonably modern computers have 8GB to 16GB of memory so handling 1.6GB shouldn't be a problem.
However, in order to handle the patchworking you need to do, you could stream each pixel from one file into the other. This assumes the format is a lossless bitmap using a linear encoding format like BMP or TIFF. Simply read each file and append to your result file.
You may need to get a bit clever if the files are different sizes or patched together in some type of grid. In that case, you'd need to calculate the total dimensions of the resulting image and offset the file writing pointer.

Handling very large netCDF files in python

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.
For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:
import netCDF4 as nc
f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]
But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.
Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.

You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:
http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html
The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles)
you want for the variables. You can specify how much memory to use as a buffer and how much to
use for chunk caches, but it's not clear how to use memory optimally between these uses, so you
may have to just try some examples and time them. Rather than completely transpose a variable,
you probably want to "partially transpose" it, by specifying chunks that have a lot of data along
the 2 big dimensions of your slice and have only a few values along the other dimensions.

This is a comment, not an answer, but I can't comment on the above, sorry.
I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:
for i in range(450):
myslice = myvar[:,:,i]
do_something(slice)
and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.