Fast slicing .h5 files using h5py

Fast slicing .h5 files using h5py - python

I am working with .h5 files with little experience.
In a script I wrote I load in data from an .h5 file. The shape of the resulting array is: [3584, 3584, 75]. Here the values 3584 denotes the number of pixels, and 75 denotes the number of time frames. Loading the data and printing the shape takes 180 ms. I obtain this time using os.times().
If I now want to look at the data at a specific time frame I use the following piece of code:
data_1 = data[:, :, 1]
The slicing takes up a lot of time (1.76 s). I understand that my 2D array is huge but at some point I would like to loop over time which will take very long as I'm performing this slice within the for loop.
Is there a more effective/less time consuming way of slicing the time frames or handling this type of data?
Thank you!

Note: I'm making assumptions here since I'm unfamiliar with .H5 files and the Python code the accesses them.
I think that what is happening is that when you "load" the array, you're not actually loading an array. Instead, I think that an object is constructed on top of the file. It probably reads in dimensions and information related to how the file is organized, but it doesn't read the whole file.
That object mimicks an array so good that when you later on perform the slice operation, the normal Python slice operation can be executed, but at this point the actual data is being read. That's why the slice takes so long time compared to "loading" all the data.
I arrive at this conclusion because of the following.
If you're reading 75 frames of 3584x3584 pixels, I'm assuming they're uncompressed (H5 seems to be just raw dumps of data), and in that case, 75 * 3.584 * 3.584 = 963.379.200, this is around 918MB of data. Couple that with you "reading" this in 180ms, we get this calculation:
918MB / 180ms = 5.1GB/second reading speed
Note, this number is for 1-byte pixels, which is also unlikely.
This speed thus seems highly unlikely, as even the best SSDs today reach way below 1GB/sec.
It seems much more plausible that an object is just constructed on top of the file and the slice operation incurs the cost of reading at least 1 frame worth of data.
If we divide the speed by 75 to get per-frame speed, we get 68MB/sec speed for 1-byte pixels, and with 24 or 32-bit pixels we get up to 270MB/sec reading speeds. Much more plausible.

Related

numpy.load gets stuck, how to debug?

I have a function that generates data as a numpy array, but first checks if the data was previously generated and just loads it.
if os.path.exists(SUBJECT_DATA_PATH):
print('Found prepared patches. Loading..')
vol = np.load(SUBJECT_DATA_PATH)
return vol
This is fast when dealing with ~10 arrays it must load, but for some reason, when I increase the number around 50, the loading gets increasingly slow, up to a point where it seems to freeze...
The size of the arrays is very manageable and I can generate the 50 arrays without any problem. If I don't load the data, and generate it on the fly instead, it is done in seconds. There seem to be something about np.load() that gets VERY slow quickly...
Any obvious hints?

NumPy memmap slow loading small chunk from large file on first read only

I am using NumPy memmap to load a small amount of data from various locations throughout a large binary file (memmap'd, reshaped, flipped around, and then around 2000x1000 points loaded from around a 2 GB binary file). There are five 2 GB files each with its own memory map object.
The memory maps are created all very quickly. And the slice of data from the first several files pulls out very quickly. But, then, it suddenly stops on the fourth and fifth file. Memory usage remains low, so, it does not appear to be reading the whole file into memory, but, I/O access from the process is high. It could easily take ten or fifteen minutes for this to clear, and then everything proceeds as expected. Subsequent access through all of the memory maps is extremely rapid, including loading data that was not previously touched. Memory usage remains low throughout. Closing Python and re-running, the problem does not reoccur until reboot (caching maybe?).
I'm on Windows 10 with Python 2.7. Any thoughts for troubleshooting?
EDIT: There was a request in the comments for file format type and example code. Unfortunately, I cannot provide exact details; however, I can say this much. The file format contains just int16 binary values for a 3D array which can be reshaped by the format [n1, n2, n3] where n* are the length for each dimension. However, the files are split at 2GB. So, they are loaded in like this:
memmaps = []
for filename in filelist:
memmaps.append(np.memmap(filename, dtype=np.int16, mode='r'))
memmaps[-1] = memmaps[-1].reshape([len(memmaps[-1])/n2/n3, n2, n3])
memmaps[-1] = np.transpose(memmaps[-1], [2,1,0])
This certainly isn't the cleanest code in the world, but it generally works, except for this seemingly random slow down. The user has a slider which allows them to plot a slice from this array as
image = np.zeros([n2, n1], dtype=np.int16)
#####
c = 0
for d in memmaps:
image[:,c:(c+d.shape[2])] = d[slice,:,:]
c = c + d.shape[2]
I'm leaving out a lot of detail, but I think this captures the most relevant information.
EDIT 2: Also, I am open to alternative approaches to handling this problem. My end goal is real time interactive plotting of an arbitrary and relatively small chunk of 2D data as an image from a large 3D dataset that may be split across multiple binary files. I'm presently using pyqtgraph with fairly reasonable results, except for this random problem.

hdf5 Matrix Reading with python

I have a huge sequence (1000000) of small matrices (32x32) stored in a hdf5 file, each one with a label.
Each of this matrices represent a sensor data for a specific time.
I want to obtain the evolution for each pixel in for a small time slice, different for each x,y position in the matrix.
This is taking more time than I expect.
def getPixelSlice (self,xpixel,ypixel,initphoto,endphoto):
#obtain h5 keys inside time range between initphoto and endphoto
valid=np.where(np.logical_and(self.photoList>=initphoto,self.photoList<endphoto))
#look at pixel data in valid frames
evolution = []
#for each valid frame, obtain the data, and append the target pixel to the list.
for frame in valid[0]:
data = self.h5f[str(self.photoList[frame])]
evolution.append(data[ypixel][xpixel])
return evolution,valid

So, there is a problem here that took me a while to sort out for a similar application. Due to the physical limitations of hard drives, the data are stored in such a way that with a three dimensional array it will always be easier to read in one orientation than another. It all depends on what order you stored the data in.
How you handle this problem depends on your application. My specific application can be characterized as "write few, read many". In this case, it makes the most sense to store the data in the order that I expect to read it. To do this, I use PyTables and specify a "chunkshape" that is the same as one of my timeseries. So, in your case it would be (1,1,1000000). I'm not sure if that size is too large or not, though, so you may need to break it down a bit farther, say (1,1,10000) or something like that.
For more info see PyTables Optimization Tips.
For applications where you intend to read in a specific orientation many times, it is crucial that you choose an appropriate chuck shape for your HDF5 arrays.

Store large dictionary to file in Python

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.

Handling very large netCDF files in python

I am trying to work with data from very large netCDF files (~400 Gb each). Each file has a few variables, all much larger than the system memory (e.g. 180 Gb vs 32 Gb RAM). I am trying to use numpy and netCDF4-python do some operations on these variables by copying a slice at a time and operating on that slice. Unfortunately, it is taking a really long time just to read each slice, which is killing the performance.
For example, one of the variables is an array of shape (500, 500, 450, 300). I want to operate on the slice [:,:,0], so I do the following:
import netCDF4 as nc
f = nc.Dataset('myfile.ncdf','r+')
myvar = f.variables['myvar']
myslice = myvar[:,:,0]
But the last step takes a really long time (~5 min on my system). If for example I saved a variable of shape (500, 500, 300) on the netcdf file, then a read operation of the same size will take only a few seconds.
Is there any way I can speed this up? An obvious path would be to transpose the array so that the indices that I am selecting would come up first. But in such a large file this would not be possible to do in memory, and it seems even slower to attempt it given that a simple operation already takes a long time. What I would like is a quick way to read a slice of a netcdf file, in the fashion of the Fortran's interface get_vara function. Or some way of efficiently transposing the array.

You can transpose netCDF variables too large to fit in memory by using the nccopy utility, which is documented here:
http://www.unidata.ucar.edu/netcdf/docs/guide_nccopy.html
The idea is to "rechunk" the file by specifying what shapes of chunks (multidimensional tiles)
you want for the variables. You can specify how much memory to use as a buffer and how much to
use for chunk caches, but it's not clear how to use memory optimally between these uses, so you
may have to just try some examples and time them. Rather than completely transpose a variable,
you probably want to "partially transpose" it, by specifying chunks that have a lot of data along
the 2 big dimensions of your slice and have only a few values along the other dimensions.

This is a comment, not an answer, but I can't comment on the above, sorry.
I understand that you want to process myvar[:,:,i], with i in range(450). In that case, you are going to do something like:
for i in range(450):
myslice = myvar[:,:,i]
do_something(slice)
and the bottleneck is in accessing myslice = myvar[:,:,i]. Have you tried comparing how long it takes to access moreslices = myvar[:,:,0:n]? It would be contiguos data, and maybe you can save time with that. You would choose n as large as your memory affords it, and then process the next chunk of data moreslices = myvar[:,:,n:2n] and so on.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.