hdf5 Matrix Reading with python

hdf5 Matrix Reading with python - python

I have a huge sequence (1000000) of small matrices (32x32) stored in a hdf5 file, each one with a label.
Each of this matrices represent a sensor data for a specific time.
I want to obtain the evolution for each pixel in for a small time slice, different for each x,y position in the matrix.
This is taking more time than I expect.
def getPixelSlice (self,xpixel,ypixel,initphoto,endphoto):
#obtain h5 keys inside time range between initphoto and endphoto
valid=np.where(np.logical_and(self.photoList>=initphoto,self.photoList<endphoto))
#look at pixel data in valid frames
evolution = []
#for each valid frame, obtain the data, and append the target pixel to the list.
for frame in valid[0]:
data = self.h5f[str(self.photoList[frame])]
evolution.append(data[ypixel][xpixel])
return evolution,valid

So, there is a problem here that took me a while to sort out for a similar application. Due to the physical limitations of hard drives, the data are stored in such a way that with a three dimensional array it will always be easier to read in one orientation than another. It all depends on what order you stored the data in.
How you handle this problem depends on your application. My specific application can be characterized as "write few, read many". In this case, it makes the most sense to store the data in the order that I expect to read it. To do this, I use PyTables and specify a "chunkshape" that is the same as one of my timeseries. So, in your case it would be (1,1,1000000). I'm not sure if that size is too large or not, though, so you may need to break it down a bit farther, say (1,1,10000) or something like that.
For more info see PyTables Optimization Tips.
For applications where you intend to read in a specific orientation many times, it is crucial that you choose an appropriate chuck shape for your HDF5 arrays.

Related

Access one chunk in Zarr

Zarr saves an array on disk in chunks, each chunk is a separate file. Is there a way to access only one, chosen chunk (file)?
Can it be determined which chunks are empty without loading the whole array into memory?

I'm not aware of any way to find chunk size except hitting the FS yourself. Zarr abstracts over that. Maybe you'll have to explain what you're up to.
The project I'm currently working on uses Zarr to store meteorological data. We keep the data in a 3 dimensional array of shape (t, x, y). Alongside the data, we have an array of shape (t), effectively a bitmask to record which slots are filled. So when data comes in, we write
data[t] = [...]
ready[t] = 1
So when querying for data we know at what timeslots to expect data, and which slots are empty.

It's possible to see what chunks are filled by looking at the keys method of the underlying chunk_store. Only keys with data will be filled.
The corresponding values of these keys will contain the data of that chunk, but it will be compressed. If you want more than that, would encourage you to raise an issue over at the Zarr repo.

I don't think there is a general solution to know which chunks are initialized for any storage type, but for DirectoryStore, it is possible to list the filesystem to know which chunks are initialized. This is how zarr do it to compute the nchunks_initialized property.
I suppose you could get some inspiration from there to list all initialized chunks and then compute which slice it corresponds to in the array.
While there is no object for a chunk in zarr, you can compute their beginning and end along each axis from the array dimensions and chunk dimensions. If you want to load the chunks one by one for efficiency reasons, you can compute their indices and slice the zarr Array to get a numpy array as a working area.
Since I had similar needs, I built some function as helpers to do just that, you can look them up at https://github.com/maxime915/msi_zarr_analysis/blob/126c1115bd43e8813d2f002673491c6ef25e37db/msi_zarr_analysis/utils/iter_chunks.py if you want some inspiration.

proper method to save serialized data incrementally

This must be a very standard problem that also must have a standard solution:
What is the correct way to incrementally save feature vectors extract from data, rather than accumulating all vectors form the entire dataset and then saving all of them at once?
In more detail:
I have written a script for extracting custum text features (e.g. next_token, prefix-3, is_number) form text documents. After extraction is done I end up with one big list of scipy sparse vectors. Finally I pickle that list to space efficiently store and time efficiently load it when I want to train a model. But the problem is, that I am limited by my ram here. I can make that list of vectors only so big before it or the pickling exceeds my ram.
Of course incrementally appending string representations of these vectors would be possible. One could accumulate k vectors, append them to a text file and clear the list again for the next k vectors. But storing vectors and string would be space inefficient and require parsing the representations upon loading. That does not sound like a good solution.
I could also pickle sets of k vectors and end up with a whole bunch of pickle-files of k vectors. But that sounds messy.
So this must be a standard problem with a more elegant solution. What is the right method to solve this? Is there maybe even some existing functionality in scikit-learn for this kind of thing already, that I overlooked?
I found this: How to load one line at a time from a pickle file?
But it does not work with Python3.

Store large dictionary to file in Python

I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!

See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.

With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.

Direct access to a single pixel using Python

Is there any way with Python to directly get (only get, no modify) a single pixel (to get its RGB color) from an image (compressed format if possible) without having to load it in RAM nor processing it (to spare the CPU)?
More details:
My application is meant to have a huge database of images, and only of images.
So what I chose is to directly store images on harddrive, this will avoid the additional workload of a DBMS.
However I would like to optimize some more, and I'm wondering if there's a way to directly access a single pixel from an image (the only action on images that my application does), without having to load it in memory.
Does PIL pixel access allow that? Or is there another way?
The encoding of images is my own choice, so I can change whenever I want. Currently I'm using PNG or JPG. I can also store in raw, but I would prefer to keep images a bit compressed if possible. But I think harddrives are cheaper than CPU and RAM, so even if images must stay RAW in order to do that, I think it's still a better bet.
Thank you.
UPDATE
So, as I feared, it seems that it's impossible to do with variable compression formats such as PNG.
I'd like to refine my question:
Is there a constant compression format (not necessarily specific to an image format, I'll access it programmatically), which would allow to access any part by just reading the headers?
Technically, how to efficiently (read: fast and non blocking) access a byte from a file with Python?
SOLUTION
Thank's to all, I have successfully implemented the functionality I described by using run-length encoding on every row, and padding every row to the same length of the maximum row.
This way, by prepeding a header that describes the fixed number of columns for each row, I could easily access the row using first a file.readline() to get the headers data, then file.seek(headersize + fixedsize*y, 0) where y is the row currently selected.
Files are compressed, and in memory I only fetch a single row, and my application doesn't even need to uncompress it because I can compute where the pixel is exactly by just iterating over every RLE values. So it is also very easy on CPU cycles.

If you want to keep a compressed file format, you can break each image up into smaller rectangles and store them separately. Using a fixed size for the rectangles will make it easier to calculate which one you need. When you need the pixel value, calculate which rectangle it's in, open that image file, and offset the coordinates to get the proper pixel.
This doesn't completely optimize access to a single pixel, but it can be much more efficient than opening an entire large image.

In order to evalutate a file you have to load into memory. However, you might be able to figure out how to read only parts of a file, depending on the file format. For example the PNG file specifies a header of size of 8 bytes. However, because of compression the chunks are variable. But if you would store all the pixels in a raw format, you can directly access each pixel, because you can calculate the adress of the file and the appropriate offset. What PNG, JPEG is going to do with the raw data is impossible to predict.
Depending on the structure of the files you might be able to compute efficient hashes. I suppose there is loads of research, if you want to really get into this, for example: Link
"This paper introduces a novel image indexing technique that may be called an image hash function. The algorithm uses randomized signal processing strategies for a non-reversible compression of images into random binary strings, and is shown to be robust against image changes due to compression, geometric distortions, and other attacks"

Is it possible to store multidimensional arrays of arbitrary shape in a PyTables cell?

PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!

The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.

The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.