Slicing Ragged Array - python

I have a bunch of sliced images (32 x 32 in shape) stored each with a corresponding string id (file name and such). The slices and their ids are grouped in a bunch of arrays part of a final large array. This array is ragged and non-standard, but I'd like to be able to efficiently access the slices inside.
Let's say that I have 500 slices. The shape should be (500, 2) on the surface because the first grouping of slices and ids is not a standard array with given shape.
What I would like to be able to do is extract the sliced images themselves from the final array. Normally, I could collect everything by slicing the big array like big_array[:][:][0] but the ragged nesting has made the array appear 1D with shape (500, ).
The only way around this is to use a clunky for loop, but I'm pretty sure everything I've been doing up until this point has been a terrible way of storing the data.
I need to keep the ids associated with each slice because I'm training a model with this, and if something goes wrong, I'd like to be able to reference the origins of the slice which has undergone some processing.
The only other way around this is to store the ids and slices separately, but that is also a lot of hassle since I have to save them in separate files.
What's the correct way to store this thing?

Related

Fastest Way to Create Large Numpy Arrays?

I'm working on creating a bunch of data (10M rows or more) and am trying to understand more deeply about the best way to work with numpy arrays as quickly as possible.
I'll have 1M rows of each class of data, which I read in from different sources (async). When I'm done reading, I want to combine them into a single numpy array. I'll know the final array is 10M (precisely).
I'm thinking I have the following paths:
Create a global numpy array of the entire size and copy in
Create a global numpy array and a numpy array for each source and concat together at the end
Create a null global numpy array and add each row to the global array (I think this is the slowest)
However, I'm not sure how to do #1 - numpy.copyto seems to always start with index 0.
Is there another model. I should be going with here?
If I use "views", I'm not clear how to copy it to the final array. I'm, of course, familiar with views for DBs, but not for numpy.

Python - pandas dataframe or array of dataclass instances for reading in data?

I'm relatively new to data analysis using Python and I'm trying to determine the most practical and useful way to read in my data so that I can index into it and use it in calculations. I have many images in the form of np.arrays that each have a corresponding set of data such as x- and y-coordinates, size, filter number, etc. I just want to make sure each set of data is grouped together with its corresponding image. My first thought was sticking the data in an np.array of dataclass instances (where each element of the array is an instance that contains all my data). My second thought was a pandas dataframe.
My gut is telling me that using a dataframe makes more sense. Do np.arrays store nicely inside dataframes? What are the pros/cons to each method and which would be best if I will need to be pulling data from them often, and I always need to make sure the data can be matched with its corresponding image?
What variables I have to read in: x_coord - float, y_coord - float, filter - int, image - np.ndarray.
I've been trying to stick the image arrays into a pandas dataframe but when indexing into it using .loc it is extremely slow to run the Jupyter Notebook cell. It was also very slow to populate the dataframe using .from_dict(). I'm guessing dataframes weren't meant to hold np.ndarrays?
My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc.

Access one chunk in Zarr

Zarr saves an array on disk in chunks, each chunk is a separate file. Is there a way to access only one, chosen chunk (file)?
Can it be determined which chunks are empty without loading the whole array into memory?
I'm not aware of any way to find chunk size except hitting the FS yourself. Zarr abstracts over that. Maybe you'll have to explain what you're up to.
The project I'm currently working on uses Zarr to store meteorological data. We keep the data in a 3 dimensional array of shape (t, x, y). Alongside the data, we have an array of shape (t), effectively a bitmask to record which slots are filled. So when data comes in, we write
data[t] = [...]
ready[t] = 1
So when querying for data we know at what timeslots to expect data, and which slots are empty.
It's possible to see what chunks are filled by looking at the keys method of the underlying chunk_store. Only keys with data will be filled.
The corresponding values of these keys will contain the data of that chunk, but it will be compressed. If you want more than that, would encourage you to raise an issue over at the Zarr repo.
I don't think there is a general solution to know which chunks are initialized for any storage type, but for DirectoryStore, it is possible to list the filesystem to know which chunks are initialized. This is how zarr do it to compute the nchunks_initialized property.
I suppose you could get some inspiration from there to list all initialized chunks and then compute which slice it corresponds to in the array.
While there is no object for a chunk in zarr, you can compute their beginning and end along each axis from the array dimensions and chunk dimensions. If you want to load the chunks one by one for efficiency reasons, you can compute their indices and slice the zarr Array to get a numpy array as a working area.
Since I had similar needs, I built some function as helpers to do just that, you can look them up at https://github.com/maxime915/msi_zarr_analysis/blob/126c1115bd43e8813d2f002673491c6ef25e37db/msi_zarr_analysis/utils/iter_chunks.py if you want some inspiration.

How to save the n-d numpy array data and read it quickly next time?

Here is my question:
I have a 3-d numpy array Data which in the shape of (1000, 100, 100).
And I want to save it as a .txt or .csv file, how to achieve that?
My first attempt was to reshape it into a 1-d array which length 1000*100*100, and transfer it into pandas.Dataframe, and then, I save it as .csv file.
When I wanted to call it next time,I would reshape it back to 3-d array.
I think there must be some methods easier.
If you need to re-read it quickly into numpy you could just use the cPickle module.
This is going to be much faster that parsing it back from an ASCII dump (but however only the program will be able to re-read it). As a bonus with just one instruction you could dump more than a single matrix (i.e. any data structure built with core python and numpy arrays).
Note that parsing a floating point value from an ASCII string is a quite complex and slow operation (if implemented correctly down to ulp).

Is it possible to store multidimensional arrays of arbitrary shape in a PyTables cell?

PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.

Categories

Resources