Save dictionary of dictionary in .mat format - python

Suppose data, saved in .mat file needs to be annotated or the meta-data of some file need to be stored:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
mdic = {"data": x,
"classes": {'name_of_class_1': 0,
'name_of_class_2': 1
}
}
When I save it with scipy.io.savemat() and load it back with scipy.io.loadmat I got not quite readdable structure:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Fri Jan 20 16:52:17 2023',
'__version__': '1.0',
'__globals__': [],
'data': array([[1, 2, 3],
[4, 5, 6]], dtype=int32),
'classes': array([[(array([[0]]), array([[1]]))]],
dtype=[('name_of_class_1', 'O'), ('name_of_class_2', 'O')])}
Is there a better way to store dict of dict/jsons in .mat files?

Related

Extract an ndarray from a np.void array

the npy file I used ⬆️
https://github.com/mangomangomango0820/DataAnalysis/blob/master/NumPy/NumPyEx/NumPy_Ex1_3Dscatterplt.npy
2.
after loading the npy file,
data = np.load('NumPy_Ex1_3Dscatterplt.npy')
'''
[([ 2, 2, 1920, 480],) ([ 1, 3, 1923, 480],)
......
([ 3, 3, 1923, 480],)]
⬆️ data.shape, (69,)
⬆️ data.shape, (69,)
⬆️ data.dtype, [('f0', '<i8', (4,))]
⬆️ type(data), <class 'numpy.ndarray'>
⬆️ type(data[0]), <class 'numpy.void'>
'''
you can see for each row, e.g. data[0],its type is <class 'numpy.void'>
I wish to get a ndarray based on the data above, looking like this ⬇️
[[ 2 2 1920 480]
...
[ 3 3 1923 480]]
the way I did is ⬇️
all = np.array([data[i][0] for i in range(data.shape[0])])
'''
[[ 2 2 1920 480]
...
[ 3 3 1923 480]]
'''
I am wondering if there's a smarter way to process the numpy.void class data and achieve the expected results.
Your data is a structured array, with a compound dtype.
https://numpy.org/doc/stable/user/basics.rec.html
I can recreate it with:
In [130]: dt = np.dtype([("f0", "<i8", (4,))])
In [131]: x = np.array(
...: [([2, 2, 1920, 480],), ([1, 3, 1923, 480],), ([3, 3, 1923, 480],)], dtype=dt
...: )
In [132]: x
Out[132]:
array([([ 2, 2, 1920, 480],), ([ 1, 3, 1923, 480],),
([ 3, 3, 1923, 480],)], dtype=[('f0', '<i8', (4,))])
This is 1d array onr field, and the field itself contains 4 elements.
Fields are accessed by name:
In [133]: x["f0"]
Out[133]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
This has integer dtype with shape (3,4).
Accessing fields by name applies to more complex structured arrays as well.
Using the tolist approach from the other answer:
In [134]: x.tolist()
Out[134]:
[(array([ 2, 2, 1920, 480]),),
(array([ 1, 3, 1923, 480]),),
(array([ 3, 3, 1923, 480]),)]
In [135]: np.array(x.tolist()) # (3,1,4) shape
Out[135]:
array([[[ 2, 2, 1920, 480]],
[[ 1, 3, 1923, 480]],
[[ 3, 3, 1923, 480]]])
In [136]: np.vstack(x.tolist()) # (3,4) shape
Out[136]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
The documentation also suggests using:
In [137]: import numpy.lib.recfunctions as rf
In [138]: rf.structured_to_unstructured(x)
Out[138]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
An element of a structured array displays as a tuple, though the type is a generic np.void
There is an older class recarray, that is similar, but with an added way of accessing fields
In [146]: y=x.view(np.recarray)
In [147]: y
Out[147]:
rec.array([([ 2, 2, 1920, 480],), ([ 1, 3, 1923, 480],),
([ 3, 3, 1923, 480],)],
dtype=[('f0', '<i8', (4,))])
In [148]: y.f0
Out[148]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
In [149]: type(y[0])
Out[149]: numpy.record
I often refer to elements of structured arrays as records.
Here is the trick
data_clean = np.array(data.tolist())
print(data_clean)
print(data_clean.shape)
Output
[[[ 2 2 1920 480]]
...............
[[ 3 3 1923 480]]]
(69, 1, 4)
In case if you dont like the extra 1 dimension in between, you can squeeze like this
data_sqz = data_clean.squeeze()
print(data_sqz)
print(data_sqz.shape)
Output
...
[ 3 3 1923 480]]
(69, 4)

Binning variable with rolling window on xarray

I have an xarray.Dataset with temperature data and want to calculate the binned temperature for every element of the array using a rolling-window of 7-days.
I have data in this form:
import xarray as xr
ds = xr.Dataset(
{'t2m': (['time', 'lat', 'lon'], t2m)},
coords={
'lon': lon,
'lat': lat,
'time': time,
}
)
And then I use the rolling() method and apply a function on each window array:
r = ds.t2m.\
chunk({'time': 10}).\
rolling(time=7)
window_results = []
for label, arr_window in tqdm(r):
max_temp = arr_window.max(dim=...).values
min_temp = arr_window.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr_window.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr_window.lat.values,
'lon': arr_window.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
window_results.append(buckets_arr)
At the end, I get a list of each timestep with a window-calculation of binned arrays:
ds_concat = xr.concat(window_results, dim='time')
ds_concat
>> <xarray.DataArray (time: 18, lat: 10, lon: 10)>
array([[[1, 2, 2, ..., 2, 2, 3],
[1, 3, 3, ..., 1, 1, 2],
[2, 3, 2, ..., 1, 2, 3],
...,
[2, 2, 2, ..., 2, 2, 2],
[2, 2, 2, ..., 1, 2, 2],
[2, 2, 3, ..., 2, 3, 2]],
...
This code is yielding the results I am looking for, but I believe there must be a better alternative to apply this same process either using apply_ufunc or dask. I am also using a dask.distributed.Client, so I am looking for a way to optimize my code to run fast.
Any help is appreciated
I finally figure it out! Hope this can help someone with the same problem.
One of the coolest features of dask.distributed is dask.delayed. I can re-write the loop above and use a lazy function:
import dask
import xarray as xr
#dask.delayed
def create_bucket_window(arr, label):
max_temp = arr.max(dim=...).values
min_temp = arr.min(dim=...).values
if not np.isnan(max_temp):
bins = np.arange(min_temp, max_temp, 2)
buckets = np.digitize(arr.isel(time=-1),
bins=bins)
buckets_arr = xr.DataArray(buckets,
dims={
'lat': arr.lat.values,
'lon': arr.lon.values
})
buckets_arr = buckets_arr.assign_coords({'time': label})
return buckets_arr
and then:
window_results = []
for label, arr_window in tqdm(r):
bucket_array = create_bucket_window(arr=arr_window,
label=label)
window_results.append(bucket_array)
Once I do this, dask will lazy-generate this arrays, and will only evaluate them when needed:
dask.compute(*window_results)
And there you will have a collection of results!

How to save a Python list of arrays of varied dimensions to mat file [duplicate]

This question already has an answer here:
creating Matlab cell arrays in python
(1 answer)
Closed 2 years ago.
I am going to save a list of arrays into a file that can be read in Matlab. The arrays in the list are of 3-dimensional but varied shapes, so I cannot put them into a single large array.
Originally I thought I could save the list in to a pickle file and then read the file in Matlab. Later I found Matlab does not support reading the pickle file. I also tried using scipy.io.savemat to save the list to a mat file. The inconsistencies of array dimensions causes saving problems.
Anyone has ideas of how to solve the problem? It should be noted that the list file is very large in memory (>4 G).
If you don't need a single file, you can iterate through the list and savemat each entry into a separate file. Then iterate through that directory and load each file into a cell array element.
You can also zip the directory you store these in, to get one file to pass around.
savemat is the right tool for saving numpy arrays in a MATLAB format. I'd suggest making a desired structure in MATLAB, save it, and loadmat. Duplicate that layout when going the other way.
Also loadmat the savemat file to get a better idea how it maps python objects onto MATLAB ones.
Arrays may become order F 2d arrays. Cells may become object dtype arrays, structs may become structured arrays.
Create a .mat` from a list:
In [180]: io.savemat('test.mat', {'x':[np.arange(12).reshape(3,4), np.arange(3), 4]})
reload it:
In [181]: data = io.loadmat('test.mat')
In [182]: data
Out[182]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Sat Mar 21 20:02:02 2020',
'__version__': '1.0',
'__globals__': [],
'x': array([[array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]),
array([[0, 1, 2]]), array([[4]])]], dtype=object)}
It has one named variable
In [183]: data['x']
Out[183]:
array([[array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]),
array([[0, 1, 2]]), array([[4]])]], dtype=object)
The shape is 2d (like all MATLAB) and object dtype (to hold a mix of items):
In [185]: data['x'].shape
Out[185]: (1, 3)
Within that is a 2d array:
In [186]: data['x'][0,0].shape
Out[186]: (3, 4)
If I check the flags I see data['x'] is order F, F contiguous (but the source, used in the savemat was default C contiguous).
Note also the change in shape of the 1d array and scalar - again following MATLAB conventions.
In Octave:
>> data = load('test.mat')
data =
scalar structure containing the fields:
x =
{
[1,1] =
0 1 2 3
4 5 6 7
8 9 10 11
[1,2] =
0 1 2
[1,3] = 4
}
we get a cell variable.
If I wanted a numpy matrix that didn't get changed in this transfer, I'd have to start with something like:
In [188]: np.arange(12).reshape(3,4,order='F')
Out[188]:
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
numpy is, by default C order, row-major, with first dimension being the outermost. MATLAB is the opposite - F (fortran), column major, last dimension outermost.

Read .mat files in python whose content is a table

I was wondering whether I can read in Python a .mat file that contains a Table, is that possible?
I have read this post, but not much is mentioned there.
So far I have tried to read my .mat that contains the table in this way
import tables
from scipy.io import loadmat
from scipy.io import whosmat
x = loadmat('CurrentProto.mat')
print(x)
but I cannot address the elements there, I get this with the command print(x)
{'__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Jul 29 09:47:17 2019', '__version__': '1.0', '__globals__': [], 'None': MatlabOpaque([(b'CurrentProto', b'MCOS', b'table', array([[3707764736],
[ 2],
[ 1],
[ 1],
[ 1],
[ 1]], dtype=uint32))],
dtype=[('s0', 'O'), ('s1', 'O'), ('s2', 'O'), ('arr', 'O')]), '__function_workspace__': array([[ 0, 1, 73, ..., 0, 0, 0]], dtype=uint8)}
Is there a way of reading my table in the .mat file or I have to sabe it in a different format within Matlab?

storing numpy object array of equal-size ndarrays to a .mat file using scipy.io.savemat

I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().
This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.
When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:
[array([[array([[0.82374894]]), array([[0.50730055]]),
array([[0.36721625]]), array([[0.45036349]])],
[array([[0.26119276]]), array([[0.16843872]]),
array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]
This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.
I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.
I'm not quite clear about the problem. Let me try to recreate your case:
In [58]: from scipy.io import loadmat, savemat
In [59]: A = np.empty((2,1), object)
In [61]: A[0,0]=np.arange(4).reshape(2,2)
In [62]: A[1,0]=np.arange(6).reshape(3,2)
In [63]: A
Out[63]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]
In [65]: B
Out[65]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)
As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925
Saving:
In [66]: savemat('foo.mat', {'A':A, 'B':B})
Loading:
In [74]: loadmat('foo.mat')
Out[74]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
'__version__': '1.0',
'__globals__': [],
'A': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object),
'B': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]
Out[75]:
array([[0, 1],
[2, 3],
[4, 5]])
Your problem case looks like it's a object dtype array containing numbers:
In [89]: C = np.arange(4).reshape(2,2).astype(object)
In [90]: C
Out[90]:
array([[0, 1],
[2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})
In [92]: loadmat('foo1.mat')
Out[92]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
'__version__': '1.0',
'__globals__': [],
'C': array([[array([[0]]), array([[1]])],
[array([[2]]), array([[3]])]], dtype=object)}
Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.
===
And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:
>> load foo.mat
>> A
A =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
4 5
}
>> B
B =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
}
>> load foo1.mat
>> C
C =
{
[1,1] = 0
[2,1] = 2
[1,2] = 1
[2,2] = 3
}
Python: Issue reading in str from MATLAB .mat file using h5py and NumPy
is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.

Categories

Resources