Related
Suppose data, saved in .mat file needs to be annotated or the meta-data of some file need to be stored:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
mdic = {"data": x,
"classes": {'name_of_class_1': 0,
'name_of_class_2': 1
}
}
When I save it with scipy.io.savemat() and load it back with scipy.io.loadmat I got not quite readdable structure:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Fri Jan 20 16:52:17 2023',
'__version__': '1.0',
'__globals__': [],
'data': array([[1, 2, 3],
[4, 5, 6]], dtype=int32),
'classes': array([[(array([[0]]), array([[1]]))]],
dtype=[('name_of_class_1', 'O'), ('name_of_class_2', 'O')])}
Is there a better way to store dict of dict/jsons in .mat files?
Hello I'm stuck on getting good conversion of a matrix of matlab to pandas dataframe.
I converted it but I've got one row in which I've list of list. These list of list are normaly my rows.
import pandas as pd
import numpy as np
from scipy.io.matlab import mio
Data_mat = mio.loadmat('senet50-ferplus-logits.mat')
my Data_mat.keys() gives me this output:
dict_keys(['__header__', '__version__', '__globals__', 'images', 'wavLogits'])
I'd like to convert images and wavLogits to data frame.
By looking to this post I applied the solution.
cardio_df = pd.DataFrame(np.hstack((Data_mat['images'], Data_mat['wavLogits'])))
And the output is
How to get the df in good format?
[UPDATE] Data_mat["images"] has
array([[(array([[array(['A.J._Buckley/test/Y8hIVOBuels_0000001.wav'], dtype='<U41'),
array(['A.J._Buckley/test/Y8hIVOBuels_0000002.wav'], dtype='<U41'),
array(['A.J._Buckley/test/Y8hIVOBuels_0000003.wav'], dtype='<U41'),
...,
array(['Zulay_Henao/train/s4R4hvqrhFw_0000007.wav'], dtype='<U41'),
array(['Zulay_Henao/train/s4R4hvqrhFw_0000008.wav'], dtype='<U41'),
array(['Zulay_Henao/train/s4R4hvqrhFw_0000009.wav'], dtype='<U41')]],
dtype=object), array([[ 1, 2, 3, ..., 153484, 153485, 153486]], dtype=int32), array([[ 1, 1, 1, ..., 1251, 1251, 1251]], dtype=uint16), array([[array(['Y8hIVOBuels'], dtype='<U11'),
array(['Y8hIVOBuels'], dtype='<U11'),
array(['Y8hIVOBuels'], dtype='<U11'), ...,
array(['s4R4hvqrhFw'], dtype='<U11'),
array(['s4R4hvqrhFw'], dtype='<U11'),
array(['s4R4hvqrhFw'], dtype='<U11')]], dtype=object), array([[1, 2, 3, ..., 7, 8, 9]], dtype=uint8), array([[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/01.jpg'], dtype='<U37')],
[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/02.jpg'], dtype='<U37')],
[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/03.jpg'], dtype='<U37')],
...,
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/16.jpg'], dtype='<U36')],
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/17.jpg'], dtype='<U36')],
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/18.jpg'], dtype='<U36')]],
dtype=object), array([[1.00000e+00],
[1.00000e+00],
[1.00000e+00],
...,
[1.53486e+05],
[1.53486e+05],
[1.53486e+05]], dtype=float32), array([[3, 3, 3, ..., 1, 1, 1]], dtype=uint8))]],
dtype=[('name', 'O'), ('id', 'O'), ('sp', 'O'), ('video', 'O'), ('track', 'O'), ('denseFrames', 'O'), ('denseFramesWavIds', 'O'), ('set', 'O')])
So this is what I'd do to convert a mat file into a pandas dataframe automagically.
mat = scipy.io.loadmat('file.mat')
mat = {k:v for k, v in mat.items() if k[0] != '_'}
df = pd.DataFrame({k: np.array(v).flatten() for k, v in mat.items()})
I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().
This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.
When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:
[array([[array([[0.82374894]]), array([[0.50730055]]),
array([[0.36721625]]), array([[0.45036349]])],
[array([[0.26119276]]), array([[0.16843872]]),
array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]
This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.
I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.
I'm not quite clear about the problem. Let me try to recreate your case:
In [58]: from scipy.io import loadmat, savemat
In [59]: A = np.empty((2,1), object)
In [61]: A[0,0]=np.arange(4).reshape(2,2)
In [62]: A[1,0]=np.arange(6).reshape(3,2)
In [63]: A
Out[63]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]
In [65]: B
Out[65]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)
As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925
Saving:
In [66]: savemat('foo.mat', {'A':A, 'B':B})
Loading:
In [74]: loadmat('foo.mat')
Out[74]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
'__version__': '1.0',
'__globals__': [],
'A': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object),
'B': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]
Out[75]:
array([[0, 1],
[2, 3],
[4, 5]])
Your problem case looks like it's a object dtype array containing numbers:
In [89]: C = np.arange(4).reshape(2,2).astype(object)
In [90]: C
Out[90]:
array([[0, 1],
[2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})
In [92]: loadmat('foo1.mat')
Out[92]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
'__version__': '1.0',
'__globals__': [],
'C': array([[array([[0]]), array([[1]])],
[array([[2]]), array([[3]])]], dtype=object)}
Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.
===
And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:
>> load foo.mat
>> A
A =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
4 5
}
>> B
B =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
}
>> load foo1.mat
>> C
C =
{
[1,1] = 0
[2,1] = 2
[1,2] = 1
[2,2] = 3
}
Python: Issue reading in str from MATLAB .mat file using h5py and NumPy
is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.
I want to use numpy to implement the following data structure. Now I use python dictionary to do the work, but it's hard to do the vector operations, I have to add the vector many times, so I want to use numpy to simplify the work. The length of hosts will vary during program execution. Is it possible for me to do this job with numpy structured arrays, notice that the length of list is mutable? I'm not familiar with it, just want to know whether it's possible, so that it won't be a waste of time.
{
"0" :{
"coordinates": [100, 100],
"neighbours": [1, 40],
"hosts":[],
"v-capacity":20,
"v-immature":0,
"v-state":[20, 0, 0, 0]
},
"1" :{
"coordinates": [200, 100],
"neighbours": [0, 2, 41],
"hosts":[],
"v-capacity":20,
"v-immature":0,
"v-state":[20, 0, 0, 0]
},
What you show is a dictionary whose values are also dictionaries. Some values of the nested dictionaries are scalars, others are lists. neighbors list varies in length.
I can picture creating a structured array with fields corresponding to the inner dictionary keys.
The coordinates and v-state fields could even have inner dimensions of (2,) and (4,).
But for variable length neighbors or hosts the best we can do it define those fields as having object dtype, which will store the respective lists elsewhere in memory. Math on that kind of array is limited.
But before you get too deep into structured arrays, explore creating a set of arrays to store this data, one row per item in the out dictionary.
coordinates = np.array([[100,100],[200,100]])
neighbors = np.array([[1, 40],[0, 2, 41]])
Make sure you understand what those expressions produce.
In [537]: coordinates
Out[537]:
array([[100, 100],
[200, 100]])
In [538]: neighbors
Out[538]: array([[1, 40], [0, 2, 41]], dtype=object)
Here's an example of a structured array that can hold these arrays:
In [539]: dt=np.dtype([('coordinates',int,(2,)),('neighbors',object)])
In [540]: arr = np.zeros((2,), dtype=dt)
In [541]: arr
Out[541]:
array([([0, 0], 0), ([0, 0], 0)],
dtype=[('coordinates', '<i4', (2,)), ('neighbors', 'O')])
In [543]: arr['coordinates']=coordinates
In [544]: arr['neighbors']=neighbors
In [545]: arr
Out[545]:
array([([100, 100], [1, 40]), ([200, 100], [0, 2, 41])],
dtype=[('coordinates', '<i4', (2,)), ('neighbors', 'O')])
In [546]: arr['neighbors']
Out[546]: array([[1, 40], [0, 2, 41]], dtype=object)
Notice that is basically a packaging convenience. It stores the arrays in one place, but you still have perform your math/vector operations on the individual fields.
In [547]: coordinates.sum(axis=1)
Out[547]: array([200, 300]) # sum across columns of a 2d array
In [548]: neighbors.sum()
Out[548]: [1, 40, 0, 2, 41] # sum (concatenate) of lists
I have a given array:
array = [(u'Andrew', -3, 3, 100.032) (u'Bob', -4, 4, 103.323) (u'Joe', -5, 5, 154.324)]
that is generated from another process (that I cannot control) of taking a CSV table and it outputs this numpy array. I now need to assign the dtypes of the columns to do further analysis.
How can I do this?
Thank you
Is this what you need ?
new_array = np.array(array, dtype = [("name", object),
("N1", int),
("N2", int),
("N3", float)])
where name and N1-3 are column names I gave.
It gives :
array([(u'Andrew', -3, 3, 100.032), (u'Bob', -4, 4, 103.323),
(u'Joe', -5, 5, 154.324)],
dtype=[('name', 'O'), ('N1', '<i8'), ('N2', '<i8'), ('N3', '<f8')])
You can sort on "N1" for instance :
new_array.sort(order="N1")
new_array
array([(u'Joe', -5, 5, 154.324), (u'Bob', -4, 4, 103.323),
(u'Andrew', -3, 3, 100.032)],
dtype=[('name', 'O'), ('N1', '<i8'), ('N2', '<i8'), ('N3', '<f8')])
Hope this helps.
recarr = np.rec.fromrecords(array)
Optionally set field names:
recarr = np.rec.fromrecords(array, names="name, idata, idata2, fdata")