Convert mat file to pandas dataframe problem

Convert mat file to pandas dataframe problem - python

Hello I'm stuck on getting good conversion of a matrix of matlab to pandas dataframe.
I converted it but I've got one row in which I've list of list. These list of list are normaly my rows.
import pandas as pd
import numpy as np
from scipy.io.matlab import mio
Data_mat = mio.loadmat('senet50-ferplus-logits.mat')
my Data_mat.keys() gives me this output:
dict_keys(['__header__', '__version__', '__globals__', 'images', 'wavLogits'])
I'd like to convert images and wavLogits to data frame.
By looking to this post I applied the solution.
cardio_df = pd.DataFrame(np.hstack((Data_mat['images'], Data_mat['wavLogits'])))
And the output is
How to get the df in good format?
[UPDATE] Data_mat["images"] has
array([[(array([[array(['A.J._Buckley/test/Y8hIVOBuels_0000001.wav'], dtype='<U41'),
array(['A.J._Buckley/test/Y8hIVOBuels_0000002.wav'], dtype='<U41'),
array(['A.J._Buckley/test/Y8hIVOBuels_0000003.wav'], dtype='<U41'),
...,
array(['Zulay_Henao/train/s4R4hvqrhFw_0000007.wav'], dtype='<U41'),
array(['Zulay_Henao/train/s4R4hvqrhFw_0000008.wav'], dtype='<U41'),
array(['Zulay_Henao/train/s4R4hvqrhFw_0000009.wav'], dtype='<U41')]],
dtype=object), array([[ 1, 2, 3, ..., 153484, 153485, 153486]], dtype=int32), array([[ 1, 1, 1, ..., 1251, 1251, 1251]], dtype=uint16), array([[array(['Y8hIVOBuels'], dtype='<U11'),
array(['Y8hIVOBuels'], dtype='<U11'),
array(['Y8hIVOBuels'], dtype='<U11'), ...,
array(['s4R4hvqrhFw'], dtype='<U11'),
array(['s4R4hvqrhFw'], dtype='<U11'),
array(['s4R4hvqrhFw'], dtype='<U11')]], dtype=object), array([[1, 2, 3, ..., 7, 8, 9]], dtype=uint8), array([[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/01.jpg'], dtype='<U37')],
[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/02.jpg'], dtype='<U37')],
[array(['A.J._Buckley/1.6/Y8hIVOBuels/1/03.jpg'], dtype='<U37')],
...,
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/16.jpg'], dtype='<U36')],
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/17.jpg'], dtype='<U36')],
[array(['Zulay_Henao/1.6/s4R4hvqrhFw/9/18.jpg'], dtype='<U36')]],
dtype=object), array([[1.00000e+00],
[1.00000e+00],
[1.00000e+00],
...,
[1.53486e+05],
[1.53486e+05],
[1.53486e+05]], dtype=float32), array([[3, 3, 3, ..., 1, 1, 1]], dtype=uint8))]],
dtype=[('name', 'O'), ('id', 'O'), ('sp', 'O'), ('video', 'O'), ('track', 'O'), ('denseFrames', 'O'), ('denseFramesWavIds', 'O'), ('set', 'O')])

So this is what I'd do to convert a mat file into a pandas dataframe automagically.
mat = scipy.io.loadmat('file.mat')
mat = {k:v for k, v in mat.items() if k[0] != '_'}
df = pd.DataFrame({k: np.array(v).flatten() for k, v in mat.items()})

Related

genfromtxt read data of different types as array or arrays

I am trying to import data from a text file with a varying number of columns and insert it into an array of arrays. I know that the first column will always be a string and the next three columns will be integers, but so far I have only managed to read the file as an array of tuples
i have tried using dtype=(object,int,int,int)
from io import StringIO
import numpy as np
new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
, delimiter=",")
print("File data:",new_result )
output:
File data: [('01/23/2020', 32, 0, 2) ("01/31/2020' ", 436, 0, 10)]
I want the output tolook like this
[['01/23/2020' 32 0 2]
['01/31/2020' 436 0 10]]
to that
new_result == np.array( [['01/23/2020',32,0,2],
['01/31/2020', 436, 0, 10]],dtype=object)
will be True

This should work for your problem
import numpy as np
example_string = "01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10"
example_string_filtered = example_string.replace(' ','').replace("'",'')
newline_split = example_string_filtered.split('\n')
result = []
for line in newline_split:
line_split = line.split(',')
result.append([line_split[0], int(line_split[1]), int(line_split[2]) ,int(line_split[3])])
result = np.array(result, dtype='O')
print(result)
result:
[['01/23/2020', 32, 0, 2], ['01/31/2020', 436, 0, 10]]

Specifying a dtype like that produces a structured array
https://numpy.org/doc/stable/user/basics.rec.html
In [40]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020' ,436 ,0 ,10")
...: new_result = np.genfromtxt(new_string, dtype=(object,int,int,int), encoding="unicode"
...: , delimiter=",")
This is a 1d array, with a compound dtype. The print display just shows the elements, or records, as tuples, but the repr display shows the dtype as well:
In [41]: new_result
Out[41]:
array([(b'01/23/2020', 32, 0, 2), (b"01/31/2020' ", 436, 0, 10)],
dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
In [42]: new_result.dtype
Out[42]: dtype([('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Fields are accessed by name:
In [43]: new_result['f0']
Out[43]: array([b'01/23/2020', b"01/31/2020' "], dtype=object)
In [44]: new_result['f1']
Out[44]: array([ 32, 436])
The main structured array doc page suggests using a recfunctions function to convert dtypes:
In [46]: import numpy.lib.recfunctions as rf
Unfortunately the object field is giving that problems:
In [48]: arr = rf.structured_to_unstructured(new_result, dtype=object)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [48], in <cell line: 1>()
----> 1 arr = rf.structured_to_unstructured(new_result, dtype=object)
File <__array_function__ internals>:5, in structured_to_unstructured(*args, **kwargs)
File ~\anaconda3\lib\site-packages\numpy\lib\recfunctions.py:980, in structured_to_unstructured(arr, dtype, copy, casting)
978 with suppress_warnings() as sup: # until 1.16 (gh-12447)
979 sup.filter(FutureWarning, "Numpy has detected")
--> 980 arr = arr.view(flattened_fields)
982 # next cast to a packed format with all fields converted to new dtype
983 packed_fields = np.dtype({'names': names,
984 'formats': [(out_dtype, dt.shape) for dt in dts]})
File ~\anaconda3\lib\site-packages\numpy\core\_internal.py:494, in _view_is_safe(oldtype, newtype)
491 return
493 if newtype.hasobject or oldtype.hasobject:
--> 494 raise TypeError("Cannot change data-type for object array.")
495 return
Let's try the dtype=None option (and clean up the string a bit):
In [49]: new_string = StringIO("01/23/2020, 32, 0, 2 \n01/31/2020 ,436 ,0 ,10")
...: new_result = np.genfromtxt(new_string, dtype=None, encoding="unicode"
...: , delimiter=",")
In [50]: new_result
Out[50]:
array([('01/23/2020', 32, 0, 2), ('01/31/2020 ', 436, 0, 10)],
dtype=[('f0', '<U11'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Same as your case except the string dtype field.
But that doesn't help; it must be the target dtype that the function doesn't like (or both):
In [51]: arr = rf.structured_to_unstructured(new_result, dtype=object)
...
TypeError: Cannot change data-type for object array.
But we can convert the numeric fields, producing a 2d int array:
In [52]: arr = rf.structured_to_unstructured(new_result[['f1','f2','f3']], dtype=int)
In [53]: arr
Out[53]:
array([[ 32, 0, 2],
[436, 0, 10]])
Assigning fields to object array
In [65]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', encoding="unicode"
...: , delimiter=",")
In [66]: new_result
Out[66]:
array([(b'01/23/2020', 32, 0, 2), (b'01/31/2020', 436, 0, 10)],
dtype=[('f0', 'O'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4')])
Create a target array:
In [67]: arr = np.empty((2,4),object)
In [68]: for i,f in enumerate(new_result.dtype.fields):
...: arr[:,i] = new_result[f]
...:
In [69]: arr
Out[69]:
array([[b'01/23/2020', 32, 0, 2],
[b'01/31/2020', 436, 0, 10]], dtype=object)
Many of the recfunctions do something like this - create a target array, and copy data by field name. Usually a structured array has many more records than fields, so this iteration by field is relatively efficient.
unpack
If you specify unpack, the result is separate arrays for each column/field
In [74]: new_string = "01/23/2020, 32, 0, 2 \n01/31/2020, 436 ,0 ,10".splitlines()
...: new_result = np.genfromtxt(new_string, dtype='O,i,i,i', unpack=True
...: , delimiter=",")
In [75]: new_result
Out[75]:
[array([b'01/23/2020', b'01/31/2020'], dtype=object),
array([ 32, 436], dtype=int32),
array([0, 0], dtype=int32),
array([ 2, 10], dtype=int32)]
They can then be concatenated with stack:
In [77]: np.stack(new_result, axis=1)
Out[77]:
array([[b'01/23/2020', 32, 0, 2],
[b'01/31/2020', 436, 0, 10]], dtype=object)

Extract an ndarray from a np.void array

the npy file I used ⬆️
https://github.com/mangomangomango0820/DataAnalysis/blob/master/NumPy/NumPyEx/NumPy_Ex1_3Dscatterplt.npy
2.
after loading the npy file，
data = np.load('NumPy_Ex1_3Dscatterplt.npy')
'''
[([ 2, 2, 1920, 480],) ([ 1, 3, 1923, 480],)
......
([ 3, 3, 1923, 480],)]
⬆️ data.shape, (69,)
⬆️ data.shape, (69,)
⬆️ data.dtype, [('f0', '<i8', (4,))]
⬆️ type(data), <class 'numpy.ndarray'>
⬆️ type(data[0]), <class 'numpy.void'>
'''
you can see for each row, e.g. data[0]，its type is <class 'numpy.void'>
I wish to get a ndarray based on the data above, looking like this ⬇️
[[ 2 2 1920 480]
...
[ 3 3 1923 480]]
the way I did is ⬇️
all = np.array([data[i][0] for i in range(data.shape[0])])
'''
[[ 2 2 1920 480]
...
[ 3 3 1923 480]]
'''
I am wondering if there's a smarter way to process the numpy.void class data and achieve the expected results.

Your data is a structured array, with a compound dtype.
https://numpy.org/doc/stable/user/basics.rec.html
I can recreate it with:
In [130]: dt = np.dtype([("f0", "<i8", (4,))])
In [131]: x = np.array(
...: [([2, 2, 1920, 480],), ([1, 3, 1923, 480],), ([3, 3, 1923, 480],)], dtype=dt
...: )
In [132]: x
Out[132]:
array([([ 2, 2, 1920, 480],), ([ 1, 3, 1923, 480],),
([ 3, 3, 1923, 480],)], dtype=[('f0', '<i8', (4,))])
This is 1d array onr field, and the field itself contains 4 elements.
Fields are accessed by name:
In [133]: x["f0"]
Out[133]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
This has integer dtype with shape (3,4).
Accessing fields by name applies to more complex structured arrays as well.
Using the tolist approach from the other answer:
In [134]: x.tolist()
Out[134]:
[(array([ 2, 2, 1920, 480]),),
(array([ 1, 3, 1923, 480]),),
(array([ 3, 3, 1923, 480]),)]
In [135]: np.array(x.tolist()) # (3,1,4) shape
Out[135]:
array([[[ 2, 2, 1920, 480]],
[[ 1, 3, 1923, 480]],
[[ 3, 3, 1923, 480]]])
In [136]: np.vstack(x.tolist()) # (3,4) shape
Out[136]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
The documentation also suggests using:
In [137]: import numpy.lib.recfunctions as rf
In [138]: rf.structured_to_unstructured(x)
Out[138]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
An element of a structured array displays as a tuple, though the type is a generic np.void
There is an older class recarray, that is similar, but with an added way of accessing fields
In [146]: y=x.view(np.recarray)
In [147]: y
Out[147]:
rec.array([([ 2, 2, 1920, 480],), ([ 1, 3, 1923, 480],),
([ 3, 3, 1923, 480],)],
dtype=[('f0', '<i8', (4,))])
In [148]: y.f0
Out[148]:
array([[ 2, 2, 1920, 480],
[ 1, 3, 1923, 480],
[ 3, 3, 1923, 480]])
In [149]: type(y[0])
Out[149]: numpy.record
I often refer to elements of structured arrays as records.

Here is the trick
data_clean = np.array(data.tolist())
print(data_clean)
print(data_clean.shape)
Output
[[[ 2 2 1920 480]]
...............
[[ 3 3 1923 480]]]
(69, 1, 4)
In case if you dont like the extra 1 dimension in between, you can squeeze like this
data_sqz = data_clean.squeeze()
print(data_sqz)
print(data_sqz.shape)
Output
...
[ 3 3 1923 480]]
(69, 4)

Read .mat files in python whose content is a table

I was wondering whether I can read in Python a .mat file that contains a Table, is that possible?
I have read this post, but not much is mentioned there.
So far I have tried to read my .mat that contains the table in this way
import tables
from scipy.io import loadmat
from scipy.io import whosmat
x = loadmat('CurrentProto.mat')
print(x)
but I cannot address the elements there, I get this with the command print(x)
{'__header__': b'MATLAB 5.0 MAT-file, Platform: PCWIN64, Created on: Mon Jul 29 09:47:17 2019', '__version__': '1.0', '__globals__': [], 'None': MatlabOpaque([(b'CurrentProto', b'MCOS', b'table', array([[3707764736],
[ 2],
[ 1],
[ 1],
[ 1],
[ 1]], dtype=uint32))],
dtype=[('s0', 'O'), ('s1', 'O'), ('s2', 'O'), ('arr', 'O')]), '__function_workspace__': array([[ 0, 1, 73, ..., 0, 0, 0]], dtype=uint8)}
Is there a way of reading my table in the .mat file or I have to sabe it in a different format within Matlab?

Converty List of Lists to Numpy Array without carrying over the list

I am trying to convert list of lists and some other number with in the lists to numpy array.
So far i have tried this but still the list carries over to the array:
ll = [['119', '222', '219', '293'], '4', ['179', '124', '500', '235'], '7']
arrays = np.array(ll)
The output is:
array([list(['119', '222', '219', '293']), '4', list(['179', '124', '500', '235']), '7'], dtype=object)
My desired output is something like this:
[(array([ 119, 222, 219, 293]), 4), (array([ 179, 124, 500, 235]), 7)]
Is there a way to do this. I have been trying to get this for the last two days.

Since you want to group every two elements as a tuple, and then convert the first element of each tuple to a numpy array, you can use a list comprehension with zip:
[(np.array(i, dtype=int), int(j)) for i, j in zip(ll[::2], ll[1::2])]
# Result
[(array([119, 222, 219, 293]), 4), (array([179, 124, 500, 235]), 7)]
Notice that I specify a dtype in the numpy array constructor to cast the array to integers.
If you're concerned about making two copies of the list here, you can also simply use range based indexing:
[(np.array(ll[i], dtype=int), int(ll[i+1])) for i in range(0, len(ll), 2)]

You could make a structured array:
In [96]: ll = [['119', '222', '219', '293'], '4', ['179', '124', '500', '235'], '7']
In [97]: dt = np.dtype('4i,i')
In [98]: arr = np.zeros(2, dtype=dt)
In [99]: arr
Out[99]:
array([([0, 0, 0, 0], 0), ([0, 0, 0, 0], 0)],
dtype=[('f0', '<i4', (4,)), ('f1', '<i4')])
In [100]: arr['f0']=ll[::2]
In [101]: arr['f1']=ll[1::2]
In [102]: arr
Out[102]:
array([([119, 222, 219, 293], 4), ([179, 124, 500, 235], 7)],
dtype=[('f0', '<i4', (4,)), ('f1', '<i4')])
and extracted out to a list:
In [103]: arr.tolist()
Out[103]:
[(array([119, 222, 219, 293], dtype=int32), 4),
(array([179, 124, 500, 235], dtype=int32), 7)]
Or a 2x2 object dtype array:
In [104]: np.array(arr.tolist(),dtype=object)
Out[104]:
array([[array([119, 222, 219, 293], dtype=int32), 4],
[array([179, 124, 500, 235], dtype=int32), 7]], dtype=object)
In [105]: _.shape
Out[105]: (2, 2)

It looks like you want individual elements to be numpy arrays, not the whole thing. So you'll have to assign those particular elements directly:
ll[0][0] = np.array(ll[0][0])
ll[0][2] = np.array(ll[0][2])
You could also loop through and find "lists" and then convert them if you don't want to write individual lines.

Numpy - assign column data types (dtype) to existing array

I have a given array:
array = [(u'Andrew', -3, 3, 100.032) (u'Bob', -4, 4, 103.323) (u'Joe', -5, 5, 154.324)]
that is generated from another process (that I cannot control) of taking a CSV table and it outputs this numpy array. I now need to assign the dtypes of the columns to do further analysis.
How can I do this?
Thank you

Is this what you need ?
new_array = np.array(array, dtype = [("name", object),
("N1", int),
("N2", int),
("N3", float)])
where name and N1-3 are column names I gave.
It gives :
array([(u'Andrew', -3, 3, 100.032), (u'Bob', -4, 4, 103.323),
(u'Joe', -5, 5, 154.324)],
dtype=[('name', 'O'), ('N1', '<i8'), ('N2', '<i8'), ('N3', '<f8')])
You can sort on "N1" for instance :
new_array.sort(order="N1")
new_array
array([(u'Joe', -5, 5, 154.324), (u'Bob', -4, 4, 103.323),
(u'Andrew', -3, 3, 100.032)],
dtype=[('name', 'O'), ('N1', '<i8'), ('N2', '<i8'), ('N3', '<f8')])
Hope this helps.

recarr = np.rec.fromrecords(array)
Optionally set field names:
recarr = np.rec.fromrecords(array, names="name, idata, idata2, fdata")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert mat file to pandas dataframe problem - python

So this is what I'd do to convert a mat file into a pandas dataframe automagically. mat = scipy.io.loadmat('file.mat') mat = {k:v for k, v in mat.items() if k[0] != '_'} df = pd.DataFrame({k: np.array(v).flatten() for k, v in mat.items()})

Related

genfromtxt read data of different types as array or arrays

Extract an ndarray from a np.void array

Read .mat files in python whose content is a table

Converty List of Lists to Numpy Array without carrying over the list

Numpy - assign column data types (dtype) to existing array

Categories

Resources