Pandas series to array conversion is getting me arrays of array objects - python

I have a Pandas series and here are two first two rows:
X.head(2)
Which has 1D arrays for each row: the column header is mels_flatten
mels_flatten
0 [0.0171469795289, 0.0173154008662, 0.395695541...
1 [0.0471267533454, 0.0061760868171, 0.005647608...
I want to store the values in a single array to feed to a classifier model.
np.vstack(X.values)
or
np.array(X.values)
both returns following
array([[ array([ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04])],
[ array([ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118])],
I am not sure how to process array of array objects.
My expected result is:
array([[ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04]],
[ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118]],
Have tried np.concatenate and np.resize as some other posts suggested with no luck.

I find it likely that not all of your 1d arrays are the same length, i.e. your series is not compatible with a rectangular 2d array.
Consider the following dummy example:
import pandas as pd
import numpy as np
X = pd.Series([np.array([1,2,3]),np.array([4,5,6])])
# 0 [1, 2, 3]
# 1 [4, 5, 6]
# dtype: object
np.vstack(X.values)
# array([[1, 2, 3],
# [4, 5, 6]])
As the above demonstrate, a collection of 1d arrays (or lists) of the same size will be nicely stacked to a 2d array. Check the size of your arrays, and you'll probably find that there are some discrepancies:
>>> X.apply(len)
0 3
1 3
dtype: int64
If X.apply(len).unique() returns an array with more than 1 elements, you'll see the proof of the problem. In the above rectangular case:
>>> X.apply(len).unique()
array([3])
In a non-conforming example:
>>> Y = pd.Series([np.array([1,2,3]),np.array([4,5])])
>>> np.array(Y.values)
array([array([1, 2, 3]), array([4, 5])], dtype=object)
>>> Y.apply(len).unique()
array([3, 2])
As you can see, the nested array result is coupled to the non-unique length of items inside the original array.

Related

Python xarray - vectorized indexing

I'm trying to understand vectorized indexing in xarray by following this example from the docs:
import xarray as xr
import numpy as np
da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
ind_x = xr.DataArray([0, 1], dims=['x'])
ind_y = xr.DataArray([0, 1], dims=['y'])
The output of the array da is as follows:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
So far so good. Now in the example there are shown two ways of indexing. Orthogonal (not interested in this case) and vectorized (what I want). For the vectorized indexing the following is shown:
In [37]: da[ind_x, ind_x] # vectorized indexing
Out[37]:
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
y (x) <U1 'a' 'b'
* x (x) int64 0 1
The result seems to be what I want, but this feels very strange to me. ind_x (which in theory refers to dims=['x']) is being passed twice but somehow is capable of indexing what appears to be both in the x and y dims. As far as I understand the x dim would be the rows and y dim would be the columns, is that correct? How come the same ind_x is capable of accessing both the rows and the cols?
This seems to be the concept I need for my problem, but can't understand how it works or how to extend it to more dimensions. I was expecting this result to be given by da[ind_x, ind_y] however that seems to yield the orthogonal indexing surprisingly enough.
Having the example with ind_x being used twice is probably a little confusing: actually, the dimension of the indexer doesn't have to matter at all for the indexing behavior! Observe:
ind_a = xr.DataArray([0, 1], dims=["a"]
da[ind_a, ind_a]
Gives:
<xarray.DataArray (a: 2)>
array([0, 5])
Coordinates:
x (a) int32 0 1
y (a) <U1 'a' 'b'
Dimensions without coordinates: a
The same goes for the orthogonal example:
ind_a = xr.DataArray([0, 1], dims=["a"])
ind_b = xr.DataArray([0, 1], dims=["b"])
da[ind_a, ind_b]
Result:
<xarray.DataArray (a: 2, b: 2)>
array([[0, 2],
[4, 6]])
Coordinates:
x (a) int32 0 1
y (b) <U1 'a' 'c'
Dimensions without coordinates: a, b
The difference is purely in terms of "labeling", as in this case you end up with dimensions without coordinates.
Fancy indexing
Generally stated, I personally do not find "fancy indexing" the most intuitive concept. I did find this example in NEP 21 pretty clarifying: https://numpy.org/neps/nep-0021-advanced-indexing.html
Specifically, this:
Consider indexing a 2D array by two 1D integer arrays, e.g., x[[0, 1], [0, 1]]:
Outer indexing is equivalent to combining multiple integer indices with itertools.product(). The result in this case is another 2D
array with all combinations of indexed elements, e.g.,
np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])
Vectorized indexing is equivalent to combining multiple integer
indices with zip(). The result in this case is a 1D array containing
the diagonal elements, e.g., np.array([x[0, 0], x[1, 1]]).
Back to xarray
da[ind_x, ind_y]
Can also be written as:
da.isel(x=ind_x, y=ind_y)
The dimensions are implicit in the order. However, xarray still attempts to broadcast (based on dimension labels), so da[ind_y] mismatches and results in an error. da[ind_a] and da[ind_b] both work.
More dimensions
The dims you provide for the indexer are what determines the shape of the output, not the dimensions of the array you're indexing.
If you want to select single values along the dimensions (so we're zip()-ing through the indexes simultaneously), just make sure that your indexers share the dimension, here for a 3D array:
da = xr.DataArray(
data=np.arange(3 * 4 * 5).reshape(3, 4, 5),
coords={
"x": [1, 2, 3],
"y": ["a", "b", "c", "d"],
"z": [1.0, 2.0, 3.0, 4.0, 5.0],
},
dims=["x", "y", "z"],
)
ind_along_x = xr.DataArray([0, 1], dims=["new_index"])
ind_along_y = xr.DataArray([0, 2], dims=["new_index"])
ind_along_z = xr.DataArray([0, 3], dims=["new_index"])
da[ind_along_x, ind_along_y, ind_along_z]
Note that the values of the indexers do not have to the same -- that would be a pretty severe limitation, after all.
Result:
<xarray.DataArray (new_index: 2)>
array([ 0, 33])
Coordinates:
x (new_index) int32 1 2
y (new_index) <U1 'a' 'c'
z (new_index) float64 1.0 4.0
Dimensions without coordinates: new_index

iterating a filtered Numpy array whilst maintaining index information

I am attempting to pass filtered values from a Numpy array into a function.
I need to pass values only above a certain value, and their index position with the Numpy array.
I am attempting to avoid iterating over the entire array within python by using Numpys own filtering systems, the arrays i am dealing with have 20k of values in them with potentially only very few being relevant.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = np.nonzero(somearray > 4)
for i in arrayindex:
somefunction(arrayindex[0], somearray[arrayindex[0]])
This threw up errors of logic not being able to handle multiple values,
this led me to testing it through print statement to see what was going on.
for cell in arrayindex:
print(f"index {cell}")
print(f"data {somearray[cell]}")
I expected an output of
index 4
data 5
index 5
data 6
But instead i get
index [4 5]
data [5 6]
I have looked through different methods to iterate through numpy arrays such and neditor, but none seem to still allow me to do the filtering of values outside of the for loop.
Is there a solution to my quandary?
Oh, i am aware that is is generally frowned upon to loop through a numpy array, however the function that i am passing these values to are complex, triggering certain events and involving data to be uploaded to a data base dependent on the data location within the array.
Thanks.
import numpy as np
somearray = np.array([1,2,3,4,5,6])
arrayindex = [idx for idx, val in enumerate(somearray) if val > 4]
for i in range(0, len(arrayindex)):
somefunction(arrayindex[i], somearray[arrayindex[i]])
for i in range(0, len(arrayindex)):
print("index", arrayindex[i])
print("data", somearray[arrayindex[i]])
You need to have a clear idea of what nonzero produces, and pay attention to the difference between indexing with a list(s) and with a tuple.
===
In [110]: somearray = np.array([1,2,3,4,5,6])
...: arrayindex = np.nonzero(somearray > 4)
nonzero produces a tuple of arrays, one per dimension (this becomes more obvious with 2d arrays):
In [111]: arrayindex
Out[111]: (array([4, 5]),)
It can be used directly as an index:
In [113]: somearray[arrayindex]
Out[113]: array([5, 6])
In this 1d case you could take the array out of the tuple, and iterate on it:
In [114]: for i in arrayindex[0]:print(i, somearray[i])
4 5
5 6
argwhere does a 'transpose', which could also be used for iteration
In [115]: idxs = np.argwhere(somearray>4)
In [116]: idxs
Out[116]:
array([[4],
[5]])
In [117]: for i in idxs: print(i,somearray[i])
[4] [5]
[5] [6]
idxs is (2,1) shape, so i is (1,) shape array, resulting in the brackets in the display. Occasionally it's useful, but nonzero is used more (often by it's other name, np.where).
2d
argwhere has a 2d example:
In [119]: x=np.arange(6).reshape(2,3)
In [120]: np.argwhere(x>1)
Out[120]:
array([[0, 2],
[1, 0],
[1, 1],
[1, 2]])
In [121]: np.nonzero(x>1)
Out[121]: (array([0, 1, 1, 1]), array([2, 0, 1, 2]))
In [122]: x[np.nonzero(x>1)]
Out[122]: array([2, 3, 4, 5])
While nonzero can be used to index the array, argwhere elements can't.
In [123]: for ij in np.argwhere(x>1):
...: print(ij,x[ij])
...:
...
IndexError: index 2 is out of bounds for axis 0 with size 2
Problem is that ij is a list, which is used to index on dimension. numpy distinguishes between lists and tuples when indexing. (Earlier versions fudged the difference, but current versions are taking a more rigorous approach.)
So we need to change the list into a tuple. One way is to unpack it:
In [124]: for i,j in np.argwhere(x>1):
...: print(i,j,x[i,j])
...:
...:
0 2 2
1 0 3
1 1 4
1 2 5
I could have used: print(ij,x[tuple(ij)]) in [123].
I should have used unpacking the [117] iteration:
In [125]: for i, in idxs: print(i,somearray[i])
4 5
5 6
or somearray[tuple(i)]

Converting a row from an astropy table to a numpy array

I have a numpy table
<Table length=3>
a b
int64 int64
----- -----
1 3
2 5
4 7
And I would like to convert a row to a numpy array. But when I try, I end up with an array with no dimensions
In: np.array(mytable[0]).shape
Out: ()
and if I do
myrow = mytable[0]
myrow[0]
I get the error
IndexError: too many indices for array
Is there something like t[0].values I could do, that would return array([1, 3]) ?
When you slice a row from a table in Astropy and convert to an ndarray, you get a 0D structured array back, which is the shape attribute is empty. For a general solution, numpy provides a structured_to_unstructured method that will work well for more than just a single row slice as well.
>>> np.lib.recfunctions.structured_to_unstructured(np.array(t[0]))
array([1, 3])
>>> np.lib.recfunctions.structured_to_unstructured(np.array(t[1:]))
array([[2, 5],
[4, 7]])
The Table.Row object provides an iterator over the values, so you could do:
>>> np.array(list(t[0]))
array([1, 3])

storing numpy object array of equal-size ndarrays to a .mat file using scipy.io.savemat

I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().
This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.
When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:
[array([[array([[0.82374894]]), array([[0.50730055]]),
array([[0.36721625]]), array([[0.45036349]])],
[array([[0.26119276]]), array([[0.16843872]]),
array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]
This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.
I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.
I'm not quite clear about the problem. Let me try to recreate your case:
In [58]: from scipy.io import loadmat, savemat
In [59]: A = np.empty((2,1), object)
In [61]: A[0,0]=np.arange(4).reshape(2,2)
In [62]: A[1,0]=np.arange(6).reshape(3,2)
In [63]: A
Out[63]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]
In [65]: B
Out[65]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)
As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925
Saving:
In [66]: savemat('foo.mat', {'A':A, 'B':B})
Loading:
In [74]: loadmat('foo.mat')
Out[74]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
'__version__': '1.0',
'__globals__': [],
'A': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object),
'B': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]
Out[75]:
array([[0, 1],
[2, 3],
[4, 5]])
Your problem case looks like it's a object dtype array containing numbers:
In [89]: C = np.arange(4).reshape(2,2).astype(object)
In [90]: C
Out[90]:
array([[0, 1],
[2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})
In [92]: loadmat('foo1.mat')
Out[92]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
'__version__': '1.0',
'__globals__': [],
'C': array([[array([[0]]), array([[1]])],
[array([[2]]), array([[3]])]], dtype=object)}
Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.
===
And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:
>> load foo.mat
>> A
A =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
4 5
}
>> B
B =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
}
>> load foo1.mat
>> C
C =
{
[1,1] = 0
[2,1] = 2
[1,2] = 1
[2,2] = 3
}
Python: Issue reading in str from MATLAB .mat file using h5py and NumPy
is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.

Access columns and rows of numpy.ndarray

I currently struggling with extracting certain columns and rows from a matrix stored as a numpy.ndarray.
I have a list in which I've appended these numpy.ndarrays.
This list is stored in a variable named data
print data[0].shape
outputs this
(400, 288)
Which I've according to the documentation have understood being the matrix has 400 rows, and 288 columns.
How do I extract all the 288 seperately?
Example:
>> import numpy as np
>> data = np.random.rand(3,3)
>> print data
[[ 0.97522481 0.57583658 0.68582806]
[ 0.88509883 0.22261933 0.84307038]
[ 0.59397925 0.51592125 0.54346909]]
How do I print the columns separately of this 3x3 matrix, first being
[0.97522481 , 0.88509883, 0.59397925 ]
without outputting the others?
Is it what you are looking for?
import numpy as np
arr = np.array([[1, 2],
[3, 4],
[5, 6]])
print(arr.shape)
# (3, 2)
print(list(data.T))
# [array([1, 3, 5]), array([2, 4, 6])]

Categories

Resources