suppress numpy array creation protocol for numpy arrays of objects - python

I am trying to build a library which reads complex HDF5 data files in python.
I am running into a problem where, an HDF5 Dataset somehow implements the default array protocol (sometimes), such that when a numpy array is created from it, it casts to the particular array type.
In [8]: ds
Out[8]: <HDF5 dataset "two_by_zero_empty_matrix": shape (2,), type "<u8">
In [9]: ds.value
Out[9]: array([2, 0], dtype=uint64)
This Dataset object, implements the numpy array protocol, and when the dataset consists of numbers, it supplies a default array type.
In [10]: np.array(ds)
Out[10]: array([2, 0], dtype=uint64)
However, if the dataset doesn't consist of numbers, but some other objects, as you would expect, it just uses a numpy array of type np.object:
In [43]: ds2
Out[43]: <HDF5 dataset "somecells": shape (2, 3), type "|O8">
In [44]: np.array(ds2)
Out[44]:
array([[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>],
[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>]], dtype=object)
This behavior might seem convenient but in my case it's actually inconvenient since it interferes with my recursive traversal of the data file. Working around it really turns out to be difficult since there a lot of different possible data types which have to be special-cased a little differently depending on whether they are children of objects or arrays of numbers.
My question is this: is there a way to suppress the default array creation protocol, such that I could create an object array out of dataset objects that want to cast to their natural duck types?
That is, I want something like: np.array(ds, dtype=object), which will produce an array of [<Dataset object of type int>, dtype=object] and not [3 4 5, dtype=int].
But np.array(ds, dtype=np.object) throws IOError: Can't read data (No appropriate function for conversion path)
I tried in earnest to google some documentation about the numpy array protocol works, and found a lot, but it doesn't really appear to me that anyone considered the possibility that someone might want this behavior.

I can understand where the Out[44] is coming from. It's an array containing pointers to objects, in this case h5py references to objects on the file (I think).
With np.array(ds, dtype=object) are you trying to create something more like this, rather than the 'normal' array that you get with np.array(ds)? array([2, 0], dtype=uint64).
But what is the parallel array? A single element array with a pointer to ds? Or a 2 element array with pointers to 2 and 0 somewhere on the file? What if they aren't <HDF5 object reference>?
In numpy, without any h5py stuff, I can create an object array from a list of values:
In [104]: np.array([2,0], dtype=object)
Out[104]: array([2, 0], dtype=object)
Or I can start with an empty array (filled with None) and assign values:
In [105]: x=np.empty((2,), dtype=object)
In [106]: x[0]=2
In [107]: x[1]=0
In [108]: x
Out[108]: array([2, 0], dtype=object)
I guess you could try:
x[0] = ds[0]
or
x[:] = ds[:]
Or make a single element object array
x = np.empty((), dtype=object)
x[()] = ds
I don't have a h5py test file open on my Ipython session to test this.
But I can do something weird like make an object array that contains itself. I can work with, but I can't display it without getting a recursion error.
In [118]: x=np.empty((),dtype=object)
In [119]: x[()]=x
In [120]: x1=x[()]
In [121]: x1==x
Out[121]: True
I have a small h5py file open on another terminal:
In [315]: list(f.keys())
Out[315]: ['d', 'x', 'y']
In [317]: f['d'] # the group
Out[317]: <HDF5 group "/d" (2 members)>
x is a string:
In [318]: f['x'] # a single element (a string)
Out[318]: <HDF5 dataset "x": shape (), type "|O4">
In [330]: f['x'].value
Out[330]: 'astring'
In [331]: np.array(f['x'])
Out[331]: array('astring', dtype=object)
y is an array:
In [320]: f['y'][:]
Out[320]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [321]: f['y'].value
Out[321]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [322]: np.array(f['y'])
Out[322]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [323]: timeit np.array(f['y'])
1000 loops, best of 3: 364 µs per loop
In [324]: timeit f['y'].value
1000 loops, best of 3: 380 µs per loop
So access with value and array is equivalent.
Access as object array gives the same sort of error as you got.
In [325]: np.array(f['y'],dtype=object)
...
OSError: can't read data (Dataset: Read failed)
Conversion to float works fine:
In [326]: np.array(f['y'],dtype=float)
Out[326]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
And the assignment to a predefined object array works:
In [327]: x=np.empty((),dtype=object)
In [328]: x[()]=f['y']
In [329]: x
Out[329]: array(<HDF5 dataset "y": shape (10,), type "<i4">, dtype=object)
Trying to create a 10 element array to take y:
In [332]: y1=np.empty((10,),dtype=object)
In [333]: y1[:]=f['y']
...
OSError: can't read data (Dataset: Read failed)
In [334]: y1[:]=f['y'].value
In [335]: y1
Out[335]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
y1[:]=f['y'][:] also works
I can't assign dataset to y1 (same error as when I tried np.array(f['y'],dtype=object). But I can assign its values. I can even assign the dataset to one element of y1
In [338]: y1[-1]=f['y']
In [339]: y1
Out[339]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8,
<HDF5 dataset "y": shape (10,), type "<i4">], dtype=object)
I keep coming back to the basic idea that an object array is just a collection of pointers, essentially a list in an array wrapper.

Related

Python xarray index return

I have a multidimensional xarray like:
testarray = xr.DataArray([[1,-2,3],[4,5,-6]])
and i want to get the indices for a specific condition, eg. where testarray is smaller then 0.
So the expected result should be an array like:
result = [[1,2],[0,1]]
Or any other format that let me get these indices for further calculations. Can't imagine, that there is no option within xarray for such an elementary problem, but i can't find it. Things like
testarray.where(testarray<0)
do some very ???suspicious??? stuff. Whats the use of an array thats the same but with nan's where conditions not met???
Thanks alot for your help :)
To get the indices, you could use np.argwhere:
In [3]: da= xr.DataArray([[1,-2,3],[4,5,-6]])
In [4]: da
Out[4]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 1, -2, 3],
[ 4, 5, -6]])
Dimensions without coordinates: dim_0, dim_1
In [14]: da.where(da<0,0)
Out[14]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 0, -2, 0],
[ 0, 0, -6]])
Dimensions without coordinates: dim_0, dim_1
# Note you'd need to handle the case of a 0 value here
In [13]: np.argwhere(da.where(da<0,0).values)
Out[13]:
array([[0, 1],
[1, 2]])
I agree this would be useful function to have natively in xarray; I'm not sure of the best way of doing it natively at the moment. Open to ideas!

How to save a Python list of arrays of varied dimensions to mat file [duplicate]

This question already has an answer here:
creating Matlab cell arrays in python
(1 answer)
Closed 2 years ago.
I am going to save a list of arrays into a file that can be read in Matlab. The arrays in the list are of 3-dimensional but varied shapes, so I cannot put them into a single large array.
Originally I thought I could save the list in to a pickle file and then read the file in Matlab. Later I found Matlab does not support reading the pickle file. I also tried using scipy.io.savemat to save the list to a mat file. The inconsistencies of array dimensions causes saving problems.
Anyone has ideas of how to solve the problem? It should be noted that the list file is very large in memory (>4 G).
If you don't need a single file, you can iterate through the list and savemat each entry into a separate file. Then iterate through that directory and load each file into a cell array element.
You can also zip the directory you store these in, to get one file to pass around.
savemat is the right tool for saving numpy arrays in a MATLAB format. I'd suggest making a desired structure in MATLAB, save it, and loadmat. Duplicate that layout when going the other way.
Also loadmat the savemat file to get a better idea how it maps python objects onto MATLAB ones.
Arrays may become order F 2d arrays. Cells may become object dtype arrays, structs may become structured arrays.
Create a .mat` from a list:
In [180]: io.savemat('test.mat', {'x':[np.arange(12).reshape(3,4), np.arange(3), 4]})
reload it:
In [181]: data = io.loadmat('test.mat')
In [182]: data
Out[182]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Sat Mar 21 20:02:02 2020',
'__version__': '1.0',
'__globals__': [],
'x': array([[array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]),
array([[0, 1, 2]]), array([[4]])]], dtype=object)}
It has one named variable
In [183]: data['x']
Out[183]:
array([[array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]),
array([[0, 1, 2]]), array([[4]])]], dtype=object)
The shape is 2d (like all MATLAB) and object dtype (to hold a mix of items):
In [185]: data['x'].shape
Out[185]: (1, 3)
Within that is a 2d array:
In [186]: data['x'][0,0].shape
Out[186]: (3, 4)
If I check the flags I see data['x'] is order F, F contiguous (but the source, used in the savemat was default C contiguous).
Note also the change in shape of the 1d array and scalar - again following MATLAB conventions.
In Octave:
>> data = load('test.mat')
data =
scalar structure containing the fields:
x =
{
[1,1] =
0 1 2 3
4 5 6 7
8 9 10 11
[1,2] =
0 1 2
[1,3] = 4
}
we get a cell variable.
If I wanted a numpy matrix that didn't get changed in this transfer, I'd have to start with something like:
In [188]: np.arange(12).reshape(3,4,order='F')
Out[188]:
array([[ 0, 3, 6, 9],
[ 1, 4, 7, 10],
[ 2, 5, 8, 11]])
numpy is, by default C order, row-major, with first dimension being the outermost. MATLAB is the opposite - F (fortran), column major, last dimension outermost.

storing numpy object array of equal-size ndarrays to a .mat file using scipy.io.savemat

I am trying to create .mat data files using python. The matlab code expects the data to have a certain format, where two-dimensional ndarrays of non-uniform sizes are stored as objects in a column vector. So, in my case, there would be k numpy arrays of shape (m_i, n) - with different m_i for each array - stored in a numpy array with dtype=object of shape (k, 1). I then add this object array to a dictionary and pass it to scipy.io.savemat().
This works fine so long as the m_i are indeed different. If all k arrays happen to have the same number of rows m_i, the behaviour becomes strange. First of all, it requires very explicit assignment to a numpy array of dtype=object that has been initialised to the final size k, otherwise numpy simply creates a three-dimensional array. But even when I have the correct format in python and store it to a .mat file using savemat, there is some kind of problem in the translation to the matlab format.
When I reload the data from the .mat file using scipy.io.loadmat, I find that I still have an object array of shape (k, 1), which still has elements of shape (m, n). However, each element is no longer an int or a float but is instead a numpy array of shape (1, 1) that has to be further indexed to access the contained int or float. So an individual element of an object vector that was supposed to be a numpy array of shape (2, 4) would look something like this:
[array([[array([[0.82374894]]), array([[0.50730055]]),
array([[0.36721625]]), array([[0.45036349]])],
[array([[0.26119276]]), array([[0.16843872]]),
array([[0.28649524]]), array([[0.64239569]])]], dtype=object)]
This also poses a problem for the matlab code that I am trying to build my data files for. It runs fine for the arrays of objects that have different shapes but will break when there are arrays containing arrays of the same shape.
I know this is a rather obscure and possibly unavoidable issue but I figured I would see if anyone else has encountered it and found a fix. Thanks.
I'm not quite clear about the problem. Let me try to recreate your case:
In [58]: from scipy.io import loadmat, savemat
In [59]: A = np.empty((2,1), object)
In [61]: A[0,0]=np.arange(4).reshape(2,2)
In [62]: A[1,0]=np.arange(6).reshape(3,2)
In [63]: A
Out[63]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object)
In [64]: B=A[[0,0],:]
In [65]: B
Out[65]:
array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)
As I explained earlier today, creating an object dtype array from arrays of matching size requires special handling. np.array(...) tries to create a higher dimensional array. https://stackoverflow.com/a/56243305/901925
Saving:
In [66]: savemat('foo.mat', {'A':A, 'B':B})
Loading:
In [74]: loadmat('foo.mat')
Out[74]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:20:42 2019',
'__version__': '1.0',
'__globals__': [],
'A': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3],
[4, 5]])]], dtype=object),
'B': array([[array([[0, 1],
[2, 3]])],
[array([[0, 1],
[2, 3]])]], dtype=object)}
In [75]: _74['A'][1,0]
Out[75]:
array([[0, 1],
[2, 3],
[4, 5]])
Your problem case looks like it's a object dtype array containing numbers:
In [89]: C = np.arange(4).reshape(2,2).astype(object)
In [90]: C
Out[90]:
array([[0, 1],
[2, 3]], dtype=object)
In [91]: savemat('foo1.mat', {'C': C})
In [92]: loadmat('foo1.mat')
Out[92]:
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue May 21 11:39:31 2019',
'__version__': '1.0',
'__globals__': [],
'C': array([[array([[0]]), array([[1]])],
[array([[2]]), array([[3]])]], dtype=object)}
Evidently savemat has converted the integer objects into 2d MATLAB compatible arrays. In MATLAB everything, even scalars, is at least 2d.
===
And in Octave, the object dtype arrays all produce cells, and the 2d numeric arrays produce matrices:
>> load foo.mat
>> A
A =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
4 5
}
>> B
B =
{
[1,1] =
0 1
2 3
[2,1] =
0 1
2 3
}
>> load foo1.mat
>> C
C =
{
[1,1] = 0
[2,1] = 2
[1,2] = 1
[2,2] = 3
}
Python: Issue reading in str from MATLAB .mat file using h5py and NumPy
is a relatively recent SO that showed there's a difference between the Octave HDF5 and MATLAB.

Pandas series to array conversion is getting me arrays of array objects

I have a Pandas series and here are two first two rows:
X.head(2)
Which has 1D arrays for each row: the column header is mels_flatten
mels_flatten
0 [0.0171469795289, 0.0173154008662, 0.395695541...
1 [0.0471267533454, 0.0061760868171, 0.005647608...
I want to store the values in a single array to feed to a classifier model.
np.vstack(X.values)
or
np.array(X.values)
both returns following
array([[ array([ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04])],
[ array([ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118])],
I am not sure how to process array of array objects.
My expected result is:
array([[ 1.71469795e-02, 1.73154009e-02, 3.95695542e-01, ...,
2.35955651e-04, 8.64118460e-04, 7.74663408e-04]],
[ 0.04712675, 0.00617609, 0.00564761, ..., 0.00277199,
0.00205229, 0.00043118]],
Have tried np.concatenate and np.resize as some other posts suggested with no luck.
I find it likely that not all of your 1d arrays are the same length, i.e. your series is not compatible with a rectangular 2d array.
Consider the following dummy example:
import pandas as pd
import numpy as np
X = pd.Series([np.array([1,2,3]),np.array([4,5,6])])
# 0 [1, 2, 3]
# 1 [4, 5, 6]
# dtype: object
np.vstack(X.values)
# array([[1, 2, 3],
# [4, 5, 6]])
As the above demonstrate, a collection of 1d arrays (or lists) of the same size will be nicely stacked to a 2d array. Check the size of your arrays, and you'll probably find that there are some discrepancies:
>>> X.apply(len)
0 3
1 3
dtype: int64
If X.apply(len).unique() returns an array with more than 1 elements, you'll see the proof of the problem. In the above rectangular case:
>>> X.apply(len).unique()
array([3])
In a non-conforming example:
>>> Y = pd.Series([np.array([1,2,3]),np.array([4,5])])
>>> np.array(Y.values)
array([array([1, 2, 3]), array([4, 5])], dtype=object)
>>> Y.apply(len).unique()
array([3, 2])
As you can see, the nested array result is coupled to the non-unique length of items inside the original array.

Numpy multi-dimensional slicing with multiple boolean arrays

I'm trying to use individual 1-dimensional boolean arrays to slice a multi-dimension array. For some reason, this code doesn't work:
>>> a = np.ones((100, 200, 300, 2))
>>> a.shape
(100, 200, 300, 2)
>>> m1 = np.asarray([True]*200)
>>> m2 = np.asarray([True]*300)
>>> m2[-1] = False
>>> a[:,m1,m2,:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (200,) (299,)
>>> m2 = np.asarray([True]*300) # try again with all 300 dimensions True
>>> a[:,m1,m2,:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (200,) (300,)
But this works just fine:
>>> a = np.asarray([[[1, 2], [3, 4], [5, 6]], [[11, 12], [13, 14], [15, 16]]])
>>> a.shape
(2, 3, 2)
>>> m1 = np.asarray([True, False, True])
>>> m2 = np.asarray([True, False])
>>> a[:,m1,m2]
array([[ 1, 5],
[11, 15]])
Any idea of what I might be doing wrong in the first example?
Short answer: The number of True elements in m1 and m2 must match, unless one of them has only one True term.
Also distinguish between 'diagonal' indexing and 'rectangular' indexing. This is about indexing, not slicing. The dimensions with : are just along for the ride.
Initial ideas
I can get your first case working with:
In [137]: a=np.ones((100,200,300,2))
In [138]: m1=np.ones((200,),bool)
In [139]: m2=np.ones((300,),bool)
In [140]: m2[-1]=False
In [141]: I,J=np.ix_(m1,m2)
In [142]: a[:,I,J,:].shape
Out[142]: (100, 200, 299, 2)
np.ix_ turns the 2 boolean arrays into broadcastable index arrays
In [143]: I.shape
Out[143]: (200, 1)
In [144]: J.shape
Out[144]: (1, 299)
Note that this picks 200 'rows' in one dimension, and 299 in the other.
I'm not sure why this kind of reworking of the arrays is needed in this case, but not in the 2nd
In [154]: b=np.arange(2*3*2).reshape((2,3,2))
In [155]: n1=np.array([True,False,True])
In [156]: n2=np.array([True,False])
In [157]: b[:,n1,n2]
Out[157]:
array([[ 0, 4], # shape (2,2)
[ 6, 10]])
Taking the same ix_ strategy produces the same values but a different shape:
In [164]: b[np.ix_(np.arange(b.shape[0]),n1,n2)]
# or I,J=np.ix_(n1,n2);b[:,I,J]
Out[164]:
array([[[ 0],
[ 4]],
[[ 6],
[10]]])
In [165]: _.shape
Out[165]: (2, 2, 1)
Both cases use all rows of the 1st dimension. The ix one picks 2 'rows' of the 2nd dim, and 1 column of the last, resulting the (2,2,1) shape. The other picks b[:,0,0] and b[0,2,0] terms, resulting (2,2) shape.
(see my addenda as to why both are simply broadcasting).
These are all cases of advanced indexing, with boolean and numeric indexes. One can study the docs, or one can play around. Sometimes it's more fun to do the later. :)
(I knew that ix_ was good for adding the necessary np.newaxis to arrays so can be broadcast together, but didn't realize that worked with boolean arrays as well - it uses np.nonzero() to convert boolean to indices.)
Resolution
Underlying this is, I think, a confusion over 2 modes of indexing. which might called 'diagonal' and 'rectangular' (or element-by-element selection versus block selection). To illustrate look at a small 2d array
In [73]: M=np.arange(6).reshape(2,3)
In [74]: M
Out[74]:
array([[0, 1, 2],
[3, 4, 5]])
and 2 simple numeric indexes
In [75]: m1=np.arange(2); m2=np.arange(2)
They can be used 2 ways:
In [76]: M[m1,m2]
Out[76]: array([0, 4])
and
In [77]: M[m1[:,None],m2]
Out[77]:
array([[0, 1],
[3, 4]])
The 1st picks 2 points, the M[0,0] and M[1,1]. This kind of indexing lets us pick out the diagonals of an array.
The 2nd picks 2 rows and from that 2 columns. This is the kind of indexing the np.ix_ produces. The 1st picks 2 points, the M[0,0] and M[1,1]. This a 'rectangular' form of indexing.
Change m2 to 3 values:
In [78]: m2=np.arange(3)
In [79]: M[m1[:,None],m2] # returns a 2x3
Out[79]:
array([[0, 1, 2],
[3, 4, 5]])
In [80]: M[m1,m2] # produces an error
...
ValueError: shape mismatch: objects cannot be broadcast to a single shape
But if m2 has just one element, we don't get the broadcast error - because the size 1 dimension can be expanded during broadcasting:
In [81]: m2=np.arange(1)
In [82]: M[m1,m2]
Out[82]: array([0, 3])
Now change the index arrays to boolean, each matching the length of the respective dimensions, 2 and 3.
In [91]: m1=np.ones(2,bool); m2=np.ones(3,bool)
In [92]: M[m1,m2]
...
ValueError: shape mismatch: objects cannot be broadcast to a single shape
In [93]: m2[2]=False # m1 and m2 each have 2 True elements
In [94]: M[m1,m2]
Out[94]: array([0, 4])
In [95]: m2[0]=False # m2 has 1 True element
In [96]: M[m1,m2]
Out[96]: array([1, 4])
With 2 and 3 True terms we get an error, but with 2 and 2 or 2 and 1 it runs - just as though we'd used the indices of the True elements: np.nonzero(m2).
To apply this to your examples. In the first, m1 and m2 have 200 and 299 True elements. a[:,m1,m2,:] fails because of a mismatch in the number of True terms.
In the 2nd, they have 2 and 1 True terms, with nonzero indices of [0,2] and [0], which can be broadcast to [0,0]. So it runs.
http://docs.scipy.org/doc/numpy-1.10.0/reference/arrays.indexing.html
explains boolean array indexing in terms of nonzero and ix_.
Combining multiple Boolean indexing arrays or a Boolean with an integer indexing array can best be understood with the obj.nonzero() analogy. The function ix_ also supports boolean arrays and will work without any surprises.
Addenda
On further thought the distinction between 'diagonal' and 'block/rectangular' indexing might be more my mental construct that numpys. Underlying both is the concept of broadcasting.
Take the n1 and n2 booleans, and get their nonzero equivalents:
In [107]: n1
Out[107]: array([ True, False, True], dtype=bool)
In [108]: np.nonzero(n1)
Out[108]: (array([0, 2], dtype=int32),)
In [109]: n2
Out[109]: array([ True, False], dtype=bool)
In [110]: np.nonzero(n2)
Out[110]: (array([0], dtype=int32),)
Now try broadcasting in the 'diagonal' and 'rectangular' modes:
In [105]: np.broadcast_arrays(np.array([0,2]),np.array([0]))
Out[105]: [array([0, 2]),
array([0, 0])]
In [106]: np.broadcast_arrays(np.array([0,2])[:,None],np.array([0]))
Out[106]:
[array([[0],
[2]]),
array([[0],
[0]])]
One produces (2,) arrays, the other (2,1).
This might be a simple workaround:
a[:,m1,:,:][:,:,m2,:]

Categories

Resources