Python xarray index return - python

I have a multidimensional xarray like:
testarray = xr.DataArray([[1,-2,3],[4,5,-6]])
and i want to get the indices for a specific condition, eg. where testarray is smaller then 0.
So the expected result should be an array like:
result = [[1,2],[0,1]]
Or any other format that let me get these indices for further calculations. Can't imagine, that there is no option within xarray for such an elementary problem, but i can't find it. Things like
testarray.where(testarray<0)
do some very ???suspicious??? stuff. Whats the use of an array thats the same but with nan's where conditions not met???
Thanks alot for your help :)

To get the indices, you could use np.argwhere:
In [3]: da= xr.DataArray([[1,-2,3],[4,5,-6]])
In [4]: da
Out[4]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 1, -2, 3],
[ 4, 5, -6]])
Dimensions without coordinates: dim_0, dim_1
In [14]: da.where(da<0,0)
Out[14]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[ 0, -2, 0],
[ 0, 0, -6]])
Dimensions without coordinates: dim_0, dim_1
# Note you'd need to handle the case of a 0 value here
In [13]: np.argwhere(da.where(da<0,0).values)
Out[13]:
array([[0, 1],
[1, 2]])
I agree this would be useful function to have natively in xarray; I'm not sure of the best way of doing it natively at the moment. Open to ideas!

Related

How to reshape matrix with numpy without using explicit values for argument?

I am trying to create a function and calculate the inner product using numpy.
I got the function to work however I am using explicit numbers in my np.reshape function and I need to make it to use based on input.
my code looks like this:
import numpy as np
X = np.array([[1,2],[3,4]])
Z = np.array([[1,4],[2,5],[3,6]])
# Calculating S
def calculate_S(X, n, m):
assert n == X.shape[0]
n,d1=X.shape
m,d2=X.shape
S = np.diag(np.inner(X,X))
return S
S= calculate_S(X,n,m)
S = S.reshape(2,1)
print(s)
output:
---------------------------------
[[ 5]
[25]]
So the output is correct however instead of specifying 2,1 I need that those values to be automatically placed there based on the shape of my matrix.
How do I do that?
In [163]: X = np.array([[1,2],[3,4]])
In [164]: np.inner(X,X)
Out[164]:
array([[ 5, 11],
[11, 25]])
In [165]: np.diag(np.inner(X,X))
Out[165]: array([ 5, 25])
reshape with -1 gets around having to specify the 2:
In [166]: np.diag(np.inner(X,X)).reshape(-1,1)
Out[166]:
array([[ 5],
[25]])
another way of adding dimension:
In [167]: np.diag(np.inner(X,X))[:,None]
Out[167]:
array([[ 5],
[25]])
You can get the "diagonal" directly with:
In [175]: np.einsum('ij,ij->i',X,X)
Out[175]: array([ 5, 25])
another
In [177]: (X[:,None,:]#X[:,:,None])[:,0,:]
Out[177]:
array([[ 5],
[25]])

Python xarray - vectorized indexing

I'm trying to understand vectorized indexing in xarray by following this example from the docs:
import xarray as xr
import numpy as np
da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
ind_x = xr.DataArray([0, 1], dims=['x'])
ind_y = xr.DataArray([0, 1], dims=['y'])
The output of the array da is as follows:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
So far so good. Now in the example there are shown two ways of indexing. Orthogonal (not interested in this case) and vectorized (what I want). For the vectorized indexing the following is shown:
In [37]: da[ind_x, ind_x] # vectorized indexing
Out[37]:
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
y (x) <U1 'a' 'b'
* x (x) int64 0 1
The result seems to be what I want, but this feels very strange to me. ind_x (which in theory refers to dims=['x']) is being passed twice but somehow is capable of indexing what appears to be both in the x and y dims. As far as I understand the x dim would be the rows and y dim would be the columns, is that correct? How come the same ind_x is capable of accessing both the rows and the cols?
This seems to be the concept I need for my problem, but can't understand how it works or how to extend it to more dimensions. I was expecting this result to be given by da[ind_x, ind_y] however that seems to yield the orthogonal indexing surprisingly enough.
Having the example with ind_x being used twice is probably a little confusing: actually, the dimension of the indexer doesn't have to matter at all for the indexing behavior! Observe:
ind_a = xr.DataArray([0, 1], dims=["a"]
da[ind_a, ind_a]
Gives:
<xarray.DataArray (a: 2)>
array([0, 5])
Coordinates:
x (a) int32 0 1
y (a) <U1 'a' 'b'
Dimensions without coordinates: a
The same goes for the orthogonal example:
ind_a = xr.DataArray([0, 1], dims=["a"])
ind_b = xr.DataArray([0, 1], dims=["b"])
da[ind_a, ind_b]
Result:
<xarray.DataArray (a: 2, b: 2)>
array([[0, 2],
[4, 6]])
Coordinates:
x (a) int32 0 1
y (b) <U1 'a' 'c'
Dimensions without coordinates: a, b
The difference is purely in terms of "labeling", as in this case you end up with dimensions without coordinates.
Fancy indexing
Generally stated, I personally do not find "fancy indexing" the most intuitive concept. I did find this example in NEP 21 pretty clarifying: https://numpy.org/neps/nep-0021-advanced-indexing.html
Specifically, this:
Consider indexing a 2D array by two 1D integer arrays, e.g., x[[0, 1], [0, 1]]:
Outer indexing is equivalent to combining multiple integer indices with itertools.product(). The result in this case is another 2D
array with all combinations of indexed elements, e.g.,
np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])
Vectorized indexing is equivalent to combining multiple integer
indices with zip(). The result in this case is a 1D array containing
the diagonal elements, e.g., np.array([x[0, 0], x[1, 1]]).
Back to xarray
da[ind_x, ind_y]
Can also be written as:
da.isel(x=ind_x, y=ind_y)
The dimensions are implicit in the order. However, xarray still attempts to broadcast (based on dimension labels), so da[ind_y] mismatches and results in an error. da[ind_a] and da[ind_b] both work.
More dimensions
The dims you provide for the indexer are what determines the shape of the output, not the dimensions of the array you're indexing.
If you want to select single values along the dimensions (so we're zip()-ing through the indexes simultaneously), just make sure that your indexers share the dimension, here for a 3D array:
da = xr.DataArray(
data=np.arange(3 * 4 * 5).reshape(3, 4, 5),
coords={
"x": [1, 2, 3],
"y": ["a", "b", "c", "d"],
"z": [1.0, 2.0, 3.0, 4.0, 5.0],
},
dims=["x", "y", "z"],
)
ind_along_x = xr.DataArray([0, 1], dims=["new_index"])
ind_along_y = xr.DataArray([0, 2], dims=["new_index"])
ind_along_z = xr.DataArray([0, 3], dims=["new_index"])
da[ind_along_x, ind_along_y, ind_along_z]
Note that the values of the indexers do not have to the same -- that would be a pretty severe limitation, after all.
Result:
<xarray.DataArray (new_index: 2)>
array([ 0, 33])
Coordinates:
x (new_index) int32 1 2
y (new_index) <U1 'a' 'c'
z (new_index) float64 1.0 4.0
Dimensions without coordinates: new_index

numpy.r_ place the the 1’s in the mid

I am reading numpy.r_ docs. I get it that I cannot place the 1’s at the mid position.
For example ,
a = np.array( [[3,4,5],[ 33,44,55]])
b = np.array( [[-3,-4,-5],[ -33,-44,-55]])
np.r_['0,3,1',a,b]
Actually firstly the shape (2,3) of a is upgraded to shape (1,2,3) and the same as b.Then plus the two shape (1,2,3) + (1,2,3) = (2,2,3) is the final shape of result, note I only plus the first number since the '0' in the '0,3,1'.
Now the question is that according the docs, I can upgrade the shape of a to shape(1,2,3) or (2,3,1) ,but how can upgrade to the shape (2,1,3) ?
In [381]: a = np.array( [[3,4,5],[ 33,44,55]])
...:
...: b = np.array( [[-3,-4,-5],[ -33,-44,-55]])
...:
...: np.r_['0,3,1',a,b]
Out[381]:
array([[[ 3, 4, 5],
[ 33, 44, 55]],
[[ -3, -4, -5],
[-33, -44, -55]]])
Your question should have displayed this result. It helps the reader visualize the action, and better understand your question. Not everyone can run your sample (I couldn't when I first read it on my phone).
You can do the same concatenation with stack (or even np.array((a,b))):
In [382]: np.stack((a,b))
Out[382]:
array([[[ 3, 4, 5],
[ 33, 44, 55]],
[[ -3, -4, -5],
[-33, -44, -55]]])
stack with axis produces what you want (again, a good question would display the desired result):
In [383]: np.stack((a,b), axis=1)
Out[383]:
array([[[ 3, 4, 5],
[ -3, -4, -5]],
[[ 33, 44, 55],
[-33, -44, -55]]])
We can add the dimension to a by itself with:
In [384]: np.expand_dims(a,1)
Out[384]:
array([[[ 3, 4, 5]],
[[33, 44, 55]]])
In [385]: _.shape
Out[385]: (2, 1, 3)
a[:,None] and a.reshape(2,1,3) also do it.
As you found, I can't do the same with np.r_:
In [413]: np.r_['0,3,0',a].shape
Out[413]: (2, 3, 1)
In [414]: np.r_['0,3,1',a].shape
Out[414]: (1, 2, 3)
In [415]: np.r_['0,3,-1',a].shape
Out[415]: (1, 2, 3)
Even looking at the code it is hard to tell how r_ is handling this 3rd parameter. It looks like it uses the ndmin parameter to expand the arrays (which prepends new axes if needed), and then some sort of transpose to move the new axis.
This could be classed as bug in r_, but it's been around so long, I doubt if any one will care. It's more useful for expanding "slices" than for fancy concatenation.
While the syntax of np.r_ may be convenient on occasion, it isn't an essential function. It's just another front end to np.concatenate (with the added arange/linspace functionality).

For a given condition, get indices of values in 2D tensor A, use those to index a 3D tensor B

For a given 2D tensor I want to retrieve all indices where the value is 1. I expected to be able to simply use torch.nonzero(a == 1).squeeze(), which would return tensor([1, 3, 2]). However, instead, torch.nonzero(a == 1) returns a 2D tensor (that's okay), with two values per row (that's not what I expected). The returned indices should then be used to index the second dimension (index 1) of a 3D tensor, again returning a 2D tensor.
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print('a_size', a.size())
# a_size torch.Size([3, 4])
print('b_size', b.size())
# b_size torch.Size([3, 4, 8])
idxs = torch.nonzero(a == 1)
print('idxs_size', idxs.size())
# idxs_size torch.Size([3, 2])
print(b.gather(1, idxs))
Evidently, this does not work, leading to aRunTimeError:
RuntimeError: invalid argument 4: Index tensor must have same
dimensions as input tensor at
C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensorEvenMoreMath.cpp:453
It seems that idxs is not what I expect it to be, nor can I use it the way I thought. idxs is
tensor([[0, 1],
[1, 3],
[2, 2]])
but reading through the documentation I don't understand why I also get back the row indices in the resulting tensor. Now, I know I can get the correct idxs by slicing idxs[:, 1] but then still, I cannot use those values as indices for the 3D tensor because the same error as before is raised. Is it possible to use the 1D tensor of indices to select items across a given dimension?
You could simply slice them and pass it as the indices as in:
In [193]: idxs = torch.nonzero(a == 1)
In [194]: c = b[idxs[:, 0], idxs[:, 1]]
In [195]: c
Out[195]:
tensor([[0.3411, 0.3944, 0.8108, 0.3986, 0.3917, 0.1176, 0.6252, 0.4885],
[0.5698, 0.3140, 0.6525, 0.7724, 0.3751, 0.3376, 0.5425, 0.1062],
[0.7780, 0.4572, 0.5645, 0.5759, 0.5957, 0.2750, 0.6429, 0.1029]])
Alternatively, an even simpler & my preferred approach would be to just use torch.where() and then directly index into the tensor b as in:
In [196]: b[torch.where(a == 1)]
Out[196]:
tensor([[0.3411, 0.3944, 0.8108, 0.3986, 0.3917, 0.1176, 0.6252, 0.4885],
[0.5698, 0.3140, 0.6525, 0.7724, 0.3751, 0.3376, 0.5425, 0.1062],
[0.7780, 0.4572, 0.5645, 0.5759, 0.5957, 0.2750, 0.6429, 0.1029]])
A bit more explanation about the above approach of using torch.where(): It works based on the concept of advanced indexing. That is, when we index into the tensor using a tuple of sequence objects such as tuple of tensors, tuple of lists, tuple of tuples etc.
# some input tensor
In [207]: a
Out[207]:
tensor([[12., 1., 0., 0.],
[ 4., 9., 21., 1.],
[10., 2., 1., 0.]])
For basic slicing, we would need a tuple of integer indices:
In [212]: a[(1, 2)]
Out[212]: tensor(21.)
To achieve the same using advanced indexing, we would need a tuple of sequence objects:
# adv. indexing using a tuple of lists
In [213]: a[([1,], [2,])]
Out[213]: tensor([21.])
# adv. indexing using a tuple of tuples
In [215]: a[((1,), (2,))]
Out[215]: tensor([21.])
# adv. indexing using a tuple of tensors
In [214]: a[(torch.tensor([1,]), torch.tensor([2,]))]
Out[214]: tensor([21.])
And the dimension of the returned tensor would always be one dimension less than the dimension of the input tensor.
Assuming that b's three dimensions are batch_size x sequence_length x features (b x s x feats), the expected results can be achieved as follows.
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print(b.size())
# b x s x feats
idxs = torch.nonzero(a == 1)[:, 1]
print(idxs.size())
# b
c = b[torch.arange(b.size(0)), idxs]
print(c.size())
# b x feats
import torch
a = torch.Tensor([[12, 1, 0, 0],
[4, 9, 21, 1],
[10, 2, 1, 0]])
b = torch.rand(3, 4, 8)
print('a_size', a.size())
# a_size torch.Size([3, 4])
print('b_size', b.size())
# b_size torch.Size([3, 4, 8])
#idxs = torch.nonzero(a == 1, as_tuple=True)
idxs = torch.nonzero(a == 1)
#print('idxs_size', idxs.size())
print(torch.index_select(b,1,idxs[:,1]))
As a supplementary of #kmario23's solution, you can still achieve the same results like
b[torch.nonzero(a==1,as_tuple=True)]

suppress numpy array creation protocol for numpy arrays of objects

I am trying to build a library which reads complex HDF5 data files in python.
I am running into a problem where, an HDF5 Dataset somehow implements the default array protocol (sometimes), such that when a numpy array is created from it, it casts to the particular array type.
In [8]: ds
Out[8]: <HDF5 dataset "two_by_zero_empty_matrix": shape (2,), type "<u8">
In [9]: ds.value
Out[9]: array([2, 0], dtype=uint64)
This Dataset object, implements the numpy array protocol, and when the dataset consists of numbers, it supplies a default array type.
In [10]: np.array(ds)
Out[10]: array([2, 0], dtype=uint64)
However, if the dataset doesn't consist of numbers, but some other objects, as you would expect, it just uses a numpy array of type np.object:
In [43]: ds2
Out[43]: <HDF5 dataset "somecells": shape (2, 3), type "|O8">
In [44]: np.array(ds2)
Out[44]:
array([[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>],
[<HDF5 object reference>, <HDF5 object reference>,
<HDF5 object reference>]], dtype=object)
This behavior might seem convenient but in my case it's actually inconvenient since it interferes with my recursive traversal of the data file. Working around it really turns out to be difficult since there a lot of different possible data types which have to be special-cased a little differently depending on whether they are children of objects or arrays of numbers.
My question is this: is there a way to suppress the default array creation protocol, such that I could create an object array out of dataset objects that want to cast to their natural duck types?
That is, I want something like: np.array(ds, dtype=object), which will produce an array of [<Dataset object of type int>, dtype=object] and not [3 4 5, dtype=int].
But np.array(ds, dtype=np.object) throws IOError: Can't read data (No appropriate function for conversion path)
I tried in earnest to google some documentation about the numpy array protocol works, and found a lot, but it doesn't really appear to me that anyone considered the possibility that someone might want this behavior.
I can understand where the Out[44] is coming from. It's an array containing pointers to objects, in this case h5py references to objects on the file (I think).
With np.array(ds, dtype=object) are you trying to create something more like this, rather than the 'normal' array that you get with np.array(ds)? array([2, 0], dtype=uint64).
But what is the parallel array? A single element array with a pointer to ds? Or a 2 element array with pointers to 2 and 0 somewhere on the file? What if they aren't <HDF5 object reference>?
In numpy, without any h5py stuff, I can create an object array from a list of values:
In [104]: np.array([2,0], dtype=object)
Out[104]: array([2, 0], dtype=object)
Or I can start with an empty array (filled with None) and assign values:
In [105]: x=np.empty((2,), dtype=object)
In [106]: x[0]=2
In [107]: x[1]=0
In [108]: x
Out[108]: array([2, 0], dtype=object)
I guess you could try:
x[0] = ds[0]
or
x[:] = ds[:]
Or make a single element object array
x = np.empty((), dtype=object)
x[()] = ds
I don't have a h5py test file open on my Ipython session to test this.
But I can do something weird like make an object array that contains itself. I can work with, but I can't display it without getting a recursion error.
In [118]: x=np.empty((),dtype=object)
In [119]: x[()]=x
In [120]: x1=x[()]
In [121]: x1==x
Out[121]: True
I have a small h5py file open on another terminal:
In [315]: list(f.keys())
Out[315]: ['d', 'x', 'y']
In [317]: f['d'] # the group
Out[317]: <HDF5 group "/d" (2 members)>
x is a string:
In [318]: f['x'] # a single element (a string)
Out[318]: <HDF5 dataset "x": shape (), type "|O4">
In [330]: f['x'].value
Out[330]: 'astring'
In [331]: np.array(f['x'])
Out[331]: array('astring', dtype=object)
y is an array:
In [320]: f['y'][:]
Out[320]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [321]: f['y'].value
Out[321]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [322]: np.array(f['y'])
Out[322]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [323]: timeit np.array(f['y'])
1000 loops, best of 3: 364 µs per loop
In [324]: timeit f['y'].value
1000 loops, best of 3: 380 µs per loop
So access with value and array is equivalent.
Access as object array gives the same sort of error as you got.
In [325]: np.array(f['y'],dtype=object)
...
OSError: can't read data (Dataset: Read failed)
Conversion to float works fine:
In [326]: np.array(f['y'],dtype=float)
Out[326]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
And the assignment to a predefined object array works:
In [327]: x=np.empty((),dtype=object)
In [328]: x[()]=f['y']
In [329]: x
Out[329]: array(<HDF5 dataset "y": shape (10,), type "<i4">, dtype=object)
Trying to create a 10 element array to take y:
In [332]: y1=np.empty((10,),dtype=object)
In [333]: y1[:]=f['y']
...
OSError: can't read data (Dataset: Read failed)
In [334]: y1[:]=f['y'].value
In [335]: y1
Out[335]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
y1[:]=f['y'][:] also works
I can't assign dataset to y1 (same error as when I tried np.array(f['y'],dtype=object). But I can assign its values. I can even assign the dataset to one element of y1
In [338]: y1[-1]=f['y']
In [339]: y1
Out[339]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8,
<HDF5 dataset "y": shape (10,), type "<i4">], dtype=object)
I keep coming back to the basic idea that an object array is just a collection of pointers, essentially a list in an array wrapper.

Categories

Resources