When building a DataArray, I can conveniently select along some coordinate:
import xarray as xr
d = xr.DataArray([1, 2, 3],
coords={'c': ['a', 'b', 'c']},
dims=['c'])
d.sel(c='a')
and even along multiple values on that coordinate:
d.sel(c=['a', 'b'])
However, this fails to work once the coordinate is part of a multi-index dimension:
d = xr.DataArray([1, 2, 3],
coords={'c': ('multi_index', ['a', 'b', 'c']),
'd': ('multi_index', ['x', 'y', 'z'])},
dims=['multi_index'])
d.sel(c='a') # error
d.sel(c=['a', 'b']) # error
with the error ValueError: dimensions or multi-index levels ['c'] do not exist.
Another error message I see in trying to do this is ValueError: Vectorized selection is not available along level variable.
It seems like one can only select along dimensions.
This becomes difficult when a single dimension contains a lot of metadata and one would only like to select based on the values of a single metadata-coordinate.
Is there a suggested workaround other than positionally indexing things by hand?
swap_dims does some workaround.
In [8]: d.swap_dims({'multi_index': 'c'}).sel(c=['a', 'b'])
Out[8]:
<xarray.DataArray (c: 2)>
array([1, 2])
Coordinates:
* c (c) <U1 'a' 'b'
d (c) <U1 'a' 'b'
where 'c' becomes dimension instead of 'multi_index'.
If you want to select based on 'c' and 'd' in random manner, the use of MultiIndex maybe appropriate. set_index does this,
In [12]: d.set_index(cd=['c', 'd'])
Out[12]:
<xarray.DataArray (multi_index: 3)>
array([1, 2, 3])
Coordinates:
* cd (cd) MultiIndex
- c (cd) object 'a' 'b' 'c'
- d (cd) object 'a' 'b' 'c'
Dimensions without coordinates: multi_index
In [13]: d.set_index(cd=['c', 'd']).sel(c='b')
Out[13]:
<xarray.DataArray (multi_index: 3)>
array([1, 2, 3])
Coordinates:
* d (d) object 'b'
Dimensions without coordinates: multi_index
However, vectorized selection is not yet supported for MultiIndex,
(it will complain ValueError: Vectorized selection is not available along level variable)
Maybe the first option is better for your use case.
Related
I'm trying to understand vectorized indexing in xarray by following this example from the docs:
import xarray as xr
import numpy as np
da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
ind_x = xr.DataArray([0, 1], dims=['x'])
ind_y = xr.DataArray([0, 1], dims=['y'])
The output of the array da is as follows:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
So far so good. Now in the example there are shown two ways of indexing. Orthogonal (not interested in this case) and vectorized (what I want). For the vectorized indexing the following is shown:
In [37]: da[ind_x, ind_x] # vectorized indexing
Out[37]:
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
y (x) <U1 'a' 'b'
* x (x) int64 0 1
The result seems to be what I want, but this feels very strange to me. ind_x (which in theory refers to dims=['x']) is being passed twice but somehow is capable of indexing what appears to be both in the x and y dims. As far as I understand the x dim would be the rows and y dim would be the columns, is that correct? How come the same ind_x is capable of accessing both the rows and the cols?
This seems to be the concept I need for my problem, but can't understand how it works or how to extend it to more dimensions. I was expecting this result to be given by da[ind_x, ind_y] however that seems to yield the orthogonal indexing surprisingly enough.
Having the example with ind_x being used twice is probably a little confusing: actually, the dimension of the indexer doesn't have to matter at all for the indexing behavior! Observe:
ind_a = xr.DataArray([0, 1], dims=["a"]
da[ind_a, ind_a]
Gives:
<xarray.DataArray (a: 2)>
array([0, 5])
Coordinates:
x (a) int32 0 1
y (a) <U1 'a' 'b'
Dimensions without coordinates: a
The same goes for the orthogonal example:
ind_a = xr.DataArray([0, 1], dims=["a"])
ind_b = xr.DataArray([0, 1], dims=["b"])
da[ind_a, ind_b]
Result:
<xarray.DataArray (a: 2, b: 2)>
array([[0, 2],
[4, 6]])
Coordinates:
x (a) int32 0 1
y (b) <U1 'a' 'c'
Dimensions without coordinates: a, b
The difference is purely in terms of "labeling", as in this case you end up with dimensions without coordinates.
Fancy indexing
Generally stated, I personally do not find "fancy indexing" the most intuitive concept. I did find this example in NEP 21 pretty clarifying: https://numpy.org/neps/nep-0021-advanced-indexing.html
Specifically, this:
Consider indexing a 2D array by two 1D integer arrays, e.g., x[[0, 1], [0, 1]]:
Outer indexing is equivalent to combining multiple integer indices with itertools.product(). The result in this case is another 2D
array with all combinations of indexed elements, e.g.,
np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])
Vectorized indexing is equivalent to combining multiple integer
indices with zip(). The result in this case is a 1D array containing
the diagonal elements, e.g., np.array([x[0, 0], x[1, 1]]).
Back to xarray
da[ind_x, ind_y]
Can also be written as:
da.isel(x=ind_x, y=ind_y)
The dimensions are implicit in the order. However, xarray still attempts to broadcast (based on dimension labels), so da[ind_y] mismatches and results in an error. da[ind_a] and da[ind_b] both work.
More dimensions
The dims you provide for the indexer are what determines the shape of the output, not the dimensions of the array you're indexing.
If you want to select single values along the dimensions (so we're zip()-ing through the indexes simultaneously), just make sure that your indexers share the dimension, here for a 3D array:
da = xr.DataArray(
data=np.arange(3 * 4 * 5).reshape(3, 4, 5),
coords={
"x": [1, 2, 3],
"y": ["a", "b", "c", "d"],
"z": [1.0, 2.0, 3.0, 4.0, 5.0],
},
dims=["x", "y", "z"],
)
ind_along_x = xr.DataArray([0, 1], dims=["new_index"])
ind_along_y = xr.DataArray([0, 2], dims=["new_index"])
ind_along_z = xr.DataArray([0, 3], dims=["new_index"])
da[ind_along_x, ind_along_y, ind_along_z]
Note that the values of the indexers do not have to the same -- that would be a pretty severe limitation, after all.
Result:
<xarray.DataArray (new_index: 2)>
array([ 0, 33])
Coordinates:
x (new_index) int32 1 2
y (new_index) <U1 'a' 'c'
z (new_index) float64 1.0 4.0
Dimensions without coordinates: new_index
I am having trouble figuring out some basic usage patterns of xarray. Here's something that I used to be able to do easily in numpy: (setting elements where a particular condition is satisfied in another array)
import numpy as np
q_index = np.array([
[0, 1, 2, 3, 4, 5],
[1, 5, 3, 2, 0, 4],
])
# any element not yet specified
q_kinds = np.full_like(q_index, 'other', dtype=object)
# any element with q-index 0 should be classified as 'gamma'
q_kinds[q_index == 0] = 'gamma'
# q_kinds is now:
# [['gamma' 'other' 'other' 'other' 'other' 'other']
# ['other' 'other' 'other' 'other' 'gamma' 'other']]
# afterwards I do some other things to fill in some (but not all)
# of the 'other' elements with different labels
But I don't see any reasonable way to do this masked assignment in xarray:
import xarray as xr
ds = xr.Dataset()
ds.coords['q-index'] = (['layer', 'q'], [
[0, 1, 2, 3, 4, 5],
[1, 5, 3, 2, 0, 4],
])
ds['q-kinds'] = xr.full_like(ds.coords['q-index'], 'other', dtype=object)
# any element with q-index == 0 should be classified as 'gamma'
# Attempt 1:
# 'IndexError: 2-dimensional boolean indexing is not supported.'
ds['q-kinds'][ds.coords['q-index'] == 0] = 'gamma'
# Attempt 2:
# Under 'More advanced indexing', the docs show that you can
# use isel with DataArrays to do pointwise indexing, but...
ds['q-kinds'].isel(
# ...I don't how to compute these index arrays from q-index...
layer = xr.DataArray([1, 0]),
q = xr.DataArray([5, 0]),
# ...and the docs also clearly state that isel does not support mutation.
)[...] = 'gamma' # FIXME ineffective
"xy-problem" style answers are okay. It seems to me that maybe the way you're supposed to build an array like this is to start with an array that (somehow) describes just the 'gamma' elements (and likewise an array for each other classification), use the immutable APIs to (somehow) merge/combine them, do something to make sure the data is dense along the q dimension, and then .fillna('other'). Or something like that. I really don't know.
You're very close! Instead of boolean indexing, you can use xarray.where() with three arguments:
>>> xr.where(ds.coords['q-index'] == 0, 'gamma', ds['q-kinds'])
<xarray.DataArray (layer: 2, q: 6)>
array([['gamma', 'other', 'other', 'other', 'other', 'other'],
['other', 'other', 'other', 'other', 'gamma', 'gamma']], dtype=object)
Coordinates:
q-index (layer, q) int64 0 1 2 3 4 5 1 5 3 2 0 4
Dimensions without coordinates: layer, q
Or equivalently, instead of using .isel() for assignment, you can use a dictionary inside [], e.g.,
>>> indexer = dict(layer=xr.DataArray([1, 0]), q=xr.DataArray([5, 0]))
>>> ds['q-kinds'][indexer] = 'gamma'
Note that it's important to create the DataArray objects explicitly inside the dictionary, because they are created with the same new dimension name dim_0:
>>> indexer
{'layer': <xarray.DataArray (dim_0: 2)>
array([1, 0])
Dimensions without coordinates: dim_0, 'q': <xarray.DataArray (dim_0: 2)>
array([5, 0])
Dimensions without coordinates: dim_0}
If you pass lists or 1D numpy arrays directly, they are assumed to along independent dimensions, so you would end up with "outer" style indexing instead:
>>> indexer = dict(layer=[1, 0], q=[5, 0])
>>> ds['q-kinds'][indexer] = 'gamma'
>>> ds['q-kinds']
<xarray.DataArray 'q-kinds' (layer: 2, q: 6)>
array([['gamma', 'other', 'other', 'other', 'other', 'gamma'],
['gamma', 'other', 'other', 'other', 'other', 'gamma']], dtype=object)
Coordinates:
q-index (layer, q) int64 0 1 2 3 4 5 1 5 3 2 0 4
Dimensions without coordinates: layer, q
Suppose I have the following dataframe:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Bar=[1, 2, 3, 4]))
i.e.:
Bar Foo
0 1 A
1 2 A
2 3 B
3 4 B
Then I create a pandas.GroupBy object:
g = df.groupby('Foo')
How can I get, from g, the fact that g is grouped by a column originally named Foo?
If I do g.groups I get:
{'A': Int64Index([0, 1], dtype='int64'),
'B': Int64Index([2, 3], dtype='int64')}
That tells me the values that the Foo column takes ('A' and 'B') but not the original column name.
Now, I can just do something like:
g.first().index.name
But it seems odd that there's not an attribute of g with the group name in it, so I feel like I must be missing something. In particular, if g was grouped by multiple columns, then the above doesn't work:
df = pd.DataFrame(dict(Foo=['A', 'A', 'B', 'B'], Baz=['C', 'D', 'C', 'D'], Bar=[1, 2, 3, 4]))
g = df.groupby(['Foo', 'Baz'])
g.first().index.name # returns None, because it's a MultiIndex
g.first().index.names # returns ['Foo', 'Baz']
For context, I am trying to do some plotting with a grouped dataframe, and I want to be able to label each facet (which is plotting a single group) with the name of that group as well as the group label.
Is there a better way?
Query GroupBy.BaseGrouper.names to get a list of all groupers:
df.groupby('Foo').grouper.names
Which gives,
['Foo']
I find the behavior of the groupby method on a DataFrame object unexpected.
Let me explain with an example.
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
data1 = df['data1']
data1
# Out[14]:
# 0 1.989430
# 1 -0.250694
# 2 -0.448550
# 3 0.776318
# 4 -1.843558
# Name: data1, dtype: float64
data1 does not have the 'key1' column anymore.
So I would expect to get an error if I applied the following operation:
grouped = data1.groupby(df['key1'])
But I don't, and I can further apply the mean method on grouped to get the expected result.
grouped.mean()
# Out[13]:
# key1
# a -0.034941
# b 0.163884
# Name: data1, dtype: float64
However, the above operation does create a group using the 'key1' column of df.
How can this happen? Does the interpreter store information of the originating DataFrame (df in this case) with the created DataFrame/series (data1 in this case)?
Thank you.
It is only syntactic sugar, check here - selection by columns (Series) separately:
This is mainly syntactic sugar for the alternative and much more verbose
s = df['data1'].groupby(df['key1']).mean()
print (s)
key1
a 0.565292
b 0.106360
Name: data1, dtype: float64
Although the grouping columns are typically from the same dataframe or series, they don't have to be.
Your statement data1.groupby(df['key1']) is equivalent to data1.groupby(['a', 'a', 'b', 'b', 'a']). In fact, you can inspect the actual groups:
>>> data1.groupby(['a', 'a', 'b', 'b', 'a']).groups
{'a': [0, 1, 4], 'b': [2, 3]}
This means that your groupby on data1 will have a group a using rows 0, 1, and 4 from data1 and a group b using rows 2 and 3.
I have some Numpy code which I'm trying to decipher. There's a line v1 = v1[:, a1.tolist()] which passes a numpy array a1 and converts it into a list. I'm confused as to what v1[:, a1.tolist()] actually does. I know that v1 is now being set to a column array given from v1 given by the selection [:, a1.tolist()] but what's getting selected? More precisely, what is [:, a.tolist()] doing?
The syntax you observed is easier to understand if you split it in two parts:
1. Using a list as index
With numpy the meaning of
a[[1,2,3]]
is
[a[1], a[2], a[3]]
In other words when using a list as index is like creating a list of using elements as index.
2. Selecting a column with [:,x]
The meaning of
a2[:, x]
is
[a2[0][x],
a2[1][x],
a2[2][x],
...
a2[n-1][x]]
I.e. is selecting one column from a matrix.
Summing up
The meaning of
a[:, [1, 3, 5]]
is therefore
[[a[ 0 ][1], a[ 0 ][3], a[ 0 ][5]],
[a[ 1 ][1], a[ 1 ][3], a[ 1 ][5]],
...
[a[n-1][1], a[n-1][3], a[n-1][5]]]
In other words a copy of a with a selection of columns (or duplication and reordering; elements in the list of indexes doesn't need to be distinct or sorted).
Assuming a simple example like a 2D array, v1[:, a1.tolist()] would selects all rows of v1, but only columns described by a1 values
Simple example:
>>> x
array([['a', 'b', 'c'],
['d', 'f', 'g']],
dtype='|S1')
>>> x[:,[0]]
array([['a'],
['d']],
dtype='|S1')
>>> x[:,[0, 1]]
array([['a', 'b'],
['d', 'f']],
dtype='|S1')