How to reshape xarray dataset by collapsing coordinate - python

I currently have a dataset that when opened with xarray contains three coordinates x, y, band. The band coordinate has temperature and dewpoint each at 4 different time intervals, meaning there are 8 total bands. Is there a way to reshape this so that I could have x, y, band, time such that the band coordinate is now only length 2 and the time coordinate would be length 4?
I thought I could add a new coordinate named time and then add the bands in but
ds = ds.assign_coords(time=[1,2,3,4])
returns ValueError: cannot add coordinates with new dimensions to a DataArray.

You can re-assign the "band" coordinate to a MultiIndex:
In [4]: da = xr.DataArray(np.random.random((4, 4, 8)), dims=['x', 'y', 'band'])
In [5]: da.coords['band'] = pd.MultiIndex.from_arrays(
...: [
...: [1, 1, 1, 1, 2, 2, 2, 2],
...: pd.to_datetime(['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01'] * 2),
...: ],
...: names=['band_stacked', 'time'],
...: )
In [6]: stacked
Out[6]:
<xarray.DataArray (x: 4, y: 4, band: 8)>
array([[[2.55228052e-01, 6.71680777e-01, 8.76158643e-01, 5.23808010e-01,
8.56941412e-01, 2.75757101e-01, 7.88877551e-02, 1.54739786e-02],
[3.70350510e-01, 1.90604842e-02, 2.17871931e-01, 9.40704074e-01,
4.28769745e-02, 9.24407375e-01, 2.81715762e-01, 9.12889594e-01],
[7.36529770e-02, 1.53507827e-01, 2.83341417e-01, 3.00687140e-01,
7.41822972e-01, 6.82413237e-01, 7.92126231e-01, 4.84821281e-01],
[5.24897891e-01, 4.69537663e-01, 2.47668326e-01, 7.56147251e-02,
6.27767921e-01, 2.70630355e-01, 5.44669493e-01, 3.53063860e-01]],
...
[[1.56513994e-02, 8.49568142e-01, 3.67268562e-01, 7.28406400e-01,
2.82383223e-01, 5.00901504e-01, 9.99643260e-01, 1.16446139e-01],
[9.98980637e-01, 2.45060112e-02, 8.12423749e-01, 4.49895624e-01,
6.64880037e-01, 8.73506549e-01, 1.79186788e-01, 1.94347924e-01],
[6.32000394e-01, 7.60414128e-01, 4.90153658e-01, 3.40693056e-01,
5.19820559e-01, 4.49398587e-01, 1.90339730e-01, 6.38101614e-02],
[7.64102189e-01, 6.79961676e-01, 7.63165470e-01, 6.23766131e-02,
5.62677420e-01, 3.85784911e-01, 4.43436365e-01, 2.44385584e-01]]])
Coordinates:
* band (band) MultiIndex
- band_stacked (band) int64 1 1 1 1 2 2 2 2
- time (band) datetime64[ns] 2020-01-01 2021-01-01 ... 2023-01-01
Dimensions without coordinates: x, y
Then you can expand the dimensionality by unstacking:
In [7]: unstacked
Out[7]:
<xarray.DataArray (x: 4, y: 4, band: 2, time: 4)>
array([[[[2.55228052e-01, 6.71680777e-01, 8.76158643e-01,
5.23808010e-01],
[8.56941412e-01, 2.75757101e-01, 7.88877551e-02,
1.54739786e-02]],
...
[[7.64102189e-01, 6.79961676e-01, 7.63165470e-01,
6.23766131e-02],
[5.62677420e-01, 3.85784911e-01, 4.43436365e-01,
2.44385584e-01]]]])
Coordinates:
* band (band) int64 1 2
* time (time) datetime64[ns] 2020-01-01 2021-01-01 2022-01-01 2023-01-01
Dimensions without coordinates: x, y
Another more manual option would be to reshape in numpy and just create a new DataArray. Note that this manual reshape is much faster for a larger array:
In [8]: reshaped = xr.DataArray(
...: da.data.reshape((4, 4, 2, 4)),
...: dims=['x', 'y', 'band', 'time'],
...: coords={
...: 'time': pd.to_datetime(['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01']),
...: 'band': [1, 2],
...: },
...: )
Note that if your data is chunked (and assuming you'd like to keep it that way) your options are a bit more limited - see the dask docs on reshaping dask arrays. The first (MultiIndexing unstack) approach does work with dask arrays as long as the arrays are not chunked along the unstacked dimension. See this question for an example.

Related

Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous?

I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.
The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)
Construct a frame from a pair of "series":
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
The resulting array is F_CONTIGUOUS.
If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
Create it with an order F, the result is same as in the first case:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
melt
Going back to the frame created from an order C:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.
With the order 'F' frame:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

Python xarray - vectorized indexing

I'm trying to understand vectorized indexing in xarray by following this example from the docs:
import xarray as xr
import numpy as np
da = xr.DataArray(np.arange(12).reshape((3, 4)), dims=['x', 'y'],
coords={'x': [0, 1, 2], 'y': ['a', 'b', 'c', 'd']})
ind_x = xr.DataArray([0, 1], dims=['x'])
ind_y = xr.DataArray([0, 1], dims=['y'])
The output of the array da is as follows:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
So far so good. Now in the example there are shown two ways of indexing. Orthogonal (not interested in this case) and vectorized (what I want). For the vectorized indexing the following is shown:
In [37]: da[ind_x, ind_x] # vectorized indexing
Out[37]:
<xarray.DataArray (x: 2)>
array([0, 5])
Coordinates:
y (x) <U1 'a' 'b'
* x (x) int64 0 1
The result seems to be what I want, but this feels very strange to me. ind_x (which in theory refers to dims=['x']) is being passed twice but somehow is capable of indexing what appears to be both in the x and y dims. As far as I understand the x dim would be the rows and y dim would be the columns, is that correct? How come the same ind_x is capable of accessing both the rows and the cols?
This seems to be the concept I need for my problem, but can't understand how it works or how to extend it to more dimensions. I was expecting this result to be given by da[ind_x, ind_y] however that seems to yield the orthogonal indexing surprisingly enough.
Having the example with ind_x being used twice is probably a little confusing: actually, the dimension of the indexer doesn't have to matter at all for the indexing behavior! Observe:
ind_a = xr.DataArray([0, 1], dims=["a"]
da[ind_a, ind_a]
Gives:
<xarray.DataArray (a: 2)>
array([0, 5])
Coordinates:
x (a) int32 0 1
y (a) <U1 'a' 'b'
Dimensions without coordinates: a
The same goes for the orthogonal example:
ind_a = xr.DataArray([0, 1], dims=["a"])
ind_b = xr.DataArray([0, 1], dims=["b"])
da[ind_a, ind_b]
Result:
<xarray.DataArray (a: 2, b: 2)>
array([[0, 2],
[4, 6]])
Coordinates:
x (a) int32 0 1
y (b) <U1 'a' 'c'
Dimensions without coordinates: a, b
The difference is purely in terms of "labeling", as in this case you end up with dimensions without coordinates.
Fancy indexing
Generally stated, I personally do not find "fancy indexing" the most intuitive concept. I did find this example in NEP 21 pretty clarifying: https://numpy.org/neps/nep-0021-advanced-indexing.html
Specifically, this:
Consider indexing a 2D array by two 1D integer arrays, e.g., x[[0, 1], [0, 1]]:
Outer indexing is equivalent to combining multiple integer indices with itertools.product(). The result in this case is another 2D
array with all combinations of indexed elements, e.g.,
np.array([[x[0, 0], x[0, 1]], [x[1, 0], x[1, 1]]])
Vectorized indexing is equivalent to combining multiple integer
indices with zip(). The result in this case is a 1D array containing
the diagonal elements, e.g., np.array([x[0, 0], x[1, 1]]).
Back to xarray
da[ind_x, ind_y]
Can also be written as:
da.isel(x=ind_x, y=ind_y)
The dimensions are implicit in the order. However, xarray still attempts to broadcast (based on dimension labels), so da[ind_y] mismatches and results in an error. da[ind_a] and da[ind_b] both work.
More dimensions
The dims you provide for the indexer are what determines the shape of the output, not the dimensions of the array you're indexing.
If you want to select single values along the dimensions (so we're zip()-ing through the indexes simultaneously), just make sure that your indexers share the dimension, here for a 3D array:
da = xr.DataArray(
data=np.arange(3 * 4 * 5).reshape(3, 4, 5),
coords={
"x": [1, 2, 3],
"y": ["a", "b", "c", "d"],
"z": [1.0, 2.0, 3.0, 4.0, 5.0],
},
dims=["x", "y", "z"],
)
ind_along_x = xr.DataArray([0, 1], dims=["new_index"])
ind_along_y = xr.DataArray([0, 2], dims=["new_index"])
ind_along_z = xr.DataArray([0, 3], dims=["new_index"])
da[ind_along_x, ind_along_y, ind_along_z]
Note that the values of the indexers do not have to the same -- that would be a pretty severe limitation, after all.
Result:
<xarray.DataArray (new_index: 2)>
array([ 0, 33])
Coordinates:
x (new_index) int32 1 2
y (new_index) <U1 'a' 'c'
z (new_index) float64 1.0 4.0
Dimensions without coordinates: new_index

Is there a built-in function in xarray to remove outliers from a dataset?

I have a spatio-temporal .nc file that I opened as a xarray dataset and I would like to remove the values that exceeds the 99th percentile. Is there any easy/straight way to drop those values?
The information abour my Dataset is
Dimensions: (latitude: 204, longitude: 180, time: 985)
Coordinates:
* longitude (longitude) float32 -69.958336 -69.875 ... -55.124996 -55.04166
* latitude (latitude) float32 -38.041668 -38.12501 ... -54.87501 -54.95834
* time (time) datetime64[ns] 1997-09-06 1997-09-14 ... 2019-09-06
Data variables:
chl (time, latitude, longitude) float64 nan nan nan ... nan nan nan
You can create your own function
import xarray as xr
import numpy as np
# perc -> percentile that define the exclusion threshold
# dim -> dimension to which apply the filtering
def replace_outliers(data, dim=0, perc=0.99):
# calculate percentile
threshold = data[dim].quantile(perc)
# find outliers and replace them with max among remaining values
mask = data[dim].where(abs(data[dim]) <= threshold)
max_value = mask.max().values
# .where replace outliers with nan
mask = mask.fillna(max_value)
print(mask)
data[dim] = mask
return data
Testing
data = np.random.randint(1,5,[3, 3, 3])
# create outlier
data[0,0,0] = 100
temp = xr.DataArray(data.copy())
print(temp[0])
Out:
array([[100, 1, 2],
[ 4, 4, 4],
[ 1, 4, 3]])
Apply function:
temp = replace_outliers(temp, dim=0, perc=99)
print(temp[0])
Out:
array([[[4, 1, 2],
[4, 4, 4],
[1, 4, 3]],

Stack xarray DataArray

I have N 1D xr.DataArray's with an 1 array coordinate b and 1 scalar coordinate a. I want to combine them to a 2D DataArray with array coordinates b, a. How to do this? I have tried:
x1 = xr.DataArray(np.arange(0,3)[...,np.newaxis], coords=[('b', np.arange(3,6)),('a', [10])]).squeeze()
x2 = xr.DataArray(np.arange(0,3)[...,np.newaxis], coords=[('b', np.arange(3,6)),('a', [11])]).squeeze()
xcombined = xr.concat([x1, x2])
xcombined
Results in :
<xarray.DataArray (concat_dims: 2, b: 3)>
array([[0, 1, 2],
[0, 1, 2]])
Coordinates:
* b (b) int64 3 4 5
a (concat_dims) int64 10 11
Dimensions without coordinates: concat_dims
Now I like to select a particularly 'a':
xcombined.sel(a=10)
However, this raises:
ValueError: dimensions or multi-index levels ['a'] do not exist
If you supply dim to concat, this works:
xcombined = xr.concat([x1, x2], dim='a')
And then:
xcombined.sel(a=10)
<xarray.DataArray (b: 3)>
array([0, 1, 2])
Coordinates:
* b (b) int64 3 4 5
a int64 10

Interpolating between values of a DataArray

I would like to know what is the most efficient, or most elegant way to interpolate between values of a DataArray. Ideally, this should be usable for arbitrary number of dimensions, but good solutions for low dimensions such as 2D and 3D would also be useful.
I am aware of the 'method' keyword for the sel method and have the feeling that is going to be part of the answer, but I find the solution I came up with not to be very elegant. (I do not know about its efficiency.) Let me illustrate this solution:
>>> import xarray as xr
>>> arr = xr.DataArray([[12, 32], [14, 34]],
... dims=['x', 'y'], coords={'x': [1, 3], 'y': [2, 4]})
>>> arr
<xarray.DataArray (x: 2, y: 2)>
array([[12, 32],
[14, 34]])
Coordinates:
* x (x) int64 1 3
* y (y) int64 2 4
>>> arry = ((arr.sel(x=2, method='pad') * (2 - arr.coords['x'].sel(x=2, method='pad')) +
... arr.sel(x=2, method='bfill') * (arr.coords['x'].sel(x=2, method='bfill') - 2)) /
... (arr.coords['x'].sel(x=2, method='bfill') - arr.coords['x'].sel(x=2, method='pad')))
>>> arry
<xarray.DataArray (y: 2)>
array([ 13., 33.])
Coordinates:
* y (y) int64 2 4
>>> ((arry.sel(y=3, method='pad') * (3 - arry.coords['y'].sel(y=3, method='pad')) +
... arry.sel(y=3, method='bfill') * (arry.coords['y'].sel(y=3, method='bfill') - 3)) /
... (arry.coords['y'].sel(y=3, method='bfill') - arry.coords['y'].sel(y=3, method='pad')))
<xarray.DataArray ()>
array(23.0)
For why I find this approach sub-optimal: the calculation of arry could include a large number of unnecessary arithmetic operations when the length of the y index is large (and not just 2 as in this toy example).

Categories

Resources