xarray RMSE of each XY cell from the median - python

I've written a leave one out spatial interpolation routine using xarray where an interpolation is done n-1 times across the input data. Each iteration of the interpolation is stored in an xarray dataset under a "IterN" dimension. Here's an example:
grids
Out[2]:
<xarray.Dataset>
Dimensions: (IterN: 84, northing: 53, easting: 54)
Coordinates:
* easting (easting) float64 5.648e+05 5.653e+05 ... 5.907e+05 5.912e+05
* northing (northing) float64 4.333e+06 4.334e+06 ... 4.359e+06 4.359e+06
* IterN (IterN) int64 0 1 2 3 4 5 6 7 8 9 ... 75 76 77 78 79 80 81 82 83
Data variables:
Data (IterN, northing, easting) float64 2.065 2.018 ... 0.3037 0.2913
I can compute some statistics over the dataset using the in-built functions, for example median here:
median = grids.median(dim="IterN")
median
Out[5]:
<xarray.Dataset>
Dimensions: (northing: 53, easting: 54)
Coordinates:
* easting (easting) float64 5.648e+05 5.653e+05 ... 5.907e+05 5.912e+05
* northing (northing) float64 4.333e+06 4.334e+06 ... 4.359e+06 4.359e+06
Data variables:
Data (northing, easting) float64 2.111 2.061 2.011 ... 0.3104 0.2992
I would like to compute the root mean square error of the iteration sets against something like the median above. I can't see an inbuilt function for this, and I'm struggling to see how to implement this answer.
The desired output is an RMSE grid in the same XY dimensions as the other grids.
Any help would be greatly appreciated!

Does the following go in the direction you mean?
median = grids.median(dim="IterN")
error = grids - median
rmse = np.sqrt((error**2).mean("IterN"))

Related

How to convert netCDFs with unusual dimensions to a standard netCDF (ltime, lat, lon) (python)

I have multiple netCDF files that eventually i want to merge. An example netCDF is as follows.
import xarray as xr
import numpy as np
import cftime
Rain_nc = xr.open_dataset('filepath.nc', decode_times=False)
print(Rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, tstep:248)
Dimensions without coordinates: land, tstep
Data variables:
lon (land) float32...
lat (land) float32...
timestp(tstep) int32...
time (tstep) int32...
Rainf (tstep, land) float32...
the dimension 'land' is a count of numbers 1 to 67209, and 'tstep' is a count from 1 to 248.
the variable 'lat' and 'lon' are latitude and longitude values with a shape of (67209,)
the variable 'time' is the time in seconds since the start of the month (netcdf is a month long)
Next ive swapped the dimensions from 'tstep' to 'time', converted it for later merging and set coords as lon, lat and time.
rain_nc = rain_nc.swap_dims({'tstep':'time'})
rain_nc = rain_nc.set_coords(['lon', 'lat', 'time'])
rain_nc['time'] = cftime.num2date(rain_nc['time'], units='seconds since 2016-01-01 00:00:00', calendar = 'standard')
rain_nc['time'] = cftime.date2num(rain_nc['time'], units='seconds since 1970-01-01 00:00:00', calendar = 'standard')
this has left me with the following Dataset:
print(rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, time: 248)
Coordinates:
lon (land)float32
lat (land)float32
* time (time)float64
Dimensions without coordinates: land
Data variables:
timestp (time)int32
Rainf (time, land)
print(rain_nc['land'])
<xarray.DataArray 'land' (land: 67209)>
array([ 0, 1, 2,..., 67206, 67207, 67208])
Coordinates:
lon (land) float32 ...
lat (land) float32 ...
Dimensions without coordinates: land
the Rainf variable i am interested in is as follows:
<xarray.DataArray 'Rainf' (time: 248, land: 67209)>
[16667832 values with dtype=float32]
Coordinates:
lon (land) float32 -179.75 -179.75 -179.75 ... 179.75 179.75
179.75
lat (land) float32 71.25 70.75 68.75 68.25 ... -16.25 -16.75
-19.25
* time (time) float64 1.452e+09 1.452e+09 ... 1.454e+09 1.454e+09
Dimensions without coordinates: land
Attributes:
title: Rainf
units: kg/m2s
long_name: Mean rainfall rate over the \nprevious 3 hours
actual_max: 0.008114143
actual_min: 0.0
Fill_value: 1e+20
From here i would like to create a netCDF with the dimensions (time, lat, lon) and the variable Rainf.
I have tried creating a new netCDF (or alter this one) but when i try to pass the Rainf variable does not work as it has a shape of (248, 67209) and needs a shape of (248, 67209, 67209). Even though the current 'land' dimension of 'Rainf' has a lat and lon coordinate. Is it possible to reshape this variable to have a time, lat, and lon dimension?
In the end it seems that what you want is to reshape the "land" dimensions to the ("lat", "lon") ones.
So, you have some DataArray similar to this:
# Setting sizes and coordinates
lon_size, lat_size = 50, 80
lon, lat = [arr.flatten() for arr in np.meshgrid(range(lon_size), range(lat_size))]
land_size = lon_size * lat_size
time_size = 100
da = xr.DataArray(
dims=("time", "land"),
data=np.random.randn(time_size, land_size),
coords=dict(
time=np.arange(time_size),
lon=("land", lon),
lat=("land", lat),
)
)
which looks like this:
>>> da
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
lon (land) int64 0 1 ... 48 49
lat (land) int64 0 0 ... 79 79
Dimensions without coordinates: land
First, we'll use the .set_index() method to tell xarray that the "land" index should be represented from the "lon" and "lat" coordinates:
>>> da.set_index(land=("lon", "lat"))
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* land (land) MultiIndex
- lon (land) int64 0 1 ... 48 49
- lat (land) int64 0 0 ... 79 79
The dimensions are still ("time", "land"), but now "land" is a MultiIndex.
Note that if you try to write to NETCDF at this point you'll have the following error:
>>> da.set_index(land=("lon", "lat")).to_netcdf("data.nc")
NotImplementedError: variable 'land' is a MultiIndex, which cannot yet be serialized to netCDF files (https://github.com/pydata/xarray/issues/1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.
It tells you to use the .reset_index() method. But that's not what you want here, because it will just go back to the original da state.
What you want from now is to use the .unstack() method:
>>> da.set_index(land=("lon", "lat")).unstack("land")
<xarray.DataArray (time: 100, lon: 50, lat: 80)>
array([[[...]]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* lon (lon) int64 0 1 ... 48 49
* lat (lat) int64 0 1 ... 78 79
It effectively kills the "land" dimension and gives the desired output.

Can dimensions be passed to a variable within an XArray Dataset?

I'm currently working on a project that involves clipping Xarray datasets based on a certain grid of latitude and longitude values. To do this, I've imported a file with the specified lats and lons, and added dimensions onto the dataset to give it the necessary lat/lon information. However, I also want to pass the dimensions to a data variable (called 'precip') within the dataset. I know that I'm able to pass the dimensions to the array of values in the variable, but can they be passed to the variable itself?
My code is below:
precip = obs_dataset_full.precip.expand_dims(lon = 350, lat = 450)
precip.coords['Longitude'] = (-1 * (360 - obs_dataset_full.Longitude))
precip.coords['Latitude'] = obs_dataset_full.Latitude
precip
With output as such:
xarray.Dataset
Dimensions:
dim_0: 350 dim_1: 451 lat: 450 lon: 350
Coordinates:
dim_0
(dim_0)
int64
0 1 2 3 4 5 ... 345 346 347 348 349
dim_1
(dim_1)
int64
0 1 2 3 4 5 ... 446 447 448 449 450
Longitude
(lon)
float64
-105.7 -105.7 ... -78.34 -78.26
Latitude
(lat)
float64
35.04 35.08 35.11 ... 51.52 51.56
Data variables:
precip
(dim_0, dim_1)
float64
0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 nan
Attributes: (0)
Specifically, I want the data variable precip to also possess dimensions of lat and lon, as the dataset does.
Thanks in advance!

Slicing longitude coordinate in xarray which crosses the 0

I have xarray dataset with longitude coordinate from 0.5 to 359.5, like the following:
Dimensions: (bnds: 2, lat: 40, lev: 35, lon: 31, member_id: 1)
Coordinates:
lev_bnds (lev, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 359.5
* lev (lev) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03 6.5e+03
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... -53.5 -52.5 -51.5 -50.5
* member_id (member_id) object 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
so (member_id, lev, lat, lon) float32 nan nan nan ...
The area I'm interested in is between 60W to 30E, which probably corresponds to longitude 300.5 to 30.5. Is there any way to slice the dataset between these coordinates?
I tried to use isel(slice(-60,30) but it's not possible to have negative to positive numbers in the slice function.
I know I can just split the data into two small ones (300.5-359.5 and 0.5-30.5), but I was wondering if there is a better way.
Thank you!
As you correctly point out, currently isel can’t select from both the start and end of a dimension in a single pass.
If combined with roll http://xarray.pydata.org/en/stable/generated/xarray.DataArray.roll.html, you can move the points you want into a contiguous region, and then select those you need.
NB: I couldn’t be sure from your example, but it looks like you may want sel rather than isel, given you may be selecting by index value rather than position

Efficiently mask and calculate means for multiple groups in `xr.Dataset` xarray

I have two xr.Dataset objects. One is a continuous map of some variable (here precipitation). The other is a categorical map of a set of regions
['region_1', 'region_2', 'region_3', 'region_4'].
I want to calculate the mean precip in each region at each timestep by masking by region/time and then outputting a dataframe looking like the below.
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
I have some code but it runs very slowly for the real datasets. Can anyone help me optimize?
A minimum reproducible example
Initalising our objects, two variables of the same shape. The region object will have been read from a shapefile and will have more than two regions.
import xarray as xr
import pandas as pd
import numpy as np
def make_dataset(
variable_name='precip',
size=(30, 30),
start_date='2008-01-01',
end_date='2010-01-01',
lonmin=-180.0,
lonmax=180.0,
latmin=-55.152,
latmax=75.024,
):
# create 2D lat/lon dimension
lat_len, lon_len = size
longitudes = np.linspace(lonmin, lonmax, lon_len)
latitudes = np.linspace(latmin, latmax, lat_len)
dims = ["lat", "lon"]
coords = {"lat": latitudes, "lon": longitudes}
# add time dimension
times = pd.date_range(start_date, end_date, name="time", freq="M")
size = (len(times), size[0], size[1])
dims.insert(0, "time")
coords["time"] = times
# create values
var = np.random.randint(100, size=size)
return xr.Dataset({variable_name: (dims, var)}, coords=coords), size
ds, size = make_dataset()
# create dummy regions (not contiguous but doesn't matter for this example)
region_ds = xr.ones_like(ds).rename({'precip': 'region'})
array = np.random.choice([0, 1, 2, 3], size=size)
region_ds = region_ds * array
# create a dictionary explaining what the regions area
region_lookup = {
0: 'region_1',
1: 'region_2',
2: 'region_3',
3: 'region_4',
}
 What do these objects look like?
In[]: ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
Data variables:
precip (time, lat, lon) int64 51 92 14 71 60 20 82 ... 16 33 34 98 23 53
In[]: region_ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
Data variables:
region (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
Current Implementation
In order to calculate the mean of the variable in ds in each of the regions ['region_1', 'region_2', ...] in region_ds at each time, I need to loop over the TIME and the REGION.
I loop over each REGION, and then each TIMESTEP in the da object. This operation is pretty slow as the dataset gets larger (more pixels and more timesteps). Is there a more efficient / vectorized implementation anyone can think of.
My current implementation is super slow for all the regions and times that I need. Is there a more efficient use of numpy / xarray that will get me my desired result faster?
def drop_nans_and_flatten(dataArray: xr.DataArray) -> np.ndarray:
"""flatten the array and drop nans from that array. Useful for plotting histograms.
Arguments:
---------
: dataArray (xr.DataArray)
the DataArray of your value you want to flatten
"""
# drop NaNs and flatten
return dataArray.values[~np.isnan(dataArray.values)]
#
da = ds.precip
region_da = region_ds.region
valid_region_ids = [k for k in region_lookup.keys()]
# initialise empty lists
region_names = []
datetimes = []
mean_values = []
for valid_region_id in valid_region_ids:
for time in da.time.values:
region_names.append(region_lookup[valid_region_id])
datetimes.append(time)
# extract all non-nan values for that time-region
mean_values.append(
da.sel(time=time).where(region_da == valid_region_id).mean().values
)
df = pd.DataFrame(
{
"datetime": datetimes,
"region_name": region_names,
"mean_value": mean_values,
}
)
The output:
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
In [7]: df.tail()
Out[7]:
datetime region_name mean_value
43 2009-08-31 region_4 50.83111111111111
44 2009-09-30 region_4 48.40888888888889
45 2009-10-31 region_4 51.56148148148148
46 2009-11-30 region_4 48.961481481481485
47 2009-12-31 region_4 48.36296296296296
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
datetime 96 non-null datetime64[ns]
region_name 96 non-null object
mean_value 96 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.4+ KB
In [21]: df.describe()
Out[21]:
datetime region_name mean_value
count 96 96 96
unique 24 4 96
top 2008-10-31 00:00:00 region_1 48.88984800150122
freq 4 24 1
first 2008-01-31 00:00:00 NaN NaN
last 2009-12-31 00:00:00 NaN NaN
Any help would be very much appreciated ! Thankyou
It's hard to avoid iterating to generate the masks for the regions given how they are defined, but once you have those constructed (e.g. with the code below), I think the following would be pretty efficient:
regions = xr.concat(
[(region_ds.region == region_id).expand_dims(region=[region])
for region_id, region in region_lookup.items()],
dim='region'
)
result = ds.precip.where(regions).mean(['lat', 'lon'])
This generates a DataArray with 'time' and 'region' dimensions, where the value at each point is the mean at a given time over a given region. It would be straightforward to extend this to an area-weighted average if that were desired too.
An alternative option that generates the same result would be:
regions = xr.DataArray(
list(region_lookup.keys()),
coords=[list(region_lookup.values())],
dims=['region']
)
result = ds.precip.where(regions == region_ds.region).mean(['lat', 'lon'])
Here regions is basically just a DataArray representation of the region_lookup dictionary.

How can I find the nearest distance between points in a 2D array and a 3D matrix grid?

I have buoy data as an array of longitudes and latitudes for 30 days, however I would like to find the closest distance between the buoys' locations and the 0% sea ice concentration for each day. The sea ice concentration data is a 3D matrix, however the space dimensions are in x y coordinates not latitudes and longitudes.
I have converted all the concentrations above 0% to Nan. I am not to sure now how to locate the closest latitude+longitude points of the sea ice to each point along the buoy trajectory.
This is my ice dataset:
Dimensions: (time: 363, x: 2528, y: 2656)
Coordinates:
y (y) int16 1 2 3 4 5 6 ... 2652 2653 2654 2655 2656
x (x) int16 1 2 3 4 5 6 ... 2524 2525 2526 2527 2528
time (time) datetime64[ns] 2017-01-01T12:00:00 ... 2017-12-30T12:00:00
longitude (time, y, x) float32 dask.array
latitude (time, y, x) float32 dask.array
Data variables:
sea_ice_concentration (time, y, x) float32 dask.array<shape=(363, 2656, 2528), chunksize=(1, 2656, 2528)>
land (time, y, x) int8 dask.array<shape=(363, 2656, 2528), chunksize=(1, 2656, 2528)>
You could try using Djikstras shortest path algorithm. The link contains an example in python which could be your starting point.
You will then need to convert your results, which will be in x,y dimensions to latitude/longitude.

Categories

Resources