How to set the coordinates of the output of xarray.assign?

How to set the coordinates of the output of xarray.assign? - python

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?

To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)

The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

Related

Need help interpolating data into specific points in netcdf files

I need help interpolating data onto nodes in a series of netcdf files which have oceanographic data. The files have 3 dimensions (latitude (12), longitude (20), time (24)) and variables (u, v) . Of all the data points, there are four nodes which do not have current velocity data (u, v) although they should have data, as they are at sea but register as land. I am trying to interpolate data onto these nodes but have no idea how.
EDIT: The four points with missing data are already in the coordinates but are have NaN values. The other points would keep the original data.
I am OK with Pandas but I know that this probably requires numpy and/or xarray and I am not well versed in either. I cannot get to nodes using the coordinates to interpolate the data I require. Can this be done at all?
print(data)
<xarray.Dataset>
Dimensions: (latitude: 12, time: 24, longitude: 20)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ...
41.67 41.75 41.83
* time (time) datetime64[ns] 2017-03-03T00:30:00 ...
2017-03-03T23:30:00
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ...
2.833 2.917 3.0
Data variables:
vo (time, latitude, longitude) float32 ...
uo (time, latitude, longitude) float32 ...
thetao (time, latitude, longitude) float32 ...
zos (time, latitude, longitude) float32 ...
Attributes: (12/22)
Conventions: CF-1.0
source: CMEMS IBI-MFC...
print(data.latitude)
<xarray.DataArray 'latitude' (latitude: 12)>
array([40.916668, 41. , 41.083332, 41.166668, 41.25 ,
41.333332, 41.416668, 41.5 , 41.583332, 41.666668, 41.75 , 41.833332], dtype=float32)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ... 41.67
41.75 41.83
Attributes:
standard_name: latitude
long_name: Latitude
units: degrees_north
axis: Y
unit_long: Degrees North
step: 0.08333f
_CoordinateAxisType: Lat
_ChunkSizes: 361
valid_min: 40.916668
valid_max: 41.833332
print(data.longitude)
<xarray.DataArray 'longitude' (longitude: 20)>
array([1.416666, 1.499999, 1.583333, 1.666666, 1.749999, 1.833333, 1.916666, 1.999999, 2.083333, 2.166666, 2.249999, 2.333333, 2.416666, 2.499999,2.583333, 2.666666, 2.749999, 2.833333, 2.916666, 2.999999],
dtype=float32)
Coordinates:
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ... 2.833 2.917 3.0
Attributes:
standard_name: longitude
long_name: Longitude
units: degrees_east
axis: X
unit_long: Degrees East
step: 0.08333f
_CoordinateAxisType: Lon
_ChunkSizes: 289
valid_min: 1.4166658
valid_max: 2.999999

The goal of the question is to in-fill cells, which are on land/missing in the raw files, but should really be in the sea. In some cases, you might want something sophisticated to do this. For example, if there was a sharp coastal gradient.
But the easiest way to solve it is to use nearest neighbour to replace missing values with the nearest neighbour. That will of course replace more than you need. So, you will then need to apply some kind of land-sea mask to your data. The workflow below, using my package nctoolkit, should do the job
import nctoolkit as nc
# read in the file and set missing values to nn
ds = nc.open_data("infile.nc")
ds.fill_na()
# create the mask to apply. This should only have one time step
# I'm going to assume in this case that it is a file with temperature that has the correct land-sea division
ds_mask = nc.open_data("mask.nc")
# Ensure sea values are 1. Land values should be nan
ds_mask.compare(">-1000")
# multiply the dataset by the mask to set land values to missing
ds.multiply(ds_mask)
# plot the results
ds.plot()

Add projection to rioxarray dataset in Python

I've downloaded a netcdf from the Climate Data Store and would like to write a CRS to it, so I can clip it for a shapefile. However, I get an error when assigning a CRS.
Below my script and what is being printed. I receive this error after trying to write a crs:
MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds
# load netcdf with xarray
dset = xr.open_dataset(netcdf_fn)
print(dset)
# add projection system to nc
dset = dset.rio.write_crs("EPSG:4326", inplace=True)
# mask CMIP6 data with shapefile
dset_shp = dset.rio.clip(shp.geometry.apply(mapping), shp.crs)
dset
Out[44]:
<xarray.Dataset>
Dimensions: (time: 1825, bnds: 2, lat: 2, lon: 1)
Coordinates:
* time (time) object 2021-01-01 12:00:00 ... 2025-12-31 12:00:00
* lat (lat) float64 0.4712 1.414
* lon (lon) float64 31.25
spatial_ref int32 0
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Attributes: (12/48)
Conventions: CF-1.7 CMIP-6.2
activity_id: ScenarioMIP
branch_method: standard
branch_time_in_child: 60225.0
branch_time_in_parent: 60225.0
comment: none
...
title: CMCC-ESM2 output prepared for CMIP6
variable_id: pr
variant_label: r1i1p1f1
license: CMIP6 model data produced by CMCC is licensed und...
cmor_version: 3.6.0
tracking_id: hdl:21.14100/0c6732f7-2cdd-4296-99a0-7952b7ca911e

When you call the rioxarray accessor ds.rio.clip using a xr.Dataset rather than a xr.DataArray, rioxarray needs to guess which variables in the dataset should be clipped. The method docstring gives the following warning:
Warning:
Clips variables that have dimensions ‘x’/’y’. Others are appended as is.
So the issue you're running into is that rioxarray sees four variables in your dataset:
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Of these, lat_bnds, lon_bnds, and pr all have x or y dimensions which could conceivably be clipped. Rather than making some arbitrary choice about what to do in this situation, rioxarray is raising an error with the message MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds. This indicates that when processing the variable lon_bnds, it's not sure what to do, because it can find an x dimension but not a y dimension.
To address this, you have two options. The first is to call clip on the pr array only. This is probably the right call - generally I'd recommend only doing data processing with Arrays (not Datasets) whenever possible unless you really know you want to map an operation across all variables in the dataset. Calling clip on pr would look like this:
clipped = dset.pr.rio.clip(shp.geometry.apply(mapping), shp.crs)
Alternatively, you could resolve the issue of having data_variables that really should be coordinates. You can use the method set_coordsto reclassify the non-data data_variables as non-dimension coordinates. In this case:
dset = dset.set_coords(['time_bnds', 'lat_bnds', 'lon_bnds'])
I'm not sure if this will completely resolve your issue - it's possible that rioxarray will still raise this error when processing coordinates. You could always drop the bounds, too. But the first method of only calling this on a single variable will work.

Converting 3D xarray dataset to dataframe

I have imported a xarray dataset like this and extracted the values at coordinates defined by zones from a csv file, and a time period defined by a date range (30 days of a (lon,lat) grid with some environmental values for every coordinates).
from xgrads import open_CtlDataset
ds_Snow = open_CtlDataset(path + 'file')
ds_Snow = ds_Snow.sel(lat = list(set(zones['lat'])), lon = list(set(zones['lon'])),
time = period, method = 'nearest')
When i look for the information of ds_Snow, this is what I get :
Dimensions: (lat: 12, lon: 12, time: 30)
Coordinates:
* time (time) datetime64[ns] 2000-09-01 2000-09-02 ... 2000-09-30
* lat (lat) float32 3.414e+06 3.414e+06 3.414e+06 ... 3.414e+06 3.414e+06
* lon (lon) float32 6.873e+05 6.873e+05 6.873e+05 ... 6.873e+05 6.873e+05
Data variables:
spre (time, lat, lon) float32 dask.array<chunksize=(1, 12, 12), meta=np.ndarray>
Attributes:
title: SnowModel
undef: -9999.0 type : <class 'xarray.core.dataset.Dataset'>
I would like to make it a dataframe, respecting the initial dimensions (time, lat, lon).
So I did this :
df_Snow = ds_Snow.to_dataframe()
But here are the dimensions of the dataframe :
print(df_Snow)
lat lon time
3414108.0 687311.625 2000-09-01 0.0
2000-09-02 0.0
2000-09-03 0.0
2000-09-04 0.0
2000-09-05 0.0
... ...
2000-09-26 0.0
2000-09-27 0.0
2000-09-28 0.0
2000-09-29 0.0
2000-09-30 0.0
[4320 rows x 1 columns]
It looks like all the data just got put in a single column.
I have tried giving the dimensions orders as some documentation explained :
df_Snow = ds_Snow.to_dataframe(dim_order = ['time', 'lat', 'lon'])
But it does not change anything, and I can't seem to find an answer in forums or the documentation. I would like to know a way to keep the array configuration in the dataframe.
EDIT : I found a solution
Instead of converting the xarray, I chose to build my dataframe with pd.Series of each attributes like this :
ds_Snow = ds_Snow.sel(lat = list(set(station_list['lat_utm'])),lon = list(set(station_list['lon_utm'])), time = Ind_Run_ERA5_Land, method = 'nearest')
time = pd.Series(ds_Snow.coords["time"].values)
lon = pd.Series(ds_Snow.coords["lon"].values)
lat = pd.Series(ds_Snow.coords["lat"].values)
spre = pd.Series(ds_Snow['spre'].values[:,0,0])
frame = { 'spre': spre, 'time': time, 'lon' : lon, 'lat' : lat}
df_Snow = pd.DataFrame(frame)

This is the expected behaviour. From the docs:
The DataFrame is indexed by the Cartesian product of index coordinates (in the form of a pandas.MultiIndex). Other coordinates are included as columns in the DataFrame.
There is only one variable, spre, in the dataset. The other properties, the 'coordinates' have become the index. Since there were several coordinates (lat, lon, and time), the DataFrame has a hierarchical MultiIndex.
You can either get the index data through tools like get_level_values or, if you want to change how the DataFrame is indexed, you can use reset_index().

Using XArray.isel to access data in GRIB2 file from a specific location?

I'm trying to access the data in a GRIB2 file at a specific longitude and latitude. I have been following along with this tutorial (https://www.youtube.com/watch?v=yLoudFv3hAY) approximately 2:52 but my GRIB file is formatted differently to the example and uses different variables
import xarray as xr
import pygrib
ds=xr.open_dataset('testdata.grb2', engine='cfgrib', filter_by_keys={'typeOfLevel': 'heightAboveGround', 'topLevel':2})
ds
This prints:
<xarray.Dataset>
Dimensions: (latitude: 361, longitude: 720)
Coordinates:
time datetime64[ns] ...
step timedelta64[ns] ...
heightAboveGround float64 ...
* latitude (latitude) float64 90.0 89.5 89.0 ... -89.0 -89.5 -90.0
* longitude (longitude) float64 0.0 0.5 1.0 1.5 ... 358.5 359.0 359.5
valid_time datetime64[ns] ...
Data variables:
t2m (latitude, longitude) float32 ...
sh2 (latitude, longitude) float32 ...
r2 (latitude, longitude) float32 ...
I then try to use imshow to index along the latitude and longitude (t2m?) dimension using:
t0_ds = ds.isel(t2m={200,200})
which gives this error:
ValueError: Dimensions {'t2m'} do not exist. Expected one or more of Frozen({'latitude': 361, 'longitude': 720})
Obviously there is an error in the way I'm using isel but I have tried many variations and I can't find much information about this particular error

You can access the closest datapoint to a specific latitude/longitude using:
lat = #yourlatitude
lon = #yourlongitude
ds_loc = ds.sel(latitude = lat, longitude = lon, method = 'nearest')
isel is used to access point by index, ie:
ds_loc = ds.isel(latitude = 200)
will return a subset along the 200th latitude value.

How can I find the indices equivalent to a specific selection of xarray?

I have an xarray dataset.
<xarray.Dataset>
Dimensions: (lat: 92, lon: 172, time: 183)
Coordinates:
* lat (lat) float32 4.125001 4.375 4.625 ... 26.624994 26.874996
* lon (lon) float32 nan nan nan ... 24.374996 24.624998 24.875
* time (time) datetime64[ns] 2003-09-01 2003-09-02 ... 2004-03-01
Data variables:
swnet (time, lat, lon) float32 dask.array<shape=(183, 92, 172), chunksize=(1, 92, 172)>
Find the nearest lat-long
df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')
Need to find
The indices of this particular location. Basically, the row-column in the grid. What would be the easiest way to go about it?
Tried
nearestlat=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lat'].values
nearestlon=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lon'].values
rowlat=np.where(df['lat'].values==nearestlat)[0][0]
collon=np.where(df['lon'].values==nearestlon)[0][0]
But I am not sure if this is the right way to go about it. How can I do this 'correctly'?

I agree that finding the index associated with a .sel operation is trickier than one would expect!
This code works:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.lon.values).index(ds.sel(lon=250.0, method='nearest').lon)
ilat = list(ds.lat.values).index(ds.sel(lat=45.0, method='nearest').lat)
print(' lon index=',ilon,'\n','lat index=', ilat)
producing:
lon index= 20
lat index= 12
And just in case one is wondering why one might want to do this, we use this for investigating time stacks of images, where we are interested in selecting the image immediately preceding the image on a specified date:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.time.values).index(ds.sel(time='2013-06-01 00:00:00', method='nearest').time)
print(idx)
which produces
848

I think that something like that should work:
ds = xr.tutorial.open_dataset('air_temperature')
idx = ds.indexes["time"].get_loc('2013-06-01 00:00:00', method="nearest")
print(idx)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.