Add projection to rioxarray dataset in Python

Add projection to rioxarray dataset in Python - python

I've downloaded a netcdf from the Climate Data Store and would like to write a CRS to it, so I can clip it for a shapefile. However, I get an error when assigning a CRS.
Below my script and what is being printed. I receive this error after trying to write a crs:
MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds
# load netcdf with xarray
dset = xr.open_dataset(netcdf_fn)
print(dset)
# add projection system to nc
dset = dset.rio.write_crs("EPSG:4326", inplace=True)
# mask CMIP6 data with shapefile
dset_shp = dset.rio.clip(shp.geometry.apply(mapping), shp.crs)
dset
Out[44]:
<xarray.Dataset>
Dimensions: (time: 1825, bnds: 2, lat: 2, lon: 1)
Coordinates:
* time (time) object 2021-01-01 12:00:00 ... 2025-12-31 12:00:00
* lat (lat) float64 0.4712 1.414
* lon (lon) float64 31.25
spatial_ref int32 0
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Attributes: (12/48)
Conventions: CF-1.7 CMIP-6.2
activity_id: ScenarioMIP
branch_method: standard
branch_time_in_child: 60225.0
branch_time_in_parent: 60225.0
comment: none
...
title: CMCC-ESM2 output prepared for CMIP6
variable_id: pr
variant_label: r1i1p1f1
license: CMIP6 model data produced by CMCC is licensed und...
cmor_version: 3.6.0
tracking_id: hdl:21.14100/0c6732f7-2cdd-4296-99a0-7952b7ca911e

When you call the rioxarray accessor ds.rio.clip using a xr.Dataset rather than a xr.DataArray, rioxarray needs to guess which variables in the dataset should be clipped. The method docstring gives the following warning:
Warning:
Clips variables that have dimensions ‘x’/’y’. Others are appended as is.
So the issue you're running into is that rioxarray sees four variables in your dataset:
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Of these, lat_bnds, lon_bnds, and pr all have x or y dimensions which could conceivably be clipped. Rather than making some arbitrary choice about what to do in this situation, rioxarray is raising an error with the message MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds. This indicates that when processing the variable lon_bnds, it's not sure what to do, because it can find an x dimension but not a y dimension.
To address this, you have two options. The first is to call clip on the pr array only. This is probably the right call - generally I'd recommend only doing data processing with Arrays (not Datasets) whenever possible unless you really know you want to map an operation across all variables in the dataset. Calling clip on pr would look like this:
clipped = dset.pr.rio.clip(shp.geometry.apply(mapping), shp.crs)
Alternatively, you could resolve the issue of having data_variables that really should be coordinates. You can use the method set_coordsto reclassify the non-data data_variables as non-dimension coordinates. In this case:
dset = dset.set_coords(['time_bnds', 'lat_bnds', 'lon_bnds'])
I'm not sure if this will completely resolve your issue - it's possible that rioxarray will still raise this error when processing coordinates. You could always drop the bounds, too. But the first method of only calling this on a single variable will work.

Related

Need help interpolating data into specific points in netcdf files

I need help interpolating data onto nodes in a series of netcdf files which have oceanographic data. The files have 3 dimensions (latitude (12), longitude (20), time (24)) and variables (u, v) . Of all the data points, there are four nodes which do not have current velocity data (u, v) although they should have data, as they are at sea but register as land. I am trying to interpolate data onto these nodes but have no idea how.
EDIT: The four points with missing data are already in the coordinates but are have NaN values. The other points would keep the original data.
I am OK with Pandas but I know that this probably requires numpy and/or xarray and I am not well versed in either. I cannot get to nodes using the coordinates to interpolate the data I require. Can this be done at all?
print(data)
<xarray.Dataset>
Dimensions: (latitude: 12, time: 24, longitude: 20)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ...
41.67 41.75 41.83
* time (time) datetime64[ns] 2017-03-03T00:30:00 ...
2017-03-03T23:30:00
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ...
2.833 2.917 3.0
Data variables:
vo (time, latitude, longitude) float32 ...
uo (time, latitude, longitude) float32 ...
thetao (time, latitude, longitude) float32 ...
zos (time, latitude, longitude) float32 ...
Attributes: (12/22)
Conventions: CF-1.0
source: CMEMS IBI-MFC...
print(data.latitude)
<xarray.DataArray 'latitude' (latitude: 12)>
array([40.916668, 41. , 41.083332, 41.166668, 41.25 ,
41.333332, 41.416668, 41.5 , 41.583332, 41.666668, 41.75 , 41.833332], dtype=float32)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ... 41.67
41.75 41.83
Attributes:
standard_name: latitude
long_name: Latitude
units: degrees_north
axis: Y
unit_long: Degrees North
step: 0.08333f
_CoordinateAxisType: Lat
_ChunkSizes: 361
valid_min: 40.916668
valid_max: 41.833332
print(data.longitude)
<xarray.DataArray 'longitude' (longitude: 20)>
array([1.416666, 1.499999, 1.583333, 1.666666, 1.749999, 1.833333, 1.916666, 1.999999, 2.083333, 2.166666, 2.249999, 2.333333, 2.416666, 2.499999,2.583333, 2.666666, 2.749999, 2.833333, 2.916666, 2.999999],
dtype=float32)
Coordinates:
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ... 2.833 2.917 3.0
Attributes:
standard_name: longitude
long_name: Longitude
units: degrees_east
axis: X
unit_long: Degrees East
step: 0.08333f
_CoordinateAxisType: Lon
_ChunkSizes: 289
valid_min: 1.4166658
valid_max: 2.999999

The goal of the question is to in-fill cells, which are on land/missing in the raw files, but should really be in the sea. In some cases, you might want something sophisticated to do this. For example, if there was a sharp coastal gradient.
But the easiest way to solve it is to use nearest neighbour to replace missing values with the nearest neighbour. That will of course replace more than you need. So, you will then need to apply some kind of land-sea mask to your data. The workflow below, using my package nctoolkit, should do the job
import nctoolkit as nc
# read in the file and set missing values to nn
ds = nc.open_data("infile.nc")
ds.fill_na()
# create the mask to apply. This should only have one time step
# I'm going to assume in this case that it is a file with temperature that has the correct land-sea division
ds_mask = nc.open_data("mask.nc")
# Ensure sea values are 1. Land values should be nan
ds_mask.compare(">-1000")
# multiply the dataset by the mask to set land values to missing
ds.multiply(ds_mask)
# plot the results
ds.plot()

How to remove xarray dimension after adding another without deleting the data variables

I have data from ECMWF which when read into xarray looks like this
Dimensions: (time: 424, step: 12, latitude: 3, longitude: 2)
Coordinates:
number int64 0
* time (time) datetime64[ns] 1990-03-01T06:00:00 ... 1993-04-22T18:0...
* step (step) timedelta64[ns] 01:00:00 02:00:00 ... 11:00:00 12:00:00
surface float64 0.0
* latitude (latitude) float64 41.0 40.75 40.5
* longitude (longitude) float64 -96.92 -96.67
valid_time (time, step) datetime64[ns] 1990-03-01T07:00:00 ... 1993-04-2...
Data variables:
i10fg (time, step, latitude, longitude) float32 4.876 4.637 ... 3.959
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2022-04-23T22:18 GRIB to CDM+CF via cfgrib-0.9.9...
I've stacked the time and step dimensions to make a new one called forecast. I then added another dimension called valid and set that equal to the coordinate valid_time. Both valid and fcst are the same length, but I now want to drop the fcst dimension. However, when I do this, the data variable also gets deleted. Does anyone know how to fix this? Here is my sample code. There might be a better way to do what I'm doing, but I'm still pretty new to xarray!
ds = ds.stack(fcst=("time", "step")).transpose("fcst", "latitude", "longitude")
ds.expand_dims(valid=ds['valid_time']).drop_dims('fcst')
which leaves me with
<xarray.Dataset>
Dimensions: (valid: 5088, latitude: 3, longitude: 2)
Coordinates:
* valid (valid) datetime64[ns] 1990-03-01T07:00:00 ... 1993-04-23T06:0...
number int64 0
surface float64 0.0
* latitude (latitude) float64 41.0 40.75 40.5
* longitude (longitude) float64 -96.92 -96.67
Data variables:
*empty*
I've tried setting valid as the index then dropping fcst but it still deletes the data variables.
any help would be appreciated!

All variables in an xarray Dataset must be indexed by named dimensions. You can use ds.reset_index` drop any labeled coordinates associated with a dimension, but this isn't what you want. You can't simply do away with a dimension without losing the variables which are indexed with this dimension.
From the xarray docs on data structures:
dimension names are always present in the xarray data model: if you do not provide them, defaults of the form dim_N will be created.
Instead, you can swap a non-indexing coordinate for an indexing coordinate using swap_dims. This will swicth valid_time and fcst in your array, such that valid_time becomes the dimension which indexes fcst and any variables previously indexed by fcst (i10fg in your case).
So the answer is, after the stack but without expand_dims or drop:
ds.swap_dims({'fcst': 'valid_time'}).drop('fcst')

How to set the coordinates of the output of xarray.assign?

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?

To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)

The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

Clip NetCDF files by non-associated coordinates

Look at this NetCDF EURO-CORDEX file:
<xarray.Dataset>
Dimensions: (bnds: 2, rlat: 412, rlon: 424, time: 1800)
Coordinates:
lat (rlat, rlon) float64 ...
lon (rlat, rlon) float64 ...
* rlat (rlat) float64 -23.38 -23.26 -23.16 ... 21.61 21.73 21.83
* rlon (rlon) float64 -28.38 -28.26 -28.16 ... 17.93 18.05 18.16
* time (time) object 2011-01-01 12:00:00 ... 2015-12-30 12:00:00
Dimensions without coordinates: bnds
Data variables:
pr (time, rlat, rlon) float32 ...
rotated_pole |S1 ...
time_bnds (time, bnds) object ...
When I attempt to clip it using xarray (in Python script) or nco (in the command line) by lat and lon coordinates, it is presumably not working as lat and lon are not dimensions. I read these are rather coordinates not associated with any variable. How can I associate them or make them dimensions? Do the asterisks by the coordinates rlat and rlon suggest something I am missing?
Note I achieve clipping by xarray.Dataset.where, nevertheless I am rather interested with manipulating the dimensions/coordinates such that usual clipping operations would work. Thanks!

You should be able to clip this using CDO:
sellonlatbox,lonmin,lonmax,latmin,latmax infile outfile
CDO will figure out the smallest possible rectangle in the original dataset that will contain the specified box. This seems to be a rotated polar grid, so you will not get some grid cells outside the box.
NCO will also work, but you will need to crop using rlon and rlat. So, you will need to select an appropriate rlon and rlat that will result in the desired lon and lat.

I need to calculate a global average in a netCDF CF file using weighted latitudes using xarray then convert to pandas

I need to calculate a global time series (time) from a netCDF cf data 3D (time,lat,lon) file then convert it to a pandas/dataframe. I need to weight the latitude by cos(lat). I have been using numpy to do the averages but the conversion to a pandas/dataframe to the weighted array doesn't work.
ds=xr.open_dataset('sample_data.nc')
data=ds.tas
start_time='1980-01-01'
end_time='2018-12-31'
time_slice = slice(start_time, end_time)
nrows=len(data.lat.values)
ncols=len(data.lon.values)
t=len(data.time.values)
weights=np.zeros([len(data.lat.values)])
latsr = np.deg2rad(data.lat.values).reshape((nrows,1))
weight_matrix=np.repeat(np.cos(latsr),ncols,axis=1)
wghtpr=np.zeros_like(data)
for i in range (0,t):
wghtpr[i,:,:]=data[i,:,:]*weight_matrix
new_data=wghtpr
wtdata=np.average(new_data,axis=1)
da=np.average(wtdata,axis=1)
This ends up with a nuumpy array with no "name"
If I do a ds, I get:
<xarray.Dataset>
Dimensions: (bnds: 2, lat: 361, lon: 576, time: 477)
Coordinates:
* time (time) datetime64[ns] 1980-01-16T12:00:00 ... 2019-09-16
* lat (lat) float64 -90.0 -89.5 -89.0 -88.5 ... 88.5 89.0 89.5 90.0
* lon (lon) float64 0.0 0.625 1.25 1.875 ... 357.5 358.1 358.8 359.4
height float64 ...
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) datetime64[ns] ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
tas (time, lat, lon) float32 244.15399 244.15399 ... 267.52875
Attributes:
institution: Global Modeling and Assimilation Office, NASA Goddard Sp...
institute_id: NASA-GMAO
experiment_id: MERRA-2
source: MERRA-2 Monthly tavgM_2d_slv_Nx
model_id: GEOS-5
references: http://gmao.gsfc.nasa.gov/research/merra/, http://gmao.g...
tracking_id: e77fd4de-19c2-45ad-afe2-ce3f6c1eb148
mip_specs: CMIP5
source_id: MERRA-2
product: reanalysis
frequency: mon
creation_date: 2015-10-11T23:12:34Z
history: 2015-10-11T23:12:34Z CMOR rewrote data to comply with CF...
Conventions: CF-1.4
project_id: CREATE-IP
table_id: Table Amon_ana (10 March 2011) c3ffdce87438d8df0839620ee...
title: Reanalysis output prepared for CREATE-IP.
modeling_realm: atmos
cmor_version: 2.9.1
doi: http://dx.doi.org/10.5067/AP1B0BA5PD2K
contact: MERRA-2, Steven Pawson (steven.pawson-1#nasa.gov)
#

To make use of xarray's broadcasting and alignment, you may do the weighting like this:
ds=xr.open_dataset('sample_data.nc')
data=ds.tas
#start_time='1980-01-01'
#end_time='2018-12-31'
#time_slice = slice(start_time, end_time)
#nrows=len(data.lat.values)
#ncols=len(data.lon.values)
#t=len(data.time.values)
latsr = xr.ufunc.deg2rad(data.lat)
weights = xr.ufunc.cos(latsr)
weighted = data * weights # broadcasting here
weighted_mean = weighted.mean(['lat','lon'])
# to pandas
df = weighted_mean.to_dataframe()
Hope this helps.

An alternative is to use CDO. To get the global mean you should only need to do the following:
cdo fldmean 'sample_data.nc' out.nc
If on Linux, you can also use my Python package nctoolkit which uses CDO as a backend (https://nctoolkit.readthedocs.io/en/latest/installing.html). Calculating the global mean and then converting it to pandas would require the following:
import nctoolkit as nc
data = nc.open_data("sample_data.nc")
data.spatial_mean()
pd_ts = data.to_dataframe()
Plotting the time series would require:
data.plot()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.