How to use HDF5 dimension scales in h5py - python

HDF5 has the concept of dimension scales, as explained on the HDF5 and h5py websites. However, the explanations both use terse or generic examples and so I don't really understand how to use dimension scales. Namely, given a dataset f['coordinates'] in some HDF5 file f = h5py.File('data.h5'):
>>> f['coordinates'].value
array([[ 52.60636111, 4.38963889],
[ 52.57877778, 4.43422222],
[ 52.58319444, 4.42811111],
...,
[ 52.62269444, 4.43130556],
[ 52.62711111, 4.42519444],
[ 52.63152778, 4.41905556]])
I'd like to make it clear that the first column is the latitude and the second is the longitude. Are dimension scales used for this? Or are they used to indicate that the unit is degrees. Or both?
Perhaps another concrete example can illustrate the use of dimension scales better? If you have one, please share it, even if you are not using h5py.

Specifically for this question, the best answer is probably to use attributes:
f['coordinates'].attrs['columns'] = ['latitude', 'longitude']
But dimension scales are useful for other things. I'll show what they're for, how you could use them in a way similar to attributes, and how you might actually use your f['coordinates'] dataset as a scale for some other dataset.
Dimension scales
I agree that those documentation pages are not as clear as they could be, because they launch into complicated possibilities and mire in technical details before they actually explain the basic concepts. I think some simple examples should make things clear.
First, suppose you've kept track of the temperature outside over the course of a day — maybe measuring it every hour on the hour, for a total of 24 measurements. You might think of this as two columns of data: one for the hour, and one for the temperature. You could store this as a single dataset of shape 24x2. But time and temperature have different units, and are really different datatypes. So it might make more sense to store time and temperature as separate datasets — probably named "time" and "temperature", each of shape 24. But you'd also need to be a little more clear about what these are and how they're related to each other. That relationship is what "dimension scales" are really for.
If you imagine plotting the temperature as a function of time, you might label the horizontal axis as "Time (hour of day)", and the scale for the horizontal axis would be the hours themselves, telling you the horizontal position at which to plot each temperature. You could store this information through h5py like this:
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
Note that the argument to make_scale is specific information about that particular time dataset — in this case, the units we used to measure time — whereas the label is the more general concept of that dimension. Also note that it's actually more standard to attach unit information as attributes, but I like this approach more for specifying the unit of a scale because of its simplicity.
Now, suppose we measured the temperatures in three different places — say, Los Angeles, Chicago, and New York. Now, our array of temperatures would have shape 24x3. We would still need the time scale for dim[0], but now we also have dim[1] to deal with.
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
cities = f.create_dataset("cities",
data=[s.encode() for s in ["Los Angeles", "Chicago", "New York"]]
)
cities.make_scale("city")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
temp.dims[1].label = "Location"
temp.dims[1].attach_scale(cities)
It might be more useful to store the latitude and longitude, instead of city names. You can actually attach both types of scale to the same dimension. Just add code like this at the bottom of that last code block:
latlong = f.create_dataset("latlong",
data=[[34.0522, 118.2437], [41.8781, 87.6298], [40.7128, 74.0060]]
)
latlong.make_scale("latitude and longitude (degrees)")
temp.dims[1].attach_scale(latlong)
Finally, you can access these labels and scales like this:
with h5py.File("temperatures.h5", "r") as f:
print('Datasets:', list(f))
print('Temperature dimension labels:', [dim.label for dim in f['temperature'].dims])
print('Temperature dim[1] scales:', f['temperature'].dims[1].keys())
latlong = f['temperature'].dims[1]['latitude and longitude (degrees)'][:]
print(latlong)
The output looks like this:
Datasets: ['cities', 'latlong', 'temperature', 'time']
Temperature dimension labels: ['Time', 'Location']
Temperature dim[1] scales: ['city', 'latitude and longitude (degrees)']
[[ 34.0522 118.2437]
[ 41.8781 87.6298]
[ 40.7128 74.006 ]]

#mike 's answer is very helpful for understanding.
Since you gave a geospatial example with latitude and longitude, I might personally create your dataset like this, where the "lat" and "lon" are each separate 1D datasets:
import numpy as np
import h5py
# assuming you already have stored the coordinates in that 2-column array...
coords = np.array(
[[52.60636111, 4.38963889], [52.57877778, 4.43422222], [52.58319444, 4.42811111]]
)
heights = np.random.rand(3, 3)
with h5py.File("dem.h5", "w") as hf:
lat = hf.create_dataset("lat", data=coords[:, 0])
lat.make_scale("latitude")
lat.attrs["units"] = "degrees north"
lon = hf.create_dataset("lon", data=coords[:, 1])
lon.make_scale("longitude")
lon.attrs["units"] = "degrees east"
h = hf.create_dataset("height", data=heights)
h.attrs['units'] = "meters"
h.dims[0].attach_scale(lat)
h.dims[1].attach_scale(lon)
The reason: this will let you use it with xarray much more easily:
In [1]: import xarray as xr
In [2]: ds1 = xr.open_dataset("dem.h5")
In [3]: ds1
Out[3]:
<xarray.Dataset>
Dimensions: (lat: 3, lon: 3)
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Data variables:
height (lat, lon) float64 ...
In [4]: ds1['lat']
Out[4]:
<xarray.DataArray 'lat' (lat: 3)>
array([52.606361, 52.578778, 52.583194])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
Attributes:
units: degrees north
In [5]: ds1['lon']
Out[5]:
<xarray.DataArray 'lon' (lon: 3)>
array([4.389639, 4.434222, 4.428111])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: degrees east
In [6]: ds1['height']
Out[6]:
<xarray.DataArray 'height' (lat: 3, lon: 3)>
array([[0.832685, 0.24167 , 0.831189],
[0.294826, 0.779141, 0.280573],
[0.980254, 0.593158, 0.634342]])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: meters
The slightly extra metadata you add (including the "units" as attributes) pays off when you want to play around with calculations, or plot the data:
In [9]: ds1.height.mean(dim="lat")
Out[9]:
<xarray.DataArray 'height' (lon: 3)>
array([0.70258813, 0.5379896 , 0.58203484])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
In [10]: ds1.height.plot.imshow()

Related

Add projection to rioxarray dataset in Python

I've downloaded a netcdf from the Climate Data Store and would like to write a CRS to it, so I can clip it for a shapefile. However, I get an error when assigning a CRS.
Below my script and what is being printed. I receive this error after trying to write a crs:
MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds
# load netcdf with xarray
dset = xr.open_dataset(netcdf_fn)
print(dset)
# add projection system to nc
dset = dset.rio.write_crs("EPSG:4326", inplace=True)
# mask CMIP6 data with shapefile
dset_shp = dset.rio.clip(shp.geometry.apply(mapping), shp.crs)
dset
Out[44]:
<xarray.Dataset>
Dimensions: (time: 1825, bnds: 2, lat: 2, lon: 1)
Coordinates:
* time (time) object 2021-01-01 12:00:00 ... 2025-12-31 12:00:00
* lat (lat) float64 0.4712 1.414
* lon (lon) float64 31.25
spatial_ref int32 0
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Attributes: (12/48)
Conventions: CF-1.7 CMIP-6.2
activity_id: ScenarioMIP
branch_method: standard
branch_time_in_child: 60225.0
branch_time_in_parent: 60225.0
comment: none
...
title: CMCC-ESM2 output prepared for CMIP6
variable_id: pr
variant_label: r1i1p1f1
license: CMIP6 model data produced by CMCC is licensed und...
cmor_version: 3.6.0
tracking_id: hdl:21.14100/0c6732f7-2cdd-4296-99a0-7952b7ca911e
When you call the rioxarray accessor ds.rio.clip using a xr.Dataset rather than a xr.DataArray, rioxarray needs to guess which variables in the dataset should be clipped. The method docstring gives the following warning:
Warning:
Clips variables that have dimensions ‘x’/’y’. Others are appended as is.
So the issue you're running into is that rioxarray sees four variables in your dataset:
Data variables:
time_bnds (time, bnds) object ...
lat_bnds (lat, bnds) float64 0.0 0.9424 0.9424 1.885
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
Of these, lat_bnds, lon_bnds, and pr all have x or y dimensions which could conceivably be clipped. Rather than making some arbitrary choice about what to do in this situation, rioxarray is raising an error with the message MissingSpatialDimensionError: y dimension (lat) not found. Data variable: lon_bnds. This indicates that when processing the variable lon_bnds, it's not sure what to do, because it can find an x dimension but not a y dimension.
To address this, you have two options. The first is to call clip on the pr array only. This is probably the right call - generally I'd recommend only doing data processing with Arrays (not Datasets) whenever possible unless you really know you want to map an operation across all variables in the dataset. Calling clip on pr would look like this:
clipped = dset.pr.rio.clip(shp.geometry.apply(mapping), shp.crs)
Alternatively, you could resolve the issue of having data_variables that really should be coordinates. You can use the method set_coordsto reclassify the non-data data_variables as non-dimension coordinates. In this case:
dset = dset.set_coords(['time_bnds', 'lat_bnds', 'lon_bnds'])
I'm not sure if this will completely resolve your issue - it's possible that rioxarray will still raise this error when processing coordinates. You could always drop the bounds, too. But the first method of only calling this on a single variable will work.

How to set the coordinates of the output of xarray.assign?

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?
To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)
The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

Why latitudes and longitudes are two dimensional arrays in netcdf file?

I have netCDF file, which contains temperature data over some location. Data shape is 1450x900.
I am creating search functionality in my app, to locate temperature data with lat, lon values.
So I extracted lat and lon coordinates data from netCDf file, but I was expecting that they would be 1D arrays and instead got 2D arrays with 1450x900 shape for both coordinates.
So my question: why they are 2d arrays, instead of 1450 latitude values and 900 longitude values? Doesn't 1450 lat values and 900 lon values describe whole grid?
Lets say we have 4x5 square, indices for locating rightmost and bottom-most point of the grid will be [4, 5]. So my indices for x will be[1, 2, 3, 4] and for y: [1, 2, 3, 4, 5]. 9 indices in total are enough to locate any point on that grid (consisting of 20 cells). So why do lat (x) and lon (y) coordinates in netcdf file contain 20 indices separately (40 in total), instead of 4 and 5 indices respectively (9 in total)? Hope you get what confuses me.
Is it possible to somehow map those 2D arrays and "downgrade" to 1450 latitude values and 900 longitude values? OR it is ok as it is right now? How can I use those values for my intention? Do I need to zip lat lon arrays?
here are the shapes:
>>> DS = xarray.open_dataset('file.nc')
>>> DS.tasmin.shape
(31, 1450, 900)
>>> DS.projection_x_coordinate.shape
(900,)
>>> DS.projection_y_coordinate.shape
(1450,)
>>> DS.latitude.shape
(1450, 900)
>>> DS.longitude.shape
(1450, 900)
consider that projection_x_coordinate and projection_y_coordinate are easting/northing values not lat/longs
here is the metadata of file if needed:
Dimensions: (bnds: 2, projection_x_coordinate: 900, projection_y_coordinate: 1450, time: 31)
Coordinates:
* time (time) datetime64[ns] 2018-12-01T12:00:00 ....
* projection_y_coordinate (projection_y_coordinate) float64 -1.995e+0...
* projection_x_coordinate (projection_x_coordinate) float64 -1.995e+0...
latitude (projection_y_coordinate, projection_x_coordinate) float64 ...
longitude (projection_y_coordinate, projection_x_coordinate) float64 ...
Dimensions without coordinates: bnds
Data variables:
tasmin (time, projection_y_coordinate, projection_x_coordinate) float64 ...
transverse_mercator int32 ...
time_bnds (time, bnds) datetime64[ns] ...
projection_y_coordinate_bnds (projection_y_coordinate, bnds) float64 ...
projection_x_coordinate_bnds (projection_x_coordinate, bnds) float64 ...
Attributes:
comment: Daily resolution gridded climate observations
creation_date: 2019-08-21T21:26:02
frequency: day
institution: Met Office
references: doi: 10.1002/joc.1161
short_name: daily_mintemp
source: HadUK-Grid_v1.0.1.0
title: Gridded surface climate observations data for the UK
version: v20190808
Conventions: CF-1.5
Your data adheres to version 1.5 of the Climate and Forecast conventions.
The document describing this version of the conventions is here, although the relevant section is essentially unchanged across many versions of the conventions.
See section 5.2:
5.2. Two-Dimensional Latitude, Longitude, Coordinate Variables
The latitude and longitude coordinates of a horizontal grid that was
not defined as a Cartesian product of latitude and longitude axes, can
sometimes be represented using two-dimensional coordinate variables.
These variables are identified as coordinates by use of the coordinates attribute.
It looks like you are using the HadOBS 1km resolution gridded daily minimum temperature, and this file in particular:
http://dap.ceda.ac.uk/thredds/fileServer/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid/v1.0.1.0/1km/tasmin/day/v20190808/tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc (warning: >300MB download)
As it states, the data is on a transverse mercator grid.
If you look at output from ncdump -h <filename> you will also see the following description of the grid expressed by means of attributes of the transverse_mercator dummy variable:
int transverse_mercator ;
transverse_mercator:grid_mapping_name = "transverse_mercator" ;
transverse_mercator:longitude_of_prime_meridian = 0. ;
transverse_mercator:semi_major_axis = 6377563.396 ;
transverse_mercator:semi_minor_axis = 6356256.909 ;
transverse_mercator:longitude_of_central_meridian = -2. ;
transverse_mercator:latitude_of_projection_origin = 49. ;
transverse_mercator:false_easting = 400000. ;
transverse_mercator:false_northing = -100000. ;
transverse_mercator:scale_factor_at_central_meridian = 0.9996012717 ;
and you will also see that the coordinate variables projection_x_coordinate and projection_y_coordinate have units of metres.
The grid in question is the Ordnance Survey UK grid using numeric grid references.
See for example this description of the OS grid (from Wikipedia).
If you wish to express the data on a regular longitude-latitude grid then you will need to do some type of interpolation. I see that you are using xarray. You can combine this with pyresample to do the interpolation. Here is an example:
import xarray as xr
import numpy as np
from pyresample.geometry import SwathDefinition
from pyresample.kd_tree import resample_nearest, resample_gauss
ds = xr.open_dataset("tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc")
# Define a target grid. For sake of example, here is one with just
# 3 longitudes and 4 latitudes.
lons = np.array([-2.1, -2., -1.9])
lats = np.array([51.7, 51.8, 51.9, 52.0])
# The target grid is regular (1-d lon, lat coordinates) but we will need
# a 2d version (similar to the input grid), so use numpy.meshgrid to produce this.
lon2d, lat2d = np.meshgrid(lons, lats)
origin_grid = SwathDefinition(lons=ds.longitude, lats=ds.latitude)
target_grid = SwathDefinition(lons=lon2d, lats=lat2d)
# get a numpy array for the first timestep
data = ds.tasmin[0].to_masked_array()
# nearest neighbour interpolation example
# Note that radius_of_influence has units metres
interpolated = resample_nearest(origin_grid, data, target_grid, radius_of_influence=1000)
# GIVES:
# array([[5.12490065, 5.02715332, 5.36414835],
# [5.08337723, 4.96372838, 5.00862833],
# [6.47538931, 5.53855722, 5.11511239],
# [6.46571817, 6.17949381, 5.87357538]])
# gaussian weighted interpolation example
# Note that radius_of_influence and sigmas both have units metres
interpolated = resample_gauss(origin_grid, data, target_grid, radius_of_influence=1000, sigmas=1000)
# GIVES:
# array([[5.20432465, 5.07436805, 5.39693221],
# [5.09069187, 4.8565934 , 5.08191639],
# [6.4505963 , 5.44018209, 5.13774416],
# [6.47345359, 6.2386732 , 5.62121948]])
I figured an answer by myself.
As appeared 2D lat long arrays are used to define the "grid" of some location.
In other words, if we zip lat long values and project on the map, we will get "curved grid" (earth curvature is considered in other words) over some location, which then are used to create grid reference of location.
Hope its clear for anyone interested.

How can I find the indices equivalent to a specific selection of xarray?

I have an xarray dataset.
<xarray.Dataset>
Dimensions: (lat: 92, lon: 172, time: 183)
Coordinates:
* lat (lat) float32 4.125001 4.375 4.625 ... 26.624994 26.874996
* lon (lon) float32 nan nan nan ... 24.374996 24.624998 24.875
* time (time) datetime64[ns] 2003-09-01 2003-09-02 ... 2004-03-01
Data variables:
swnet (time, lat, lon) float32 dask.array<shape=(183, 92, 172), chunksize=(1, 92, 172)>
Find the nearest lat-long
df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')
Need to find
The indices of this particular location. Basically, the row-column in the grid. What would be the easiest way to go about it?
Tried
nearestlat=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lat'].values
nearestlon=df.sel(time='2003-09-01', lon=6.374997, lat=16.375006, method='nearest')['lon'].values
rowlat=np.where(df['lat'].values==nearestlat)[0][0]
collon=np.where(df['lon'].values==nearestlon)[0][0]
But I am not sure if this is the right way to go about it. How can I do this 'correctly'?
I agree that finding the index associated with a .sel operation is trickier than one would expect!
This code works:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.lon.values).index(ds.sel(lon=250.0, method='nearest').lon)
ilat = list(ds.lat.values).index(ds.sel(lat=45.0, method='nearest').lat)
print(' lon index=',ilon,'\n','lat index=', ilat)
producing:
lon index= 20
lat index= 12
And just in case one is wondering why one might want to do this, we use this for investigating time stacks of images, where we are interested in selecting the image immediately preceding the image on a specified date:
import xarray as xr
ds = xr.tutorial.open_dataset('air_temperature')
ilon = list(ds.time.values).index(ds.sel(time='2013-06-01 00:00:00', method='nearest').time)
print(idx)
which produces
848
I think that something like that should work:
ds = xr.tutorial.open_dataset('air_temperature')
idx = ds.indexes["time"].get_loc('2013-06-01 00:00:00', method="nearest")
print(idx)

Method for Time Slicing in Netcdf similar to xarray

I have a NetCDF data file containing sea ice concentration
from netCDF4 import Dataset
ds = Dataset('file.nic', 'r')
ds.variables.keys()
>>odict_keys(['latitude', 'longitude', 'seaice_conc', 'seaice_source', 'time'])
ds.dimensions.keys()
>>odict_keys(['latitude', 'longitude', 'time'])
Question: In this dataset, time is stored as days since 2001-01-01 00:00:00. Let's say I want seaice_conc for a particular time = 1990-12-01 then how do I approach it without using xarray or writing another function to calculate the days difference.
Is it possible to do it like in xarrays, for eg;
import xarray as xr
ds1 = xr.open_dataset('file.nc')
seaice_data = ds1['seaice_conc'].sel(time = '1990-12-01')
To give further info on dataset, it looks like this:
ds1.seaice_conc
<xarray.DataArray 'seaice_conc' (time: 1968, latitude: 240, longitude:
1440)>
[680140800 values with dtype=float32]
Coordinates:
* latitude (latitude) float32 89.875 89.625 89.375 89.125 88.875 88.625
...
* longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625
...
* time (time) datetime64[ns] 1850-01-15 1850-02-15 1850-03-15 ...
Attributes:
short_name: concentration
long_name: Sea_Ice_Concentration
standard_name: Sea_Ice_Concentration
units: Percent
Another thing which I'm confused is that using netcdf it says that time is stored in days since 2001:01:01 but in xarrays it shows me the exact date in yyyy-mm-dd format instead of showing the 'days since...' definition?
Thanks!
The easiest approach I could find is
from netCDF4 import date2index
from datetime import datetime
timeindex = date2index(datetime(1990,12,1),ds.variables['time'])
seaice_data = ds.variables['seaice_conc'][timeindex,:,:]
netCDF4.Dataset is indeed a kind of lower level library than xarray, if it could do everything that xarray already does, there would be no need for xarray, right.
Still, there is a useful function num2date in netCDF4, which can make your life easier when managing the date units. Approximately:
from netCDF4 import Dataset, num2date
import datetime
import numpy as np
ds = Dataset('file.nic', 'r')
your_date = datetime.datetime(1990,12,1)
select_time = np.argmax(num2date(ds.variables['time'][:],ds.variables['time'].units) == your_date)
seaice_data = ds.variables['seaice_conc'][select_time,:,:]
I admit it is still more code than xarray.
You can do what you are trying to do in Xarray.
For Question 1. It looks like your dates are all on the 15th of each month. Selecting just one time point should work like this.
ds1['seaice_conc'].sel(time='1990-12-15')
Another way you can do this is to use the method='nearest' keyword argument.
ds1['seaice_conc'].sel(time='1990-12-01', method='nearest')
Finally, you may consider reindexing your time axis to the first of each month.
ds1['seaice_conc'].resample(time='MS').mean('time').sel(time='1990-12-01')
A bonus answer, you can select time slices with a similar approach:
ds1['seaice_conc'].sel(time=slice('1990-01-01', '1991-12-31')
The Xarray documentation includes a section on datetime indexing
For Question 2. Xarray automatically decodes coordinate variables when you use open_dataset. You can turn this off with the decode_times argument but that doesn't seem like what you want to do here.
This is also discussed in the Xarray documentation.

Categories

Resources