Why latitudes and longitudes are two dimensional arrays in netcdf file? - python

I have netCDF file, which contains temperature data over some location. Data shape is 1450x900.
I am creating search functionality in my app, to locate temperature data with lat, lon values.
So I extracted lat and lon coordinates data from netCDf file, but I was expecting that they would be 1D arrays and instead got 2D arrays with 1450x900 shape for both coordinates.
So my question: why they are 2d arrays, instead of 1450 latitude values and 900 longitude values? Doesn't 1450 lat values and 900 lon values describe whole grid?
Lets say we have 4x5 square, indices for locating rightmost and bottom-most point of the grid will be [4, 5]. So my indices for x will be[1, 2, 3, 4] and for y: [1, 2, 3, 4, 5]. 9 indices in total are enough to locate any point on that grid (consisting of 20 cells). So why do lat (x) and lon (y) coordinates in netcdf file contain 20 indices separately (40 in total), instead of 4 and 5 indices respectively (9 in total)? Hope you get what confuses me.
Is it possible to somehow map those 2D arrays and "downgrade" to 1450 latitude values and 900 longitude values? OR it is ok as it is right now? How can I use those values for my intention? Do I need to zip lat lon arrays?
here are the shapes:
>>> DS = xarray.open_dataset('file.nc')
>>> DS.tasmin.shape
(31, 1450, 900)
>>> DS.projection_x_coordinate.shape
(900,)
>>> DS.projection_y_coordinate.shape
(1450,)
>>> DS.latitude.shape
(1450, 900)
>>> DS.longitude.shape
(1450, 900)
consider that projection_x_coordinate and projection_y_coordinate are easting/northing values not lat/longs
here is the metadata of file if needed:
Dimensions: (bnds: 2, projection_x_coordinate: 900, projection_y_coordinate: 1450, time: 31)
Coordinates:
* time (time) datetime64[ns] 2018-12-01T12:00:00 ....
* projection_y_coordinate (projection_y_coordinate) float64 -1.995e+0...
* projection_x_coordinate (projection_x_coordinate) float64 -1.995e+0...
latitude (projection_y_coordinate, projection_x_coordinate) float64 ...
longitude (projection_y_coordinate, projection_x_coordinate) float64 ...
Dimensions without coordinates: bnds
Data variables:
tasmin (time, projection_y_coordinate, projection_x_coordinate) float64 ...
transverse_mercator int32 ...
time_bnds (time, bnds) datetime64[ns] ...
projection_y_coordinate_bnds (projection_y_coordinate, bnds) float64 ...
projection_x_coordinate_bnds (projection_x_coordinate, bnds) float64 ...
Attributes:
comment: Daily resolution gridded climate observations
creation_date: 2019-08-21T21:26:02
frequency: day
institution: Met Office
references: doi: 10.1002/joc.1161
short_name: daily_mintemp
source: HadUK-Grid_v1.0.1.0
title: Gridded surface climate observations data for the UK
version: v20190808
Conventions: CF-1.5

Your data adheres to version 1.5 of the Climate and Forecast conventions.
The document describing this version of the conventions is here, although the relevant section is essentially unchanged across many versions of the conventions.
See section 5.2:
5.2. Two-Dimensional Latitude, Longitude, Coordinate Variables
The latitude and longitude coordinates of a horizontal grid that was
not defined as a Cartesian product of latitude and longitude axes, can
sometimes be represented using two-dimensional coordinate variables.
These variables are identified as coordinates by use of the coordinates attribute.
It looks like you are using the HadOBS 1km resolution gridded daily minimum temperature, and this file in particular:
http://dap.ceda.ac.uk/thredds/fileServer/badc/ukmo-hadobs/data/insitu/MOHC/HadOBS/HadUK-Grid/v1.0.1.0/1km/tasmin/day/v20190808/tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc (warning: >300MB download)
As it states, the data is on a transverse mercator grid.
If you look at output from ncdump -h <filename> you will also see the following description of the grid expressed by means of attributes of the transverse_mercator dummy variable:
int transverse_mercator ;
transverse_mercator:grid_mapping_name = "transverse_mercator" ;
transverse_mercator:longitude_of_prime_meridian = 0. ;
transverse_mercator:semi_major_axis = 6377563.396 ;
transverse_mercator:semi_minor_axis = 6356256.909 ;
transverse_mercator:longitude_of_central_meridian = -2. ;
transverse_mercator:latitude_of_projection_origin = 49. ;
transverse_mercator:false_easting = 400000. ;
transverse_mercator:false_northing = -100000. ;
transverse_mercator:scale_factor_at_central_meridian = 0.9996012717 ;
and you will also see that the coordinate variables projection_x_coordinate and projection_y_coordinate have units of metres.
The grid in question is the Ordnance Survey UK grid using numeric grid references.
See for example this description of the OS grid (from Wikipedia).
If you wish to express the data on a regular longitude-latitude grid then you will need to do some type of interpolation. I see that you are using xarray. You can combine this with pyresample to do the interpolation. Here is an example:
import xarray as xr
import numpy as np
from pyresample.geometry import SwathDefinition
from pyresample.kd_tree import resample_nearest, resample_gauss
ds = xr.open_dataset("tasmin_hadukgrid_uk_1km_day_20181201-20181231.nc")
# Define a target grid. For sake of example, here is one with just
# 3 longitudes and 4 latitudes.
lons = np.array([-2.1, -2., -1.9])
lats = np.array([51.7, 51.8, 51.9, 52.0])
# The target grid is regular (1-d lon, lat coordinates) but we will need
# a 2d version (similar to the input grid), so use numpy.meshgrid to produce this.
lon2d, lat2d = np.meshgrid(lons, lats)
origin_grid = SwathDefinition(lons=ds.longitude, lats=ds.latitude)
target_grid = SwathDefinition(lons=lon2d, lats=lat2d)
# get a numpy array for the first timestep
data = ds.tasmin[0].to_masked_array()
# nearest neighbour interpolation example
# Note that radius_of_influence has units metres
interpolated = resample_nearest(origin_grid, data, target_grid, radius_of_influence=1000)
# GIVES:
# array([[5.12490065, 5.02715332, 5.36414835],
# [5.08337723, 4.96372838, 5.00862833],
# [6.47538931, 5.53855722, 5.11511239],
# [6.46571817, 6.17949381, 5.87357538]])
# gaussian weighted interpolation example
# Note that radius_of_influence and sigmas both have units metres
interpolated = resample_gauss(origin_grid, data, target_grid, radius_of_influence=1000, sigmas=1000)
# GIVES:
# array([[5.20432465, 5.07436805, 5.39693221],
# [5.09069187, 4.8565934 , 5.08191639],
# [6.4505963 , 5.44018209, 5.13774416],
# [6.47345359, 6.2386732 , 5.62121948]])

I figured an answer by myself.
As appeared 2D lat long arrays are used to define the "grid" of some location.
In other words, if we zip lat long values and project on the map, we will get "curved grid" (earth curvature is considered in other words) over some location, which then are used to create grid reference of location.
Hope its clear for anyone interested.

Related

Need help interpolating data into specific points in netcdf files

I need help interpolating data onto nodes in a series of netcdf files which have oceanographic data. The files have 3 dimensions (latitude (12), longitude (20), time (24)) and variables (u, v) . Of all the data points, there are four nodes which do not have current velocity data (u, v) although they should have data, as they are at sea but register as land. I am trying to interpolate data onto these nodes but have no idea how.
EDIT: The four points with missing data are already in the coordinates but are have NaN values. The other points would keep the original data.
I am OK with Pandas but I know that this probably requires numpy and/or xarray and I am not well versed in either. I cannot get to nodes using the coordinates to interpolate the data I require. Can this be done at all?
print(data)
<xarray.Dataset>
Dimensions: (latitude: 12, time: 24, longitude: 20)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ...
41.67 41.75 41.83
* time (time) datetime64[ns] 2017-03-03T00:30:00 ...
2017-03-03T23:30:00
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ...
2.833 2.917 3.0
Data variables:
vo (time, latitude, longitude) float32 ...
uo (time, latitude, longitude) float32 ...
thetao (time, latitude, longitude) float32 ...
zos (time, latitude, longitude) float32 ...
Attributes: (12/22)
Conventions: CF-1.0
source: CMEMS IBI-MFC...
print(data.latitude)
<xarray.DataArray 'latitude' (latitude: 12)>
array([40.916668, 41. , 41.083332, 41.166668, 41.25 ,
41.333332, 41.416668, 41.5 , 41.583332, 41.666668, 41.75 , 41.833332], dtype=float32)
Coordinates:
* latitude (latitude) float32 40.92 41.0 41.08 41.17 ... 41.67
41.75 41.83
Attributes:
standard_name: latitude
long_name: Latitude
units: degrees_north
axis: Y
unit_long: Degrees North
step: 0.08333f
_CoordinateAxisType: Lat
_ChunkSizes: 361
valid_min: 40.916668
valid_max: 41.833332
print(data.longitude)
<xarray.DataArray 'longitude' (longitude: 20)>
array([1.416666, 1.499999, 1.583333, 1.666666, 1.749999, 1.833333, 1.916666, 1.999999, 2.083333, 2.166666, 2.249999, 2.333333, 2.416666, 2.499999,2.583333, 2.666666, 2.749999, 2.833333, 2.916666, 2.999999],
dtype=float32)
Coordinates:
* longitude (longitude) float32 1.417 1.5 1.583 1.667 ... 2.833 2.917 3.0
Attributes:
standard_name: longitude
long_name: Longitude
units: degrees_east
axis: X
unit_long: Degrees East
step: 0.08333f
_CoordinateAxisType: Lon
_ChunkSizes: 289
valid_min: 1.4166658
valid_max: 2.999999
The goal of the question is to in-fill cells, which are on land/missing in the raw files, but should really be in the sea. In some cases, you might want something sophisticated to do this. For example, if there was a sharp coastal gradient.
But the easiest way to solve it is to use nearest neighbour to replace missing values with the nearest neighbour. That will of course replace more than you need. So, you will then need to apply some kind of land-sea mask to your data. The workflow below, using my package nctoolkit, should do the job
import nctoolkit as nc
# read in the file and set missing values to nn
ds = nc.open_data("infile.nc")
ds.fill_na()
# create the mask to apply. This should only have one time step
# I'm going to assume in this case that it is a file with temperature that has the correct land-sea division
ds_mask = nc.open_data("mask.nc")
# Ensure sea values are 1. Land values should be nan
ds_mask.compare(">-1000")
# multiply the dataset by the mask to set land values to missing
ds.multiply(ds_mask)
# plot the results
ds.plot()

Temperature and Salinty from CMEMS netcdf file for specific geographical location

I want to get the temperature ['thetao'] and salinity ['so'] of the sea surface (just the top layer) for specific geographical location.
I found guidance for how to do this on this website.
import netCDF4 as nc
import numpy as np
fn = "\\...\...\Downloads\global-analysis-forecast-phy-001-024_1647367066622.nc" # path to netcdf file
ds = nc.Dataset(fn) # read as netcdf dataset
print(ds)
print(ds.variables.keys()) # get all variable names
temp = ds.variables['thetao']
sal = ds.variables['so']
lat,lon = ds.variables['latitude'], ds.variables['longitude']
# extract lat/lon values (in degrees) to numpy arrays
latvals = lat[:]; lonvals = lon[:]
# a function to find the index of the point closest pt
# (in squared distance) to give lat/lon value.
def getclosest_ij(lats,lons,latpt,lonpt):
# find squared distance of every point on grid
dist_sq = (lats-latpt)**2 + (lons-lonpt)**2
# 1D index of minimum dist_sq element
minindex_flattened = dist_sq.argmin()
# Get 2D index for latvals and lonvals arrays from 1D index
return np.unravel_index(minindex_flattened, lats.shape)
iy_min, ix_min = getclosest_ij(latvals, lonvals, 50., -140)
print(iy_min)
print(ix_min)
# Read values out of the netCDF file for temperature and salinity
print('%7.4f %s' % (temp[0,0,iy_min,ix_min], temp.units))
print('%7.4f %s' % (sal[0,0,iy_min,ix_min], sal.units))
Some details on the nc-file I am using:
dimensions(sizes): time(1), depth(1), latitude(2041), longitude(4320)
variables(dimensions): float32 depth(depth), float32 latitude(latitude), int16 thetao(time, depth, latitude, longitude), float32 time(time), int16 so(time, depth, latitude, longitude), float32 longitude(longitude)
groups:
dict_keys(['depth', 'latitude', 'thetao', 'time', 'so', 'longitude'])
I am getting this error:
dist_sq = (lats-latpt)**2 + (lons-lonpt)**2
ValueError: operands could not be broadcast together with shapes (2041,) (4320,)
I suspect there is an issue with the shapes/arrays. In the example of the website (link above) the Lat and Lon have a (x,y), however this NC file only has for Latitude (2041,) and for Longitude (4320,).
How can I solve this?
It's because the lats and lons are vectors with different size...
I usually do this if using WGS84 or degrees as unit:
lonm,latm = np.meshgrid(lons,lats)
dmat = (np.cos(latm*np.pi/180.0)*(lonm-lonpt)*60.*1852)**2+((latm-latpt)*60.*1852)**2
Now you can find the closest point:
kkd = np.where(dmat==np.nanmin(dmat))

How to set the coordinates of the output of xarray.assign?

I've been trying to create two new variables based on the latitude coordinate of a data point in an xarray dataset. However, I can only seem to assign new coordinates. The data set looks like this:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
What I've attempted so far is this:
def get_latitude_band(latitude):
return np.select(
condlist=
[abs(latitude) < 23.45,
abs(latitude) < 35,
abs(latitude) < 66.55],
choicelist=
["tropical",
"sub_tropical",
"temperate"],
default="frigid"
)
def get_hemisphere(latitude):
return np.select(
[latitude > 0, latitude <=0],
["north", "south"]
)
mhw_data = mhw_data \
.assign(climate_zone=get_latitude_band(mhw_data.lat)) \
.assign(hemisphere=get_hemisphere(mhw_data.lat)) \
.reset_index(["hemisphere", "climate_zone"]) \
.reset_coords()
print(mhw_data)
Which is getting me close:
<xarray.Dataset>
Dimensions: (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
* time (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
evapr (time, lat, lon) float32 ...
lhtfl (time, lat, lon) float32 ...
...
hemisphere_ (hemisphere) object 'south' 'south' ... 'north' 'north'
climate_zone_ (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...
However, I want to then stack the DataSet and convert it to a DataFrame. I am unable to do so, and I think it is because the new variables hemisphere_ and climate_zone_ do not have time, lat, lon coordinates:
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
results in a KeyError on "lon".
So my question is: How do I assign new variables to the xarray dataset that maintain the original coordinates of time, lat and lon?
To assign a new variable or coordinate, xarray needs to know what the dimensions are called. There are a number of ways to define a DataArray or Coordinate, but the one closest to what you're currently using is to provide a tuple of (dim_names, array):
mhw_data = mhw_data.assign_coords(
climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)
Here I'm using da.assign_coords, which will define climate_zone and hemisphere as non-dimension coordinates, which you can think of as additional metadata about latitude and about your data, but which aren't proper data in themselves. This will also allow them to be preserved when sending individual arrays to pandas.
For stacking, converting to pandas will stack automatically. The following will return a DataFrame with variables/non-dimension coordinates as columns and dimensions as a MultiIndex:
stacked = mhw_data.to_dataframe()
Alternatively, if you want a Series indexed by (lat, lon, time) for just one of these coordinates you can always use expand_dims:
(
mhw_data.climate_zone
.expand_dims(lon=mhw_data.lon, time=mhw_data.time)
.to_series()
)
The two possible solutions I've worked out for myself are as follow:
First, stack the xarray data into pandas DataFrames, and then create new columns:
df = None
variables = list(mhw_data.data_vars)
for var in tqdm(variables):
stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
if df is None:
df = stacked
else:
df = pd.concat([df, stacked], axis=1)
df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)
df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)
Create new xarray.DataArrays for each variable you'd like to add, then add them to the dataset:
# calculate climate zone and hemisphere from latitude.
latitudes = mhw_data.lat.values.reshape(-1, 1)
zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)
# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape.
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)
zones = zones + base
hemispheres = hemispheres + base
# finally, create two new DataArrays and assign them as variables in the dataset.
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)
mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray
# ... call the code to stack and convert to pandas (shown in method 1) ...#
My intuition is that method 1 is faster and more memory efficient because there are no repeated values that need broadcasting into a large 3-dimensional array. I did not test this, however.
Also, my intuition is that there is a less cumbersome xarray native way of accomplishing the same goal, but I could not find it.
One thing is certain, method 1 is far more concise due to the fact that there is no need to create intermediate arrays or reshape data.

applying a generalized additive model to an xarray

I have a netCDF file which I have read with xarray. The array contains times, latidude, longitude and only one data variable (i.e. index values)
# read the netCDF files
with xr.open_mfdataset('wet_tropics.nc') as wet:
print(wet)
Out[]:
<xarray.Dataset>
Dimensions: (time: 1437, x: 24, y: 20)
Coordinates:
* y (y) float64 -1.878e+06 -1.878e+06 -1.878e+06 -1.878e+06 ...
* x (x) float64 1.468e+06 1.468e+06 1.468e+06 1.468e+06 ...
* time (time) object '2013-03-29T00:22:28.500000000' ...
Data variables:
index_values (time, y, x) float64 dask.array<shape=(1437, 20, 24), chunksize=(1437, 20, 24)>
So far, so good.
Now I need to apply a generalized additive model to each grid cell in the array. The model I want to use comes from Facebook Prophet (https://facebook.github.io/prophet/) and I have successfully applied it to a pandas array of data before. For example:
cns_ap['y'] = cns_ap['av_index'] # Prophet requires specific names 'y' and 'ds' for column names
cns_ap['ds'] = cns_ap['Date']
cns_ap['cap'] = 1
m1 = Prophet(weekly_seasonality=False, # disables weekly_seasonality
daily_seasonality=False, # disables daily_seasonality
growth='logistic', # logistic because indices have a maximum
yearly_seasonality=4, # fourier transform. int between 1-10
changepoint_prior_scale=0.5).fit(cns_ap)
future1 = m1.make_future_dataframe(periods=60, # 5 year prediction
freq='M', # monthly predictions
include_history=True) # fits model to all historical data
future1['cap'] = 1 # sets cap at maximum index value
forecast1 = m1.predict(future1)
# m1.plot_components(forecast1, plot_cap=False);
# m1.plot(forecast1, plot_cap=False, ylabel='CNS index', xlabel='Year');
The problem is that now I have to
1)iterate through every cell of the netCDF file,
2)get all the values for that cell through time,
3)apply the GAM (using fbprophet), and then export and plot the results.
The question: do you have any ideas on how to loop through the raster, get the index_values of each pixel for all times so that i can run the GAM?
I think that a nested for loop would be feasible, although i dont know how to make one that goes through every cell.
Any help is appreciated

How to use HDF5 dimension scales in h5py

HDF5 has the concept of dimension scales, as explained on the HDF5 and h5py websites. However, the explanations both use terse or generic examples and so I don't really understand how to use dimension scales. Namely, given a dataset f['coordinates'] in some HDF5 file f = h5py.File('data.h5'):
>>> f['coordinates'].value
array([[ 52.60636111, 4.38963889],
[ 52.57877778, 4.43422222],
[ 52.58319444, 4.42811111],
...,
[ 52.62269444, 4.43130556],
[ 52.62711111, 4.42519444],
[ 52.63152778, 4.41905556]])
I'd like to make it clear that the first column is the latitude and the second is the longitude. Are dimension scales used for this? Or are they used to indicate that the unit is degrees. Or both?
Perhaps another concrete example can illustrate the use of dimension scales better? If you have one, please share it, even if you are not using h5py.
Specifically for this question, the best answer is probably to use attributes:
f['coordinates'].attrs['columns'] = ['latitude', 'longitude']
But dimension scales are useful for other things. I'll show what they're for, how you could use them in a way similar to attributes, and how you might actually use your f['coordinates'] dataset as a scale for some other dataset.
Dimension scales
I agree that those documentation pages are not as clear as they could be, because they launch into complicated possibilities and mire in technical details before they actually explain the basic concepts. I think some simple examples should make things clear.
First, suppose you've kept track of the temperature outside over the course of a day — maybe measuring it every hour on the hour, for a total of 24 measurements. You might think of this as two columns of data: one for the hour, and one for the temperature. You could store this as a single dataset of shape 24x2. But time and temperature have different units, and are really different datatypes. So it might make more sense to store time and temperature as separate datasets — probably named "time" and "temperature", each of shape 24. But you'd also need to be a little more clear about what these are and how they're related to each other. That relationship is what "dimension scales" are really for.
If you imagine plotting the temperature as a function of time, you might label the horizontal axis as "Time (hour of day)", and the scale for the horizontal axis would be the hours themselves, telling you the horizontal position at which to plot each temperature. You could store this information through h5py like this:
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
Note that the argument to make_scale is specific information about that particular time dataset — in this case, the units we used to measure time — whereas the label is the more general concept of that dimension. Also note that it's actually more standard to attach unit information as attributes, but I like this approach more for specifying the unit of a scale because of its simplicity.
Now, suppose we measured the temperatures in three different places — say, Los Angeles, Chicago, and New York. Now, our array of temperatures would have shape 24x3. We would still need the time scale for dim[0], but now we also have dim[1] to deal with.
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
cities = f.create_dataset("cities",
data=[s.encode() for s in ["Los Angeles", "Chicago", "New York"]]
)
cities.make_scale("city")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
temp.dims[1].label = "Location"
temp.dims[1].attach_scale(cities)
It might be more useful to store the latitude and longitude, instead of city names. You can actually attach both types of scale to the same dimension. Just add code like this at the bottom of that last code block:
latlong = f.create_dataset("latlong",
data=[[34.0522, 118.2437], [41.8781, 87.6298], [40.7128, 74.0060]]
)
latlong.make_scale("latitude and longitude (degrees)")
temp.dims[1].attach_scale(latlong)
Finally, you can access these labels and scales like this:
with h5py.File("temperatures.h5", "r") as f:
print('Datasets:', list(f))
print('Temperature dimension labels:', [dim.label for dim in f['temperature'].dims])
print('Temperature dim[1] scales:', f['temperature'].dims[1].keys())
latlong = f['temperature'].dims[1]['latitude and longitude (degrees)'][:]
print(latlong)
The output looks like this:
Datasets: ['cities', 'latlong', 'temperature', 'time']
Temperature dimension labels: ['Time', 'Location']
Temperature dim[1] scales: ['city', 'latitude and longitude (degrees)']
[[ 34.0522 118.2437]
[ 41.8781 87.6298]
[ 40.7128 74.006 ]]
#mike 's answer is very helpful for understanding.
Since you gave a geospatial example with latitude and longitude, I might personally create your dataset like this, where the "lat" and "lon" are each separate 1D datasets:
import numpy as np
import h5py
# assuming you already have stored the coordinates in that 2-column array...
coords = np.array(
[[52.60636111, 4.38963889], [52.57877778, 4.43422222], [52.58319444, 4.42811111]]
)
heights = np.random.rand(3, 3)
with h5py.File("dem.h5", "w") as hf:
lat = hf.create_dataset("lat", data=coords[:, 0])
lat.make_scale("latitude")
lat.attrs["units"] = "degrees north"
lon = hf.create_dataset("lon", data=coords[:, 1])
lon.make_scale("longitude")
lon.attrs["units"] = "degrees east"
h = hf.create_dataset("height", data=heights)
h.attrs['units'] = "meters"
h.dims[0].attach_scale(lat)
h.dims[1].attach_scale(lon)
The reason: this will let you use it with xarray much more easily:
In [1]: import xarray as xr
In [2]: ds1 = xr.open_dataset("dem.h5")
In [3]: ds1
Out[3]:
<xarray.Dataset>
Dimensions: (lat: 3, lon: 3)
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Data variables:
height (lat, lon) float64 ...
In [4]: ds1['lat']
Out[4]:
<xarray.DataArray 'lat' (lat: 3)>
array([52.606361, 52.578778, 52.583194])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
Attributes:
units: degrees north
In [5]: ds1['lon']
Out[5]:
<xarray.DataArray 'lon' (lon: 3)>
array([4.389639, 4.434222, 4.428111])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: degrees east
In [6]: ds1['height']
Out[6]:
<xarray.DataArray 'height' (lat: 3, lon: 3)>
array([[0.832685, 0.24167 , 0.831189],
[0.294826, 0.779141, 0.280573],
[0.980254, 0.593158, 0.634342]])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: meters
The slightly extra metadata you add (including the "units" as attributes) pays off when you want to play around with calculations, or plot the data:
In [9]: ds1.height.mean(dim="lat")
Out[9]:
<xarray.DataArray 'height' (lon: 3)>
array([0.70258813, 0.5379896 , 0.58203484])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
In [10]: ds1.height.plot.imshow()

Categories

Resources