calculate mean using 'year + month' combination in xarray - python

I have a netcdf file with daily data for 5 years (2011 to 2015). I want to calculate monthly averages for the data using XArray in Python.
netcdf file:////test/Combined.nc {
dimensions:
latitude = 681;
longitude = 841;
time = 1826;
variables:
double latitude(latitude=681);
:_FillValue = NaN; // double
:name = "latitude";
:long_name = "latitude";
:units = "degrees_north";
:standard_name = "latitude";
double longitude(longitude=841);
:_FillValue = NaN; // double
:name = "longitude";
:long_name = "longitude";
:units = "degrees_east";
:standard_name = "longitude";
long time(time=1826);
:name = "time";
:long_name = "time";
:standard_name = "time";
:units = "days since 2011-01-01 00:00:00";
:calendar = "proleptic_gregorian";
float PET(time=1826, latitude=681, longitude=841);
:_FillValue = -999.0f; // float
:name = "PET";
:long_name = "Potential evapotranspiration";
:units = "mm";
:standard_name = "PET";
:var_name = "PET";
}
What I tried to do was was use groupby to calculate monthly averages:
import numpy as np
import xarray as xr
ds = xr.open_dataset("c:\\test\\Combined.nc")
ds_avg = ds.PET.groupby('time.month').mean(dim='time')
ds_avg.to_netcdf("C:\\test\\Combined_avg.nc")
But the problem with above code is spits out a file with combined monthly average (from 2011 to 2015). That means i have 12 months in the result file. That is not what i want to do. I want to calculate the monthly average for January 2011, Feb 2011, March 2011 to Dec 2015 so that i get 12 * 5 months in the result file. So that means that the groupby should happen not on 'time.month' but 'time.year:time.month'. How do I do that?
Thanks

You should use resampledoc with the frequency with one month. Then:
ds_avg = ds.resample('1M').mean()
If you are interested on any other similar (simple) manipulations have a look at this notebook we have set up for the ERA-NUTS dataset.
Another example using another dataset:
<xarray.Dataset>
Dimensions: (bnds: 2, latitude: 61, longitude: 91, time: 218)
Coordinates:
* longitude (longitude) float32 -22.5 -21.75 -21.0 -20.25 ... 43.5 44.25 45.0
* latitude (latitude) float32 72.0 71.25 70.5 69.75 ... 28.5 27.75 27.0
* time (time) datetime64[ns] 2000-01-16T15:00:00 ... 2018-01-01T03:00:00
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) datetime64[ns] ...
ssrdc (time, latitude, longitude) float64 ...
ssrd (time, latitude, longitude) float64 ...
And then applying the resample:
In [13]: d.resample(time = '1Y').mean()
Out[13]:
<xarray.Dataset>
Dimensions: (latitude: 61, longitude: 91, time: 19)
Coordinates:
* time (time) datetime64[ns] 2000-12-31 2001-12-31 ... 2018-12-31
* longitude (longitude) float32 -22.5 -21.75 -21.0 -20.25 ... 43.5 44.25 45.0
* latitude (latitude) float32 72.0 71.25 70.5 69.75 ... 28.5 27.75 27.0
Data variables:
ssrdc (time, latitude, longitude) float64 5.033e+05 ... 1.908e+05
ssrd (time, latitude, longitude) float64 4.229e+05 ... 1.909e+05

Related

Merging multiple observational nc files based on station attributes

I am trying to merge multiple nc files containing physical oceanographic data for different depths at different latitudes and longitudes.
I am using ds = xr.open_mfdataset to do this, but the files are not merging correctly and when I try to plot them it seems there is only one resulting value for the merged files.
This is the code I am using:
##Combining using concat_dim and nested method
ds = xr.open_mfdataset("33HQ20150809*.nc", concat_dim=['latitude'], combine= "nested")
ds.to_netcdf('geotraces2015_combined.nc')
df = xr.open_dataset("geotraces2015_combined.nc")
##Setting up values. Oxygen values are transposed so it matches same shape as lat and pressure.
oxygen = df['oxygen'].values.transpose()
##Plotting using colourf
fig = plt.figure()
ax = fig.add_subplot(111)
plt.contourf(oxygen, cmap = 'inferno')
plt.gca().invert_yaxis()
cbar = plt.colorbar(label = 'Oxygen Concentration (umol kg-1')
You can download the nc files from here under CTD
https://cchdo.ucsd.edu/cruise/33HQ20150809
This is how each file looks like:
<xarray.Dataset>
Dimensions: (pressure: 744, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 741.0 742.0 743.0
* time (time) datetime64[ns] 2015-08-12T18:13:00
* latitude (latitude) float32 60.25
* longitude (longitude) float32 -179.1
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
...
Another file would look like this:
<xarray.Dataset>
Dimensions: (pressure: 179, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 176.0 177.0 178.0
* time (time) datetime64[ns] 2015-08-18T19:18:00
* latitude (latitude) float32 73.99
* longitude (longitude) float32 -168.8
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
EDIT: This is my new approach which is still not working:
I'm trying to use preprocess to set_coords, squeeze, and expand_dims following Michael's approch:
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('33HQ20150809*.nc', concat_dim='station', combine='nested', preprocess=preprocess)
But I'm still having the same problem...
Solution: First, I had to identify the coordinate with the unique value, in my case was 'station'. Then I used preprocess to apply the squeeze and set_coords and expand_dims functions to each file, following Michael's answers.
import pandas as pd
import numpy as np
import os
import netCDF4
import pathlib
import matplotlib.pyplot as plt
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('filename*.nc', preprocess=preprocess, parallel=True)
ds = ds.sortby('latitude').transpose()
ds.oxygen.plot.contourf(x="latitude", y="pressure")
plt.gca().invert_yaxis()
The xarray data model requires that all data dimensions be perpendicular and complete. In other words, every combination of every coordinate along each dimension will be present in the data array (either as data or NaNs).
You can work with observational data such as yours using xarray, but you have to be careful with the indices to ensure you don't explode the data dimensionality. Specifically, whenever data is not truly a dimension of the data, but is simply an observation or attribute tied to a station or monitor, you should think of this more as a data variable than a coordinate. In your case, your dimensions seem to be station ID and pressure level (which does not have a full set of observations for each station, but is a dimension of the data). On the other hand, time, latitude, and longitude are attributes of each station, and should not be treated as dimensions.
I'll generate some random data that looks like yours:
def generate_random_station():
station_id = "{:09d}".format(np.random.randint(0, int(1e9)))
time = np.random.choice(pd.date_range("2015-08-01", "2015-08-31", freq="H"))
plevs = np.arange(np.random.randint(1, 1000)).astype(float)
lat = np.random.random() * 10 + 30
lon = np.random.random() * 10 - 80
ds = xr.Dataset(
{
"salinity": (('pressure', ), np.sin(plevs / 200 + lat)),
"woce_date": (("time", ), [time]),
"station": station_id,
},
coords={
"pressure": plevs,
"time": [time],
"latitude": [lat],
"longitude": [lon],
},
)
return ds
This ends up looking like the following:
In [11]: single = generate_random_station()
In [12]: single
Out[12]:
<xarray.Dataset>
Dimensions: (pressure: 37, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
* time (time) datetime64[ns] 2015-08-21T01:00:00
* latitude (latitude) float64 39.61
* longitude (longitude) float64 -72.19
Data variables:
salinity (pressure) float64 0.9427 0.941 0.9393 ... 0.8726 0.8702 0.8677
woce_date (time) datetime64[ns] 2015-08-21T01:00:00
station <U9 '233136181'
The problem is the latitude, longitude, and time coords aren't really dimensions which can be used to index a larger array. They aren't evenly spaced, and each combination of lat/lon/time does not have a station at it. Because of this, we need to be extra careful to make sure that when we combine the data, the lat/lon/time dimensions are not expanded.
To do this, we'll squeeze these dimensions, and expand the datasets along a new dimension, station:
In [13]: single.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
Out[13]:
<xarray.Dataset>
Dimensions: (pressure: 37, station: 1)
Coordinates:
* station (station) <U9 '233136181'
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
time datetime64[ns] 2015-08-21T01:00:00
latitude float64 39.61
longitude float64 -72.19
Data variables:
salinity (station, pressure) float64 0.9427 0.941 0.9393 ... 0.8702 0.8677
woce_date (station) datetime64[ns] 2015-08-21T01:00:00
This can be done to all of your datasets, then they can be concatenated along the "station" dimension:
In [14]: all_stations = xr.concat(
...: [
...: generate_random_station()
...: .set_coords('station')
...: .squeeze(["latitude", "longitude", "time"])
...: .expand_dims('station')
...: for i in range(10)
...: ],
...: dim="station",
...: )
This results in a dataset indexed by pressure level and station:
In [15]: all_stations
Out[15]:
<xarray.Dataset>
Dimensions: (pressure: 657, station: 10)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 653.0 654.0 655.0 656.0
* station (station) <U9 '197171488' '089978445' ... '107555081' '597650083'
time (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
latitude (station) float64 37.96 34.3 38.74 39.28 ... 37.72 33.89 36.46
longitude (station) float64 -74.28 -73.59 -78.33 ... -76.6 -76.47 -77.96
Data variables:
salinity (station, pressure) float64 0.2593 0.2642 0.269 ... 0.8916 0.8893
woce_date (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
You can now plot along the latitude and pressure level dimensions:
In [16]: all_stations.salinity.plot.contourf(x="latitude", y="pressure")

Calculating BIOCLIM variables using Xarray and UKCP18 - Multivariable indexing

I am currently generating several bioclimatic variables (climatic derivatives) to apply to some biodiversity work using UKCP18 data. I am generating bioclimatic variable "Bio 19": Precipitation of the Coldest Quarter (https://pubs.usgs.gov/ds/691/ds691.pdf) which uses tas and pr.
The task involves finding the 3-month rolling sum of mean temperatures, identifying the minimum of these (constituting the coldest quarter) and then extracting the total precipitation over that 3-month rolling period to obtain "Bio 19".
My issue: I can find the coldest quarter (using tas) without problem, but pr data is dropped by Xarray in the action alongside the time indexing. It means I cannot know which period to extract rainfall from, because the data is not linked in that way across variables (using my method).
Example code:
# previous code here
...
# calculate rolling climatological mean over 3 month period(s)
ds_rolling_3mos_sum = ds_monthly.rolling(time=3, center=True).sum().dropna('time')
ds_rolling_3mos_sum
<xarray.Dataset>
Dimensions: (time: 12, grid_latitude: 606, grid_longitude: 484)
Coordinates:
* time (time) object 1981-01-01 00:00:00 ... 1981-12...
* grid_latitude (grid_latitude) float64 -4.683 -4.647 ... 8.063
* grid_longitude (grid_longitude) float64 353.9 354.0 ... 364.3
latitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
longitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
Data variables:
pr (time, grid_latitude, grid_longitude) float32 dask.array<chunksize=(2, 606, 484), meta=np.ndarray>
rotated_latitude_longitude (time) float64 -6.442e+09 ... -6.442e+09
grid_latitude_bnds (time, grid_latitude) float64 dask.array<chunksize=(2, 606), meta=np.ndarray>
grid_longitude_bnds (time, grid_longitude) float64 dask.array<chunksize=(2, 484), meta=np.ndarray>
tas (time, grid_latitude, grid_longitude) float32 dask.array<chunksize=(2, 606, 484), meta=np.ndarray>
Find the coldest quarter:
ds_rolling_3mos_sum_tas_min = ds_rolling_3mos_sum.tas.min('time')
I now have neither time index information - which I could use to obtain the correct month for rainfall in a .sel(), nor do I have any connected pr data to access for this variable.
<xarray.DataArray 'tas' (grid_latitude: 606, grid_longitude: 484)>
dask.array<nanmin-aggregate, shape=(606, 484), dtype=float32, chunksize=(606, 484), chunktype=numpy.ndarray>
Coordinates:
* grid_latitude (grid_latitude) float64 -4.683 -4.647 -4.611 ... 8.027 8.063
* grid_longitude (grid_longitude) float64 353.9 354.0 354.0 ... 364.3 364.3
latitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
longitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
I've tried using a .where statement and a few other hands on techniques but nothing working to date. I feel like there's a recognised Xarray action I'm not aware of! Please help!
This sounds like a case for xarray's advanced indexing! Get excited - this is one of the most fun & powerful features of xarray in my opinion :)
Here's a quick MRE I'll use for this example:
import xarray as xr, numpy as np, pandas as pd
time = pd.date_range('1980-12-01', '1982-01-01', freq='MS')
x = np.arange(-4, 8, 0.25)
y = np.arange(353, 364, 0.25)
tas = np.random.random(size=(len(time), len(y), len(x))) * 100 + 260
pr = np.exp(np.random.random(size=(len(time), len(y), len(x)))) * 100
ds = xr.Dataset(
{'pr': (('time', 'lat', 'lon'), pr), 'tas': (('time', 'lat', 'lon'), tas)},
coords={'lat': y, 'lon': x, 'time': time},
)
Your dataset of rolling 3mo sums contains the temperature information about which locations within the precip array you want to select. So what we can do is use the idxmin method to build a 2D DataArray of times, indexed by lat/lon, which give the center of the coldest quarter:
In [24]: # find the time of the min rolled tas for each lat/lon
...: time_of_min_tas_by_cell = ds_rolling_3mos_sum.tas.idxmin(dim='time')
In [25]: time_of_min_tas_by_cell
Out[25]:
<xarray.DataArray 'time' (lat: 44, lon: 48)>
array([['1981-06-01T00:00:00.000000000', '1981-12-01T00:00:00.000000000',
'1981-06-01T00:00:00.000000000', ...,
'1981-03-01T00:00:00.000000000', '1981-04-01T00:00:00.000000000',
'1981-10-01T00:00:00.000000000'],
...,
['1981-07-01T00:00:00.000000000', '1981-09-01T00:00:00.000000000',
'1981-11-01T00:00:00.000000000', ...,
'1981-07-01T00:00:00.000000000', '1981-12-01T00:00:00.000000000',
'1981-09-01T00:00:00.000000000']], dtype='datetime64[ns]')
Coordinates:
* lat (lat) float64 353.0 353.2 353.5 353.8 ... 363.0 363.2 363.5 363.8
* lon (lon) float64 -4.0 -3.75 -3.5 -3.25 -3.0 ... 6.75 7.0 7.25 7.5 7.75
This can be used to index into the precip array directly to find the rolling 3mo precip total at the center of the min temp quarter:
In [30]: ds_rolling_3mos_sum.pr.sel(time=time_of_min_tas_by_cell)
Out[30]:
<xarray.DataArray 'pr' (lat: 44, lon: 48)>
array([[449.50779525, 531.90472182, 747.26749901, ..., 405.24357679,
610.08658199, 488.83666056],
[487.65599173, 567.01802137, 380.9979117 , ..., 613.84289448,
442.8228211 , 629.50269312],
[432.48761645, 444.76568124, 480.11564481, ..., 464.74424834,
543.97169369, 491.91926534],
...,
[488.68368642, 455.70782431, 363.25961252, ..., 457.72558376,
529.17600183, 438.6763931 ],
[370.4485618 , 491.65565156, 391.47992765, ..., 689.95878533,
585.65987576, 407.78032041],
[576.67281438, 551.36298132, 389.643589 , ..., 366.8810199 ,
526.52862773, 593.30879779]])
Coordinates:
* lat (lat) float64 353.0 353.2 353.5 353.8 ... 363.0 363.2 363.5 363.8
* lon (lon) float64 -4.0 -3.75 -3.5 -3.25 -3.0 ... 6.75 7.0 7.25 7.5 7.75
time (lat, lon) datetime64[ns] 1981-06-01 1981-12-01 ... 1981-09-01
xarray knows to reshape the result into an array indexed by (lat, lon) because the time indices are also indexed by (lat, lon). So it collapses across time, matching the indexer's lat/lon values to the source array's dims. Cool right?

xarray slice function alternative for calculating average along a dimension

I'm using Xarray and netCDF meteorological data. I have the usual dimensions time, latitude and longitude and two main variables: the wind speed (time, lat, lon) and a latitudinal position (time, lon).
<xarray.Dataset>
Dimensions: (lon: 53, time: 25873, lat: 20)
Coordinates:
* lon (lon) float64 -80.0 -77.5 -75.0 -72.5 ... 45.0 47.5 50.0
* time (time) datetime64[ns] 1950-01-31 ... 2020-12-01
* lat (lat) float32 70.0 67.5 65.0 62.5 ... 27.5 25.0 22.5
Data variables:
uwnd (time, lat, lon) float32 -0.0625 0.375 ... -1.812 -2.75
positions (time, lon) float64 40.0 40.0 45.0 ... 70.0 70.0 70.0
For each time, lon, I'd like to calculate a latitudinal average around the positions.
If I do a loop, I would do this (for a +-2.5° latitude average):
for i in ds.lon.values:
for t in ds.time.values:
wind_averaged.loc[t,i]=ds.uwnd.sel(lon=i,time=t).sel(lat=slice(2.5+ds.positions.sel(lon=i,time=t).values,ds.positions.sel(lon=i,time=t).values-2.5)).mean('lat')
This is obviously very bad and I wanted to use slice() like this:
wind_averaged=ds.uwnd.sel(lat=slice(2.5+ds.jet_positions.values,ds.jet_positions.values-2.5)).mean('lat')
but it gives an error because I
cannot use non-scalar arrays in a slice for xarray indexing
Is there any alternative to do what I want without doing two for loops by using Xarray power?
Thanks
I believe you are looking for the multidimensional groupby. If I understand correctly, there is a tutorial for this problem here: https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html

Merging datasets with xarray makes variables to be nan

I want to represent in the same plot two datasets, so I am merging them using xarray. These is how they look like:
ds1
<xarray.Dataset>
Dimensions: (time: 1, lat: 1037, lon: 1345)
Coordinates:
* lat (lat) float32 37.7 37.7 37.69 37.69 37.69 ... 35.01 35.01 35.0 35.0
* time (time) datetime64[ns] 2021-11-23
* lon (lon) float32 -9.001 -8.999 -8.996 -8.993 ... -5.507 -5.504 -5.501
Data variables:
CHL (time, lat, lon) float32 ...
ds2
<xarray.Dataset>
Dimensions: (time: 1, lat: 852, lon: 1168)
Coordinates:
* time (time) datetime64[ns] 2021-11-23
* lat (lat) float32 35.0 35.0 35.01 35.01 35.01 ... 37.29 37.29 37.3 37.3
* lon (lon) float32 -5.501 -5.498 -5.494 -5.491 ... -1.507 -1.503 -1.5
Data variables:
CHL (time, lat, lon) float32 ...
So then I use:
ds3 = xr.merge([ds1,ds2])
It works for the dimensions, but my variable CHL becomes nan:
<xarray.Dataset>
Dimensions: (lat: 1887, lon: 2513, time: 1)
Coordinates:
* lat (lat) float64 35.0 35.0 35.0 35.0 35.01 ... 37.69 37.69 37.7 37.7
* lon (lon) float64 -9.001 -8.999 -8.996 -8.993 ... -1.507 -1.503 -1.5
* time (time) datetime64[ns] 2021-11-23
Data variables:
CHL (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
So when I plot this dataset I have the following result:
I assume those white stripes are caused by the variable CHL becoming nan...
Any ideas of what could be happening? Thank you!
I don't think that any values become NaNs. Rather, I think that the latitude coordinates simply differ. Because you do an outer join (the default for xr.merge), xarray has to fill up the matrix at places where there is no information about the values. The default for the fill_value seems to be NaN.
So the question is, what values would you expect in these locations?
One possibility could be to fill the missing places by interpolation. In several dimensions this might be tricky, but as far as I see you are just placing two images next to each other with no overlap in the lon dimension.
In that case, xarray lets you interpolate the lat dimension easily:
ds3["CHL"].interpolate_na(dim="lat", method="linear")

How to convert netCDFs with unusual dimensions to a standard netCDF (ltime, lat, lon) (python)

I have multiple netCDF files that eventually i want to merge. An example netCDF is as follows.
import xarray as xr
import numpy as np
import cftime
Rain_nc = xr.open_dataset('filepath.nc', decode_times=False)
print(Rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, tstep:248)
Dimensions without coordinates: land, tstep
Data variables:
lon (land) float32...
lat (land) float32...
timestp(tstep) int32...
time (tstep) int32...
Rainf (tstep, land) float32...
the dimension 'land' is a count of numbers 1 to 67209, and 'tstep' is a count from 1 to 248.
the variable 'lat' and 'lon' are latitude and longitude values with a shape of (67209,)
the variable 'time' is the time in seconds since the start of the month (netcdf is a month long)
Next ive swapped the dimensions from 'tstep' to 'time', converted it for later merging and set coords as lon, lat and time.
rain_nc = rain_nc.swap_dims({'tstep':'time'})
rain_nc = rain_nc.set_coords(['lon', 'lat', 'time'])
rain_nc['time'] = cftime.num2date(rain_nc['time'], units='seconds since 2016-01-01 00:00:00', calendar = 'standard')
rain_nc['time'] = cftime.date2num(rain_nc['time'], units='seconds since 1970-01-01 00:00:00', calendar = 'standard')
this has left me with the following Dataset:
print(rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, time: 248)
Coordinates:
lon (land)float32
lat (land)float32
* time (time)float64
Dimensions without coordinates: land
Data variables:
timestp (time)int32
Rainf (time, land)
print(rain_nc['land'])
<xarray.DataArray 'land' (land: 67209)>
array([ 0, 1, 2,..., 67206, 67207, 67208])
Coordinates:
lon (land) float32 ...
lat (land) float32 ...
Dimensions without coordinates: land
the Rainf variable i am interested in is as follows:
<xarray.DataArray 'Rainf' (time: 248, land: 67209)>
[16667832 values with dtype=float32]
Coordinates:
lon (land) float32 -179.75 -179.75 -179.75 ... 179.75 179.75
179.75
lat (land) float32 71.25 70.75 68.75 68.25 ... -16.25 -16.75
-19.25
* time (time) float64 1.452e+09 1.452e+09 ... 1.454e+09 1.454e+09
Dimensions without coordinates: land
Attributes:
title: Rainf
units: kg/m2s
long_name: Mean rainfall rate over the \nprevious 3 hours
actual_max: 0.008114143
actual_min: 0.0
Fill_value: 1e+20
From here i would like to create a netCDF with the dimensions (time, lat, lon) and the variable Rainf.
I have tried creating a new netCDF (or alter this one) but when i try to pass the Rainf variable does not work as it has a shape of (248, 67209) and needs a shape of (248, 67209, 67209). Even though the current 'land' dimension of 'Rainf' has a lat and lon coordinate. Is it possible to reshape this variable to have a time, lat, and lon dimension?
In the end it seems that what you want is to reshape the "land" dimensions to the ("lat", "lon") ones.
So, you have some DataArray similar to this:
# Setting sizes and coordinates
lon_size, lat_size = 50, 80
lon, lat = [arr.flatten() for arr in np.meshgrid(range(lon_size), range(lat_size))]
land_size = lon_size * lat_size
time_size = 100
da = xr.DataArray(
dims=("time", "land"),
data=np.random.randn(time_size, land_size),
coords=dict(
time=np.arange(time_size),
lon=("land", lon),
lat=("land", lat),
)
)
which looks like this:
>>> da
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
lon (land) int64 0 1 ... 48 49
lat (land) int64 0 0 ... 79 79
Dimensions without coordinates: land
First, we'll use the .set_index() method to tell xarray that the "land" index should be represented from the "lon" and "lat" coordinates:
>>> da.set_index(land=("lon", "lat"))
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* land (land) MultiIndex
- lon (land) int64 0 1 ... 48 49
- lat (land) int64 0 0 ... 79 79
The dimensions are still ("time", "land"), but now "land" is a MultiIndex.
Note that if you try to write to NETCDF at this point you'll have the following error:
>>> da.set_index(land=("lon", "lat")).to_netcdf("data.nc")
NotImplementedError: variable 'land' is a MultiIndex, which cannot yet be serialized to netCDF files (https://github.com/pydata/xarray/issues/1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.
It tells you to use the .reset_index() method. But that's not what you want here, because it will just go back to the original da state.
What you want from now is to use the .unstack() method:
>>> da.set_index(land=("lon", "lat")).unstack("land")
<xarray.DataArray (time: 100, lon: 50, lat: 80)>
array([[[...]]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* lon (lon) int64 0 1 ... 48 49
* lat (lat) int64 0 1 ... 78 79
It effectively kills the "land" dimension and gives the desired output.

Categories

Resources