I have xarray dataset with longitude coordinate from 0.5 to 359.5, like the following:
Dimensions: (bnds: 2, lat: 40, lev: 35, lon: 31, member_id: 1)
Coordinates:
lev_bnds (lev, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
lat_bnds (lat, bnds) float64 ...
* lon (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 359.5
* lev (lev) float64 2.5 10.0 20.0 32.5 ... 5e+03 5.5e+03 6e+03 6.5e+03
* lat (lat) float64 -89.5 -88.5 -87.5 -86.5 ... -53.5 -52.5 -51.5 -50.5
* member_id (member_id) object 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
so (member_id, lev, lat, lon) float32 nan nan nan ...
The area I'm interested in is between 60W to 30E, which probably corresponds to longitude 300.5 to 30.5. Is there any way to slice the dataset between these coordinates?
I tried to use isel(slice(-60,30) but it's not possible to have negative to positive numbers in the slice function.
I know I can just split the data into two small ones (300.5-359.5 and 0.5-30.5), but I was wondering if there is a better way.
Thank you!
As you correctly point out, currently isel can’t select from both the start and end of a dimension in a single pass.
If combined with roll http://xarray.pydata.org/en/stable/generated/xarray.DataArray.roll.html, you can move the points you want into a contiguous region, and then select those you need.
NB: I couldn’t be sure from your example, but it looks like you may want sel rather than isel, given you may be selecting by index value rather than position
Related
I am trying to merge multiple nc files containing physical oceanographic data for different depths at different latitudes and longitudes.
I am using ds = xr.open_mfdataset to do this, but the files are not merging correctly and when I try to plot them it seems there is only one resulting value for the merged files.
This is the code I am using:
##Combining using concat_dim and nested method
ds = xr.open_mfdataset("33HQ20150809*.nc", concat_dim=['latitude'], combine= "nested")
ds.to_netcdf('geotraces2015_combined.nc')
df = xr.open_dataset("geotraces2015_combined.nc")
##Setting up values. Oxygen values are transposed so it matches same shape as lat and pressure.
oxygen = df['oxygen'].values.transpose()
##Plotting using colourf
fig = plt.figure()
ax = fig.add_subplot(111)
plt.contourf(oxygen, cmap = 'inferno')
plt.gca().invert_yaxis()
cbar = plt.colorbar(label = 'Oxygen Concentration (umol kg-1')
You can download the nc files from here under CTD
https://cchdo.ucsd.edu/cruise/33HQ20150809
This is how each file looks like:
<xarray.Dataset>
Dimensions: (pressure: 744, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 741.0 742.0 743.0
* time (time) datetime64[ns] 2015-08-12T18:13:00
* latitude (latitude) float32 60.25
* longitude (longitude) float32 -179.1
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
...
Another file would look like this:
<xarray.Dataset>
Dimensions: (pressure: 179, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 176.0 177.0 178.0
* time (time) datetime64[ns] 2015-08-18T19:18:00
* latitude (latitude) float32 73.99
* longitude (longitude) float32 -168.8
Data variables: (12/19)
pressure_QC (pressure) int16 ...
temperature (pressure) float64 ...
temperature_QC (pressure) int16 ...
salinity (pressure) float64 ...
salinity_QC (pressure) int16 ...
oxygen (pressure) float64 ...
... ...
CTDNOBS (pressure) float64 ...
CTDETIME (pressure) float64 ...
woce_date (time) int32 ...
woce_time (time) int16 ...
station |S40 ...
cast |S40 ...
Attributes:
EXPOCODE: 33HQ20150809
Conventions: COARDS/WOCE
WOCE_VERSION: 3.0
EDIT: This is my new approach which is still not working:
I'm trying to use preprocess to set_coords, squeeze, and expand_dims following Michael's approch:
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('33HQ20150809*.nc', concat_dim='station', combine='nested', preprocess=preprocess)
But I'm still having the same problem...
Solution: First, I had to identify the coordinate with the unique value, in my case was 'station'. Then I used preprocess to apply the squeeze and set_coords and expand_dims functions to each file, following Michael's answers.
import pandas as pd
import numpy as np
import os
import netCDF4
import pathlib
import matplotlib.pyplot as plt
def preprocess(ds):
return ds.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
ds = xr.open_mfdataset('filename*.nc', preprocess=preprocess, parallel=True)
ds = ds.sortby('latitude').transpose()
ds.oxygen.plot.contourf(x="latitude", y="pressure")
plt.gca().invert_yaxis()
The xarray data model requires that all data dimensions be perpendicular and complete. In other words, every combination of every coordinate along each dimension will be present in the data array (either as data or NaNs).
You can work with observational data such as yours using xarray, but you have to be careful with the indices to ensure you don't explode the data dimensionality. Specifically, whenever data is not truly a dimension of the data, but is simply an observation or attribute tied to a station or monitor, you should think of this more as a data variable than a coordinate. In your case, your dimensions seem to be station ID and pressure level (which does not have a full set of observations for each station, but is a dimension of the data). On the other hand, time, latitude, and longitude are attributes of each station, and should not be treated as dimensions.
I'll generate some random data that looks like yours:
def generate_random_station():
station_id = "{:09d}".format(np.random.randint(0, int(1e9)))
time = np.random.choice(pd.date_range("2015-08-01", "2015-08-31", freq="H"))
plevs = np.arange(np.random.randint(1, 1000)).astype(float)
lat = np.random.random() * 10 + 30
lon = np.random.random() * 10 - 80
ds = xr.Dataset(
{
"salinity": (('pressure', ), np.sin(plevs / 200 + lat)),
"woce_date": (("time", ), [time]),
"station": station_id,
},
coords={
"pressure": plevs,
"time": [time],
"latitude": [lat],
"longitude": [lon],
},
)
return ds
This ends up looking like the following:
In [11]: single = generate_random_station()
In [12]: single
Out[12]:
<xarray.Dataset>
Dimensions: (pressure: 37, time: 1, latitude: 1, longitude: 1)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
* time (time) datetime64[ns] 2015-08-21T01:00:00
* latitude (latitude) float64 39.61
* longitude (longitude) float64 -72.19
Data variables:
salinity (pressure) float64 0.9427 0.941 0.9393 ... 0.8726 0.8702 0.8677
woce_date (time) datetime64[ns] 2015-08-21T01:00:00
station <U9 '233136181'
The problem is the latitude, longitude, and time coords aren't really dimensions which can be used to index a larger array. They aren't evenly spaced, and each combination of lat/lon/time does not have a station at it. Because of this, we need to be extra careful to make sure that when we combine the data, the lat/lon/time dimensions are not expanded.
To do this, we'll squeeze these dimensions, and expand the datasets along a new dimension, station:
In [13]: single.set_coords('station').squeeze(["latitude", "longitude", "time"]).expand_dims('station')
Out[13]:
<xarray.Dataset>
Dimensions: (pressure: 37, station: 1)
Coordinates:
* station (station) <U9 '233136181'
* pressure (pressure) float64 0.0 1.0 2.0 3.0 4.0 ... 33.0 34.0 35.0 36.0
time datetime64[ns] 2015-08-21T01:00:00
latitude float64 39.61
longitude float64 -72.19
Data variables:
salinity (station, pressure) float64 0.9427 0.941 0.9393 ... 0.8702 0.8677
woce_date (station) datetime64[ns] 2015-08-21T01:00:00
This can be done to all of your datasets, then they can be concatenated along the "station" dimension:
In [14]: all_stations = xr.concat(
...: [
...: generate_random_station()
...: .set_coords('station')
...: .squeeze(["latitude", "longitude", "time"])
...: .expand_dims('station')
...: for i in range(10)
...: ],
...: dim="station",
...: )
This results in a dataset indexed by pressure level and station:
In [15]: all_stations
Out[15]:
<xarray.Dataset>
Dimensions: (pressure: 657, station: 10)
Coordinates:
* pressure (pressure) float64 0.0 1.0 2.0 3.0 ... 653.0 654.0 655.0 656.0
* station (station) <U9 '197171488' '089978445' ... '107555081' '597650083'
time (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
latitude (station) float64 37.96 34.3 38.74 39.28 ... 37.72 33.89 36.46
longitude (station) float64 -74.28 -73.59 -78.33 ... -76.6 -76.47 -77.96
Data variables:
salinity (station, pressure) float64 0.2593 0.2642 0.269 ... 0.8916 0.8893
woce_date (station) datetime64[ns] 2015-08-19T06:00:00 ... 2015-08-24T15...
You can now plot along the latitude and pressure level dimensions:
In [16]: all_stations.salinity.plot.contourf(x="latitude", y="pressure")
I am currently generating several bioclimatic variables (climatic derivatives) to apply to some biodiversity work using UKCP18 data. I am generating bioclimatic variable "Bio 19": Precipitation of the Coldest Quarter (https://pubs.usgs.gov/ds/691/ds691.pdf) which uses tas and pr.
The task involves finding the 3-month rolling sum of mean temperatures, identifying the minimum of these (constituting the coldest quarter) and then extracting the total precipitation over that 3-month rolling period to obtain "Bio 19".
My issue: I can find the coldest quarter (using tas) without problem, but pr data is dropped by Xarray in the action alongside the time indexing. It means I cannot know which period to extract rainfall from, because the data is not linked in that way across variables (using my method).
Example code:
# previous code here
...
# calculate rolling climatological mean over 3 month period(s)
ds_rolling_3mos_sum = ds_monthly.rolling(time=3, center=True).sum().dropna('time')
ds_rolling_3mos_sum
<xarray.Dataset>
Dimensions: (time: 12, grid_latitude: 606, grid_longitude: 484)
Coordinates:
* time (time) object 1981-01-01 00:00:00 ... 1981-12...
* grid_latitude (grid_latitude) float64 -4.683 -4.647 ... 8.063
* grid_longitude (grid_longitude) float64 353.9 354.0 ... 364.3
latitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
longitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
Data variables:
pr (time, grid_latitude, grid_longitude) float32 dask.array<chunksize=(2, 606, 484), meta=np.ndarray>
rotated_latitude_longitude (time) float64 -6.442e+09 ... -6.442e+09
grid_latitude_bnds (time, grid_latitude) float64 dask.array<chunksize=(2, 606), meta=np.ndarray>
grid_longitude_bnds (time, grid_longitude) float64 dask.array<chunksize=(2, 484), meta=np.ndarray>
tas (time, grid_latitude, grid_longitude) float32 dask.array<chunksize=(2, 606, 484), meta=np.ndarray>
Find the coldest quarter:
ds_rolling_3mos_sum_tas_min = ds_rolling_3mos_sum.tas.min('time')
I now have neither time index information - which I could use to obtain the correct month for rainfall in a .sel(), nor do I have any connected pr data to access for this variable.
<xarray.DataArray 'tas' (grid_latitude: 606, grid_longitude: 484)>
dask.array<nanmin-aggregate, shape=(606, 484), dtype=float32, chunksize=(606, 484), chunktype=numpy.ndarray>
Coordinates:
* grid_latitude (grid_latitude) float64 -4.683 -4.647 -4.611 ... 8.027 8.063
* grid_longitude (grid_longitude) float64 353.9 354.0 354.0 ... 364.3 364.3
latitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
longitude (grid_latitude, grid_longitude) float64 dask.array<chunksize=(606, 484), meta=np.ndarray>
I've tried using a .where statement and a few other hands on techniques but nothing working to date. I feel like there's a recognised Xarray action I'm not aware of! Please help!
This sounds like a case for xarray's advanced indexing! Get excited - this is one of the most fun & powerful features of xarray in my opinion :)
Here's a quick MRE I'll use for this example:
import xarray as xr, numpy as np, pandas as pd
time = pd.date_range('1980-12-01', '1982-01-01', freq='MS')
x = np.arange(-4, 8, 0.25)
y = np.arange(353, 364, 0.25)
tas = np.random.random(size=(len(time), len(y), len(x))) * 100 + 260
pr = np.exp(np.random.random(size=(len(time), len(y), len(x)))) * 100
ds = xr.Dataset(
{'pr': (('time', 'lat', 'lon'), pr), 'tas': (('time', 'lat', 'lon'), tas)},
coords={'lat': y, 'lon': x, 'time': time},
)
Your dataset of rolling 3mo sums contains the temperature information about which locations within the precip array you want to select. So what we can do is use the idxmin method to build a 2D DataArray of times, indexed by lat/lon, which give the center of the coldest quarter:
In [24]: # find the time of the min rolled tas for each lat/lon
...: time_of_min_tas_by_cell = ds_rolling_3mos_sum.tas.idxmin(dim='time')
In [25]: time_of_min_tas_by_cell
Out[25]:
<xarray.DataArray 'time' (lat: 44, lon: 48)>
array([['1981-06-01T00:00:00.000000000', '1981-12-01T00:00:00.000000000',
'1981-06-01T00:00:00.000000000', ...,
'1981-03-01T00:00:00.000000000', '1981-04-01T00:00:00.000000000',
'1981-10-01T00:00:00.000000000'],
...,
['1981-07-01T00:00:00.000000000', '1981-09-01T00:00:00.000000000',
'1981-11-01T00:00:00.000000000', ...,
'1981-07-01T00:00:00.000000000', '1981-12-01T00:00:00.000000000',
'1981-09-01T00:00:00.000000000']], dtype='datetime64[ns]')
Coordinates:
* lat (lat) float64 353.0 353.2 353.5 353.8 ... 363.0 363.2 363.5 363.8
* lon (lon) float64 -4.0 -3.75 -3.5 -3.25 -3.0 ... 6.75 7.0 7.25 7.5 7.75
This can be used to index into the precip array directly to find the rolling 3mo precip total at the center of the min temp quarter:
In [30]: ds_rolling_3mos_sum.pr.sel(time=time_of_min_tas_by_cell)
Out[30]:
<xarray.DataArray 'pr' (lat: 44, lon: 48)>
array([[449.50779525, 531.90472182, 747.26749901, ..., 405.24357679,
610.08658199, 488.83666056],
[487.65599173, 567.01802137, 380.9979117 , ..., 613.84289448,
442.8228211 , 629.50269312],
[432.48761645, 444.76568124, 480.11564481, ..., 464.74424834,
543.97169369, 491.91926534],
...,
[488.68368642, 455.70782431, 363.25961252, ..., 457.72558376,
529.17600183, 438.6763931 ],
[370.4485618 , 491.65565156, 391.47992765, ..., 689.95878533,
585.65987576, 407.78032041],
[576.67281438, 551.36298132, 389.643589 , ..., 366.8810199 ,
526.52862773, 593.30879779]])
Coordinates:
* lat (lat) float64 353.0 353.2 353.5 353.8 ... 363.0 363.2 363.5 363.8
* lon (lon) float64 -4.0 -3.75 -3.5 -3.25 -3.0 ... 6.75 7.0 7.25 7.5 7.75
time (lat, lon) datetime64[ns] 1981-06-01 1981-12-01 ... 1981-09-01
xarray knows to reshape the result into an array indexed by (lat, lon) because the time indices are also indexed by (lat, lon). So it collapses across time, matching the indexer's lat/lon values to the source array's dims. Cool right?
I'm using Xarray and netCDF meteorological data. I have the usual dimensions time, latitude and longitude and two main variables: the wind speed (time, lat, lon) and a latitudinal position (time, lon).
<xarray.Dataset>
Dimensions: (lon: 53, time: 25873, lat: 20)
Coordinates:
* lon (lon) float64 -80.0 -77.5 -75.0 -72.5 ... 45.0 47.5 50.0
* time (time) datetime64[ns] 1950-01-31 ... 2020-12-01
* lat (lat) float32 70.0 67.5 65.0 62.5 ... 27.5 25.0 22.5
Data variables:
uwnd (time, lat, lon) float32 -0.0625 0.375 ... -1.812 -2.75
positions (time, lon) float64 40.0 40.0 45.0 ... 70.0 70.0 70.0
For each time, lon, I'd like to calculate a latitudinal average around the positions.
If I do a loop, I would do this (for a +-2.5° latitude average):
for i in ds.lon.values:
for t in ds.time.values:
wind_averaged.loc[t,i]=ds.uwnd.sel(lon=i,time=t).sel(lat=slice(2.5+ds.positions.sel(lon=i,time=t).values,ds.positions.sel(lon=i,time=t).values-2.5)).mean('lat')
This is obviously very bad and I wanted to use slice() like this:
wind_averaged=ds.uwnd.sel(lat=slice(2.5+ds.jet_positions.values,ds.jet_positions.values-2.5)).mean('lat')
but it gives an error because I
cannot use non-scalar arrays in a slice for xarray indexing
Is there any alternative to do what I want without doing two for loops by using Xarray power?
Thanks
I believe you are looking for the multidimensional groupby. If I understand correctly, there is a tutorial for this problem here: https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html
I want to represent in the same plot two datasets, so I am merging them using xarray. These is how they look like:
ds1
<xarray.Dataset>
Dimensions: (time: 1, lat: 1037, lon: 1345)
Coordinates:
* lat (lat) float32 37.7 37.7 37.69 37.69 37.69 ... 35.01 35.01 35.0 35.0
* time (time) datetime64[ns] 2021-11-23
* lon (lon) float32 -9.001 -8.999 -8.996 -8.993 ... -5.507 -5.504 -5.501
Data variables:
CHL (time, lat, lon) float32 ...
ds2
<xarray.Dataset>
Dimensions: (time: 1, lat: 852, lon: 1168)
Coordinates:
* time (time) datetime64[ns] 2021-11-23
* lat (lat) float32 35.0 35.0 35.01 35.01 35.01 ... 37.29 37.29 37.3 37.3
* lon (lon) float32 -5.501 -5.498 -5.494 -5.491 ... -1.507 -1.503 -1.5
Data variables:
CHL (time, lat, lon) float32 ...
So then I use:
ds3 = xr.merge([ds1,ds2])
It works for the dimensions, but my variable CHL becomes nan:
<xarray.Dataset>
Dimensions: (lat: 1887, lon: 2513, time: 1)
Coordinates:
* lat (lat) float64 35.0 35.0 35.0 35.0 35.01 ... 37.69 37.69 37.7 37.7
* lon (lon) float64 -9.001 -8.999 -8.996 -8.993 ... -1.507 -1.503 -1.5
* time (time) datetime64[ns] 2021-11-23
Data variables:
CHL (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
So when I plot this dataset I have the following result:
I assume those white stripes are caused by the variable CHL becoming nan...
Any ideas of what could be happening? Thank you!
I don't think that any values become NaNs. Rather, I think that the latitude coordinates simply differ. Because you do an outer join (the default for xr.merge), xarray has to fill up the matrix at places where there is no information about the values. The default for the fill_value seems to be NaN.
So the question is, what values would you expect in these locations?
One possibility could be to fill the missing places by interpolation. In several dimensions this might be tricky, but as far as I see you are just placing two images next to each other with no overlap in the lon dimension.
In that case, xarray lets you interpolate the lat dimension easily:
ds3["CHL"].interpolate_na(dim="lat", method="linear")
I have a precipitation data in a netCDF file which I have downloaded from CMIP5 database. I became able to make a subset of the file and I obtained the attributes which are given below. These data have 2.5 X 3.75 degree spatial resolution. Now I need to convert then into 0.05 Degree spatial resolution. Is there anyone who can help me by writing How can I do it using Python.
Please keep in mind that, I am using python 3.7 on windows machine. CDO or NCO doesn't suit on windows. The data properties are here.
Dimensions: (bnds: 2, lat: 15, lon: 13, time: 122)
Coordinates:
* time (time) float64 15.0 45.0 75.0 ... 3.585e+03 3.615e+03 3.645e+03
* lat (lat) float64 -42.5 -40.0 -37.5 -35.0 ... -15.0 -12.5 -10.0 -7.5
* lon (lon) float64 112.5 116.2 120.0 123.8 ... 146.2 150.0 153.8 157.5
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) float64 ...
lat_bnds (lat, bnds) float64 ...
lon_bnds (lon, bnds) float64 ...
pr (time, lat, lon) float32 ...
I would be grateful and appreciate if anyone can help me anyway. Thanks in advance.
I can propose some solution like this with some random data, where I re-grid data from one resolution to another.
#!/usr/bin/env ipython
# ---------------------
import numpy as np
from netCDF4 import Dataset,num2date,date2num
# -----------------------------
ntime,nlon,nlat=10,10,10;
lonin=np.linspace(0.,1.,10);
latin=np.linspace(0.,1.,10);
dataout=np.random.random((ntime,nlat,nlon));
unout='seconds since 2018-01-01 00:00:00'
# ---------------------
# make data:
ncout=Dataset('in.nc','w','NETCDF3_CLASSIC');
ncout.createDimension('lon',nlon);
ncout.createDimension('lat',nlat);
ncout.createDimension('time',None);
ncout.createVariable('lon','float32',('lon'));ncout.variables['lon'][:]=lonin;
ncout.createVariable('lat','float32',('lat'));ncout.variables['lat'][:]=latin;
ncout.createVariable('time','float64',('time'));ncout.variables['time'].setncattr('units',unout);ncout.variables['time'][:]=np.linspace(0,3600*ntime,ntime);
ncout.createVariable('randomdata','float32',('time','lat','lon'));ncout.variables['randomdata'][:]=dataout;
ncout.close()
# ----------------------
# regrid:
from scipy.interpolate import griddata
lonout=np.linspace(0.,1.,20);
latout=np.linspace(0.,1.,20);
ncout=Dataset('out.nc','w','NETCDF3_CLASSIC');
ncout.createDimension('lon',np.size(lonout));
ncout.createDimension('lat',np.size(latout));
ncout.createDimension('time',None);
ncout.createVariable('lon','float32',('lon'));ncout.variables['lon'][:]=lonout;
ncout.createVariable('lat','float32',('lat'));ncout.variables['lat'][:]=latout;
ncout.createVariable('time','float64',('time'));ncout.variables['time'].setncattr('units',unout);ncout.variables['time'][:]=np.linspace(0,3600*ntime,ntime);
ncout.createVariable('randomdata','float32',('time','lat','lon'));
ncin=Dataset('in.nc');
lonin=ncin.variables['lon'][:];latin=ncin.variables['lat'][:];
lonmin,latmin=np.meshgrid(lonin,latin);
lonmout,latmout=np.meshgrid(lonout,latout);
for itime in range(np.size(ncin.variables['time'][:])):
zout=griddata((lonmin.flatten(),latmin.flatten()),ncin.variables['randomdata'][itime,:,:].flatten(),(lonmout,latmout),'linear');
ncout.variables['randomdata'][itime,:]=zout;
ncin.close();ncout.close()