I'm currently working on a project that involves clipping Xarray datasets based on a certain grid of latitude and longitude values. To do this, I've imported a file with the specified lats and lons, and added dimensions onto the dataset to give it the necessary lat/lon information. However, I also want to pass the dimensions to a data variable (called 'precip') within the dataset. I know that I'm able to pass the dimensions to the array of values in the variable, but can they be passed to the variable itself?
My code is below:
precip = obs_dataset_full.precip.expand_dims(lon = 350, lat = 450)
precip.coords['Longitude'] = (-1 * (360 - obs_dataset_full.Longitude))
precip.coords['Latitude'] = obs_dataset_full.Latitude
precip
With output as such:
xarray.Dataset
Dimensions:
dim_0: 350 dim_1: 451 lat: 450 lon: 350
Coordinates:
dim_0
(dim_0)
int64
0 1 2 3 4 5 ... 345 346 347 348 349
dim_1
(dim_1)
int64
0 1 2 3 4 5 ... 446 447 448 449 450
Longitude
(lon)
float64
-105.7 -105.7 ... -78.34 -78.26
Latitude
(lat)
float64
35.04 35.08 35.11 ... 51.52 51.56
Data variables:
precip
(dim_0, dim_1)
float64
0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 nan
Attributes: (0)
Specifically, I want the data variable precip to also possess dimensions of lat and lon, as the dataset does.
Thanks in advance!
Related
I'm using Xarray and netCDF meteorological data. I have the usual dimensions time, latitude and longitude and two main variables: the wind speed (time, lat, lon) and a latitudinal position (time, lon).
<xarray.Dataset>
Dimensions: (lon: 53, time: 25873, lat: 20)
Coordinates:
* lon (lon) float64 -80.0 -77.5 -75.0 -72.5 ... 45.0 47.5 50.0
* time (time) datetime64[ns] 1950-01-31 ... 2020-12-01
* lat (lat) float32 70.0 67.5 65.0 62.5 ... 27.5 25.0 22.5
Data variables:
uwnd (time, lat, lon) float32 -0.0625 0.375 ... -1.812 -2.75
positions (time, lon) float64 40.0 40.0 45.0 ... 70.0 70.0 70.0
For each time, lon, I'd like to calculate a latitudinal average around the positions.
If I do a loop, I would do this (for a +-2.5° latitude average):
for i in ds.lon.values:
for t in ds.time.values:
wind_averaged.loc[t,i]=ds.uwnd.sel(lon=i,time=t).sel(lat=slice(2.5+ds.positions.sel(lon=i,time=t).values,ds.positions.sel(lon=i,time=t).values-2.5)).mean('lat')
This is obviously very bad and I wanted to use slice() like this:
wind_averaged=ds.uwnd.sel(lat=slice(2.5+ds.jet_positions.values,ds.jet_positions.values-2.5)).mean('lat')
but it gives an error because I
cannot use non-scalar arrays in a slice for xarray indexing
Is there any alternative to do what I want without doing two for loops by using Xarray power?
Thanks
I believe you are looking for the multidimensional groupby. If I understand correctly, there is a tutorial for this problem here: https://xarray.pydata.org/en/stable/examples/multidimensional-coords.html
I am trying to calculate the distribution of a variable in a xarray. I can achieve what I am looking for by converting the xarray to a pandas dataframe as follows:
lon = np.linspace(0,10,11)
lat = np.linspace(0,10,11)
time = np.linspace(0,10,1000)
temperature = 3*np.random.randn(len(lat),len(lon),len(time))
ds = xr.Dataset(
data_vars=dict(
temperature=(["lat", "lon", "time"], temperature),
),
coords=dict(
lon=lon,
lat=lat,
time=time,
),
)
bin_t = np.linspace(-10,10,21)
DS = ds.to_dataframe()
DS.loc[:,'temperature_bin'] = pd.cut(DS['temperature'],bin_t,labels=(bin_t[0:-1]+bin_t[1:])*0.5)
DS_stats = DS.reset_index().groupby(['lat','lon','temperature_bin']).count()
ds_stats = DS_stats.to_xarray()
<xarray.Dataset>
Dimensions: (lat: 11, lon: 11, temperature_bin: 20)
Coordinates:
* lat (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* temperature_bin (temperature_bin) float64 -9.5 -8.5 -7.5 ... 7.5 8.5 9.5
Data variables:
time (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
temperature (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
Is there a way to generate ds_stats without converting to a dataframe? I have tried to use groupby_bins but this does not preserve coordinates.
print(ds.groupby_bins('temperature',bin_t).count())
distributed.utils_perf - WARNING - full garbage collections took 21% CPU time recently (threshold: 10%)
<xarray.Dataset>
Dimensions: (temperature_bins: 20)
Coordinates:
* temperature_bins (temperature_bins) object (-10.0, -9.0] ... (9.0, 10.0]
Data variables:
temperature (temperature_bins) int64 121 315 715 1677 ... 709 300 116
Using xhistogram may be helpful.
With the same definitions as you had set above,
from xhistogram import xarray as xhist
ds_stats = xhist.histogram(ds.temperature, bins=bin_t,dim=['time'])
should do the trick.
The one difference is that it returns a DataArray, not a Dataset, so if you want to do it for multiple variables, you'll have to do it separately for each one and then recombine, I believe.
I'm currently using the Python module rioxarray to clip an Xarray dataset based on a specific geometry to produce a latitude/longitude grid of coordinates. My data is below:
obs_dataset_full
xarray.Dataset
Dimensions:
Lat: 451 Lon: 350 lat: 450 lon: 350
Coordinates:
Lon
(Lon)
int64
0 1 2 3 4 5 ... 345 346 347 348 349
Lat
(Lat)
int64
0 1 2 3 4 5 ... 446 447 448 449 450
Longitude
(lon)
float64
-105.7 -105.7 ... -78.34 -78.26
Latitude
(lat)
float64
35.04 35.08 35.11 ... 51.52 51.56
spatial_ref
()
int64
0
Data variables:
precip_var
(lon, lat, Lon, Lat)
float64
0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 nan
Attributes:
grid_mapping :
spatial_ref
Note: The Lon/Lat dimensions are irrelevant; I'm trying to utilize the lon/lat, which have the actual coordinates.
obs_dataset_full.rio.write_crs('EPSG:4326',inplace = True)
obs_dataset_cropped = obs_dataset_full.rio.clip(geometries=cropping_geometries, crs='EPSG:4326')
When I run this code, I get the following error:
DimensionMissingCoordinateError: lon missing coordinates.
Both the obs_dataset_full dataset and the precip_var data array have the appropriate coordinates, and the rioxarray documentation page is not particularly clear as to what this exception entails. Any help is much appreciated!
The issue you are facing is that rioxarray expects your spatial dimensions and coordinate to have the same name. I would recommend using the rename methods of in xarray to rename the dimensions and coordinates so they are both longitide and latitude or x and y.
I have multiple netCDF files that eventually i want to merge. An example netCDF is as follows.
import xarray as xr
import numpy as np
import cftime
Rain_nc = xr.open_dataset('filepath.nc', decode_times=False)
print(Rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, tstep:248)
Dimensions without coordinates: land, tstep
Data variables:
lon (land) float32...
lat (land) float32...
timestp(tstep) int32...
time (tstep) int32...
Rainf (tstep, land) float32...
the dimension 'land' is a count of numbers 1 to 67209, and 'tstep' is a count from 1 to 248.
the variable 'lat' and 'lon' are latitude and longitude values with a shape of (67209,)
the variable 'time' is the time in seconds since the start of the month (netcdf is a month long)
Next ive swapped the dimensions from 'tstep' to 'time', converted it for later merging and set coords as lon, lat and time.
rain_nc = rain_nc.swap_dims({'tstep':'time'})
rain_nc = rain_nc.set_coords(['lon', 'lat', 'time'])
rain_nc['time'] = cftime.num2date(rain_nc['time'], units='seconds since 2016-01-01 00:00:00', calendar = 'standard')
rain_nc['time'] = cftime.date2num(rain_nc['time'], units='seconds since 1970-01-01 00:00:00', calendar = 'standard')
this has left me with the following Dataset:
print(rain_nc)
<xarray.Dataset>
Dimensions: (land: 67209, time: 248)
Coordinates:
lon (land)float32
lat (land)float32
* time (time)float64
Dimensions without coordinates: land
Data variables:
timestp (time)int32
Rainf (time, land)
print(rain_nc['land'])
<xarray.DataArray 'land' (land: 67209)>
array([ 0, 1, 2,..., 67206, 67207, 67208])
Coordinates:
lon (land) float32 ...
lat (land) float32 ...
Dimensions without coordinates: land
the Rainf variable i am interested in is as follows:
<xarray.DataArray 'Rainf' (time: 248, land: 67209)>
[16667832 values with dtype=float32]
Coordinates:
lon (land) float32 -179.75 -179.75 -179.75 ... 179.75 179.75
179.75
lat (land) float32 71.25 70.75 68.75 68.25 ... -16.25 -16.75
-19.25
* time (time) float64 1.452e+09 1.452e+09 ... 1.454e+09 1.454e+09
Dimensions without coordinates: land
Attributes:
title: Rainf
units: kg/m2s
long_name: Mean rainfall rate over the \nprevious 3 hours
actual_max: 0.008114143
actual_min: 0.0
Fill_value: 1e+20
From here i would like to create a netCDF with the dimensions (time, lat, lon) and the variable Rainf.
I have tried creating a new netCDF (or alter this one) but when i try to pass the Rainf variable does not work as it has a shape of (248, 67209) and needs a shape of (248, 67209, 67209). Even though the current 'land' dimension of 'Rainf' has a lat and lon coordinate. Is it possible to reshape this variable to have a time, lat, and lon dimension?
In the end it seems that what you want is to reshape the "land" dimensions to the ("lat", "lon") ones.
So, you have some DataArray similar to this:
# Setting sizes and coordinates
lon_size, lat_size = 50, 80
lon, lat = [arr.flatten() for arr in np.meshgrid(range(lon_size), range(lat_size))]
land_size = lon_size * lat_size
time_size = 100
da = xr.DataArray(
dims=("time", "land"),
data=np.random.randn(time_size, land_size),
coords=dict(
time=np.arange(time_size),
lon=("land", lon),
lat=("land", lat),
)
)
which looks like this:
>>> da
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
lon (land) int64 0 1 ... 48 49
lat (land) int64 0 0 ... 79 79
Dimensions without coordinates: land
First, we'll use the .set_index() method to tell xarray that the "land" index should be represented from the "lon" and "lat" coordinates:
>>> da.set_index(land=("lon", "lat"))
<xarray.DataArray (time: 100, land: 4000)>
array([[...]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* land (land) MultiIndex
- lon (land) int64 0 1 ... 48 49
- lat (land) int64 0 0 ... 79 79
The dimensions are still ("time", "land"), but now "land" is a MultiIndex.
Note that if you try to write to NETCDF at this point you'll have the following error:
>>> da.set_index(land=("lon", "lat")).to_netcdf("data.nc")
NotImplementedError: variable 'land' is a MultiIndex, which cannot yet be serialized to netCDF files (https://github.com/pydata/xarray/issues/1077). Use reset_index() to convert MultiIndex levels into coordinate variables instead.
It tells you to use the .reset_index() method. But that's not what you want here, because it will just go back to the original da state.
What you want from now is to use the .unstack() method:
>>> da.set_index(land=("lon", "lat")).unstack("land")
<xarray.DataArray (time: 100, lon: 50, lat: 80)>
array([[[...]]])
Coordinates:
* time (time) int64 0 1 ... 98 99
* lon (lon) int64 0 1 ... 48 49
* lat (lat) int64 0 1 ... 78 79
It effectively kills the "land" dimension and gives the desired output.
I have two xr.Dataset objects. One is a continuous map of some variable (here precipitation). The other is a categorical map of a set of regions
['region_1', 'region_2', 'region_3', 'region_4'].
I want to calculate the mean precip in each region at each timestep by masking by region/time and then outputting a dataframe looking like the below.
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
I have some code but it runs very slowly for the real datasets. Can anyone help me optimize?
A minimum reproducible example
Initalising our objects, two variables of the same shape. The region object will have been read from a shapefile and will have more than two regions.
import xarray as xr
import pandas as pd
import numpy as np
def make_dataset(
variable_name='precip',
size=(30, 30),
start_date='2008-01-01',
end_date='2010-01-01',
lonmin=-180.0,
lonmax=180.0,
latmin=-55.152,
latmax=75.024,
):
# create 2D lat/lon dimension
lat_len, lon_len = size
longitudes = np.linspace(lonmin, lonmax, lon_len)
latitudes = np.linspace(latmin, latmax, lat_len)
dims = ["lat", "lon"]
coords = {"lat": latitudes, "lon": longitudes}
# add time dimension
times = pd.date_range(start_date, end_date, name="time", freq="M")
size = (len(times), size[0], size[1])
dims.insert(0, "time")
coords["time"] = times
# create values
var = np.random.randint(100, size=size)
return xr.Dataset({variable_name: (dims, var)}, coords=coords), size
ds, size = make_dataset()
# create dummy regions (not contiguous but doesn't matter for this example)
region_ds = xr.ones_like(ds).rename({'precip': 'region'})
array = np.random.choice([0, 1, 2, 3], size=size)
region_ds = region_ds * array
# create a dictionary explaining what the regions area
region_lookup = {
0: 'region_1',
1: 'region_2',
2: 'region_3',
3: 'region_4',
}
What do these objects look like?
In[]: ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
Data variables:
precip (time, lat, lon) int64 51 92 14 71 60 20 82 ... 16 33 34 98 23 53
In[]: region_ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
Data variables:
region (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
Current Implementation
In order to calculate the mean of the variable in ds in each of the regions ['region_1', 'region_2', ...] in region_ds at each time, I need to loop over the TIME and the REGION.
I loop over each REGION, and then each TIMESTEP in the da object. This operation is pretty slow as the dataset gets larger (more pixels and more timesteps). Is there a more efficient / vectorized implementation anyone can think of.
My current implementation is super slow for all the regions and times that I need. Is there a more efficient use of numpy / xarray that will get me my desired result faster?
def drop_nans_and_flatten(dataArray: xr.DataArray) -> np.ndarray:
"""flatten the array and drop nans from that array. Useful for plotting histograms.
Arguments:
---------
: dataArray (xr.DataArray)
the DataArray of your value you want to flatten
"""
# drop NaNs and flatten
return dataArray.values[~np.isnan(dataArray.values)]
#
da = ds.precip
region_da = region_ds.region
valid_region_ids = [k for k in region_lookup.keys()]
# initialise empty lists
region_names = []
datetimes = []
mean_values = []
for valid_region_id in valid_region_ids:
for time in da.time.values:
region_names.append(region_lookup[valid_region_id])
datetimes.append(time)
# extract all non-nan values for that time-region
mean_values.append(
da.sel(time=time).where(region_da == valid_region_id).mean().values
)
df = pd.DataFrame(
{
"datetime": datetimes,
"region_name": region_names,
"mean_value": mean_values,
}
)
The output:
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
In [7]: df.tail()
Out[7]:
datetime region_name mean_value
43 2009-08-31 region_4 50.83111111111111
44 2009-09-30 region_4 48.40888888888889
45 2009-10-31 region_4 51.56148148148148
46 2009-11-30 region_4 48.961481481481485
47 2009-12-31 region_4 48.36296296296296
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
datetime 96 non-null datetime64[ns]
region_name 96 non-null object
mean_value 96 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.4+ KB
In [21]: df.describe()
Out[21]:
datetime region_name mean_value
count 96 96 96
unique 24 4 96
top 2008-10-31 00:00:00 region_1 48.88984800150122
freq 4 24 1
first 2008-01-31 00:00:00 NaN NaN
last 2009-12-31 00:00:00 NaN NaN
Any help would be very much appreciated ! Thankyou
It's hard to avoid iterating to generate the masks for the regions given how they are defined, but once you have those constructed (e.g. with the code below), I think the following would be pretty efficient:
regions = xr.concat(
[(region_ds.region == region_id).expand_dims(region=[region])
for region_id, region in region_lookup.items()],
dim='region'
)
result = ds.precip.where(regions).mean(['lat', 'lon'])
This generates a DataArray with 'time' and 'region' dimensions, where the value at each point is the mean at a given time over a given region. It would be straightforward to extend this to an area-weighted average if that were desired too.
An alternative option that generates the same result would be:
regions = xr.DataArray(
list(region_lookup.keys()),
coords=[list(region_lookup.values())],
dims=['region']
)
result = ds.precip.where(regions == region_ds.region).mean(['lat', 'lon'])
Here regions is basically just a DataArray representation of the region_lookup dictionary.