xarray groupby coordinates and non coordinate variables - python

I am trying to calculate the distribution of a variable in a xarray. I can achieve what I am looking for by converting the xarray to a pandas dataframe as follows:
lon = np.linspace(0,10,11)
lat = np.linspace(0,10,11)
time = np.linspace(0,10,1000)
temperature = 3*np.random.randn(len(lat),len(lon),len(time))
ds = xr.Dataset(
data_vars=dict(
temperature=(["lat", "lon", "time"], temperature),
),
coords=dict(
lon=lon,
lat=lat,
time=time,
),
)
bin_t = np.linspace(-10,10,21)
DS = ds.to_dataframe()
DS.loc[:,'temperature_bin'] = pd.cut(DS['temperature'],bin_t,labels=(bin_t[0:-1]+bin_t[1:])*0.5)
DS_stats = DS.reset_index().groupby(['lat','lon','temperature_bin']).count()
ds_stats = DS_stats.to_xarray()
<xarray.Dataset>
Dimensions: (lat: 11, lon: 11, temperature_bin: 20)
Coordinates:
* lat (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
* temperature_bin (temperature_bin) float64 -9.5 -8.5 -7.5 ... 7.5 8.5 9.5
Data variables:
time (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
temperature (lat, lon, temperature_bin) int64 0 1 8 13 18 ... 9 5 3 0
Is there a way to generate ds_stats without converting to a dataframe? I have tried to use groupby_bins but this does not preserve coordinates.
print(ds.groupby_bins('temperature',bin_t).count())
distributed.utils_perf - WARNING - full garbage collections took 21% CPU time recently (threshold: 10%)
<xarray.Dataset>
Dimensions: (temperature_bins: 20)
Coordinates:
* temperature_bins (temperature_bins) object (-10.0, -9.0] ... (9.0, 10.0]
Data variables:
temperature (temperature_bins) int64 121 315 715 1677 ... 709 300 116

Using xhistogram may be helpful.
With the same definitions as you had set above,
from xhistogram import xarray as xhist
ds_stats = xhist.histogram(ds.temperature, bins=bin_t,dim=['time'])
should do the trick.
The one difference is that it returns a DataArray, not a Dataset, so if you want to do it for multiple variables, you'll have to do it separately for each one and then recombine, I believe.

Related

Merging datasets with xarray makes variables to be nan

I want to represent in the same plot two datasets, so I am merging them using xarray. These is how they look like:
ds1
<xarray.Dataset>
Dimensions: (time: 1, lat: 1037, lon: 1345)
Coordinates:
* lat (lat) float32 37.7 37.7 37.69 37.69 37.69 ... 35.01 35.01 35.0 35.0
* time (time) datetime64[ns] 2021-11-23
* lon (lon) float32 -9.001 -8.999 -8.996 -8.993 ... -5.507 -5.504 -5.501
Data variables:
CHL (time, lat, lon) float32 ...
ds2
<xarray.Dataset>
Dimensions: (time: 1, lat: 852, lon: 1168)
Coordinates:
* time (time) datetime64[ns] 2021-11-23
* lat (lat) float32 35.0 35.0 35.01 35.01 35.01 ... 37.29 37.29 37.3 37.3
* lon (lon) float32 -5.501 -5.498 -5.494 -5.491 ... -1.507 -1.503 -1.5
Data variables:
CHL (time, lat, lon) float32 ...
So then I use:
ds3 = xr.merge([ds1,ds2])
It works for the dimensions, but my variable CHL becomes nan:
<xarray.Dataset>
Dimensions: (lat: 1887, lon: 2513, time: 1)
Coordinates:
* lat (lat) float64 35.0 35.0 35.0 35.0 35.01 ... 37.69 37.69 37.7 37.7
* lon (lon) float64 -9.001 -8.999 -8.996 -8.993 ... -1.507 -1.503 -1.5
* time (time) datetime64[ns] 2021-11-23
Data variables:
CHL (time, lat, lon) float32 nan nan nan nan nan ... nan nan nan nan
So when I plot this dataset I have the following result:
I assume those white stripes are caused by the variable CHL becoming nan...
Any ideas of what could be happening? Thank you!
I don't think that any values become NaNs. Rather, I think that the latitude coordinates simply differ. Because you do an outer join (the default for xr.merge), xarray has to fill up the matrix at places where there is no information about the values. The default for the fill_value seems to be NaN.
So the question is, what values would you expect in these locations?
One possibility could be to fill the missing places by interpolation. In several dimensions this might be tricky, but as far as I see you are just placing two images next to each other with no overlap in the lon dimension.
In that case, xarray lets you interpolate the lat dimension easily:
ds3["CHL"].interpolate_na(dim="lat", method="linear")

How would I parse a text file into a 3D array in Python?

I have a text file with a dimension of 82355. I want to define this as density(82355) and parse this into a three-dimensional array called rho(5,91,181), where
5 is for the number of days,
91 is for the number of latitudes,
181 is for the number of longitudes.
This is an example of the text file with the first 11 values. There are 5 days with latitudes from -90 to 90 in two-degree steps. For each latitude, there will be 180 rows, corresponding to longitudes from 0 to 360 in two-degree steps.
Latitude Longitude rho
0 -90.0 0.0 3.396760e-12
1 -90.0 2.0 3.397140e-12
2 -90.0 4.0 3.397510e-12
3 -90.0 6.0 3.397870e-12
4 -90.0 8.0 3.398470e-12
5 -90.0 10.0 3.399060e-12
6 -90.0 12.0 3.399810e-12
7 -90.0 14.0 3.400560e-12
8 -90.0 16.0 3.401440e-12
9 -90.0 18.0 3.402310e-12
10 -90.0 20.0 3.403200e-12
I'm confused how to start and parse this text file with dimension 82355 into a 3D array called rho(5,91,181) in Python. Does anyone have any recommendations?
The file itself is quite easy to parse as it seems to be in standard tsv format. You only need to specify how the day is calculated, then just go over each row and put the rho value into the right place of array:
import numpy as np
import pandas as pd
df = pd.read_csv('path_to_data_file', sep=r'[ \t]+')
rho = np.zeros((5, 9, 181))
for i, e in df.iterrows():
day = calc_day_from(i, e)
lat = int((e['Latitude'] + 90) / 2)
long = int(e['Longitude'] / 2)
rho[day, lat, long] = e['rho']

Can dimensions be passed to a variable within an XArray Dataset?

I'm currently working on a project that involves clipping Xarray datasets based on a certain grid of latitude and longitude values. To do this, I've imported a file with the specified lats and lons, and added dimensions onto the dataset to give it the necessary lat/lon information. However, I also want to pass the dimensions to a data variable (called 'precip') within the dataset. I know that I'm able to pass the dimensions to the array of values in the variable, but can they be passed to the variable itself?
My code is below:
precip = obs_dataset_full.precip.expand_dims(lon = 350, lat = 450)
precip.coords['Longitude'] = (-1 * (360 - obs_dataset_full.Longitude))
precip.coords['Latitude'] = obs_dataset_full.Latitude
precip
With output as such:
xarray.Dataset
Dimensions:
dim_0: 350 dim_1: 451 lat: 450 lon: 350
Coordinates:
dim_0
(dim_0)
int64
0 1 2 3 4 5 ... 345 346 347 348 349
dim_1
(dim_1)
int64
0 1 2 3 4 5 ... 446 447 448 449 450
Longitude
(lon)
float64
-105.7 -105.7 ... -78.34 -78.26
Latitude
(lat)
float64
35.04 35.08 35.11 ... 51.52 51.56
Data variables:
precip
(dim_0, dim_1)
float64
0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 nan
Attributes: (0)
Specifically, I want the data variable precip to also possess dimensions of lat and lon, as the dataset does.
Thanks in advance!

Efficiently mask and calculate means for multiple groups in `xr.Dataset` xarray

I have two xr.Dataset objects. One is a continuous map of some variable (here precipitation). The other is a categorical map of a set of regions
['region_1', 'region_2', 'region_3', 'region_4'].
I want to calculate the mean precip in each region at each timestep by masking by region/time and then outputting a dataframe looking like the below.
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
I have some code but it runs very slowly for the real datasets. Can anyone help me optimize?
A minimum reproducible example
Initalising our objects, two variables of the same shape. The region object will have been read from a shapefile and will have more than two regions.
import xarray as xr
import pandas as pd
import numpy as np
def make_dataset(
variable_name='precip',
size=(30, 30),
start_date='2008-01-01',
end_date='2010-01-01',
lonmin=-180.0,
lonmax=180.0,
latmin=-55.152,
latmax=75.024,
):
# create 2D lat/lon dimension
lat_len, lon_len = size
longitudes = np.linspace(lonmin, lonmax, lon_len)
latitudes = np.linspace(latmin, latmax, lat_len)
dims = ["lat", "lon"]
coords = {"lat": latitudes, "lon": longitudes}
# add time dimension
times = pd.date_range(start_date, end_date, name="time", freq="M")
size = (len(times), size[0], size[1])
dims.insert(0, "time")
coords["time"] = times
# create values
var = np.random.randint(100, size=size)
return xr.Dataset({variable_name: (dims, var)}, coords=coords), size
ds, size = make_dataset()
# create dummy regions (not contiguous but doesn't matter for this example)
region_ds = xr.ones_like(ds).rename({'precip': 'region'})
array = np.random.choice([0, 1, 2, 3], size=size)
region_ds = region_ds * array
# create a dictionary explaining what the regions area
region_lookup = {
0: 'region_1',
1: 'region_2',
2: 'region_3',
3: 'region_4',
}
 What do these objects look like?
In[]: ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
Data variables:
precip (time, lat, lon) int64 51 92 14 71 60 20 82 ... 16 33 34 98 23 53
In[]: region_ds
Out[]:
<xarray.Dataset>
Dimensions: (lat: 30, lon: 30, time: 24)
Coordinates:
* lat (lat) float64 -55.15 -50.66 -46.17 -41.69 ... 66.05 70.54 75.02
* time (time) datetime64[ns] 2008-01-31 2008-02-29 ... 2009-12-31
* lon (lon) float64 -180.0 -167.6 -155.2 -142.8 ... 155.2 167.6 180.0
Data variables:
region (time, lat, lon) float64 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0
Current Implementation
In order to calculate the mean of the variable in ds in each of the regions ['region_1', 'region_2', ...] in region_ds at each time, I need to loop over the TIME and the REGION.
I loop over each REGION, and then each TIMESTEP in the da object. This operation is pretty slow as the dataset gets larger (more pixels and more timesteps). Is there a more efficient / vectorized implementation anyone can think of.
My current implementation is super slow for all the regions and times that I need. Is there a more efficient use of numpy / xarray that will get me my desired result faster?
def drop_nans_and_flatten(dataArray: xr.DataArray) -> np.ndarray:
"""flatten the array and drop nans from that array. Useful for plotting histograms.
Arguments:
---------
: dataArray (xr.DataArray)
the DataArray of your value you want to flatten
"""
# drop NaNs and flatten
return dataArray.values[~np.isnan(dataArray.values)]
#
da = ds.precip
region_da = region_ds.region
valid_region_ids = [k for k in region_lookup.keys()]
# initialise empty lists
region_names = []
datetimes = []
mean_values = []
for valid_region_id in valid_region_ids:
for time in da.time.values:
region_names.append(region_lookup[valid_region_id])
datetimes.append(time)
# extract all non-nan values for that time-region
mean_values.append(
da.sel(time=time).where(region_da == valid_region_id).mean().values
)
df = pd.DataFrame(
{
"datetime": datetimes,
"region_name": region_names,
"mean_value": mean_values,
}
)
The output:
In [6]: df.head()
Out[6]:
datetime region_name mean_value
0 2008-01-31 region_1 51.77333333333333
1 2008-02-29 region_1 44.87555555555556
2 2008-03-31 region_1 50.88444444444445
3 2008-04-30 region_1 48.50666666666667
4 2008-05-31 region_1 47.653333333333336
In [7]: df.tail()
Out[7]:
datetime region_name mean_value
43 2009-08-31 region_4 50.83111111111111
44 2009-09-30 region_4 48.40888888888889
45 2009-10-31 region_4 51.56148148148148
46 2009-11-30 region_4 48.961481481481485
47 2009-12-31 region_4 48.36296296296296
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 3 columns):
datetime 96 non-null datetime64[ns]
region_name 96 non-null object
mean_value 96 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 2.4+ KB
In [21]: df.describe()
Out[21]:
datetime region_name mean_value
count 96 96 96
unique 24 4 96
top 2008-10-31 00:00:00 region_1 48.88984800150122
freq 4 24 1
first 2008-01-31 00:00:00 NaN NaN
last 2009-12-31 00:00:00 NaN NaN
Any help would be very much appreciated ! Thankyou
It's hard to avoid iterating to generate the masks for the regions given how they are defined, but once you have those constructed (e.g. with the code below), I think the following would be pretty efficient:
regions = xr.concat(
[(region_ds.region == region_id).expand_dims(region=[region])
for region_id, region in region_lookup.items()],
dim='region'
)
result = ds.precip.where(regions).mean(['lat', 'lon'])
This generates a DataArray with 'time' and 'region' dimensions, where the value at each point is the mean at a given time over a given region. It would be straightforward to extend this to an area-weighted average if that were desired too.
An alternative option that generates the same result would be:
regions = xr.DataArray(
list(region_lookup.keys()),
coords=[list(region_lookup.values())],
dims=['region']
)
result = ds.precip.where(regions == region_ds.region).mean(['lat', 'lon'])
Here regions is basically just a DataArray representation of the region_lookup dictionary.

XArray - Multidimensional Binning & Array Reduction on Sample Dataset of 4 x4 to 2 x 2

I am trying to reduce an Xarray from 4 x 4 to 2 x 2 via both the dimensions. I haven't found any luck with the current Xarray Dataset.
These are the steps I followed. I want to bin or group based on latitude and longitude.
a = np.array(np.random.randint(1, 90+1,(4,4)),dtype=np.float64)
b = np.array(np.random.randint(1, 360+1,(4,4)),dtype=np.float64)
c = np.random.random_sample(16,)
c = c.reshape(4,4)
dsa = xr.Dataset()
dsa['CloudFraction'] = (('x', 'y'), c)
dsa.coords['latitude'] = (('x', 'y'), a)
dsa.coords['longitude'] = (('x', 'y'), b)
dsa
Dimensions: (x: 4, y: 4)
Coordinates:
latitude (x, y) float64 23.0 16.0 53.0 1.0 ... 82.0 65.0 45.0 88.0
longitude (x, y) float64 219.0 13.0 276.0 69.0 ... 156.0 277.0 16.0
Dimensions without coordinates: x, y
Data variables:
CloudFraction (x, y) float64 0.1599 0.05671 0.8624 ... 0.7757 0.7572
delgadom.
We can realize the (moving) binning for your example by using rolling method.
A simple way to bin along x axis is
In [14]: dsa.rolling(x=2).mean().isel(x=slice(1, None, 2))
Out[14]:
<xarray.Dataset>
Dimensions: (x: 2, y: 4)
Coordinates:
latitude (x, y) float64 9.0 61.0 58.0 57.0 23.0 38.0 10.0 75.0
longitude (x, y) float64 198.0 177.0 303.0 71.0 163.0 213.0 55.0 102.0
Dimensions without coordinates: x, y
Data variables:
CloudFraction (x, y) float64 0.2882 0.7061 0.9226 ... 0.5084 0.2377 0.6352
This actually computes the moving average with the window size 2, and then subsamples with stride of 2.
Since mean operation is linear, you can do the same thing for y-axis sequentially.
The above operation wasting a computation resource a little, since we only use a half of the computed values.
In order to avoid this, we can use construct method instead,
In [18]: dsa.rolling(x=2).construct('tmp').isel(x=slice(1, None, 2)).mean('tmp')
...:
Out[18]:
<xarray.Dataset>
Dimensions: (x: 2, y: 4)
Coordinates:
latitude (x, y) float64 9.0 61.0 58.0 57.0 23.0 38.0 10.0 75.0
longitude (x, y) float64 198.0 177.0 303.0 71.0 163.0 213.0 55.0 102.0
Dimensions without coordinates: x, y
Data variables:
CloudFraction (x, y) float64 0.2882 0.7061 0.9226 ... 0.5084 0.2377 0.6352
For the details of rolling method, see the official page
http://xarray.pydata.org/en/stable/computation.html#rolling-window-operations
Personally, I think it would be nice if xarray has bin method for this purpose.
If you don't mind contributing, let's discuss on the Github issue page.

Categories

Resources