slow performance using xarray.DataArray.quantile() on large dataset - python

I am using xarray in pyhton (Spyder) to read large NetCDF-files and process them.
import xarray as xr
ds = xr.open_dataset('my_file.nc')
ds has the following dimensions and variables:
<xarray.Dataset>
Dimensions: (time: 62215, points: 2195)
Coordinates:
* time (time) datetime64[ns] 1980-04-01 ... 2021-09-30T21:00:00
Dimensions without coordinates: points
Data variables:
longitude (time, points) float32 ...
latitude (time, points) float32 ...
hs (time, points) float32 ...
I want to calculate the 95th percentile of the variable hs for each specific point, and generate a new variable to the dataset:
hs_95 (points) float32
I do this with one line of code:
ds['hs_95'] = ds.hs.quantile(0.95, dim='time')
Where ds.hs is a xr.DataArray.
But it takes a very long time to run. Is there anything I can do to make it run faster? Is xarray the most convenient to use for this application?

Can you try skipna=False in xarray.DataArray.quantile() method? This could help a bit.

Migrating my comment into an answer...
xarray loads data from netCDFs lazily, only reading in the parts of the data which are requested for an operation. So the first time you work with the data, you'll be getting the read time + the quantile time. The quantiling may still be slow, but for a real benchmark you should first load the dataset with xr.Dataset.load(), e.g.:
ds = ds.load()
or alternatively, you can load the data and close the file object together with xr.load_dataset(filpath).
That said, you should definitely heed #tekiz's great advice to use skipna=False if you can - the performance improvement can be on the order of 100x if you don't have to skip NaNs when quantiling (if you're sure you don't have NaNs).

Related

Rechunk DataArray to calculate 90% quantile over over chunked time dimension

I try to calculate the 90 percentile of a variable over a period of 16 years. The data is stored in netCDF files (where 1 month is stored in 1 file --> 12files/year * 16years).
I pre-processed the data and took the daily_max and monthly mean of the variable of interested. So bottom line the folder consists of 192 files that contain each one value (the monthly mean of the daily max).
The data was opened using following command:
ds = xr.open_mfdataset(f"{folderdir}/*.nc", chunks={"time":1})
Trying to calculate the quantile (from some data variable, which was extracted from the ds: data_variable = ds["data_variable"]) with following code:
q90 = data_varaible.qunatile(0.95, "time"), yields follwing error message:
ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., .chunk(dict(time=-1)), or pass allow_rechunk=True in dask_gufunc_kwargs but beware that this may significantly increase memory usage.
I tried to rechunk, as explained in the error message by apply: data_variable.chunk(dict(time=-1).quantile(0.95,'time'), with no success (got the exact same error.
Further I tired to rechunk in the following way: data_variable.chunk({'time':1})), which was also not successful.
Printing out the data.variable.chunk(), actually shows that the chunk size in time dimension is supposed to be 1, so i don't understand where I made a mistake.
ps: I didn't try allow_rechunk=True in dask_gufunc_kwargs, since I don't know where to pass that argument.
Thanks for the help,
Max
ps: Printing out the data_variable yields, (to be clear, some_variable (see above) is 'wsgsmax' here):
<xarray.DataArray 'wsgsmax' (time: 132, y: 853, x: 789)>
dask.array<concatenate, shape=(132, 853, 789), dtype=float32, chunksize=(1, 853, 789), chunktype=numpy.ndarray>
Coordinates:
* time (time) datetime64[ns] 1995-01-16T12:00:00 ... 2005-12-16T12:00:00
lon (y, x) float32 dask.array<chunksize=(853, 789), meta=np.ndarray>
lat (y, x) float32 dask.array<chunksize=(853, 789), meta=np.ndarray>
* x (x) float32 0.0 2.5e+03 5e+03 ... 1.965e+06 1.968e+06 1.97e+06
* y (y) float32 0.0 2.5e+03 5e+03 ... 2.125e+06 2.128e+06 2.13e+06
height float32 10.0
Attributes:
standard_name: wind_speed_of_gust
long_name: Maximum Near Surface Wind Speed Of Gust
units: m s-1
grid_mapping: Lambert_Conformal
cell_methods: time: maximum
FA_name: CLSRAFALES.POS
par: 228
lvt: 105
lev: 10
tri: 2
chunk({"time": 1} will produce as many chunks as there are time steps.
Each chunk will have a size of 1.
Printing out the data.variable.chunk(), actually shows that the chunk size in time dimension is supposed to be 1, so i don't understand where I made a mistake.
To compute percentiles dask needs to load the full timeseries into memory so it forbids chunking over "time" dimension.
So what you want is either chunk({"time": len(ds.time)} or to use directly the shorthand chunk({"time": -1}.
I don't understand why data_variable.chunk(dict(time=-1).quantile(0.95,'time') would not work though.

Optimize plane of array (POA) irradiance calculation using WRF (netCDF) data

I need to calculate the plane of array (POA) irradiance using python's pvlib package (https://pvlib-python.readthedocs.io/en/stable/). For this I would like to use the output data from the WRF model (GHI, DNI, DHI). The output data is in netCDF format, which I open using the netCDF4 package and then I extract the necessary variables using the wrf-python package.
With that I get a xarray.Dataset with the variables I will use. I then use the xarray.Dataset.to_dataframe() method to transform it into a pandas dataframe, and then I transform the dataframe into a numpy array using the dataframe.values. And then I do a loop where in each iteration I calculate the POA using the function irradiance.get_total_irradiance (https://pvlib-python.readthedocs.io/en/stable/auto_examples/plot_ghi_transposition.html) for a grid point.
That's the way I've been doing it so far, however I have over 160000 grid points in the WRF domain, the data is hourly and spans 365 days. This gives a very large amount of data. I believe if pvlib could work directly with xarray.dataset it could be faster. However, I could only do it this way, transforming the data into a numpy.array and looping through the rows. Could anyone tell me how I can optimize this calculation? Because the code I developed is very time-consuming.
If anyone can help me with this I would appreciate it. Maybe an improvement to the code, or another way to calculate the POA from the WRF data...
I'm providing the code I've built so far:
from pvlib import location
from pvlib import irradiance
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import netCDF4
import wrf
Getting WRF data
variaveis = ['T2',
'U10',
'V10',
'SWDDNI',
'SWDDIF',
'SWDOWN']
netcdf_data = netCDF4.Dataset('wrfout_d02_2003-11-01_00_00_00')
first = True
for v in variaveis:
var = wrf.getvar(netcdf_data, v, timeidx=wrf.ALL_TIMES)
if first:
met_data = var
first = False
else:
met_data = xr.merge([met_data, var])
met_data = xr.Dataset.reset_coords(met_data, ['XTIME'], drop=True)
met_data['T2'] = met_data['T2'] - 273.15
WS10 = (met_data['U10']**2 + met_data['V10']**2)**0.5
met_data['WS10'] = WS10
df = met_data[['SWDDIF',
'SWDDNI',
'SWDOWN',
'T2',
'WS10']].to_dataframe().reset_index().drop(columns=['south_north',
'west_east'])
df.rename(columns={'SWDOWN': 'ghi',
'SWDDNI':'dni',
'SWDDIF':'dhi',
'T2':'temp_air',
'WS10':'wind_speed',
'XLAT': 'lat',
'XLONG': 'lon',
'Time': 'time'}, inplace=True)
df.set_index(['time'], inplace=True)
df = df[df.ghi>0]
df.index = df.index.tz_localize('America/Recife')
Function to get POA irradiance
def get_POA_irradiance(lon, lat, date, dni, dhi, ghi, tilt=10, surface_azimuth=0):
site_location = location.Location(lat, lon, tz='America/Recife')
# Get solar azimuth and zenith to pass to the transposition function
solar_position = site_location.get_solarposition(times=date)
# Use the get_total_irradiance function to transpose the GHI to POA
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt = tilt,
surface_azimuth = surface_azimuth,
dni = dni,
ghi = ghi,
dhi = dhi,
solar_zenith = solar_position['apparent_zenith'],
solar_azimuth = solar_position['azimuth'])
# Return DataFrame with only GHI and POA
return pd.DataFrame({'lon': lon,
'lat': lat,
'GHI': ghi,
'POA': POA_irradiance['poa_global']}, index=[date])
Loop in each row (time) of the array
array = df.reset_index().values
list_poa = []
def loop_POA():
for i in tqdm(range(len(array) - 1)):
POA = get_POA_irradiance(lon=array[i,6],
lat=array[i,7],
dni=array[i,2],
dhi=array[i,1],
ghi=array[i,3],
date=str(array[i,0]))
list_poa.append(POA)
return list_poa
poa_final = pd.concat(lista)
Thanks both for a good question and for using pvlib! You're right that pvlib is intended for modeling single locations and is not designed for use with xarray datasets, although some functions might coincidentally work with them.
I strongly suspect that the majority of the runtime you're seeing is for the solar position calculations. You could switch to a faster method (see the method options here), as the default solar position method is very accurate but also quite slow when calculating bulk positions. Installing numba will help, but it still might be too slow for you, so you might check the other models (ephemeris, pyephem). There are also some fast but low-precision methods, but you will need to change your code a bit to use them. See the list under "Correlations and analytical expressions for low precision solar position calculations" here.
Like Michael Delgado suggests in the comments, parallel processing is an option. But that can be a headache in python. You will probably want multiprocessing, not multithreading.
Another idea is to use atlite, a python package designed for this kind of spatial modeling. But its solar modeling capabilities are not nearly as detailed as pvlib, so it might not be useful for your case.
One other note: I don't know if the WRF data are interval averages or instantaneous values, but if you care about accuracy you should handle them differently for transposition. See this example.
Edit to add: after looking at your code again, there might be another significant speedup to be had. Are you calling get_POA_irradiance for single combinations of position and timestamp? If so, that is unnecessary and very slow. It would be much faster to pass in the full time series for each location, i.e. scalar lat/lon but vector irradiance.

Apply Mann Whitney U test on multidimensional array and replace single values of variable of xarray data array in Python?

I'm new to Python and need some help with xarray.
I have two 3 dimensional data arrays (rlon, rlat, time) for future and past climate. I want to compute the Mann-Whitney-U-test for each grid point to analyse significance of temperature change in future compared to past. I already got the Mann-Whitney-U-test work with selecting a time serie from one grid point of historical and future data each. Example:
import numpy as np
import xarray as xr
import scipy.stats as sts
#selecting time period and grid point of past and future data
tp = fileHis['tas']
tf = fileFut['tas']
gridpoint_past=tp.sel(rlon=-6.375, rlat=1.375, time=slice('1999-01-01', '1999-01-31'))
gridpoint_future=tf.sel(rlon=-6.375, rlat=1.375, time=slice('2099-01-01', '2099-01-31'))
#mannwhintey-u-test
result=sts.mannwhitneyu(gridpoint_past, gridpoint_future, alternative='two-sided')
print('pvalue =',result[1])
Output:
pvalue = 0.05922372345359562
My problem now is that I need to do this for each grid point and each month and in the end I would like to have a data array with pvalues for each grid point and each month of a year.
I was thinking about looping through all rlat, rlon and months and run the Mann-Whitney-U-test for each, unless there is a better way to do.?
And how can I write the pvalues one by one into a new data array with the same rlat, rlon dimension?
I was trying this, but it does not work:
I created a data array pvalue_mon, which has the same rlat, rlon as tp and tf and has 12 months as time steps.
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=th.time.dt.month.isin([1])) = result[1]
SyntaxError: can't assign to function call
or this:
pvalue_mon.sel(rlon=-6.375, rlat=1.375, time=pvalue_mon.time.dt.month.isin([1])).update(result[1])
TypeError: 'numpy.float64' object is not iterable
How can I replace a single value of an existing variable?
Instead of using the .sel() function, try using .loc[ ] as described here:
http://xarray.pydata.org/en/stable/indexing.html#assigning-values-with-indexing

Method for Time Slicing in Netcdf similar to xarray

I have a NetCDF data file containing sea ice concentration
from netCDF4 import Dataset
ds = Dataset('file.nic', 'r')
ds.variables.keys()
>>odict_keys(['latitude', 'longitude', 'seaice_conc', 'seaice_source', 'time'])
ds.dimensions.keys()
>>odict_keys(['latitude', 'longitude', 'time'])
Question: In this dataset, time is stored as days since 2001-01-01 00:00:00. Let's say I want seaice_conc for a particular time = 1990-12-01 then how do I approach it without using xarray or writing another function to calculate the days difference.
Is it possible to do it like in xarrays, for eg;
import xarray as xr
ds1 = xr.open_dataset('file.nc')
seaice_data = ds1['seaice_conc'].sel(time = '1990-12-01')
To give further info on dataset, it looks like this:
ds1.seaice_conc
<xarray.DataArray 'seaice_conc' (time: 1968, latitude: 240, longitude:
1440)>
[680140800 values with dtype=float32]
Coordinates:
* latitude (latitude) float32 89.875 89.625 89.375 89.125 88.875 88.625
...
* longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625
...
* time (time) datetime64[ns] 1850-01-15 1850-02-15 1850-03-15 ...
Attributes:
short_name: concentration
long_name: Sea_Ice_Concentration
standard_name: Sea_Ice_Concentration
units: Percent
Another thing which I'm confused is that using netcdf it says that time is stored in days since 2001:01:01 but in xarrays it shows me the exact date in yyyy-mm-dd format instead of showing the 'days since...' definition?
Thanks!
The easiest approach I could find is
from netCDF4 import date2index
from datetime import datetime
timeindex = date2index(datetime(1990,12,1),ds.variables['time'])
seaice_data = ds.variables['seaice_conc'][timeindex,:,:]
netCDF4.Dataset is indeed a kind of lower level library than xarray, if it could do everything that xarray already does, there would be no need for xarray, right.
Still, there is a useful function num2date in netCDF4, which can make your life easier when managing the date units. Approximately:
from netCDF4 import Dataset, num2date
import datetime
import numpy as np
ds = Dataset('file.nic', 'r')
your_date = datetime.datetime(1990,12,1)
select_time = np.argmax(num2date(ds.variables['time'][:],ds.variables['time'].units) == your_date)
seaice_data = ds.variables['seaice_conc'][select_time,:,:]
I admit it is still more code than xarray.
You can do what you are trying to do in Xarray.
For Question 1. It looks like your dates are all on the 15th of each month. Selecting just one time point should work like this.
ds1['seaice_conc'].sel(time='1990-12-15')
Another way you can do this is to use the method='nearest' keyword argument.
ds1['seaice_conc'].sel(time='1990-12-01', method='nearest')
Finally, you may consider reindexing your time axis to the first of each month.
ds1['seaice_conc'].resample(time='MS').mean('time').sel(time='1990-12-01')
A bonus answer, you can select time slices with a similar approach:
ds1['seaice_conc'].sel(time=slice('1990-01-01', '1991-12-31')
The Xarray documentation includes a section on datetime indexing
For Question 2. Xarray automatically decodes coordinate variables when you use open_dataset. You can turn this off with the decode_times argument but that doesn't seem like what you want to do here.
This is also discussed in the Xarray documentation.

How to use HDF5 dimension scales in h5py

HDF5 has the concept of dimension scales, as explained on the HDF5 and h5py websites. However, the explanations both use terse or generic examples and so I don't really understand how to use dimension scales. Namely, given a dataset f['coordinates'] in some HDF5 file f = h5py.File('data.h5'):
>>> f['coordinates'].value
array([[ 52.60636111, 4.38963889],
[ 52.57877778, 4.43422222],
[ 52.58319444, 4.42811111],
...,
[ 52.62269444, 4.43130556],
[ 52.62711111, 4.42519444],
[ 52.63152778, 4.41905556]])
I'd like to make it clear that the first column is the latitude and the second is the longitude. Are dimension scales used for this? Or are they used to indicate that the unit is degrees. Or both?
Perhaps another concrete example can illustrate the use of dimension scales better? If you have one, please share it, even if you are not using h5py.
Specifically for this question, the best answer is probably to use attributes:
f['coordinates'].attrs['columns'] = ['latitude', 'longitude']
But dimension scales are useful for other things. I'll show what they're for, how you could use them in a way similar to attributes, and how you might actually use your f['coordinates'] dataset as a scale for some other dataset.
Dimension scales
I agree that those documentation pages are not as clear as they could be, because they launch into complicated possibilities and mire in technical details before they actually explain the basic concepts. I think some simple examples should make things clear.
First, suppose you've kept track of the temperature outside over the course of a day — maybe measuring it every hour on the hour, for a total of 24 measurements. You might think of this as two columns of data: one for the hour, and one for the temperature. You could store this as a single dataset of shape 24x2. But time and temperature have different units, and are really different datatypes. So it might make more sense to store time and temperature as separate datasets — probably named "time" and "temperature", each of shape 24. But you'd also need to be a little more clear about what these are and how they're related to each other. That relationship is what "dimension scales" are really for.
If you imagine plotting the temperature as a function of time, you might label the horizontal axis as "Time (hour of day)", and the scale for the horizontal axis would be the hours themselves, telling you the horizontal position at which to plot each temperature. You could store this information through h5py like this:
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
Note that the argument to make_scale is specific information about that particular time dataset — in this case, the units we used to measure time — whereas the label is the more general concept of that dimension. Also note that it's actually more standard to attach unit information as attributes, but I like this approach more for specifying the unit of a scale because of its simplicity.
Now, suppose we measured the temperatures in three different places — say, Los Angeles, Chicago, and New York. Now, our array of temperatures would have shape 24x3. We would still need the time scale for dim[0], but now we also have dim[1] to deal with.
with h5py.File("temperatures.h5", "w") as f:
time = f.create_dataset("time", data=...)
time.make_scale("hour of day")
cities = f.create_dataset("cities",
data=[s.encode() for s in ["Los Angeles", "Chicago", "New York"]]
)
cities.make_scale("city")
temp = f.create_dataset("temperature", data=...)
temp.dims[0].label = "Time"
temp.dims[0].attach_scale(time)
temp.dims[1].label = "Location"
temp.dims[1].attach_scale(cities)
It might be more useful to store the latitude and longitude, instead of city names. You can actually attach both types of scale to the same dimension. Just add code like this at the bottom of that last code block:
latlong = f.create_dataset("latlong",
data=[[34.0522, 118.2437], [41.8781, 87.6298], [40.7128, 74.0060]]
)
latlong.make_scale("latitude and longitude (degrees)")
temp.dims[1].attach_scale(latlong)
Finally, you can access these labels and scales like this:
with h5py.File("temperatures.h5", "r") as f:
print('Datasets:', list(f))
print('Temperature dimension labels:', [dim.label for dim in f['temperature'].dims])
print('Temperature dim[1] scales:', f['temperature'].dims[1].keys())
latlong = f['temperature'].dims[1]['latitude and longitude (degrees)'][:]
print(latlong)
The output looks like this:
Datasets: ['cities', 'latlong', 'temperature', 'time']
Temperature dimension labels: ['Time', 'Location']
Temperature dim[1] scales: ['city', 'latitude and longitude (degrees)']
[[ 34.0522 118.2437]
[ 41.8781 87.6298]
[ 40.7128 74.006 ]]
#mike 's answer is very helpful for understanding.
Since you gave a geospatial example with latitude and longitude, I might personally create your dataset like this, where the "lat" and "lon" are each separate 1D datasets:
import numpy as np
import h5py
# assuming you already have stored the coordinates in that 2-column array...
coords = np.array(
[[52.60636111, 4.38963889], [52.57877778, 4.43422222], [52.58319444, 4.42811111]]
)
heights = np.random.rand(3, 3)
with h5py.File("dem.h5", "w") as hf:
lat = hf.create_dataset("lat", data=coords[:, 0])
lat.make_scale("latitude")
lat.attrs["units"] = "degrees north"
lon = hf.create_dataset("lon", data=coords[:, 1])
lon.make_scale("longitude")
lon.attrs["units"] = "degrees east"
h = hf.create_dataset("height", data=heights)
h.attrs['units'] = "meters"
h.dims[0].attach_scale(lat)
h.dims[1].attach_scale(lon)
The reason: this will let you use it with xarray much more easily:
In [1]: import xarray as xr
In [2]: ds1 = xr.open_dataset("dem.h5")
In [3]: ds1
Out[3]:
<xarray.Dataset>
Dimensions: (lat: 3, lon: 3)
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Data variables:
height (lat, lon) float64 ...
In [4]: ds1['lat']
Out[4]:
<xarray.DataArray 'lat' (lat: 3)>
array([52.606361, 52.578778, 52.583194])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
Attributes:
units: degrees north
In [5]: ds1['lon']
Out[5]:
<xarray.DataArray 'lon' (lon: 3)>
array([4.389639, 4.434222, 4.428111])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: degrees east
In [6]: ds1['height']
Out[6]:
<xarray.DataArray 'height' (lat: 3, lon: 3)>
array([[0.832685, 0.24167 , 0.831189],
[0.294826, 0.779141, 0.280573],
[0.980254, 0.593158, 0.634342]])
Coordinates:
* lat (lat) float64 52.61 52.58 52.58
* lon (lon) float64 4.39 4.434 4.428
Attributes:
units: meters
The slightly extra metadata you add (including the "units" as attributes) pays off when you want to play around with calculations, or plot the data:
In [9]: ds1.height.mean(dim="lat")
Out[9]:
<xarray.DataArray 'height' (lon: 3)>
array([0.70258813, 0.5379896 , 0.58203484])
Coordinates:
* lon (lon) float64 4.39 4.434 4.428
In [10]: ds1.height.plot.imshow()

Categories

Resources