I try to calculate the 90 percentile of a variable over a period of 16 years. The data is stored in netCDF files (where 1 month is stored in 1 file --> 12files/year * 16years).
I pre-processed the data and took the daily_max and monthly mean of the variable of interested. So bottom line the folder consists of 192 files that contain each one value (the monthly mean of the daily max).
The data was opened using following command:
ds = xr.open_mfdataset(f"{folderdir}/*.nc", chunks={"time":1})
Trying to calculate the quantile (from some data variable, which was extracted from the ds: data_variable = ds["data_variable"]) with following code:
q90 = data_varaible.qunatile(0.95, "time"), yields follwing error message:
ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., .chunk(dict(time=-1)), or pass allow_rechunk=True in dask_gufunc_kwargs but beware that this may significantly increase memory usage.
I tried to rechunk, as explained in the error message by apply: data_variable.chunk(dict(time=-1).quantile(0.95,'time'), with no success (got the exact same error.
Further I tired to rechunk in the following way: data_variable.chunk({'time':1})), which was also not successful.
Printing out the data.variable.chunk(), actually shows that the chunk size in time dimension is supposed to be 1, so i don't understand where I made a mistake.
ps: I didn't try allow_rechunk=True in dask_gufunc_kwargs, since I don't know where to pass that argument.
Thanks for the help,
Max
ps: Printing out the data_variable yields, (to be clear, some_variable (see above) is 'wsgsmax' here):
<xarray.DataArray 'wsgsmax' (time: 132, y: 853, x: 789)>
dask.array<concatenate, shape=(132, 853, 789), dtype=float32, chunksize=(1, 853, 789), chunktype=numpy.ndarray>
Coordinates:
* time (time) datetime64[ns] 1995-01-16T12:00:00 ... 2005-12-16T12:00:00
lon (y, x) float32 dask.array<chunksize=(853, 789), meta=np.ndarray>
lat (y, x) float32 dask.array<chunksize=(853, 789), meta=np.ndarray>
* x (x) float32 0.0 2.5e+03 5e+03 ... 1.965e+06 1.968e+06 1.97e+06
* y (y) float32 0.0 2.5e+03 5e+03 ... 2.125e+06 2.128e+06 2.13e+06
height float32 10.0
Attributes:
standard_name: wind_speed_of_gust
long_name: Maximum Near Surface Wind Speed Of Gust
units: m s-1
grid_mapping: Lambert_Conformal
cell_methods: time: maximum
FA_name: CLSRAFALES.POS
par: 228
lvt: 105
lev: 10
tri: 2
chunk({"time": 1} will produce as many chunks as there are time steps.
Each chunk will have a size of 1.
Printing out the data.variable.chunk(), actually shows that the chunk size in time dimension is supposed to be 1, so i don't understand where I made a mistake.
To compute percentiles dask needs to load the full timeseries into memory so it forbids chunking over "time" dimension.
So what you want is either chunk({"time": len(ds.time)} or to use directly the shorthand chunk({"time": -1}.
I don't understand why data_variable.chunk(dict(time=-1).quantile(0.95,'time') would not work though.
Related
I am using xarray in pyhton (Spyder) to read large NetCDF-files and process them.
import xarray as xr
ds = xr.open_dataset('my_file.nc')
ds has the following dimensions and variables:
<xarray.Dataset>
Dimensions: (time: 62215, points: 2195)
Coordinates:
* time (time) datetime64[ns] 1980-04-01 ... 2021-09-30T21:00:00
Dimensions without coordinates: points
Data variables:
longitude (time, points) float32 ...
latitude (time, points) float32 ...
hs (time, points) float32 ...
I want to calculate the 95th percentile of the variable hs for each specific point, and generate a new variable to the dataset:
hs_95 (points) float32
I do this with one line of code:
ds['hs_95'] = ds.hs.quantile(0.95, dim='time')
Where ds.hs is a xr.DataArray.
But it takes a very long time to run. Is there anything I can do to make it run faster? Is xarray the most convenient to use for this application?
Can you try skipna=False in xarray.DataArray.quantile() method? This could help a bit.
Migrating my comment into an answer...
xarray loads data from netCDFs lazily, only reading in the parts of the data which are requested for an operation. So the first time you work with the data, you'll be getting the read time + the quantile time. The quantiling may still be slow, but for a real benchmark you should first load the dataset with xr.Dataset.load(), e.g.:
ds = ds.load()
or alternatively, you can load the data and close the file object together with xr.load_dataset(filpath).
That said, you should definitely heed #tekiz's great advice to use skipna=False if you can - the performance improvement can be on the order of 100x if you don't have to skip NaNs when quantiling (if you're sure you don't have NaNs).
I have a binary file in which data segments are interspersed. I know locations (byte offsets) of every data segment, as well as size of those data segments, as well as the type of data points (float, float32 - meaning that every data point is coded by 4 bytes). I want to read those data segments into an array like structure (for example, numpy array or pandas dataframe), but I have trouble doing so. I've tried using numpy's memmap, but it short circuits on the last data segment, and numpy's from file just gets me bizzare results.
Sample of the code:
begin=datadf["$BEGINDATA"][0] #datadf is pandas.df that has where data begins and its size
buf.seek(begin) #buf is the file that is opened in rb mode
size=datadf["$DATASIZE"][0]+1 #same as the above
data=buf.read(size) #this should get me that data segment, but in binary
Is there a way to reliably convert to float32 from this binary data.
For further clarification, I'm including printout of first 10 data points.
buf.seek(begin)
print(buf.read(40)) #10 points of float32 (4bytes) means 40
>>>b'\xa5\x10[#\x00\x00\x88#a\xf3\xf7A\x00\x00\x88#&\x93\x9bA\x00\x00\x88#\x00\x00\x00#\xfc\xcd\x08?\x1c\xe2\xbe?\x03\xf9\xa4?'
If it's any value, while there are 4 bytes (32 bit width) for each float point, every float point is capped to maximum value of 10 000
If you want a numpy.ndarray, you can just use numpy.frombuffer
>>> import numpy as np
>>> data = b'\xa5\x10[#\x00\x00\x88#a\xf3\xf7A\x00\x00\x88#&\x93\x9bA\x00\x00\x88#\x00\x00\x00#\xfc\xcd\x08?\x1c\xe2\xbe?\x03\xf9\xa4?'
>>> np.frombuffer(data, dtype=np.float32)
array([ 3.422891 , 4.25 , 30.993837 , 4.25 , 19.44685 ,
4.25 , 2. , 0.5343931, 1.4912753, 1.2888492],
dtype=float32)
Is there a xArray way of computing quantiles on a DataArray.rolling window? The listed available methods include mean or median, but nothing on quantiles/percentiles. I was wondering if this could be somehow done even though there is no direct way.
Currently, I am locally migrating the xArray data to a pandas.DataFrame, where I apply the rolling().quantile() sequence. After that, I take the values of the new DataFrame and build a xArray.DataArray from it. The reproducible code:
import xarray as xr
import pandas as pd
import numpy as np
times = np.arange(0, 30)
locs = ['A', 'B', 'C', 'D']
signal = xr.DataArray(np.random.rand(len(times), len(locs)),
coords=[times, locs], dims=['time', 'locations'])
window = 5
df = pd.DataFrame(data=signal.data)
roll = df.rolling(window=window, center=True, axis=0).quantile(.25).dropna()
window_array = xr.DataArray(roll.values,
coords=[np.arange(0, signal.time.shape[0] - window + 1), signal.locations],
dims=['time', 'locations'])
Any clue to stick to xArray as much as possible is welcome.
Let us consider the same problem, only smaller in size (10 time instances, 2 locations).
Here is the input of the first method (via pandas):
<xarray.DataArray (time: 8, locations: 2)>
array([[0.404362, 0.076203],
[0.353639, 0.076203],
[0.387167, 0.102917],
[0.525404, 0.298231],
[0.755646, 0.298231],
[0.460749, 0.414935],
[0.104887, 0.498813],
[0.104887, 0.420935]])
Coordinates:
* time (time) int32 0 1 2 3 4 5 6 7
* locations (locations) <U1 'A' 'B'
Note that the 'time' dimension is smaller, due to calling dropna() on the rolling object. The new dimension size is basically len(times) - window + 1. Now, the output for the proposed method (via construct):
<xarray.DataArray (time: 10, locations: 2)>
array([[0.438426, 0.127881],
[0.404362, 0.076203],
[0.353639, 0.076203],
[0.387167, 0.102917],
[0.525404, 0.298231],
[0.755646, 0.298231],
[0.460749, 0.414935],
[0.104887, 0.498813],
[0.104887, 0.420935],
[0.112651, 0.60338 ]])
Coordinates:
* time (time) int32 0 1 2 3 4 5 6 7 8 9
* locations (locations) <U1 'A' 'B'
It seems like the dimensions are still (time, locations), with the size of the former equal to 10, not 8. In the example here, since center=True, the two results are the same if you remove the first and the last rows in the second array. Shouldn't the DataArray have a new dimension, the tmp?
Also, this method (with bottleneck installed) takes more than the one initially proposed via pandas. For example, on a case study of 1000 times x 2 locations, the pandas run takes 0.015 s, while the construct one takes 1.25 s.
You can use construct method of the rolling object, which generates a new DataArray with the rolling dimension.
signal.rolling(time=window, center=True).construct('tmp').quantile(.25, dim='tmp')
Above, I constructed a DataArray with additional tmp dimension and compute quantile along this dimension.
I have two netCDF files: file1.nc and file2.nc
The only difference is that file1.nc contains one variable 'rho' which I want to append to file2.nc but by modifying the variables. The original file2.nc does not contain 'rho' in it.
I am using Python module netCDF4.
import netCDF4 as ncd
file1data=ncd.Dataset('file1.nc')
file1data.variables['rho']
<class 'netCDF4._netCDF4.Variable'> float64 rho(ocean_time, s_rho, eta_rho, xi_rho)
long_name: density anomaly
units: kilogram meter-3
time: ocean_time
grid: grid
location: face
coordinates: lon_rho lat_rho s_rho ocean_time
field: density, scalar, series
_FillValue: 1e+37
unlimited dimensions: ocean_time
current shape = (2, 15, 1100, 1000)
filling on
So rho has shape of [2,15,1100,1000] but while adding to file2.nc, I only want to add rho[1,15,1100,1000] i.e. only the data of the second time step. This will result in 'rho' in file2.nc being of shape [15,1100,1000]. But I have been unable to do so.
I have been trying code like this:
file1data=ncd.Dataset('file1.nc')
rho2=file1data.variables['rho']
file2data=ncd.Dataset('file2.nc','r+') # I also tried with 'w' option; it does not work
file2data.createVariable('rho','float64')
file2data.variables['rho']=rho2 # to copy rho2's attributes
file2data.variables['rho'][:]=rho2[-1,15,1100,1000] # to modify rho's shape in file2.nc
file2data.close()
What am I missing here?
You have not specified the size of variable rho in your second netCDF file.
What you are doing is:
file2data.createVariable('rho','float64')
while it is supposed to be
file2data.createVariable('rho','float64',('ocean_time', 's_rho', 'eta_rho', 'xi_rho'))
I have a netCDF file which I have read with xarray. The array contains times, latidude, longitude and only one data variable (i.e. index values)
# read the netCDF files
with xr.open_mfdataset('wet_tropics.nc') as wet:
print(wet)
Out[]:
<xarray.Dataset>
Dimensions: (time: 1437, x: 24, y: 20)
Coordinates:
* y (y) float64 -1.878e+06 -1.878e+06 -1.878e+06 -1.878e+06 ...
* x (x) float64 1.468e+06 1.468e+06 1.468e+06 1.468e+06 ...
* time (time) object '2013-03-29T00:22:28.500000000' ...
Data variables:
index_values (time, y, x) float64 dask.array<shape=(1437, 20, 24), chunksize=(1437, 20, 24)>
So far, so good.
Now I need to apply a generalized additive model to each grid cell in the array. The model I want to use comes from Facebook Prophet (https://facebook.github.io/prophet/) and I have successfully applied it to a pandas array of data before. For example:
cns_ap['y'] = cns_ap['av_index'] # Prophet requires specific names 'y' and 'ds' for column names
cns_ap['ds'] = cns_ap['Date']
cns_ap['cap'] = 1
m1 = Prophet(weekly_seasonality=False, # disables weekly_seasonality
daily_seasonality=False, # disables daily_seasonality
growth='logistic', # logistic because indices have a maximum
yearly_seasonality=4, # fourier transform. int between 1-10
changepoint_prior_scale=0.5).fit(cns_ap)
future1 = m1.make_future_dataframe(periods=60, # 5 year prediction
freq='M', # monthly predictions
include_history=True) # fits model to all historical data
future1['cap'] = 1 # sets cap at maximum index value
forecast1 = m1.predict(future1)
# m1.plot_components(forecast1, plot_cap=False);
# m1.plot(forecast1, plot_cap=False, ylabel='CNS index', xlabel='Year');
The problem is that now I have to
1)iterate through every cell of the netCDF file,
2)get all the values for that cell through time,
3)apply the GAM (using fbprophet), and then export and plot the results.
The question: do you have any ideas on how to loop through the raster, get the index_values of each pixel for all times so that i can run the GAM?
I think that a nested for loop would be feasible, although i dont know how to make one that goes through every cell.
Any help is appreciated