Dask looping over library function call

Dask looping over library function call - python

Goal
I would like to parallelize a loop with dask that uses a library function inside the loop. This function, mhw.detect(), calculates some statistics on a slice of a numpy array. None of the slices of the array depend on the other slices, so I was hoping that dask could be used to compute them in parallel and store them all in the same output array.
Code
The flow of the code I am working on is:
import numpy as np
import marineHeatWaves as mhw
from dask import delayed
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.random_integers(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# loop through each pixel in the data array
for idx_lat in range(lat_size):
for idx_long in range(long_size):
# Extract a slice of data
data_slice = data[:, idx_lat, idx_long]
# Use the library function to calculate the stats for the pixel
# `library_output` is a dictionary that has a numpy array inside it
_, library_output = delayed(mhw.detect)(time, data_slice)
# Update the output array with the calculated values from the library
output_array[:, idx_lat, idx_long] = library_output['seas']
Previous efforts
When I run this code I get the error TypeError: Delayed objects of unspecified length are not iterable. Another stack overflow post discusses this issue and resolves the issue by converting the output of the delayed function to a delayed object. However, because I didn't create the output object myself I am not sure if I can convert it to a delayed object.
I've also tried wrapping the last line in da.from_delayed(), as in output_array[:, idx_lat, idx_long] = da.from_delayed(library_output['seas']) and initalizing the output_array with da.empty(data.shape). I get the same error, though, since I think the code doesn't make it past the line with the library function delayed(mhw.detect)(time, data_slice).
Is it possible to parallelize this? Is this approach of asking dask to compute all the slices in parallel and put them together in an output array even a reasonable approach?
Full Traceback
TypeError Traceback (most recent call last)
/home/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 44' in <cell line: 10>()
13 data_slice = data[:, idx_lat, idx_long]
14 # Use the library function to calculate the stats for the pixel
---> 15 _, point_clim = delayed(mhw.detect)(time_ordinal, data_slice)
16 # Update the output array with the calculated values from the library
17 output_array[:, idx_lat, idx_long] = point_clim['seas']
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/delayed.py:581, in Delayed.__iter__(self)
579 def __iter__(self):
580 if self._length is None:
--> 581 raise TypeError("Delayed objects of unspecified length are not iterable")
582 for i in range(self._length):
583 yield self[i]
TypeError: Delayed objects of unspecified length are not iterable
Update
Using .apply_along_axis() as suggested:
# Create fake input data
lat_size, long_size = 100, 100
data = np.random.randint(0, 30, size=(10_000, long_size, lat_size)) # size = (time, longitude, latitude)
data = dask.array.from_array(data, chunks=(-1, 100, 100))
time = np.arange(730_000, 740_000) # time in ordinal days
# Initialize an empty array to hold the output
output_array = np.empty(data.shape)
# define a wrapper to rearrange arguments
def func1d(arr, time, shape=(10000,)):
print(arr.shape)
return mhw.detect(time, arr)
res = dask.array.apply_along_axis(func1d, 0, data, time=time)
With the output:
(1,)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in <cell line: 15>()
12 print(arr.shape)
13 return mhw.detect(time, arr)
---> 15 res = dask.array.apply_along_axis(func1d, 0, data, time=time)
File ~/.conda/envs/dask/lib/python3.10/site-packages/dask/array/routines.py:508, in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
506 if shape is None or dtype is None:
507 test_data = np.ones((1,), dtype=arr.dtype)
--> 508 test_result = np.array(func1d(test_data, *args, **kwargs))
509 if shape is None:
510 shape = test_result.shape
/homes/metogra/rwegener/mhw-ocetrac-census/notebooks/ejoliver_subset_MUR.ipynb Cell 48' in func1d(arr, time, shape)
11 def func1d(arr, time, shape=(10000,)):
12 print(arr.shape)
---> 13 return mhw.detect(time, arr)
File ~/.conda/envs/dask/lib/python3.10/site-packages/marineHeatWaves-0.28-py3.10.egg/marineHeatWaves.py:280, in detect(t, temp, climatologyPeriod, pctile, windowHalfWidth, smoothPercentile, smoothPercentileWidth, minDuration, joinAcrossGaps, maxGap, maxPadLength, coldSpells, alternateClimatology, Ly)
278 tt = tt[tt>=0] # Reject indices "before" the first element
279 tt = tt[tt<TClim] # Reject indices "after" the last element
--> 280 thresh_climYear[d-1] = np.nanpercentile(tempClim[tt.astype(int)], pctile)
281 seas_climYear[d-1] = np.nanmean(tempClim[tt.astype(int)])
282 # Special case for Feb 29
IndexError: index 115 is out of bounds for axis 0 with size 1

Rather than using delayed, this seems like a good case for dask.array.
You can create the dask array by partitioning the numpy array:
da = dask.array.from_array(output_array, chunks=(-1, 10, 10))
Now you can call mhw.detect using dask.array.map_blocks alongside np.apply_along_axis within each block:
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
def block_func(block, **kwargs):
return np.apply_along_axis(func1d, 0, block, **kwargs)
res = data.map_blocks(block_func, meta=data, time=time)
res = res.compute()

The map_blocks answer above works great! Additionally, apply_along_axis() was suggested and discussed in comments. I was able to get that method to work, but in order for it to function properly you need to use both the dtype and shape inputs to da.apply_along_axis(). If these aren't supplied the function can't figure out the shape of the data it should pass as an argument.
So, another solution:
import dask.array as da
# Create fake input data
lat_size, long_size = 100, 100
data = da.random.random_integers(0, 30, size=(1_000, long_size, lat_size), chunks=(-1, 10, 10)) # size = (time, longitude, latitude)
time = np.arange(730_000, 731_000) # time in ordinal days
# define a wrapper to rearrange arguments
def func1d(arr, time):
return mhw.detect(time, arr)
result = da.apply_along_axis(func1d, 0, data, time=time, dtype=data.dtype, shape=(1000,))
result.compute()

Related

How to create a numpy array to an xarray data array?

I am trying to convert a 3D numpy array to a data array however I am getting an error that I cannot figure out.
I have a 3D numpy array (lat, lon, and time), and I am hoping to convert it into an xarray data array with the dimensions being lat, lon, and time.
The np.random.rand is just to make a reproducible example of a 3D array:
atae = np.random.rand(10,20,30) # 3d array
lat_atae = np.random.rand(10) # latitude is the same size as the first axis
lon_atae = np.random.rand(20) # longitude is the same size as second axis
time_atae = np.random.rand(30) # time is the 3rd axis
data_xr = xr.DataArray(atae, coords=[{'y': lat_atae,'x': lon_atae,'time': time_atae}],
dims=["y", "x", "time"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-156-8f8f8a1fc7aa> in <module>
----> 1 test = xr.DataArray(atae, coords=[{'y': lat_atae,'x': lon_atae,'time': time_atae}],
2 dims=["y", "x", "time"])
3
~/opt/anaconda3/lib/python3.8/site-packages/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, indexes, fastpath)
408 data = _check_data_shape(data, coords, dims)
409 data = as_compatible_data(data)
--> 410 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
411 variable = Variable(dims, data, attrs, fastpath=True)
412 indexes = dict(
~/opt/anaconda3/lib/python3.8/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
104 and len(coords) != len(shape)
105 ):
--> 106 raise ValueError(
107 f"coords is not dict-like, but it has {len(coords)} items, "
108 f"which does not match the {len(shape)} dimensions of the "
ValueError: coords is not dict-like, but it has 1 items, which does not match the 3 dimensions of the data
How do I convert this numpy array into a xarray data array?

You don't need to provide a list for coords, the dictionary is enough :
data_xr = xr.DataArray(atae,
coords={'y': lat_atae,'x': lon_atae,'time': time_atae},
dims=["y", "x", "time"])

Index Error: Index 206893 is out of bounds for axis 0 with size 206893, griddata issue

I have an issue for the last 4 days trying to understand a python error:
`enter code here`IndexError: index 206893 is out of bounds for axis 0 with size 206893
when applying, griddata and "nearest" interpolation method using the following lines:
create a matrix where I will store the first interpolated file
tempnew = np.ones((np.asarray(w1[0,0,:,:]).shape))*np.nan
The lon, lat coordinate points of the original grid
lonl,latl = np.meshgrid(lon,lat)
points = np.vstack((np.array(lonl).flatten(),np.array(latl).flatten())).transpose()
The values of the original file
values = np.array([np.asarray(temp[0,0,:,:])]).flatten()
The dimensions of the grid that I want to interpolate to
lons = np.array(nav_lon)
lats = np.array(nav_lat)
X,Y = np.meshgrid(lons,lats)
Interpolation
tempnew = griddata(points,values, (X,Y), method = "nearest",fill_value=-3)
Here the dimension of each of the variables that I use above:
#tempnew.shape: (728, 312) #(Dimensions of tempnew is (lats,lons))
#lat.shape: (661,) #(original latitude)
#lon.shape: (313,) #(original longitude)
#points.shape: (206893, 2)
#values.shape: (206893,)
#X.shape: (728, 312)
#Y.shape: (728, 312)
Can you help me? * I would like to note here that the original file grid is regular (A-type) grid data whereas the grid to which I want to interpolate to is not regular (C-grid data)
The error looks like this:
In [36]: tempnew = sp.interpolate.griddata(points,values, (X,Y), method = "nearest
...: ",fill_value=-3)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-36-0d0b46a3542f> in <module>
----> 1 tempnew = sp.interpolate.griddata(points,values, (X,Y), method =
"nearest",fill_value=-3)
~/software/anaconda3/envs/mhw/lib/python3.7/site-packages/scipy/interpolate/ndgriddata.py in
griddata(points, values, xi, method, fill_value, rescale)
217 elif method == 'nearest':
218 ip = NearestNDInterpolator(points, values, rescale=rescale)
--> 219 return ip(xi)
220 elif method == 'linear':
221 ip = LinearNDInterpolator(points, values, fill_value=fill_value,
~/software/anaconda3/envs/mhw/lib/python3.7/site-packages/scipy/interpolate/ndgriddata.py in
__call__(self, *args)
79 xi = self._scale_x(xi)
80 dist, i = self.tree.query(xi)
---> 81 return self.values[i]
82
83
IndexError: index 206893 is out of bounds for axis 0 with size 206893
Thanks in advance,
Sofi

I encountered this error in my Python code using the scipy.interpolate.NearestNDInterpolator class. The error message that is returned is not very clear. In the end, I found that one of the values I was inserting into my interpolant had a value of 1e184 and caused this error message. After resetting this value to 0.0, my Python script ran successfully.

Can I parallelize `numpy.bincount` using `xarray.apply_ufunc`?

I want to parallelize the numpy.bincount function using the apply_ufunc API of xarray and the following code is what I've tried:
import numpy as np
import xarray as xr
da = xr.DataArray(np.random.rand(2,16,32),
dims=['time', 'y', 'x'],
coords={'time': np.array(['2019-04-18', '2019-04-19'],
dtype='datetime64'),
'y': np.arange(16), 'x': np.arange(32)})
f = xr.DataArray(da.data.reshape((2,512)),dims=['time','idx'])
x = da.x.values
y = da.y.values
r = np.sqrt(x[np.newaxis,:]**2 + y[:,np.newaxis]**2)
nbins = 4
if x.max() > y.max():
ri = np.linspace(0., y.max(), nbins)
else:
ri = np.linspace(0., x.max(), nbins)
ridx = np.digitize(np.ravel(r), ri)
func = lambda a, b: np.bincount(a, weights=b)
xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
but I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-203-974a8f0a89e8> in <module>()
12
13 func = lambda a, b: np.bincount(a, weights=b)
---> 14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_ufunc(func, *args, **kwargs)
979 signature=signature,
980 join=join,
--> 981 exclude_dims=exclude_dims)
982 elif any(isinstance(a, Variable) for a in args):
983 return variables_ufunc(*args)
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_dataarray_ufunc(func, *args, **kwargs)
208
209 data_vars = [getattr(a, 'variable', a) for a in args]
--> 210 result_var = func(*data_vars)
211
212 if signature.num_outputs > 1:
~/anaconda/envs/uptodate/lib/python3.6/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, *args, **kwargs)
558 raise ValueError('unknown setting for dask array handling in '
559 'apply_ufunc: {}'.format(dask))
--> 560 result_data = func(*input_data)
561
562 if signature.num_outputs == 1:
<ipython-input-203-974a8f0a89e8> in <lambda>(a, b)
11 ridx = np.digitize(np.ravel(r), ri)
12
---> 13 func = lambda a, b: np.bincount(a, weights=b)
14 xr.apply_ufunc(func, xr.DataArray(ridx,dims=['idx']), f)
ValueError: object too deep for desired array
I am kind of lost where the error is stemming from and help would be greatly appreciated...

The issue is that apply_along_axis iterates over 1D slices of the first argument to the applied function and not any of the others. If I understand your use-case correctly, you actually want to iterate over 1D slices of the weights (weights in the np.bincount signature), not the integer array (x in the np.bincount signature).
One way to work around this is to write a thin wrapper function around np.bincount that simply switches the order of the arguments:
def wrapped_bincount(weights, x):
return np.bincount(x, weights=weights)
We can then use np.apply_along_axis with this function for your use-case:
def apply_bincount_along_axis(x, weights, axis=-1):
return np.apply_along_axis(wrapped_bincount, axis, weights, x)
Finally, we can wrap this new function for use with xarray using apply_ufunc, noting that it can be automatically parallelized with dask (also note that that we do not need to provide an axis argument, because xarray will automatically move the input core dimension dim to the last position in the weights array before applying the function):
def xbincount(x, weights):
if len(x.dims) != 1:
raise ValueError('x must be one-dimensional')
dim, = x.dims
nbins = x.max() + 1
return xr.apply_ufunc(apply_bincount_along_axis, x, weights,
input_core_dims=[[dim], [dim]],
output_core_dims=[['bin']], dask='parallelized',
output_dtypes=[np.float], output_sizes={'bin': nbins})
Applying this function to your example then looks like:
xbincount(ridx, f)
<xarray.DataArray (time: 2, bin: 5)>
array([[ 0. , 7.934821, 34.066872, 51.118065, 152.769169],
[ 0. , 11.692989, 33.262936, 44.993856, 157.642972]])
Dimensions without coordinates: time, bin
As desired it also works with dask arrays:
xbincount(ridx, f.chunk({'time': 1}))
<xarray.DataArray (time: 2, bin: 5)>
dask.array<shape=(2, 5), dtype=float64, chunksize=(1, 5)>
Dimensions without coordinates: time, bin

I know this is a bit late, but here is an alternative for computing bincount with multiple sets of weights. Please refer to #spencerkclark's answer for information about parallelizing the function.
There is a WARNING before using this: the function bincount_2d_SLOW is used to demonstrate the idea! Do not use this function in your code directly, it is very slow!
I will explain at the end why the idea of the function can greatly speed up your code relative to the solution posted by #spencerkclark, but only if you are computing the bincount multiple times using the same set of groups.
The idea of the code is that while we can't use np.bincount with 2d weights, we can convert 2d weights into 1d data that is directly usable by np.bincount.
The way we do this is:
we repeat our grouping column by the 2nd dimension of the weights, so the grouping and weights have the same shape.
we adjust the grouping values along the 2nd dimension, so that each set of weights has its own unique grouping values. This way, we are grouping along each set of weights separately.
We flatten the data, so it's 1d. Now we can run np.bincount.
Finally, reshape the data.
def bincount_2d_SLOW(x, weights=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_groups = x.max() + 1
n_dims = weights.shape[1]
# Expand x to the same number of dimensions as weights
repeated_x = np.tile(x, (n_dims, 1)).T
# Take Kronecker product, so bincount works separately along each dimension
repeated_x = repeated_x + n_groups * np.arange(n_dims)
# Flatten
repeated_x = repeated_x.flatten()
# Compute bincount
return np.bincount(repeated_x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
Here is why the idea of this function can speed up your code: if you are computing the bincount many times using the same set of groups, you can pre-compute the tiled-and-flattened groupings, and suddenly the code is incredibly fast. Here is an alternative function (I also added an option to specify n_groups, which can speed up the code even more):
def bincount_2d(x, weights=None, n_groups=None):
if weights is None:
return np.bincount(x)
if len(weights.shape) == 1:
return np.bincount(x, weights=weights)
n_dims = weights.shape[1]
if n_groups is None:
n_groups = (x.max() + 1) // n_dims
return np.bincount(x, weights=weights.flatten()).reshape((n_dims, n_groups)).T
In some testing I did, bincount_2d_SLOW is about 1/3 slower than apply_bincount_along_axis. But bincount_2d was about 2x faster than apply_bincount_along_axis when I didn't specify n_groups, and when I did specify n_groups, it was about 3x faster.

__ValueError: setting an array element with a sequence

I was trying to calculate the trends of temperature
ntimes, ny, nx = tempF.shape
print tempF.shape
trend = MA.zeros((ny,nx),dtype=float)
print trend.shape
for y in range (ny):
for x in range(nx):
trends[y,x] = numpy.polyfit(tdum, tempF[:,y,x],1)
print trend()
the result is
(24, 241, 480)
(241, 480)
ValueErrorTraceback (most recent call last)
<ipython-input-31-4ac068601e48> in <module>()
12 for y in range (0,ny):
13 for x in range (0,nx):
---> 14 trend[y,x] = numpy.polyfit(tdum, tempF[:,y,x],1)
15
16
/home/charcoalp/anaconda2/envs/pyn_test/lib/python2.7/site-packages/numpy/ma/core.pyc in __setitem__(self, indx, value)
3272 if _mask is nomask:
3273 # Set the data, then the mask
-> 3274 _data[indx] = dval
3275 if mval is not nomask:
3276 _mask = self._mask = make_mask_none(self.shape, _dtype)
ValueError: setting an array element with a sequence.
I've just used python for few days, can any one help me, thank you

When you create ny by nx zeros ndarray, you can specify, which type you want to store in it's elements. If you want to store 1x2 array of float values (polyfit with degree=1 returns 1x2 array of floats) in each cell of your zeros array, you can choose a following type instead of float:
trend = numpy.zeros((ny,nx), dtype='2f')
After that, you can easily store your arrays, as elements of trend ndarray

Plot Quiver Function

I am trying to create a quiver plot from a NetCDF file in Python using this code:
import matplotlib.pyplot as plt
import numpy as np
import netCDF4 as Dataset
ncfile = netCDF4.Dataset('30JUNE2012_0300UTC.cdf', 'r')
dbZ = ncfile.variables['MAXDBZF']
data = dbZ[0,0]
U = ncfile.variables['UNEW'][:]
V = ncfile.variables['VNEW'][:]
x, y= np.arange(0,2*np.pi,.2), np.arange(0,2*np.pi,.2)
X,Y = np.meshgrid(x,y)
plt.quiver(X,Y,U,V)
plt.show()
and I am getting the following errors
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-109-b449c540a7ea> in <module>()
11 X,Y = np.meshgrid(x,y)
12
---> 13 plt.quiver(X,Y,U,V)
14
15 plt.show()
/Users/felishalawrence/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.pyc in quiver(*args, **kw)
3152 ax.hold(hold)
3153 try:
-> 3154 ret = ax.quiver(*args, **kw)
3155 draw_if_interactive()
3156 finally:
/Users/felishalawrence/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.pyc in quiver(self, *args, **kw)
4162 if not self._hold:
4163 self.cla()
-> 4164 q = mquiver.Quiver(self, *args, **kw)
4165
4166 self.add_collection(q, autolim=True)
/Users/felishalawrence/anaconda/lib/python2.7/site-packages/matplotlib/quiver.pyc in __init__(self, ax, *args, **kw)
415 """
416 self.ax = ax
--> 417 X, Y, U, V, C = _parse_args(*args)
418 self.X = X
419 self.Y = Y
/Users/felishalawrence/anaconda/lib/python2.7/site-packages/matplotlib/quiver.pyc in _parse_args(*args)
377 nr, nc = 1, U.shape[0]
378 else:
--> 379 nr, nc = U.shape
380 if len(args) == 2: # remaining after removing U,V,C
381 X, Y = [np.array(a).ravel() for a in args]
ValueError: too many values to unpack
What does this error mean?

ValueError: too many values to unpack is because the line 379 of your program is trying to assign two variables (nr, nc) from U.shape when there are not enough variables to assign these values to.
Look above on line 377 - that is correctly assigning two values (1 and U.shape[0] to nr and nc but line 379 has only a U.shape object to assign to two variables. If there are more than 2 values in U.shape you will get this error. It was made clear that U.shape is actually a tuple with at least two values which means that this code would work as-is as long as there are an equal amount of values to assign to the variables (in this case two). I would print out the value of U.shape and determine that it holds the expected values and quantity of values. If you U.shape can return two or more values then your code will need to learn how to adapt to this. For example if you find that U.shape is a tuple of 3 values then you will need 3 variables to hold those values like so:
nr, nc, blah = U.shape
Consider the following:
a,b,c = ["a","b","c"] #works
print a
print b
print c
a, b = ["a","b","c"] #will result in error because 3 values are trying to be assigned to only 2 variables
The results from the above code:
a
b
c
Traceback (most recent call last):
File "None", line 7, in <module>
ValueError: too many values to unpack
So you see it's just a matter of having enough values to assign to all of the variables that are requesting a value.

Probably more useful to solve future problems rather then author's but still:
The problem was likely that the netcdf file had a time dimension, therefore U and V where 3 dimensional arrays - you should choose the time slice or aggregate the data across the time dimension.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask looping over library function call - python

Related

How to create a numpy array to an xarray data array?

Index Error: Index 206893 is out of bounds for axis 0 with size 206893, griddata issue

Can I parallelize `numpy.bincount` using `xarray.apply_ufunc`?

__ValueError: setting an array element with a sequence

Plot Quiver Function

Categories

Resources