Poor parallelization using dask - python

I have a 2D grid on which there is a path. I want to calculate the distances of each point of the grid to each point on the path, then do some operations on those grid. I am using dask.dataframe and dask.array for this task.
The code is:
import dask.dataframe as dd
import dask.array as da
x = np.linspace(-60, 60, 10000)
xv, yv = da.meshgrid(x, x, sparse='True')
path = da.from_array(np.random.rand(100, 2))
h = 100.0
# function to calculate distance to point
def dist_to_point(x, y, p):
x_dist = x-p[0]
y_dist = y-p[1]
dist = da.sqrt(x_dist**2+y_dist**2)
d2 = da.sqrt(dist**2 + h**2)
return dd.from_dask_array(d2)
distances = [dist_to_point(xv, yv, path[i, :]) for i in range(npath)]
distances_grid = dd.multi.concat(distances, axis=1, ignore_index=True)
So distances_grid should the concatenation of [grid distance to point 1, grid distance to point 2, ..., grid distance to point 100]
Now suppose I want to get the max across all dataframes I apply this
l_max = distances_grid.map_partitions(lambda x: x.groupby(level=0, axis=1).max())
The dask graph for this looks like this which to me does not look like proper parallelization of the tasks. Can anyone help point me to what I am doing wrong or how I can improve this? My final application will be on 100000x100000 grids hence the use of dask

So in case anyone runs into this I solved it by broadcasting the arrays and avoiding the for loop all together. The code I ended up using is
x = da.from_array(np.linspace(-60, 60, 10000), chunks=1000)
xv, yv = da.meshgrid(x, x, sparse='True')
path = da.from_array(np.random.rand(10, 2))
h = 100.0
ngrid = x.shape[0]
xd = x[:, np.newaxis] - path[:, 0]
yd = x[:, np.newaxis] - path[:, 1]
z = xd**2 + yd[:, np.newaxis]**2 + h**2
# euclidian distance at height = 100
z = xd**2 + yd[:, np.newaxis]**2 + h**2
distances_grid = z**0.5
l_max = distances_grid.max(axis=2)
This gave me a nicer graph which I am able to balance even more by changing the sizes of the chunks.

Related

Integrating 2D data over an irregular grid in python

So I have 2D function which is sampled irregularly over a domain, and I want to calculate the volume underneath the surface. The data is organised in terms of [x,y,z], taking a simple example:
def f(x,y):
return np.cos(10*x*y) * np.exp(-x**2 - y**2)
datrange1 = np.linspace(-5,5,1000)
datrange2 = np.linspace(-0.5,0.5,1000)
ar = []
for x in datrange1:
for y in datrange2:
ar += [[x,y, f(x,y)]]
for x in xrange2:
for y in yrange2:
ar += [[x,y, f(x,y)]]
val_arr1 = np.array(ar)
data = np.unique(val_arr1)
xlist, ylist, zlist = data.T
where np.unique sorts the data in the first column then the second. The data is arranged in this way as I need to sample more heavily around the origin as there is a sharp feature that must be resolved.
Now I wondered about constructing a 2D interpolating function using scipy.interpolate.interp2d, then integrating over this using dblquad. As it turns out, this is not only inelegant and slow, but also kicks out the error:
RuntimeWarning: No more knots can be added because the number of B-spline
coefficients already exceeds the number of data points m.
Is there a better way to integrate data arranged in this fashion or overcoming this error?
If you can sample the data with high enough resolution around the feature of interest, then more sparsely everywhere else, the problem definition then becomes how to define the area under each sample. This is easy with regular rectangular samples, and could likely be done stepwise in increments of resolution around the origin. The approach I went after is to generate the 2D Voronoi cells for each sample in order to determine their area. I pulled most of the code from this answer, as it had almost all the components needed already.
import numpy as np
from scipy.spatial import Voronoi
#taken from: # https://stackoverflow.com/questions/28665491/getting-a-bounded-polygon-coordinates-from-voronoi-cells
#computes voronoi regions bounded by a bounding box
def square_voronoi(xy, bbox): #bbox: (min_x, max_x, min_y, max_y)
# Select points inside the bounding box
points_center = xy[np.where((bbox[0] <= xy[:,0]) * (xy[:,0] <= bbox[1]) * (bbox[2] <= xy[:,1]) * (bbox[2] <= bbox[3]))]
# Mirror points
points_left = np.copy(points_center)
points_left[:, 0] = bbox[0] - (points_left[:, 0] - bbox[0])
points_right = np.copy(points_center)
points_right[:, 0] = bbox[1] + (bbox[1] - points_right[:, 0])
points_down = np.copy(points_center)
points_down[:, 1] = bbox[2] - (points_down[:, 1] - bbox[2])
points_up = np.copy(points_center)
points_up[:, 1] = bbox[3] + (bbox[3] - points_up[:, 1])
points = np.concatenate((points_center, points_left, points_right, points_down, points_up,), axis=0)
# Compute Voronoi
vor = Voronoi(points)
# Filter regions (center points should* be guaranteed to have a valid region)
# center points should come first and not change in size
regions = [vor.regions[vor.point_region[i]] for i in range(len(points_center))]
vor.filtered_points = points_center
vor.filtered_regions = regions
return vor
#also stolen from: https://stackoverflow.com/questions/28665491/getting-a-bounded-polygon-coordinates-from-voronoi-cells
def area_region(vertices):
# Polygon's signed area
A = 0
for i in range(0, len(vertices) - 1):
s = (vertices[i, 0] * vertices[i + 1, 1] - vertices[i + 1, 0] * vertices[i, 1])
A = A + s
return np.abs(0.5 * A)
def f(x,y):
return np.cos(10*x*y) * np.exp(-x**2 - y**2)
#sampling could easily be shaped to sample origin more heavily
sample_x = np.random.rand(1000) * 10 - 5 #same range as example linspace
sample_y = np.random.rand(1000) - .5
sample_xy = np.array([sample_x, sample_y]).T
vor = square_voronoi(sample_xy, (-5,5,-.5,.5)) #using bbox from samples
points = vor.filtered_points
sample_areas = np.array([area_region(vor.vertices[verts+[verts[0]],:]) for verts in vor.filtered_regions])
sample_z = np.array([f(p[0], p[1]) for p in points])
volume = np.sum(sample_z * sample_areas)
I haven't exactly tested this, but the principle should work, and the math checks out.

Calculate distance between neighbors efficiently

I have data geographically scattered without any kind of pattern and I need to create an image where the value of each pixel is an average of the neighbors of that pixel that are less than X meters.
For this I use the library scipy.spatial to generate a KDTree with the data (cKDTree). Once the data structure is generated, I locate the pixel geographically and locate the geographic points that are closest.
# Generate scattered data points
coord_cart= [
[
feat.geometry().GetY(),
feat.geometry().GetX(),
feat.GetField(feature),
] for feat in layer
]
# Create KDTree structure
tree = cKDTree(coord_cart)
# Get raster image dimensions
pixel_size = 5
source_layer = shapefile.GetLayer()
x_min, x_max, y_min, y_max = source_layer.GetExtent()
x_res = int((x_max - x_min) / pixel_size)
y_res = int((y_max - y_min) / pixel_size)
# Create grid
x = np.linspace(x_min, x_max, x_res)
y = np.linspace(y_min, y_max, y_res)
X, Y = np.meshgrid(x, y)
grid = np.array(zip(Y.ravel(), X.ravel()))
# Get points that are less than 10 meters away
inds = tree.query_ball_point(grid, 10)
# inds is an np.array of lists of different length, so I need to convert it into an array of n_points x maximum number of neighbors
ll = np.array([len(l) for l in inds])
maxlen = max(ll)
arr = np.zeros((len(ll), maxlen), int)
# I don't know why but inds is an array of list, so I convert it into an array of array to use grid[inds]
# I THINK THIS IS A LITTLE INEFFICIENT
for i in range(len(inds)):
inds[i].extend([i] * (maxlen - len(inds[i])))
arr[i] = np.array(inds[i], dtype=int)
# AND THIS DOESN'T WORK
d = np.linalg.norm(grid - grid[inds])
Is there a better way to do this? I'm trying to use IDW to perform the interpolation between the points. I found this snippet that uses a function that gets the N nearest points but it does not work for me because I need that if there is no point in a radius R, the value of the pixel is 0.
d, inds = tree.query(zip(xt, yt, zt), k = 10)
w = 1.0 / d**2
air_idw = np.sum(w * air.flatten()[inds], axis=1) / np.sum(w, axis=1)
air_idw.shape = lon_curv.shape
Thanks in advance!
This may be one of the cases where KDTrees are not a good solution. This is because you are mapping to a grid, which is a very simple structure meaning there is nothing to gain from the KDTree's sophistication. Nearest grid point and distance can be found by simple arithmetic.
Below is a simple example implementation. I'm using a Gaussian kernel but changing that to IDW if you prefer should be straight-forward.
import numpy as np
from scipy import stats
def rasterize(coords, feature, gu, cutoff, kernel=stats.norm(0, 2.5).pdf):
# compute overlap (filter size / grid unit)
ovlp = int(np.ceil(cutoff/gu))
# compute raster dimensions
mn, mx = coords.min(axis=0), coords.max(axis=0)
reso = np.ceil((mx - mn) / gu).astype(int)
base = (mx + mn - reso * gu) / 2
# map coordinates to raster, the residual is the distance
grid_res = coords - base
grid_coords = np.rint(grid_res / gu).astype(int)
grid_res -= gu * grid_coords
# because of overlap we must add neighboring grid points to the nearest
gcovlp = np.c_[-ovlp:ovlp+1, np.zeros(2*ovlp+1, dtype=int)]
grid_coords = (gcovlp[:, None, None, :] + gcovlp[None, :, None, ::-1]
+ grid_coords).reshape(-1, 2)
# the corresponding residuals have the same offset with opposite sign
gdovlp = -gu * (gcovlp+1/2)
grid_res = (gdovlp[:, None, None, :] + gdovlp[None, :, None, ::-1]
+ grid_res).reshape(-1, 2)
# discard off fov grid points and points outside the cutoff
valid, = np.where(((grid_coords>=0) & (grid_coords<=reso)).all(axis=1) & (
np.einsum('ij,ij->i', grid_res, grid_res) <= cutoff*cutoff))
grid_res = grid_res[valid]
feature = feature[valid // (2*ovlp+1)**2]
# flatten grid so we can use bincount
grid_flat = np.ravel_multi_index(grid_coords[valid].T, reso+1)
return np.bincount(
grid_flat,
feature * kernel(np.sqrt(np.einsum('ij,ij->i', grid_res, grid_res))),
(reso + 1).prod()).reshape(reso+1)
gu = 5
cutoff = 10
coords = np.random.randn(10_000, 2) * (100, 20)
coords[:, 1] += 80 * np.sin(coords[:, 0] / 40)
feature = np.random.uniform(0, 1000, (10_000,))
from timeit import timeit
print(timeit("rasterize(coords, feature, gu, cutoff)", globals=globals(), number=100)*10, 'ms')
pic = rasterize(coords, feature, gu, cutoff)
import pylab
pylab.pcolor(pic, cmap=pylab.cm.jet)
pylab.colorbar()
pylab.show()

Poincare Section of a system of second order odes

It is the first time I am trying to write a Poincare section code at Python.
I borrowed the piece of code from here:
https://github.com/williamgilpin/rk4/blob/master/rk4_demo.py
and I have tried to run it for my system of second order coupled odes. The problem is that I do not see what I was expecting to. Actually, I need the Poincare section when x=0 and px>0.
I believe that my implementation is not the best out there. I would like to:
Improve the way that the initial conditions are chosen.
Apply the correct conditions (x=0 and px>0) in order to acquire the correct Poincare section.
Create one plot with all the collected poincare section data, not four separate ones.
I would appreciate any help.
This is the code:
from matplotlib.pyplot import *
from scipy import *
from numpy import *
# a simple Runge-Kutta integrator for multiple dependent variables and one independent variable
def rungekutta4(yprime, time, y0):
# yprime is a list of functions, y0 is a list of initial values of y
# time is a list of t-values at which solutions are computed
#
# Dependency: numpy
N = len(time)
y = array([thing*ones(N) for thing in y0]).T
for ii in xrange(N-1):
dt = time[ii+1] - time[ii]
k1 = dt*yprime(y[ii], time[ii])
k2 = dt*yprime(y[ii] + 0.5*k1, time[ii] + 0.5*dt)
k3 = dt*yprime(y[ii] + 0.5*k2, time[ii] + 0.5*dt)
k4 = dt*yprime(y[ii] + k3, time[ii+1])
y[ii+1] = y[ii] + (k1 + 2.0*(k2 + k3) + k4)/6.0
return y
# Miscellaneous functions
n= 1.0/3.0
kappa1 = 0.1
kappa2 = 0.1
kappa3 = 0.1
def total_energy(valpair):
(x, y, px, py) = tuple(valpair)
return .5*(px**2 + py**2) + (1.0/(1.0*(n+1)))*(kappa1*np.absolute(x)**(n+1)+kappa2*np.absolute(y-x)**(n+1)+kappa3*np.absolute(y)**(n+1))
def pqdot(valpair, tval):
# input: [x, y, px, py], t
# takes a pair of x and y values and returns \dot{p} according to the Hamiltonian
(x, y, px, py) = tuple(valpair)
return np.array([px, py, -kappa1*np.sign(x)*np.absolute(x)**n+kappa2*np.sign(y-x)*np.absolute(y-x)**n, kappa2*np.sign(y-x)*np.absolute(y-x)**n-kappa3*np.sign(y)*np.absolute(y)**n]).T
def findcrossings(data, data1):
# returns indices in 1D data set where the data crossed zero. Useful for generating Poincare map at 0
prb = list()
for ii in xrange(len(data)-1):
if (((data[ii] > 0) and (data[ii+1] < 0)) or ((data[ii] < 0) and (data[ii+1] > 0))) and data1[ii] > 0:
prb.append(ii)
return array(prb)
t = linspace(0, 1000.0, 100000)
print ("step size is " + str(t[1]-t[0]))
# Representative initial conditions for E=1
E = 1
x0=0
y0=0
init_cons = [[x0, y0, np.sqrt(2*E-(1.0*i/10.0)*(1.0*i/10.0)-2.0/(n+1)*(kappa1*np.absolute(x0)**(n+1)+kappa2*np.absolute(y0-x0)**(n+1)+kappa3*np.absolute(y0)**(n+1))), 1.0*i/10.0] for i in range(-10,11)]
outs = list()
for con in init_cons:
outs.append( rungekutta4(pqdot, t, con) )
# plot the results
fig1 = figure(1)
for ii in xrange(4):
subplot(2, 2, ii+1)
plot(outs[ii][:,1],outs[ii][:,3])
ylabel("py")
xlabel("y")
title("Full trajectory projected onto the plane")
fig1.suptitle('Full trajectories E = 1', fontsize=10)
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(4):
subplot(2, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
yints = [.5*(outs[ii][cross, 1] + outs[ii][cross+1, 1]) for cross in xcrossings]
pyints = [.5*(outs[ii][cross, 3] + outs[ii][cross+1, 3]) for cross in xcrossings]
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
fig2.suptitle('Poincare Sections E = 1', fontsize=10)
show()
You need to compute the derivatives of the Hamiltonian correctly. The derivative of |y-x|^n for x is
n*(x-y)*|x-y|^(n-2)=n*sign(x-y)*|x-y|^(n-1)
and the derivative for y is almost, but not exactly (as in your code), the same,
n*(y-x)*|x-y|^(n-2)=n*sign(y-x)*|x-y|^(n-1),
note the sign difference. With this correction you can take larger time steps, with correct linear interpolation probably even larger ones, to obtain the images
I changed the integration of the ODE to
t = linspace(0, 1000.0, 2000+1)
...
E_kin = E-total_energy([x0,y0,0,0])
init_cons = [[x0, y0, (2*E_kin-py**2)**0.5, py] for py in np.linspace(-10,10,8)]
outs = [ odeint(pqdot, con, t, atol=1e-9, rtol=1e-8) ) for con in init_cons[:8] ]
Obviously the number and parametrization of initial conditions may change.
The computation and display of the zero-crossings was changed to
def refine_crossing(a,b):
tf = -a[0]/a[2]
while abs(b[0])>1e-6:
b = odeint(pqdot, a, [0,tf], atol=1e-8, rtol=1e-6)[-1];
# Newton step using that b[0]=x(tf) and b[2]=x'(tf)
tf -= b[0]/b[2]
return [ b[1], b[3] ]
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(8):
#subplot(4, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
ycrossings = [ refine_crossing(outs[ii][cross], outs[ii][cross+1]) for cross in xcrossings]
yints, pyints = array(ycrossings).T
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
and evaluating the result of a longer integration interval

calculating the curl of u and v wind components in satellite data - Python

I am not sure how to take derivatives of the u and v components of the wind in satellite data. I thought I could use numpy.gradient in this way:
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
GridSat = Dataset('analysis_20040713_v11l30flk.nc4','r',format='NETCDF4')
missing_data = -9999.0
lat = GridSat.variables['lat']
lat = lat[:]
lat[np.where(lat==missing_data)] = np.nan
lat[np.where(lat > 90.0)] = np.nan
lon = GridSat.variables['lon']
lon = lon[:]
lon[np.where(lon==missing_data)] = np.nan
uwind_data = GridSat.variables['uwnd']
uwind = GridSat.variables['uwnd'][:]
uwind_sf = uwind_data.scale_factor
uwind_ao = uwind_data.add_offset
miss_uwind = uwind_data.missing_value
uwind[np.where(uwind==miss_uwind)] = np.nan
vwind_data = GridSat.variables['vwnd']
vwind = GridSat.variables['vwnd'][:]
vwind_sf = vwind_data.scale_factor
vwind_ao = vwind_data.add_offset
miss_vwind = vwind_data.missing_value
vwind[np.where(vwind==miss_vwind)] = np.nan
uwind = uwind[2,:,:]
vwind = vwind[2,:,:]
dx = 28400.0 # meters calculated from the 0.25 degree spatial gridding
dy = 28400.0 # meters calculated from the 0.25 degree spatial gridding
dv_dx, dv_dy = np.gradient(vwind, [dx,dy])
du_dx, du_dy = np.gradient(uwind, [dx,dy])
File "<ipython-input-229-c6a5d5b09224>", line 1, in <module>
np.gradient(vwind, [dx,dy])
File "/Users/anaconda/lib/python2.7/site-packages/nump/lib/function_base.py", line 1040, in gradient
out /= dx[axis]
ValueError: operands could not be broadcast together with shapes (628,1440) (2,) (628,1440)
Honestly, I am not sure how to calculate central differences of satellite data with (0.25x0.25) degree spacing. I dont think my dx and dy are correct either. I would really appreciate if someone had a good idea on approaching these types of calculations in satellite data. Thank you!!
As #moarningsun commented, changing how you call np.gradient should correct the ValueError
dv_dx, dv_dy = np.gradient(vwind, dx,dy)
du_dx, du_dy = np.gradient(uwind, dx,dy)
How you got vwind from the file is not particularly important, especially since we don't have access to that file. The shape of vwind would have been useful, though we can guess that from the error message. The reference in the error to a (2,) array is to [dx,dy]. When you get broadcasting errors, check the shapes of the various arguments.
np.gradient code is straight forward, only complicated by the fact that it can handle 1, 2, 3d and higher data. Basically it doing calculations like
(z[:,2:]-z[:,:-2])/2
(z[2:,:]-z[:-2,:])/2
for the inner values, and 1 item steps for the boundary values.
I'll leave the question of deriving a curl from the gradients (or not) to others.
As mentioned, there is the issue of having to implement a discrete curl operator of some kind. This is presumably a routine concern in atmospheric physics so you could check a textbook on that.
Another approach might be to fit a spline to the data so that you can use continuous operations. For example
bspl = scipy.interpolate.SmoothBivariateSpline(x,y,z,s=0)
s here is a smoothing factor which you should play with; if the data are very precise s=0 gives best results; if they have substantial scatter you will want some smoothing.Now you can compute the curl directly:
curl = bspl.integral(x0,x1,y0,y1) / ((x1-x0)*(y1-y0))
EDIT:
The above expression does not give the curl, but the basic idea is sound.
The code below can be ran on Matlab wind dataset, the file wind.mat is in
http://bioinformatics.intec.ugent.be/MotifSuite/INCLUSive_for_users/CPU_64/Matlab_Compiler_Runtime/v79/toolbox/matlab/demos/
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d
import scipy.io as sio
def curl(x,y,z,u,v,w):
dx = x[0,:,0]
dy = y[:,0,0]
dz = z[0,0,:]
dummy, dFx_dy, dFx_dz = np.gradient (u, dx, dy, dz, axis=[1,0,2])
dFy_dx, dummy, dFy_dz = np.gradient (v, dx, dy, dz, axis=[1,0,2])
dFz_dx, dFz_dy, dummy = np.gradient (w, dx, dy, dz, axis=[1,0,2])
rot_x = dFz_dy - dFy_dz
rot_y = dFx_dz - dFz_dx
rot_z = dFy_dx - dFx_dy
l = np.sqrt(np.power(u,2.0) + np.power(v,2.0) + np.power(w,2.0));
m1 = np.multiply(rot_x,u)
m2 = np.multiply(rot_y,v)
m3 = np.multiply(rot_z,w)
tmp1 = (m1 + m2 + m3)
tmp2 = np.multiply(l,2.0)
av = np.divide(tmp1, tmp2)
return rot_x, rot_y, rot_z, av
mat = sio.loadmat('wind.mat')
x = mat['x']; y = mat['y']; z = mat['z']
u = mat['u']; v = mat['v']; w = mat['w']
rot_x, rot_y, rot_z, av = curl(x,y,z,u,v,w)
# plot a small area of the wind
i=5;j=7;k=8;S = 3
x1 = x[i-S:i+S, j-S:j+S, k-S:k+S];
y1 = y[i-S:i+S, j-S:j+S, k-S:k+S];
z1 = z[i-S:i+S, j-S:j+S, k-S:k+S];
u1 = u[i-S:i+S, j-S:j+S, k-S:k+S];
v1 = v[i-S:i+S, j-S:j+S, k-S:k+S];
w1 = w[i-S:i+S, j-S:j+S, k-S:k+S];
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.view_init(elev=47, azim=-145)
ax.quiver(x1, y1, z1, u1, v1, w1, length=0.05, color = 'black')
i=5;j=7;k=8;
x0=x[i,j,k]
y0=y[i,j,k]
z0=z[i,j,k]
cx0=rot_x[i,j,k]
cy0=rot_y[i,j,k]
cz0=rot_z[i,j,k]
ax.quiver(x0, y0, z0, 0, cy0, cz0, length=1.0, color = 'blue')
plt.show()

Best way to interpolate a numpy.ndarray along an axis

I have 4-dimensional data, say for the temperature, in an numpy.ndarray.
The shape of the array is (ntime, nheight_in, nlat, nlon).
I have corresponding 1D arrays for each of the dimensions that tell me which time, height, latitude, and longitude a certain value corresponds to, for this example I need height_in giving the height in metres.
Now I need to bring it onto a different height dimension, height_out, with a different length.
The following seems to do what I want:
ntime, nheight_in, nlat, nlon = t_in.shape
nheight_out = len(height_out)
t_out = np.empty((ntime, nheight_out, nlat, nlon))
for time in range(ntime):
for lat in range(nlat):
for lon in range(nlon):
t_out[time, :, lat, lon] = np.interp(
height_out, height_in, t[time, :, lat, lon]
)
But with 3 nested loops, and lots of switching between python and numpy, I don't think this is the best way to do it.
Any suggestions on how to improve this? Thanks
scipy's interp1d can help:
import numpy as np
from scipy.interpolate import interp1d
ntime, nheight_in, nlat, nlon = (10, 20, 30, 40)
heights = np.linspace(0, 1, nheight_in)
t_in = np.random.normal(size=(ntime, nheight_in, nlat, nlon))
f_out = interp1d(heights, t_in, axis=1)
nheight_out = 50
new_heights = np.linspace(0, 1, nheight_out)
t_out = f_out(new_heights)
I was looking for a similar function that works with irregularly spaced coordinates, and ended up writing my own function. As far as I see, the interpolation is handled nicely and the performance in terms of memory and speed is also quite good. I thought I'd share it here in case anyone else comes across this question looking for a similar function:
import numpy as np
import warnings
def interp_along_axis(y, x, newx, axis, inverse=False, method='linear'):
""" Interpolate vertical profiles, e.g. of atmospheric variables
using vectorized numpy operations
This function assumes that the x-xoordinate increases monotonically
ps:
* Updated to work with irregularly spaced x-coordinate.
* Updated to work with irregularly spaced newx-coordinate
* Updated to easily inverse the direction of the x-coordinate
* Updated to fill with nans outside extrapolation range
* Updated to include a linear interpolation method as well
(it was initially written for a cubic function)
Peter Kalverla
March 2018
--------------------
More info:
Algorithm from: http://www.paulinternet.nl/?page=bicubic
It approximates y = f(x) = ax^3 + bx^2 + cx + d
where y may be an ndarray input vector
Returns f(newx)
The algorithm uses the derivative f'(x) = 3ax^2 + 2bx + c
and uses the fact that:
f(0) = d
f(1) = a + b + c + d
f'(0) = c
f'(1) = 3a + 2b + c
Rewriting this yields expressions for a, b, c, d:
a = 2f(0) - 2f(1) + f'(0) + f'(1)
b = -3f(0) + 3f(1) - 2f'(0) - f'(1)
c = f'(0)
d = f(0)
These can be evaluated at two neighbouring points in x and
as such constitute the piecewise cubic interpolator.
"""
# View of x and y with axis as first dimension
if inverse:
_x = np.moveaxis(x, axis, 0)[::-1, ...]
_y = np.moveaxis(y, axis, 0)[::-1, ...]
_newx = np.moveaxis(newx, axis, 0)[::-1, ...]
else:
_y = np.moveaxis(y, axis, 0)
_x = np.moveaxis(x, axis, 0)
_newx = np.moveaxis(newx, axis, 0)
# Sanity checks
if np.any(_newx[0] < _x[0]) or np.any(_newx[-1] > _x[-1]):
# raise ValueError('This function cannot extrapolate')
warnings.warn("Some values are outside the interpolation range. "
"These will be filled with NaN")
if np.any(np.diff(_x, axis=0) < 0):
raise ValueError('x should increase monotonically')
if np.any(np.diff(_newx, axis=0) < 0):
raise ValueError('newx should increase monotonically')
# Cubic interpolation needs the gradient of y in addition to its values
if method == 'cubic':
# For now, simply use a numpy function to get the derivatives
# This produces the largest memory overhead of the function and
# could alternatively be done in passing.
ydx = np.gradient(_y, axis=0, edge_order=2)
# This will later be concatenated with a dynamic '0th' index
ind = [i for i in np.indices(_y.shape[1:])]
# Allocate the output array
original_dims = _y.shape
newdims = list(original_dims)
newdims[0] = len(_newx)
newy = np.zeros(newdims)
# set initial bounds
i_lower = np.zeros(_x.shape[1:], dtype=int)
i_upper = np.ones(_x.shape[1:], dtype=int)
x_lower = _x[0, ...]
x_upper = _x[1, ...]
for i, xi in enumerate(_newx):
# Start at the 'bottom' of the array and work upwards
# This only works if x and newx increase monotonically
# Update bounds where necessary and possible
needs_update = (xi > x_upper) & (i_upper+1<len(_x))
# print x_upper.max(), np.any(needs_update)
while np.any(needs_update):
i_lower = np.where(needs_update, i_lower+1, i_lower)
i_upper = i_lower + 1
x_lower = _x[[i_lower]+ind]
x_upper = _x[[i_upper]+ind]
# Check again
needs_update = (xi > x_upper) & (i_upper+1<len(_x))
# Express the position of xi relative to its neighbours
xj = (xi-x_lower)/(x_upper - x_lower)
# Determine where there is a valid interpolation range
within_bounds = (_x[0, ...] < xi) & (xi < _x[-1, ...])
if method == 'linear':
f0, f1 = _y[[i_lower]+ind], _y[[i_upper]+ind]
a = f1 - f0
b = f0
newy[i, ...] = np.where(within_bounds, a*xj+b, np.nan)
elif method=='cubic':
f0, f1 = _y[[i_lower]+ind], _y[[i_upper]+ind]
df0, df1 = ydx[[i_lower]+ind], ydx[[i_upper]+ind]
a = 2*f0 - 2*f1 + df0 + df1
b = -3*f0 + 3*f1 - 2*df0 - df1
c = df0
d = f0
newy[i, ...] = np.where(within_bounds, a*xj**3 + b*xj**2 + c*xj + d, np.nan)
else:
raise ValueError("invalid interpolation method"
"(choose 'linear' or 'cubic')")
if inverse:
newy = newy[::-1, ...]
return np.moveaxis(newy, 0, axis)
And this is a small example to test it:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d as scipy1d
# toy coordinates and data
nx, ny, nz = 25, 30, 10
x = np.arange(nx)
y = np.arange(ny)
z = np.tile(np.arange(nz), (nx,ny,1)) + np.random.randn(nx, ny, nz)*.1
testdata = np.random.randn(nx,ny,nz) # x,y,z
# Desired z-coordinates (must be between bounds of z)
znew = np.tile(np.linspace(2,nz-2,50), (nx,ny,1)) + np.random.randn(nx, ny, 50)*0.01
# Inverse the coordinates for testing
z = z[..., ::-1]
znew = znew[..., ::-1]
# Now use own routine
ynew = interp_along_axis(testdata, z, znew, axis=2, inverse=True)
# Check some random profiles
for i in range(5):
randx = np.random.randint(nx)
randy = np.random.randint(ny)
checkfunc = scipy1d(z[randx, randy], testdata[randx,randy], kind='cubic')
checkdata = checkfunc(znew)
fig, ax = plt.subplots()
ax.plot(testdata[randx, randy], z[randx, randy], 'x', label='original data')
ax.plot(checkdata[randx, randy], znew[randx, randy], label='scipy')
ax.plot(ynew[randx, randy], znew[randx, randy], '--', label='Peter')
ax.legend()
plt.show()
Following the criteria of numpy.interp, one can assign the left/right bounds to the points outside the range adding this lines after within_bounds = ...
out_lbound = (xi <= _x[0,...])
out_rbound = (_x[-1,...] <= xi)
and
newy[i, out_lbound] = _y[0, out_lbound]
newy[i, out_rbound] = _y[-1, out_rbound]
after newy[i, ...] = ....
If I understood well the strategy used by #Peter9192, I think the changes are in the same line. I've checked a little bit, but maybe some strange case could not work properly.

Categories

Resources