I have a function D(x,y,z) in which I want to evaluate (via interpolation) planes within the z, y, and z axis. i.e. I want the output of my interpolations to be a 2D plane holding one of the values fixed, D(x,y,0) for example.
I have created an interpolating function via scipy using some given values of D, D_values, for my input values of x,y,z.
from scipy.interpolate import RegularGridInterpolator as rgi
D_interp=rgi((x_positions,y_positions,z_positions), D_values)
Now I can get any point interpolated by just calling
D_interpolated=D_interp(xi,yi,zi)
I understand how I can evaluate individual points from this, but how would I interpolate a plane? For example, in my case, D_values is of size 345x155x303 and I want to interpolate 345x155 planes all along the z axis corresponding to the x and y input values, at z=0, z=1, z=2, etc.
My attempt at a solution is to feed in the x_positions, y_positions vectors individually into D_interp keeping z fixed, but this just gets me a set of D values evaluated at specific positions, rather than organized into a grid like the planar output I'd actually like. Syntax doesn't allow me to call something like
Plane=D_interp(x_positions,y_positions,0)
so I was not quite sure about the syntax of calling this function to have planar output.
any help appreciated
Thanks,
The typical approach to combining multiple arrays with different sizes corresponding to different dimensions in numpy and scipy is to use broadcasting. Here is a sample problem to illustrate the application:
x_positions = np.linspace(0, 10, 101)
y_positions = np.linspace(-10, 10, 201)
z_positions = np.linspace(-5, 5, 101)
D_values = np.sin(2 * np.pi * x_positions[:, None, None] * y_positions[:, None] / 100) + np.cos(2 * np.pi * y_positions[:, None] * z_positions / 50)
This is similar to the D_values array you describe in your problem, where each of the bins in the different directions correspond to the *_positions arrays. I used broadcasting to turn x_positions into a (101, 1, 1)-shaped array, y_positions into a (201, 1)-shaped array and left z_positions as a (101,)-shaped array. The result is that D_values is a (101, 201, 101)-shaped array. The reshaped versions of the input arrays did not copy any data!
You can call your interpolator using the same idea that I used to create a sample D_values.
D_interp = rgi((x_positions, y_positions, z_positions), D_values)
Let's say you want to fix z = 0. All that scipy requires is that the inputs broadcast together. Scalars broadcast with everything, so you can just do
x_interp = np.linspace(0.05, 0.95, 200)
y_interp = np.linspace(-9.95, 9.95, 400)
z_interp = 0
D_xy_interp = D_interp((x_interp[:, None], y_interp, z_interp))
The advantage to doing this over creating a mesh is that you don't have to copy any data around and create extra 200x400 input arrays. Another advantage is that you have better control over the output. In this case, D_xy_interp has shape (len(x_interp), len(y_interp)). That's because in general, the shape of the output will be the broadcasted shape of the input. You can see that when we created D_values, and you can see it here. Since 0 is a scalar, it does not contribute to the shape. But I could also make a (400, 200) shaped array instead:
D_interp((x_interp, y_interp[:, None], z_interp))
Or even a (100, 4, 100, 2) shaped array:
D_interp((x_interp.reshape(-1, 2), y_interp.reshape(-1, 4, 1, 1), z_interp))
In either case, let's verify that the interpolator did it's job. We can compare the interpolated values to a much finer sampling of the function that created D_values:
D_xy_values = np.sin(2 * np.pi * x_interp[:, None] * y_interp / 100) + np.cos(2 * np.pi * y_interp * z_interp / 50)
fig, ax = plt.subplots(subplot_kw={'projection': '3d'})
ax.plot_surface(x_interp[:, None], y_interp, D_xy_interp, label='Interp')
ax.plot_surface(x_interp[:, None], y_interp, D_xy_values, label='Values')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()
At the moment it doesn't look like you can add legends to 3D plots:
.
The two plots are virtually indistinguishable. With the default color cycler, you will see the surface chance from blue to orange as you rotate it. Here is an analytical verification:
>>> np.sqrt(np.mean((D_xy_values - D_xy_interp)**2))
4.707625623185639e-05
Related
I have an array X with dimension mxn, for every row m I want to get a correlation with a vector y with dimension n.
In Matlab this would be possible with the corr function corr(X,y). For Python however this does not seem possible with the np.corrcoef function:
import numpy as np
X = np.random.random([1000, 10])
y = np.random.random(10)
np.corrcoef(X,y).shape
Which results in shape (1001, 1001). But this will fail when the dimension of X is large. In my case, there is an error:
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 5.93 TiB for an array with shape (902630, 902630) and data type float64
Since the X.shape[0] dimension is 902630.
My question is, how can I only get the row wise correlations with the vector resulting in shape (1000,) of all correlations?
Of course this could be done via a list comprehension:
np.array([np.corrcoef(X[i, :], y)[0,1] for i in range(X.shape[0])])
Currently I am therefore using numba with a for loop running through the >900000 elemens. But I think there could be a much more efficient matrix operation function for this problem.
EDIT:
Pandas provides with the corrwith function also a method for this problem:
X_df = pd.DataFrame(X)
y_s = pd.Series(y)
X_df.corrwith(y_s)
The implementation allows for different correlation type calculations, but does not seem to be implemmented as a matrix operation and is therefore really slow. Probably there is a more efficient implementation.
This should work to compute the correlation coefficient for each row with a specified y in a vectorized manner.
X = np.random.random([1000, 10])
y = np.random.random(10)
r = (len(y) * np.sum(X * y[None, :], axis=-1) - (np.sum(X, axis=-1) * np.sum(y))) / (np.sqrt((len(y) * np.sum(X**2, axis=-1) - np.sum(X, axis=-1) ** 2) * (len(y) * np.sum(y**2) - np.sum(y)**2)))
print(r[0], np.corrcoef(X[0], y))
0.4243951, 0.4243951
I have a 3-d array of shape=(3, 60000, 10) which needs to be 2-D so as to be able to visualize it when clustering.
I was planning on implementing the k-means clustering from scikit-learn to the 3-d array and read that it only takes in 2-D shape , I just wanted some advice as to whether there is a right way to do it ? I was planning on making it (60000,30) , but wanted a clarification before I go ahead.
How I read it is that you have 10 features each consisting of 3d data. Do you intend to cluster all 10 features? If so reshape it such that you have 600000 x 3 points (assuming you want to separate in space). For example this
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
# 3x points
data = np.random.rand(100, 3, 10) + np.arange(10) # add arbitrary offset for "difference" in real data
data = np.moveaxis(data, -1, 1).reshape(-1, 3)
n_clus = 10 # cluster in 10 --> fill in with your goal in mind
km = KMeans(n_clusters = n_clus).fit(data)
fig, ax = plt.subplots(subplot_kw = dict(projection = '3d'))
colors = plt.cm.tab20(np.linspace(0, 1, n_clus))
ax.scatter(*data.T, c = colors[km.labels_])
fig.show()
Yields
(600000 , 30) is probably not a great idea. K-means clustering uses a distance metrics to define clusters, Euclidean distance normally, but when you increase number of variables in the second dimension you fall into a curse of dimensionality where results of clustering will stop making sense.
You can of course try (600000, 30) and see if it works, but if it doesn't, you'll need to do reduce dimensionality, for example by doing a PCA and use principal components to do clustering.
EDIT
I'll probably try and explain what I mean by dimensionality and the issues it causes since there appears to be some confusion.
A 2d array of size (100, 2) is a 2-dimensional data, i.e. it's 100 observations of 2 variables. The trend line between those points would be a 1d object (line) and you can plot it on a 2d plane. Similarly, a (100, 3) array is 3-dimensional with a trendline being a 2d plane and you can plot those points on a 3d chart.
Then (100, 100) array is 100-dimensional. A trend would be a 99-dimensional hyperplane and you cannot visualise even in principle. Now let's see what issues this causes. Let's define a simple function calculating Euclidean distance:
def distance(x, y):
return sum((i - j)**2 for i, j in zip(x, y))**0.5
The function takes two iterables as arguments and calculates Euclidean distance between those. Now let's try with something simple:
v1 = (1, 1)
v2 = (2, 2)
v3 = (100, 100)
v4 = (120, 120)
>> distance(v1, v2)
Out: 1.4142135623730951
>> distance(v1, v3)
Out: 140.0071426749364
>> distance(v1, v4)
Out: 168.2914139223983
If we make these tuples 3 dimensional keeping the same values in all dimensions, distances become respectively: 1.73, 171.47, 206.11.
Now for the fun part - let's add a bunch of dimensions filled with "1"s:
v1 = [1, 1, 1] + list(1 for i in range(47))
v2 = [2, 2, 2] + list(1 for i in range(47))
v2 = [100, 100, 100] + list(1 for i in range(47))
v4 = [120, 120, 120] + list(1 for i in range(47))
>>> distance(v1, v2)
171.47302994931886
>>> distance(v1, v3)
175.16278143486988
>>> distance(v1, v4)
206.11404610069638
So here we increased dimensions without adding additional information to separate variables an suddenly what appeared as two distinct clusters are not so defined any more, in fact v1, v2 and v3 appear more like they belong together and v4 being an outsider.
This will also happen in most cases, unless the higher dimensions continue the pattern of the first three, i.e. (1, 1, 1...), (2, 2, 2,..), (100, 100, 100...), (120, 120, 120,...). But in most cases you will see distances shrink and clusters become indistinguishable.
I am using scipy.interp2d to interpolate over a bunch of coordinates of points:
from scipy.interpolate import interp2d
import numpy as np
grid_x = np.linspace(0, 1, 10)
grid_y = np.linspace(0, 1, 10)
grid_z = np.zeros((grid_size_kc,grid_size_kp))*5
function = RectBivariateSpline(x = grid_x, y = grid_y, z = grid_z)
I would like to interpolate over two matrices of coordinates
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
So that it would return the interpolated values at x=1,y=5, x=2, y=6, x=3, y=7, etc. Right now, I am simply looping over all potential values but this slows down my code quite a bit and I am trying to use vectorized operations to make things quicker.
Ideally, this would return an array of size 10x10 with all the interpolated values.
Thank you for your help!
I found an easy answer to this, when you call your 'function', juste write
function(x, y, grid = False)
Hope this will be useful for somebody.
I have generated a numpy array of (x, y) values as a N x N grid.
grid = np.meshgrid(np.linspace(0, 1, 50), np.linspace(0, 1, 50))[0]
grid.shape // (50, 50, 1)
I have a function that takes two parameters and returns 3 values.
i.e. (x, y) -> (a, b, c)
How to I apply the function over the 2d numpy array to get a 3d numpy array?
If your function really takes two parameters you probably want to map not 2d to 3d, but rather 2xMxN to 3xMxN. For this change your first line to something like
gridx, gridy = np.meshgrid(np.linspace(0, 1, 50), np.linspace(0, 1, 50))
or even use the more economical ix_ which has the advantage of not swapping axes
gridy, gridx = np.ix_(np.linspace(0, 1, 50), np.linspace(0, 1, 50))
If your function f does not handle array arguments then as #Jacques Gaudin points out np.vectorize is probably what you want. Be warned that vectorize is primarily a convenience function it doesn't make things faster. It does useful things like broadcasting which is why using ix_ actually works
f_wrapped = np.vectorize(f)
result = f_wrapped(gridy, gridx)
Note that result in your case is a 3-tuple of 50 x 50 arrays, i.e. is grouped by output. This is convenient if you want to chain vectorized functions. If you want all in one big array just convert result to array and optionally use transpose to rearrange axes, e.g.
triplets_last = np.array(result).transpose((1, 2, 0))
If I understand correctly, you are after the np.vectorize decorator. By using it you can apply a function over a meshgrid. Your function should take only one parameter though as you do not pass the coordinates but the value at the coordinates (unless the values are tulpes with two elements).
import numpy as np
grid = np.meshgrid(np.linspace(0, 1, 5), np.linspace(0, 1, 5))[0]
#np.vectorize
def func(a):
return (a, a**.5, a**2)
res = np.array(list(func(grid)))
print(res.shape)
print(res)
What I want to do is rather simple but I havent found a straightforward approach thus far:
I have a 3D rectilinear grid with float values (therefore 3 coordinate axes -1D numpy arrays- for the centers of the grid cells and a 3D numpy array with the corresponding shape with a value for each cell center) and I want to interpolate (or you may call it subsample) this entire array to a subsampled array (e.g. size factor of 5) with linear interpolation.
All the approaches I've seen this far involve 2D and then 1D interpolation or VTK tricks which Id rather not use (portability).
Could someone suggest an approach that would be the equivalent of taking 5x5x5 cells at the same time in the 3D array, averaging and returning an array 5times smaller in each direction?
Thank you in advance for any suggestions
EDIT:
Here's what the data looks like, 'd' is a 3D array representing a 3D grid of cells. Each cell has a scalar float value (pressure in my case) and 'x','y' and 'z' are three 1D arrays containing the spatial coordinates of the cells of every cell (see the shapes and how the 'x' array looks like)
In [42]: x.shape
Out[42]: (181L,)
In [43]: y.shape
Out[43]: (181L,)
In [44]: z.shape
Out[44]: (421L,)
In [45]: d.shape
Out[45]: (181L, 181L, 421L)
In [46]: x
Out[46]:
array([-0.410607 , -0.3927568 , -0.37780656, -0.36527296, -0.35475321,
-0.34591168, -0.33846866, -0.33219107, -0.32688467, -0.3223876 ,
...
0.34591168, 0.35475321, 0.36527296, 0.37780656, 0.3927568 ,
0.410607 ])
What I want to do is create a 3D array with lets say a shape of 90x90x210 (roughly downsize by a factor of 2) by first subsampling the coordinates from the axes on arrays with the above dimensions and then 'interpolating' the 3D data to that array. Im not sure whether 'interpolating' is the right term though. Downsampling? Averaging?
Here's an 2D slice of the data:
Here is an example of 3D interpolation on an irregular grid using scipy.interpolate.griddata.
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
def func(x, y, z):
return x ** 2 + y ** 2 + z ** 2
# Nx, Ny, Nz = 181, 181, 421
Nx, Ny, Nz = 18, 18, 42
subsample = 2
Mx, My, Mz = Nx // subsample, Ny // subsample, Nz // subsample
# Define irregularly spaced arrays
x = np.random.random(Nx)
y = np.random.random(Ny)
z = np.random.random(Nz)
# Compute the matrix D of shape (Nx, Ny, Nz).
# D could be experimental data, but here I'll define it using func
# D[i,j,k] is associated with location (x[i], y[j], z[k])
X_irregular, Y_irregular, Z_irregular = (
x[:, None, None], y[None, :, None], z[None, None, :])
D = func(X_irregular, Y_irregular, Z_irregular)
# Create a uniformly spaced grid
xi = np.linspace(x.min(), x.max(), Mx)
yi = np.linspace(y.min(), y.max(), My)
zi = np.linspace(y.min(), y.max(), Mz)
X_uniform, Y_uniform, Z_uniform = (
xi[:, None, None], yi[None, :, None], zi[None, None, :])
# To use griddata, I need 1D-arrays for x, y, z of length
# len(D.ravel()) = Nx*Ny*Nz.
# To do this, I broadcast up my *_irregular arrays to each be
# of shape (Nx, Ny, Nz)
# and then use ravel() to make them 1D-arrays
X_irregular, Y_irregular, Z_irregular = np.broadcast_arrays(
X_irregular, Y_irregular, Z_irregular)
D_interpolated = interpolate.griddata(
(X_irregular.ravel(), Y_irregular.ravel(), Z_irregular.ravel()),
D.ravel(),
(X_uniform, Y_uniform, Z_uniform),
method='linear')
print(D_interpolated.shape)
# (90, 90, 210)
# Make plots
fig, ax = plt.subplots(2)
# Choose a z value in the uniform z-grid
# Let's take the middle value
zindex = Mz // 2
z_crosssection = zi[zindex]
# Plot a cross-section of the raw irregularly spaced data
X_irr, Y_irr = np.meshgrid(sorted(x), sorted(y))
# find the value in the irregular z-grid closest to z_crosssection
z_near_cross = z[(np.abs(z - z_crosssection)).argmin()]
ax[0].contourf(X_irr, Y_irr, func(X_irr, Y_irr, z_near_cross))
ax[0].scatter(X_irr, Y_irr, c='white', s=20)
ax[0].set_title('Cross-section of irregular data')
ax[0].set_xlim(x.min(), x.max())
ax[0].set_ylim(y.min(), y.max())
# Plot a cross-section of the Interpolated uniformly spaced data
X_unif, Y_unif = np.meshgrid(xi, yi)
ax[1].contourf(X_unif, Y_unif, D_interpolated[:, :, zindex])
ax[1].scatter(X_unif, Y_unif, c='white', s=20)
ax[1].set_title('Cross-section of downsampled and interpolated data')
ax[1].set_xlim(x.min(), x.max())
ax[1].set_ylim(y.min(), y.max())
plt.show()
In short: doing interpolation in each dimension separately is the right way to go.
You can simply average every 5x5x5 cube and return the results. However, if your data is supposed to be continuous, you should understand that is not good subsampling practice, as it will likely induce aliasing. (Also, you can't reasonably call it "interpolation"!)
Good resampling filters need to be wider than the resampling factor in order to avoid aliasing. Since you are downsampling, you should also realize that your resampling filter needs to be scaled according to the destination resolution, not the original resolution -- in order to interpolate properly, it will likely need to be 4 or 5 times as wide as your 5x5x5 cube. This is a lot of samples -- 20*20*20 is way more than 5*5*5...
So, the reason why practical implementations of resampling typically filter each dimension separately is that it is more efficient. By taking 3 passes, you can evaluate your filter using far fewer multiply/accumulate operations per output sample.