What I want to do is rather simple but I havent found a straightforward approach thus far:
I have a 3D rectilinear grid with float values (therefore 3 coordinate axes -1D numpy arrays- for the centers of the grid cells and a 3D numpy array with the corresponding shape with a value for each cell center) and I want to interpolate (or you may call it subsample) this entire array to a subsampled array (e.g. size factor of 5) with linear interpolation.
All the approaches I've seen this far involve 2D and then 1D interpolation or VTK tricks which Id rather not use (portability).
Could someone suggest an approach that would be the equivalent of taking 5x5x5 cells at the same time in the 3D array, averaging and returning an array 5times smaller in each direction?
Thank you in advance for any suggestions
EDIT:
Here's what the data looks like, 'd' is a 3D array representing a 3D grid of cells. Each cell has a scalar float value (pressure in my case) and 'x','y' and 'z' are three 1D arrays containing the spatial coordinates of the cells of every cell (see the shapes and how the 'x' array looks like)
In [42]: x.shape
Out[42]: (181L,)
In [43]: y.shape
Out[43]: (181L,)
In [44]: z.shape
Out[44]: (421L,)
In [45]: d.shape
Out[45]: (181L, 181L, 421L)
In [46]: x
Out[46]:
array([-0.410607 , -0.3927568 , -0.37780656, -0.36527296, -0.35475321,
-0.34591168, -0.33846866, -0.33219107, -0.32688467, -0.3223876 ,
...
0.34591168, 0.35475321, 0.36527296, 0.37780656, 0.3927568 ,
0.410607 ])
What I want to do is create a 3D array with lets say a shape of 90x90x210 (roughly downsize by a factor of 2) by first subsampling the coordinates from the axes on arrays with the above dimensions and then 'interpolating' the 3D data to that array. Im not sure whether 'interpolating' is the right term though. Downsampling? Averaging?
Here's an 2D slice of the data:
Here is an example of 3D interpolation on an irregular grid using scipy.interpolate.griddata.
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
def func(x, y, z):
return x ** 2 + y ** 2 + z ** 2
# Nx, Ny, Nz = 181, 181, 421
Nx, Ny, Nz = 18, 18, 42
subsample = 2
Mx, My, Mz = Nx // subsample, Ny // subsample, Nz // subsample
# Define irregularly spaced arrays
x = np.random.random(Nx)
y = np.random.random(Ny)
z = np.random.random(Nz)
# Compute the matrix D of shape (Nx, Ny, Nz).
# D could be experimental data, but here I'll define it using func
# D[i,j,k] is associated with location (x[i], y[j], z[k])
X_irregular, Y_irregular, Z_irregular = (
x[:, None, None], y[None, :, None], z[None, None, :])
D = func(X_irregular, Y_irregular, Z_irregular)
# Create a uniformly spaced grid
xi = np.linspace(x.min(), x.max(), Mx)
yi = np.linspace(y.min(), y.max(), My)
zi = np.linspace(y.min(), y.max(), Mz)
X_uniform, Y_uniform, Z_uniform = (
xi[:, None, None], yi[None, :, None], zi[None, None, :])
# To use griddata, I need 1D-arrays for x, y, z of length
# len(D.ravel()) = Nx*Ny*Nz.
# To do this, I broadcast up my *_irregular arrays to each be
# of shape (Nx, Ny, Nz)
# and then use ravel() to make them 1D-arrays
X_irregular, Y_irregular, Z_irregular = np.broadcast_arrays(
X_irregular, Y_irregular, Z_irregular)
D_interpolated = interpolate.griddata(
(X_irregular.ravel(), Y_irregular.ravel(), Z_irregular.ravel()),
D.ravel(),
(X_uniform, Y_uniform, Z_uniform),
method='linear')
print(D_interpolated.shape)
# (90, 90, 210)
# Make plots
fig, ax = plt.subplots(2)
# Choose a z value in the uniform z-grid
# Let's take the middle value
zindex = Mz // 2
z_crosssection = zi[zindex]
# Plot a cross-section of the raw irregularly spaced data
X_irr, Y_irr = np.meshgrid(sorted(x), sorted(y))
# find the value in the irregular z-grid closest to z_crosssection
z_near_cross = z[(np.abs(z - z_crosssection)).argmin()]
ax[0].contourf(X_irr, Y_irr, func(X_irr, Y_irr, z_near_cross))
ax[0].scatter(X_irr, Y_irr, c='white', s=20)
ax[0].set_title('Cross-section of irregular data')
ax[0].set_xlim(x.min(), x.max())
ax[0].set_ylim(y.min(), y.max())
# Plot a cross-section of the Interpolated uniformly spaced data
X_unif, Y_unif = np.meshgrid(xi, yi)
ax[1].contourf(X_unif, Y_unif, D_interpolated[:, :, zindex])
ax[1].scatter(X_unif, Y_unif, c='white', s=20)
ax[1].set_title('Cross-section of downsampled and interpolated data')
ax[1].set_xlim(x.min(), x.max())
ax[1].set_ylim(y.min(), y.max())
plt.show()
In short: doing interpolation in each dimension separately is the right way to go.
You can simply average every 5x5x5 cube and return the results. However, if your data is supposed to be continuous, you should understand that is not good subsampling practice, as it will likely induce aliasing. (Also, you can't reasonably call it "interpolation"!)
Good resampling filters need to be wider than the resampling factor in order to avoid aliasing. Since you are downsampling, you should also realize that your resampling filter needs to be scaled according to the destination resolution, not the original resolution -- in order to interpolate properly, it will likely need to be 4 or 5 times as wide as your 5x5x5 cube. This is a lot of samples -- 20*20*20 is way more than 5*5*5...
So, the reason why practical implementations of resampling typically filter each dimension separately is that it is more efficient. By taking 3 passes, you can evaluate your filter using far fewer multiply/accumulate operations per output sample.
Related
I have a function D(x,y,z) in which I want to evaluate (via interpolation) planes within the z, y, and z axis. i.e. I want the output of my interpolations to be a 2D plane holding one of the values fixed, D(x,y,0) for example.
I have created an interpolating function via scipy using some given values of D, D_values, for my input values of x,y,z.
from scipy.interpolate import RegularGridInterpolator as rgi
D_interp=rgi((x_positions,y_positions,z_positions), D_values)
Now I can get any point interpolated by just calling
D_interpolated=D_interp(xi,yi,zi)
I understand how I can evaluate individual points from this, but how would I interpolate a plane? For example, in my case, D_values is of size 345x155x303 and I want to interpolate 345x155 planes all along the z axis corresponding to the x and y input values, at z=0, z=1, z=2, etc.
My attempt at a solution is to feed in the x_positions, y_positions vectors individually into D_interp keeping z fixed, but this just gets me a set of D values evaluated at specific positions, rather than organized into a grid like the planar output I'd actually like. Syntax doesn't allow me to call something like
Plane=D_interp(x_positions,y_positions,0)
so I was not quite sure about the syntax of calling this function to have planar output.
any help appreciated
Thanks,
The typical approach to combining multiple arrays with different sizes corresponding to different dimensions in numpy and scipy is to use broadcasting. Here is a sample problem to illustrate the application:
x_positions = np.linspace(0, 10, 101)
y_positions = np.linspace(-10, 10, 201)
z_positions = np.linspace(-5, 5, 101)
D_values = np.sin(2 * np.pi * x_positions[:, None, None] * y_positions[:, None] / 100) + np.cos(2 * np.pi * y_positions[:, None] * z_positions / 50)
This is similar to the D_values array you describe in your problem, where each of the bins in the different directions correspond to the *_positions arrays. I used broadcasting to turn x_positions into a (101, 1, 1)-shaped array, y_positions into a (201, 1)-shaped array and left z_positions as a (101,)-shaped array. The result is that D_values is a (101, 201, 101)-shaped array. The reshaped versions of the input arrays did not copy any data!
You can call your interpolator using the same idea that I used to create a sample D_values.
D_interp = rgi((x_positions, y_positions, z_positions), D_values)
Let's say you want to fix z = 0. All that scipy requires is that the inputs broadcast together. Scalars broadcast with everything, so you can just do
x_interp = np.linspace(0.05, 0.95, 200)
y_interp = np.linspace(-9.95, 9.95, 400)
z_interp = 0
D_xy_interp = D_interp((x_interp[:, None], y_interp, z_interp))
The advantage to doing this over creating a mesh is that you don't have to copy any data around and create extra 200x400 input arrays. Another advantage is that you have better control over the output. In this case, D_xy_interp has shape (len(x_interp), len(y_interp)). That's because in general, the shape of the output will be the broadcasted shape of the input. You can see that when we created D_values, and you can see it here. Since 0 is a scalar, it does not contribute to the shape. But I could also make a (400, 200) shaped array instead:
D_interp((x_interp, y_interp[:, None], z_interp))
Or even a (100, 4, 100, 2) shaped array:
D_interp((x_interp.reshape(-1, 2), y_interp.reshape(-1, 4, 1, 1), z_interp))
In either case, let's verify that the interpolator did it's job. We can compare the interpolated values to a much finer sampling of the function that created D_values:
D_xy_values = np.sin(2 * np.pi * x_interp[:, None] * y_interp / 100) + np.cos(2 * np.pi * y_interp * z_interp / 50)
fig, ax = plt.subplots(subplot_kw={'projection': '3d'})
ax.plot_surface(x_interp[:, None], y_interp, D_xy_interp, label='Interp')
ax.plot_surface(x_interp[:, None], y_interp, D_xy_values, label='Values')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.show()
At the moment it doesn't look like you can add legends to 3D plots:
.
The two plots are virtually indistinguishable. With the default color cycler, you will see the surface chance from blue to orange as you rotate it. Here is an analytical verification:
>>> np.sqrt(np.mean((D_xy_values - D_xy_interp)**2))
4.707625623185639e-05
I have a 3-d array of shape=(3, 60000, 10) which needs to be 2-D so as to be able to visualize it when clustering.
I was planning on implementing the k-means clustering from scikit-learn to the 3-d array and read that it only takes in 2-D shape , I just wanted some advice as to whether there is a right way to do it ? I was planning on making it (60000,30) , but wanted a clarification before I go ahead.
How I read it is that you have 10 features each consisting of 3d data. Do you intend to cluster all 10 features? If so reshape it such that you have 600000 x 3 points (assuming you want to separate in space). For example this
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt, numpy as np
# 3x points
data = np.random.rand(100, 3, 10) + np.arange(10) # add arbitrary offset for "difference" in real data
data = np.moveaxis(data, -1, 1).reshape(-1, 3)
n_clus = 10 # cluster in 10 --> fill in with your goal in mind
km = KMeans(n_clusters = n_clus).fit(data)
fig, ax = plt.subplots(subplot_kw = dict(projection = '3d'))
colors = plt.cm.tab20(np.linspace(0, 1, n_clus))
ax.scatter(*data.T, c = colors[km.labels_])
fig.show()
Yields
(600000 , 30) is probably not a great idea. K-means clustering uses a distance metrics to define clusters, Euclidean distance normally, but when you increase number of variables in the second dimension you fall into a curse of dimensionality where results of clustering will stop making sense.
You can of course try (600000, 30) and see if it works, but if it doesn't, you'll need to do reduce dimensionality, for example by doing a PCA and use principal components to do clustering.
EDIT
I'll probably try and explain what I mean by dimensionality and the issues it causes since there appears to be some confusion.
A 2d array of size (100, 2) is a 2-dimensional data, i.e. it's 100 observations of 2 variables. The trend line between those points would be a 1d object (line) and you can plot it on a 2d plane. Similarly, a (100, 3) array is 3-dimensional with a trendline being a 2d plane and you can plot those points on a 3d chart.
Then (100, 100) array is 100-dimensional. A trend would be a 99-dimensional hyperplane and you cannot visualise even in principle. Now let's see what issues this causes. Let's define a simple function calculating Euclidean distance:
def distance(x, y):
return sum((i - j)**2 for i, j in zip(x, y))**0.5
The function takes two iterables as arguments and calculates Euclidean distance between those. Now let's try with something simple:
v1 = (1, 1)
v2 = (2, 2)
v3 = (100, 100)
v4 = (120, 120)
>> distance(v1, v2)
Out: 1.4142135623730951
>> distance(v1, v3)
Out: 140.0071426749364
>> distance(v1, v4)
Out: 168.2914139223983
If we make these tuples 3 dimensional keeping the same values in all dimensions, distances become respectively: 1.73, 171.47, 206.11.
Now for the fun part - let's add a bunch of dimensions filled with "1"s:
v1 = [1, 1, 1] + list(1 for i in range(47))
v2 = [2, 2, 2] + list(1 for i in range(47))
v2 = [100, 100, 100] + list(1 for i in range(47))
v4 = [120, 120, 120] + list(1 for i in range(47))
>>> distance(v1, v2)
171.47302994931886
>>> distance(v1, v3)
175.16278143486988
>>> distance(v1, v4)
206.11404610069638
So here we increased dimensions without adding additional information to separate variables an suddenly what appeared as two distinct clusters are not so defined any more, in fact v1, v2 and v3 appear more like they belong together and v4 being an outsider.
This will also happen in most cases, unless the higher dimensions continue the pattern of the first three, i.e. (1, 1, 1...), (2, 2, 2,..), (100, 100, 100...), (120, 120, 120,...). But in most cases you will see distances shrink and clusters become indistinguishable.
My goal is to interpolate the discretized continuous 2D Fourier transform of a function. The problem seems to be that the frequencies in each dimension are not output in strictly ascending order (see here).
The fft.fft2 function accepts a 2D array, where in my case the array (let's call it A) is structured such that A[i][j] = fun(x[i], y[j]), fun being the function to be transformed. After applying fft.fft2 to A, output is an array F of the same dimensions as the original array, such that the frequency coordinate corresponding to F[i][j] is (w_x[i], w_y[j]), where w_x = fft.fftfreq(F.shape[0]) and w_y = fft.fftfreq(F.shape[1]), both of these being 1D arrays which are not in ascending order.
Over wx and wy I am wanting to interpolate F (say to a function finterp) such that the interpolated value is returned upon calling finterp(w_x, w_y), w_x and w_y being within the domain of wx and range of wy, but otherwise arbitrary. I've looked into the varieties of interpolation available through scipy.interpolate, but it doesn't seem to me that any of them can deal with this type of data structure (the coordinate axes being defined as out-of-order 1D arrays and the function values being in a 2D array).
This is a little abstract, so here I've made up a simple example which is similar in structure to the above. Suppose we are wishing to construct a continuous function f(x, y) = x + y over the region x = [-1, 1] and y = [-1, 1] given the following data:
import numpy as np
# note that below z[i][j] corresponds to what we want f(x[i], y[j]) to be
x = np.array([0, 1, -1])
y = np.array([0, 1, -1])
z = np.array([0, 1, -1],[1, 2, 0],[-1, 0, -2])
z[i][j] we know corresponds to the function evaluated at x[i], y[j]. How can one either (a) interpolate this data directly, given its original structure, or (b) rearrange the data so that x and y are in ascending order, and the arranged z is such that z[i][j] is equal to the function evaluated at the rearranged x[i], y[j]?
The following code shows how to use fftshift to change the output of fft2 and fftfreq so that the frequency axes are monotonically increasing. After applying fftshift, you can use the arrays for interpolation. I've added display of the arrays so that you can verify that the data itself is unchanged. The origin is shifted from the top-left corner to the middle of the array, moving the negative frequencies from the right side to the left side.
import numpy as np
import matplotlib.pyplot as pp
x = np.array([0, 1, -1])
y = np.array([0, 1, -1])
z = np.array([[0, 1, -1],[1, 2, 0],[-1, 0, -2]])
f = np.fft.fft2(z)
w_x = np.fft.fftfreq(f.shape[0])
w_y = np.fft.fftfreq(f.shape[1])
pp.figure()
pp.imshow(np.abs(f))
pp.xticks(np.arange(0,len(w_x)), np.round(w_x,2))
pp.yticks(np.arange(0,len(w_y)), np.round(w_y,2))
f = np.fft.fftshift(f)
w_x = np.fft.fftshift(w_x)
w_y = np.fft.fftshift(w_y)
pp.figure()
pp.imshow(np.abs(f))
pp.xticks(np.arange(0,len(w_x)), np.round(w_x,2))
pp.yticks(np.arange(0,len(w_y)), np.round(w_y,2))
pp.show()
An alternative approach is to not use fftfreq to determine your frequencies, but compute them by hand. The FFT, by default, computes the DFT for k=[0..N-1]. Because of the periodicity, with the DFT at k equal to the DFT at k+N and k-N, its output is often interpreted to have k=[N//2...(N-1)//2] instead (but arranged differently to match k=[0..N-1]); this is the k that fftfreq returns (it returns k/N).
Thus, you can instead say
N = f.shape[0]
w_x = np.linspace(0, N, N, endpoint=False) / N
Now you don't have any negative frequencies, and instead have frequencies in the range [0,N-1]/N.
I have trained a machine learning binary classifier on a 100x85 array in sklearn. I would like to be able to vary 2 of the features in the array, say column 0 and column 1, and generate contour or surface graph, showing how the predicted probability of falling in one category varies across the surface.
It seems reasonable to me that I would use something like the following:
X = 100 x 85 array of data used for training set
clf = Trained 2-class classifier
x = np.array(X)
y = np.array(X)
x[:,0] = np.linspace(0, 100, 100)
y[:,1] = np.linspace(0, 100, 100)
xx, yy = meshgrid(x,y)
The next step would be to use
clf.predict_proba(<input arrays>)
followed by plotting, but using meshgrid results in two 8500x8500 matrices that can't be used in my classifier.
How do I get the necessary 100x85 vector at each point in the grid to use pred_proba with my classifier?
Thanks for any help you can provide.
As #wflynny says above, you need to give np.meshgrid two 1D arrays. We can use X.shape to create your x and y arrays, like this:
X=np.zeros((100,85)) # just to get the right shape here
print X.shape
# (100, 85)
x=np.arange(X.shape[0])
y=np.arange(X.shape[1])
print x.shape
# (100,)
print y.shape
# (85,)
xx,yy=np.meshgrid(x,y,indexing='ij')
print xx.shape
#(100, 85)
print yy.shape
#(100, 85)
considering I have a 3D histogram or for simplicity a 3D numpy array of shape (X,Y,Z)
import numpy as np
array = np.random.random((100,100,100))
What is the best way, using numpy or scipy to obtain array's values' indexes of which satisfy a sphere conditions?
(index_x**2 + index_y**2 + index_z**2) <= radius**2
Obvioulsy, in the later condition, the array center is (0, 0, 0). In general the condition will be
((index_x-center_x)**2 + (index_y-center_y)**2 +(index_z-center_z)**2) <= radius**2
The problem is easy to solve using simply a python loop, but I need that to be optimized.
many thanks for your help
You can first efficiently get the indexes with ogrid() and then obtain the indexes that satisfy your condition with nonzero().
Getting the indexes can be obtained with nonzero() like so:
indexes = numpy.transpose((x**2+y**2+z**2 <= radius**2).nonzero()) # transpose() might be unnecessary: it depends on your needs
where the indexes arrays are obtained efficiently with ogrid():
x, y, z = numpy.ogrid[:100, :100, :100]
or, for an arbitrary shape for your input data array:
x, y, z = ogrid[tuple(slice(None, dim) for dim in data.shape)]
Just for making #EOL nice approach more general, one can define a center within the shape of the array
array = np.random.random((100,100,100))
center = (30,10,25)
radius = 5.0
x, y, z = np.ogrid[-center[0]:array.shape[0]-center[0],-center[1] :array.shape[1]-center[1], -center[2]:array.shape[2]-center[2]]
indexes = numpy.transpose((x**2+y**2+z**2 <= radius**2).nonzero())