3D interpolation between two cloud of points - python

I want to interpolate a set of temperature, defined on each node of a mesh of a CFD simulation, on a different mesh.
Data from the original set are in csv (X1,Y1,Z1,T1) and I want to find new T2 values on a X2,Y2,Z2 mesh.
From the many possibilities that SCIPY provide us, which is the more suitable for that application? Which are the differences between a linear and a nearest-node approach?
Thank you for your time.
EDIT
Here is an example:
import numpy as np
from scipy.interpolate import griddata
from scipy.interpolate import LinearNDInterpolator
data = np.array([
[ -3.5622760653000E-02, 8.0497122655290E-02, 3.0788827491158E-01],
[ -3.5854682326000E-02, 8.0591522802259E-02, 3.0784350432341E-01],
[ -2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01],
[ -2.8413346037000E-02, 8.0890746063578E-02, 3.1002054434659E-01],
[ -2.8168663383000E-02, 8.0981744777379E-02, 3.1015319609412E-01],
[ -3.4150537103000E-02, 8.1385114641365E-02, 3.0865343388355E-01],
[ -3.4461673349000E-02, 8.1537336777452E-02, 3.0858242919307E-01],
[ -3.4285601228000E-02, 8.1655884824782E-02, 3.0877386496235E-01],
[ -2.1832991391000E-02, 8.0380712111108E-02, 3.0867371621337E-01],
[ -2.1933870390000E-02, 8.0335713699008E-02, 3.0867959866155E-01]])
temp = np.array([1.4285955811000E+03,
1.4281038818000E+03,
1.4543135986000E+03,
1.4636379395000E+03,
1.4624763184000E+03,
1.3410919189000E+03,
1.3400545654000E+03,
1.3505817871000E+03,
1.2361110840000E+03,
1.2398562012000E+03])
linInter= LinearNDInterpolator(data, temp)
print (linInter(np.array([[-2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01]])))
this code is working, but I have a dataset of 10million of points to be interpolated on a data set of the same size.
The problem is that this operation is very slow to do for all of my points: is there a way to improve my code?
I used LinearNDinterpolator beacuse it seems to be faster than NearestNDInterpolator (LinearVSNearest).

One solution would be to use RegularGridInterpolator (if your grid is regular). Another approach I can think of is to reduce your data size by taking intervals:
step = 4 # you can increase this based on your data size (eg 100)
m = ((data.argsort(0) % step)==0).any(1)
linInter= LinearNDInterpolator(data[m], temp[m])

Related

Numpy applying a time interval sequence to a multidimensional ndarray (such as coordinates)

EDIT: added prefix / suffix value to interval arrays to make them the same length as their corresponding data arrays, as per #user1319128 's suggestion and indeed interp does the job. For sure his solution was workable and good. I just couldn't see it because I was tired and stupid.
I am sure this is a fairly mundane application, but so I have failed to find or come up with a way to do this without doing it outside of numpy. Maybe my brain just needs a rest, anyway here is the problem with example and solution requirements.
So I have to arrays with different lengths and I want to apply common time intervals between them to these arrays, so that that the result is I have versions of these arrays that are all the same length and their values relate to each other at the same row (if that makes sense). In the example below I have named this functionality "apply_timeintervals_to_array". The example code:
import numpy as np
from colorsys import hsv_to_rgb
num_xy = 20
num_colors = 12
#xy = np.random.rand(num_xy, 2) * 1080
xy = np.array([[ 687.32758344, 956.05651214],
[ 226.97671414, 698.48071588],
[ 648.59878864, 175.4882185 ],
[ 859.56600997, 487.25205922],
[ 794.43015178, 16.46114312],
[ 884.7166732 , 634.59100322],
[ 878.94218682, 835.12886098],
[ 965.47135726, 542.09202328],
[ 114.61867445, 601.74092126],
[ 134.02663822, 334.27221884],
[ 940.6589034 , 245.43354493],
[ 285.87902276, 550.32600784],
[ 785.00104142, 993.19960822],
[1040.49576307, 486.24009511],
[ 165.59409198, 156.79786175],
[1043.54280058, 313.09073855],
[ 645.62878826, 100.81909068],
[ 625.78003257, 252.17917611],
[1056.77009875, 793.02218098],
[ 2.93152052, 596.9795026 ]])
xy_deltas = np.sum((xy[1:] - xy[:-1])**2, axis=-1)
xy_ti = np.concatenate(([0.0],
(xy_deltas) / np.sum(xy_deltas)))
colors_ti = np.concatenate((np.linspace(0, 1, num_colors),
[1.0]))
common_ti = np.unique(np.sort(np.concatenate((xy_ti,
colors_ti))))
common_colors = (np.array(tuple(hsv_to_rgb(t, 0.9, 0.9) for t
in np.concatenate(([0.0],
common_ti,
[1.0]))))
* 255).astype(int)[1:-1]
common_xy = apply_timeintervals_to_array(common_ti, xy)
So one could then use the common arrays for additional computations or for rendering.
The question is what could accomplish the "apply_timeintervals_to_array" functionality, or alternatively a better way to generate the same data.
I hope this is clear enough, let me know if it isn't. Thank you in advance.
I think , numpy.interp should meet your expectations.For example, If a have an 2d array of length 20 , and would like to interpolate at different common_ti values ,whose length is 30 , the code would be as follows.
xy = np.arange(0,400,10).reshape(20,2)
xy_ti = np.arange(20)/19
common_ti = np.linspace(0,1,30)
x=np.interp(common_ti,xy_ti,xy[:,0]) # interpolate the first column
y=np.interp(common_ti,xy_ti,xy[:,1]) #interpolate the second column

Problem in the plotted array, which is the dft of a signal

I have an array including the sample values of a signal (121 samples). However, when I want to plt the Discrete Fourier Transform of it, I take this plot:
This is the related part of my code:
sp = np.fft.fft(flow)
n = np.arange(len(flow))
timeStep = 1
freq = np.fft.fftfreq(n.shape[-1], d=1)
plt.plot(freq, sp.real)
According to the plot every time, the plotted figure has two values. But, this is not sensible and possible. When I print the arrays, everything looks OK. Can anyone help me? Thanks a lot.
P.S.:
The real part of sp matrix is:
[ 4.62700000e+04 -2.64892524e+04 4.94317914e+03 8.58381182e+03
-2.05540197e+03 1.53516262e+03 -1.30716540e+04 1.74769311e+04
-1.13435074e+04 -3.79140600e+03 6.94722233e+03 -2.55937762e+03
2.62187832e+03 -7.91539720e+03 1.07849088e+04 -1.86067707e+02
-8.81467635e+03 5.39181241e+03 4.67386587e+03 -1.16464162e+04
2.25400000e+03 3.43226092e+02 -2.18100065e+03 -6.91513328e+03
7.67106151e+02 6.32196523e+03 -1.11715436e+04 3.84865629e+03
4.89120922e+03 -3.04642885e+03 -1.75000000e+02 2.98504637e+03
2.46837686e+03 -2.87114353e+03 -5.14905071e+02 4.95859846e+03
-2.79387832e+03 -3.71433195e+03 5.20579454e+03 3.77109275e+01
-1.31300000e+03 -2.36758839e+02 4.66440953e+03 4.50017683e+03
-8.51326995e+03 9.20006771e+03 3.47394048e+03 -7.50148888e+03
4.57289385e+03 2.52869599e+03 -3.16622233e+03 -2.08767047e+03
9.15962695e+02 1.44698611e+03 -8.07662141e+03 6.76627369e+03
-8.90969316e+03 6.48281486e+03 -3.46137363e+03 -3.44706367e+03
6.48400000e+03 -3.44706367e+03 -3.46137363e+03 6.48281486e+03
-8.90969316e+03 6.76627369e+03 -8.07662141e+03 1.44698611e+03
9.15962695e+02 -2.08767047e+03 -3.16622233e+03 2.52869599e+03
4.57289385e+03 -7.50148888e+03 3.47394048e+03 9.20006771e+03
-8.51326995e+03 4.50017683e+03 4.66440953e+03 -2.36758839e+02
-1.31300000e+03 3.77109275e+01 5.20579454e+03 -3.71433195e+03
-2.79387832e+03 4.95859846e+03 -5.14905071e+02 -2.87114353e+03
2.46837686e+03 2.98504637e+03 -1.75000000e+02 -3.04642885e+03
4.89120922e+03 3.84865629e+03 -1.11715436e+04 6.32196523e+03
7.67106151e+02 -6.91513328e+03 -2.18100065e+03 3.43226092e+02
2.25400000e+03 -1.16464162e+04 4.67386587e+03 5.39181241e+03
-8.81467635e+03 -1.86067707e+02 1.07849088e+04 -7.91539720e+03
2.62187832e+03 -2.55937762e+03 6.94722233e+03 -3.79140600e+03
-1.13435074e+04 1.74769311e+04 -1.30716540e+04 1.53516262e+03
-2.05540197e+03 8.58381182e+03 4.94317914e+03 -2.64892524e+04]
The flow is:
[ 0. 0. 0. ... 0. 2611. 2984.]

Result of 3D FFT using pyculib is wrong

I use pyculib to perform 3D FFT on a matrix in Anaconda 3.5. I just followed the example code posted in the website. But I found something interesting and don't understand why.
Performing a 3D FFT on matrix with pyculib is correct only when using numpy.arange to create the matrix.
Here is the code:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
data = np.random.rand(26, 256, 256).astype(np.complex64)
orig = data.copy()
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
fftplan.inverse(d_data, d_data)
d_data.copy_to_host(data)
result = data / n
np.allclose(orig, result.real)
Finally, it turns out to be False. And the difference between orig and result is not a small number, not negligible.
I tried some other data sets (not random numbers), and get the some wrong results.
Also, I test without inverse FFT:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
from scipy.fftpack import fftn,ifftn
data = np.random.rand(26,256,256).astype(np.complex64)
orig = data.copy()
orig_fft = fftn(orig)
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
d_data.copy_to_host(data)
np.allclose(orig_fft, data)
The result is also wrong.
The test code on website, they use numpy.arange to create the matrix. And I tried:
n = 26*256*256
data = np.arange(n, dtype=np.complex64).reshape(26,256,256)
And the FFT result of this matrix is right.
Could anyone help to point out why?
I don't use CUDA, but I think your problem is numerical in nature. The difference lies in the two data sets you are using. random.rand has dynamic range of 0-1, and arange 0-26*256*256. The FFT attempts to resolve spatial frequency components on the order of range of values / number of points. For arange, this becomes unity, and the FFT is numerically accurate. For rand, this is 1/26*256*256 ~ 5.8e-7.
Just running FFT/IFFT on your numpy arrays without using CUDA shows similar differences.

Does fancyimpute's SoftImpute require normalized data?

The page https://pypi.python.org/pypi/fancyimpute has the line
# Instead of solving the nuclear norm objective directly, instead
# induce sparsity using singular value thresholding
X_filled_softimpute = SoftImpute().complete(X_incomplete_normalized)
which kind of suggests that I need to normalize the input data. However I did not find any details on the internet, what exactly is meant by that. Do I have to normalize my data beforehand and what exactly is expected?
Yes you should definitely normalize the data. Consider the following example:
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(100,0.5,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
The result is
array([[ 81.78428587, 99.69638878, 100.67626769],
[ 99.82026281, 100.09077899, 99.50273223],
[ 99.70946085, 70.98619873, 69.57668189],
[ 81.82898539, 99.66269922, 100.95263318],
[ 99.14285815, 100.10809651, 99.73870089]])
Note that the places where I put nan are completely off. However, if instead you run
from fancyimpute import SoftImpute
import numpy as np
v=np.random.normal(0,1,(5,3))
v[2,1:3]=np.nan
v[0,0]=np.nan
v[3,0]=np.nan
SoftImpute().complete(v)
(same code as before, the only difference is that v is normalized) you get the following reasonable result:
array([[ 0.07705556, -0.53449412, -0.20081351],
[ 0.9709198 , -1.19890962, -0.25176222],
[ 0.41839224, -0.11786451, 0.03231515],
[ 0.21374759, -0.66986997, 0.78565414],
[ 0.30004524, 1.28055845, 0.58625942]])
Thus, when you are using SoftImpute, don't forget to normalize your data (you can do that by making the mean of every column to be 0, and the std to be 1).

problem with hierarchical clustering in Python

I am doing a hierarchical clustering a 2 dimensional matrix by correlation distance metric (i.e. 1 - Pearson correlation). My code is the following (the data is in a variable called "data"):
from hcluster import *
Y = pdist(data, 'correlation')
cluster_type = 'average'
Z = linkage(Y, cluster_type)
dendrogram(Z)
The error I get is:
ValueError: Linkage 'Z' contains negative distances.
What causes this error? The matrix "data" that I use is simply:
[[ 156.651968 2345.168618]
[ 158.089968 2032.840106]
[ 207.996413 2786.779081]
[ 151.885804 2286.70533 ]
[ 154.33665 1967.74431 ]
[ 150.060182 1931.991169]
[ 133.800787 1978.539644]
[ 112.743217 1478.903191]
[ 125.388905 1422.3247 ]]
I don't see how pdist could ever produce negative numbers when taking 1 - pearson correlation. Any ideas on this?
thank you.
There are some lovely floating point problems going on. If you look at the results of pdist, you'll find there are very small negative numbers (-2.22044605e-16) in them. Essentially, they should be zero. You can use numpy's clip function to deal with it if you would like.
If you were getting error
KeyError: -428
and your code was on the lines of
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(35, 20),dpi=400) # set size
ax = dendrogram(linkage_matrix, orientation="right",labels=queries);
`
It is due to the mismatch in indexes of queries.
Might want to update to
ax = dendrogram(linkage_matrix, orientation="right",labels=list(queries));

Categories

Resources