Related
I want to interpolate a set of temperature, defined on each node of a mesh of a CFD simulation, on a different mesh.
Data from the original set are in csv (X1,Y1,Z1,T1) and I want to find new T2 values on a X2,Y2,Z2 mesh.
From the many possibilities that SCIPY provide us, which is the more suitable for that application? Which are the differences between a linear and a nearest-node approach?
Thank you for your time.
EDIT
Here is an example:
import numpy as np
from scipy.interpolate import griddata
from scipy.interpolate import LinearNDInterpolator
data = np.array([
[ -3.5622760653000E-02, 8.0497122655290E-02, 3.0788827491158E-01],
[ -3.5854682326000E-02, 8.0591522802259E-02, 3.0784350432341E-01],
[ -2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01],
[ -2.8413346037000E-02, 8.0890746063578E-02, 3.1002054434659E-01],
[ -2.8168663383000E-02, 8.0981744777379E-02, 3.1015319609412E-01],
[ -3.4150537103000E-02, 8.1385114641365E-02, 3.0865343388355E-01],
[ -3.4461673349000E-02, 8.1537336777452E-02, 3.0858242919307E-01],
[ -3.4285601228000E-02, 8.1655884824782E-02, 3.0877386496235E-01],
[ -2.1832991391000E-02, 8.0380712111108E-02, 3.0867371621337E-01],
[ -2.1933870390000E-02, 8.0335713699008E-02, 3.0867959866155E-01]])
temp = np.array([1.4285955811000E+03,
1.4281038818000E+03,
1.4543135986000E+03,
1.4636379395000E+03,
1.4624763184000E+03,
1.3410919189000E+03,
1.3400545654000E+03,
1.3505817871000E+03,
1.2361110840000E+03,
1.2398562012000E+03])
linInter= LinearNDInterpolator(data, temp)
print (linInter(np.array([[-2.8168760240000E-02, 8.0819296043557E-02, 3.0988532075795E-01]])))
this code is working, but I have a dataset of 10million of points to be interpolated on a data set of the same size.
The problem is that this operation is very slow to do for all of my points: is there a way to improve my code?
I used LinearNDinterpolator beacuse it seems to be faster than NearestNDInterpolator (LinearVSNearest).
One solution would be to use RegularGridInterpolator (if your grid is regular). Another approach I can think of is to reduce your data size by taking intervals:
step = 4 # you can increase this based on your data size (eg 100)
m = ((data.argsort(0) % step)==0).any(1)
linInter= LinearNDInterpolator(data[m], temp[m])
I have a 1D signal array. This array holds information about some features that I want to analyze with np.fft.
As an example I tried the following:
My function should be the simple sine wave lambda x : sin(x), in theory when I put an input array through this function I would get a signal array, which when transformed with an fft should tell me that the main component from that signal was (in pseudocode) signal = 1* sin(x).
So far I couldnt get any wiser from any of the answers here so I put this question up.
Now my question: How do I get the "raw" sine component weights from my signal ?
This is where I'm stuck:
>>> y = f(x)
>>> fqs = np.fft.fft(y)
>>> fqs
array([ 3.07768354+0.00000000e+00j, 3.68364588+8.32272378e-16j,
8.73514635-7.15951776e-15j, -7.34287625+1.04901868e-14j,
-2.15156054+5.10742080e-15j, -1.1755705 +4.87611209e-16j,
-0.78767676+3.40334406e-16j, -0.58990993+4.25167217e-16j,
-0.476018 -3.43242308e-16j, -0.40636656+1.13055751e-15j,
-0.36327126+1.55440604e-16j, -0.33804202-1.07128132e-16j,
-0.32634218+2.76861429e-16j, -0.32634218+8.99298797e-16j,
-0.33804202+5.02435797e-16j, -0.36327126-1.55440604e-16j,
-0.40636656-3.06536611e-16j, -0.476018 -4.57882679e-17j,
-0.58990993+4.31587904e-16j, -0.78767676+9.75500354e-16j,
-1.1755705 -4.87611209e-16j, -2.15156054-1.87113952e-15j,
-7.34287625+1.79193327e-15j, 8.73514635-6.76648711e-15j,
3.68364588-6.60371698e-15j])
>>> np.abs(_)
array([3.07768354, 3.68364588, 8.73514635, 7.34287625, 2.15156054,
1.1755705 , 0.78767676, 0.58990993, 0.476018 , 0.40636656,
0.36327126, 0.33804202, 0.32634218, 0.32634218, 0.33804202,
0.36327126, 0.40636656, 0.476018 , 0.58990993, 0.78767676,
1.1755705 , 2.15156054, 7.34287625, 8.73514635, 3.68364588])
>>> where do I find my 1*sin(x) ?
Even though your x variable is know shown here, I think you're not generating a periodic function. This works fine for me:
import numpy as np
x=np.linspace(0,np.pi*2,100,endpoint=False)
y=np.sin(x)
yf=np.fft.rfft(y)
output is
(-1.5265566588595902e-16+0.0j)
(-1.8485213360008856e-14+-50.0j)
(5.8988036787285649e-15+-3.4015634637549994e-16j)
(-1.0781745022416177e-14+-3.176912458933349e-15j)
(6.9770353907875146e-15+-3.6920723832369405e-15j)
The only no zero imaginary number is at mode 1.
I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
Thank you for your help.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range[0]<a_t and a_t<a_range[1] and b_range[0]<b_t and b_t<b_range[1]:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.
You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape[0] < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape[0] < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])
To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mystery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.
I have 2 matrix:
#for example
rotation = matrix([[ 0.61782155, 0.78631834, 0. ],
[ 0.78631834, -0.61782155, 0. ],
[ 0. , 0. , -1. ]])
translation = matrix([[-0.33657291],
[ 1.04497454],
[ 0. ]])
vtkinputpath = "/hello/world/vtkfile.vtk"
vtkoutputpath = "/hello/world/vtkrotatedfile.vtk"
interpolation = "linear"
I have a vtk file which contains 3D image and I want to create a function in python to rotate/translate with interpolation it.
import vtk
def rotate(vtkinputpath, vtkoutputpath, rotation, translation, interpolation):
...
I'm trying to take inspiration from the transformJ plugin sources (see here to understand how it works)
I wanted to use vtk.vtkTransform but I don't really understand how it works: these examples are not close enough of what I want to do. This is what I did with that:
reader = vtk.vtkXMLImageDataReader()
reader.SetFileName(vtkinputpath)
reader.Update()
transform = reader.vtkTransform()
transform.RotateX(rotation[0])
transform.RotateY(rotation[1])
transform.RotateZ(rotation[2])
transform.Translate(translation[0], translation[1], translation[2])
#and I don't know how I can choose the parameter of the interpolation
But that cannot work...
I saw here that the function RotateWXYZ() exists:
# create a transform that rotates the cone
transform = vtk.vtkTransform()
transform.RotateWXYZ(45,0,1,0)
transformFilter=vtk.vtkTransformPolyDataFilter()
transformFilter.SetTransform(transform)
transformFilter.SetInputConnection(source.GetOutputPort())
transformFilter.Update()
But I don't understand what the lines do.
My main problem is that I cannot find the vtk documentation for Python...
Can you advise me a documentation website for vtk in Python ? Or can you explain me at least how vtktransform (rotateWXYZ()) work ?
Please, I'm totally lost, nothing works.
I'm not sure there is specific Python documentation, but this can be useful to understand how RotateWXYZ works: http://www.vtk.org/doc/nightly/html/classvtkTransform.html#a9a6bcc6b824fb0a9ee3a9048aa6b262c
To create the transform you want you can combine rotation and translation matrices into a 4x4 matrix, to do this we put the rotation matrix in columns and rows 0,1 and 2, we put the translation vector in the right column, the bottom row is 0,0,0,1. Here's some more info about this. For example:
0.61782155 0.78631834 0 -0.33657291
0.78631834 -0.61782155 0 1.04497454
0 0 -1 0
0 0 0 1
Then you can directly set the matrix to vtkTransform using SetMatrix:
matrix = [0.61782155,0.78631834,0,-0.33657291,0.78631834,-0.61782155,0,1.04497454,0,0,-1,0,0,0,0,1]
transform.SetMatrix(matrix)
EDIT: Edited to complete the values in the matrix variable.
I am doing a hierarchical clustering a 2 dimensional matrix by correlation distance metric (i.e. 1 - Pearson correlation). My code is the following (the data is in a variable called "data"):
from hcluster import *
Y = pdist(data, 'correlation')
cluster_type = 'average'
Z = linkage(Y, cluster_type)
dendrogram(Z)
The error I get is:
ValueError: Linkage 'Z' contains negative distances.
What causes this error? The matrix "data" that I use is simply:
[[ 156.651968 2345.168618]
[ 158.089968 2032.840106]
[ 207.996413 2786.779081]
[ 151.885804 2286.70533 ]
[ 154.33665 1967.74431 ]
[ 150.060182 1931.991169]
[ 133.800787 1978.539644]
[ 112.743217 1478.903191]
[ 125.388905 1422.3247 ]]
I don't see how pdist could ever produce negative numbers when taking 1 - pearson correlation. Any ideas on this?
thank you.
There are some lovely floating point problems going on. If you look at the results of pdist, you'll find there are very small negative numbers (-2.22044605e-16) in them. Essentially, they should be zero. You can use numpy's clip function to deal with it if you would like.
If you were getting error
KeyError: -428
and your code was on the lines of
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
from scipy.cluster.hierarchy import ward, dendrogram
linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances
fig, ax = plt.subplots(figsize=(35, 20),dpi=400) # set size
ax = dendrogram(linkage_matrix, orientation="right",labels=queries);
`
It is due to the mismatch in indexes of queries.
Might want to update to
ax = dendrogram(linkage_matrix, orientation="right",labels=list(queries));