I am having issues in implementing some less-than-usual interpolation problem. I have some (x,y) data points scattered along some curve which a priori I don't know, and I want to reconstruct this curve at my best, interpolating my point with min square error. I thought of using scipy.interpolate.splrep for this purpose (but maybe there are better options you would advise to use). The additional difficulty in my case, is that I want to constrain the spline curve to pass through some specific points of my original data. I assume that playing with knots and weights could make the trick, but I don't know how to do so (I am procrastinating avoidance of spline interpolation theory besides basic fitting procedures). Also, for some undisclosed reasons, when I try to setup knots in my splrep I get the same error of this post, which keeps complicating things. The following is my sample code:
from __future__ import division
import numpy as np
import scipy.interpolate as spi
import matplotlib.pylab as plt
# Some surrogate sample data
f = lambda x : x**2 - x/2.
x = np.arange(0.,20.,0.1)
y = f(4*(x + np.random.normal(size=np.size(x))))
# I want to use spline interpolation with least-square fitting criterion, making sure though that the spline starts
# from the origin (or in general passes through a precise point of my dataset).
# In my case for example I would like the spline to originate from the point in x=0. So I attempted to include as first knot x=0...
# but it won't work, nor I am sure this is the right procedure...
fy = spi.splrep(x,y)
fy = spi.splrep(x,y,t=fy[0])
yy = spi.splev(x,fy)
plt.plot(x,y,'-',x,yy,'--')
plt.show()
which despite the fact I am even passing knots computed from a first call of splrep, it will give me:
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/fitpack.py", line 289, in splrep
res = _impl.splrep(x, y, w, xb, xe, k, task, s, t, full_output, per, quiet)
File "/usr/lib64/python2.7/site-packages/scipy/interpolate/_fitpack_impl.py", line 515, in splrep
raise _iermess[ier][1](_iermess[ier][0])
ValueError: Error on input data
You use the weights argument of splrep: can give these points you need fixed very large weights. This is a workaround for sure, so keep an eye on the fit quality and stability.
Setting high weights for specific points is indeed a working solution as suggested by #ev-br. In addition, because there is no direct way to match derivatives at the extrema of the curve, the same rationale can be applied in this case as well. Say you want the derivative in y[0] and y[-1] match the derivative of your data points, then you add large weights also for y[1] and y[-2], i.e.
weights = np.ones(len(x))
weights[[0,-1]] = 100 # Promote spline interpolant through first and last point
weights[[1,-2]] = 50 # Make spline interpolant derivative tend to derivatives at first/last point
fy = spi.splrep(x,y,w=weights,s=0.1)
yy = spi.splev(x,fy)
Related
I am trying to interpolate a set of ordered pairs using Numpy's Lagrange Interpolation; I have done this before without incident.
This time, however, I keep getting "Division by zero error" and the interpolating polynomial comes out with infinite coefficientes.
I am aware data points must not be repeated due to the internal workings of Lagrange's Method, and they are not repeated.
Here is my code and the offending ordered pair, in numpy vector format.
Code:
x = out["x"].round(decimals=3)
x = np.array(x)
y = out["y"].round(decimals=3)
y = np.array(y)
print(x)
print(y)
pol = lagrange(x,y)
print(pol)
Ordered pair:
[273.324 285.579 309.292 279.573 297.427 290.681 276.621 293.586 283.463
284.674 273.904 288.064 280.125 294.269 288.51 285.898 273.419 273.023
281.754 281.546 283.21 303.399 297.392 293.359 306.404 356.285 302.487
280.586 299.487 302.487]
[ 0. 5.414 6.202 0. 9.331 11.52 0. 10.495 5.439 4.709
0. 4.916 0. 10.508 6.736 5.25 0. 0. 6.53 4.305
5.124 6.753 10.175 10.545 5.98 9.147 11.137 0. 8.764 9.57 ]
Lots of thanks in advance.
Why Lagrange Interpolation did not work for you.
You have the value 302.487 twice in your array x. I.e. you did repeat it.
Why Lagrange Interpolation is not what you want.
As Tim Roberts pointed out Lagrange interpolation is really not made for 20 points. The problem is that polynomials of high degree tend to overfit. Check out the following example from the wikipedia article of overfitting.
Figure 2. Noisy (roughly linear) data is fitted to a linear function and a polynomial function. Although the polynomial function is a perfect fit, the linear function can be expected to generalize better: if the two functions were used to extrapolate beyond the fitted data, the linear function should make better predictions.
Alternative Regression
There are at least two valid alternatives. One of them being what is recommended in the wikipedia article. If you know what type of function your data is ruffly coming from use regression to fit a function of that type to the data. In the case of the example above thats a linear function. If you want to do that check out scipy's curve fit.
Alternative Spline Interpolation
An other alternative is spline interpolation. Again from the wikipedia article on Spline Interpolation
Instead of fitting a single, high-degree polynomial to all of the values at once, spline interpolation fits low-degree polynomials to small subsets of the values, for example, fitting nine cubic polynomials between each of the pairs of ten points, instead of fitting a single degree-ten polynomial to all of them. Spline interpolation is often preferred over polynomial interpolation because the interpolation error can be made small even when using low-degree polynomials for the spline. Spline interpolation also avoids the problem of Runge's phenomenon, in which oscillation can occur between points when interpolating using high-degree polynomials.
There are just two little technical details that I want to point out. Point one is you points need to be ordered so I did that for you. And two scipy's UnivariateSpline has a smoothing parameter s that you need to choose. If you pick it small it sticks to the data like you're used to with Lagrange interpolation but if you make it bigger it well becomes smoother and hopefully generalizes better. Below I picked 2 different values for you to look at but you should probably play around with it yourself. I included a very small one so you see it can do what you're used to from Lagrange interpolation but wouldn't recommend it. Also you probably should use more data, preprocess it etc.. But that's not what the question was about.
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
idx = np.argsort(x)
x = x[idx]
y = y[idx]
for s in [10,60]:
t = np.linspace(np.min(x), np.max(x), 10**4)
f = UnivariateSpline(x,y, s=s)
plt.scatter(x,y)
plt.plot(t,f(t))
plt.title(f'{s=}')
plt.show()
I have a cloud of data points (x,y) that I would like to interpolate and smooth.
Currently, I am using scipy :
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
spl = interp1d(Cloud[:,1], Cloud[:,0]) # interpolation
x = np.linspace(Cloud[:,1].min(), Cloud[:,1].max(), 1000)
smoothed = savgol_filter(spl(x), 21, 1) #smoothing
This is working pretty well, except that I would like to give some weights to the data points given at interp1d. Any suggestion for another function that is handling this ?
Basically, I thought that I could just multiply the occurrence of each point of the cloud according to its weight, but that is not very optimized as it increases a lot the number of points to interpolate, and slows down the algorithm ..
The default interp1d uses linear interpolation, i.e., it simply computes a line between two points. A weighted interpolation does not make much sense mathematically in such scenario - there is only one way in euclidean space to make a straight line between two points.
Depending on your goal, you can look into other methods of interpolation, e.g., B-splines. Then you can use scipy's scipy.interpolate.splrep and set the w argument:
w - Strictly positive rank-1 array of weights the same length as x and y. The weights are used in computing the weighted least-squares spline fit. If the errors in the y values have standard-deviation given by the vector d, then w should be 1/d. Default is ones(len(x)).
I have a moderate size data set, namely 20000 x 2 floats in a two column matrix. The first column is the the x column which represents the distance to the original point along a trajectory, another column is the y column which represents the work has done to the object. This data set is obtained from lab operations, so it's fairly arbitrary. I've already turned this structure into numpy array. I want to plot y vs x in a figure with a smooth curve. So I hope the following code could help me:
x_smooth = np.linspace(x.min(),x.max(), 20000)
y_smooth = spline(x, y, x_smooth)
plt.plot(x_smooth, y_smooth)
plt.show()
However, when my program execute the line y_smooth = spline(x,y,x_smooth), it takes a very long time,say 10 min, and even sometimes it will blow my memory that I have to restart my machine. I tried to reduce the chunk number to 200 and 2000 and none of them works. Then I checked the official scipy reference: scipy.interpolate.spline here. And they said that spline is deprecated in v 0.19, but I'm not using the new version. If spline is deprecated for quite a bit of the time, how to use the equivalent Bspline now? If spline is still functioning, then what causes the slow performance
One portion of my data could look like this:
13.202 0.0
13.234738 -0.051354643759
12.999116 0.144464320836
12.86252 0.07396528119
13.1157 0.10019738758
13.357109 -0.30288563381
13.234004 -0.045792536285
12.836279 0.0362257166275
12.851597 0.0542649286915
13.110691 0.105297378401
13.220619 -0.0182963209185
13.092143 0.116647353635
12.545676 -0.641112204849
12.728248 -0.147460703493
12.874176 0.0755861585235
12.746764 -0.111583725833
13.024995 0.148079528382
13.106033 0.119481137144
13.327233 -0.197666132456
13.142423 0.0901867159545
Several issues here. First and foremost, spline fitting you're trying to use is global. This means that you're solving a system of linear equations of the size 20000 at the construction time (evaluations are weakly sensitive to the dataset size though). This explains why the spline construction is slow.
scipy.interpolate.spline, furthermore, does linear algebra with full matrices --- hence memory consumption. This is precisely why it's deprecated from scipy 0.19.0 on.
The recommended replacement, available in scipy 0.19.0, is the BSpline/ make_interp_spline combo:
>>> spl = make_interp_spline(x, y, k=3) # returns a BSpline object
>>> y_new = spl(x_new) # evaluate
Notice it is not BSpline(x, y, k): BSpline objects do not know anything about the data or fitting or interpolation.
If you are using older scipy versions, your options are:
CubicSpline(x, y) for cubic splines
splrep(x, y, s=0) / splev combo.
However, you may want to think if you really need twice continuously differentiable functions. If only once differentiable functions are smooth enough for your purposes, then you can use local spline interpolations, e.g. Akima1DInterpolator or PchipInterpolator:
In [1]: import numpy as np
In [2]: from scipy.interpolate import pchip, splmake
In [3]: x = np.arange(1000)
In [4]: y = x**2
In [5]: %timeit pchip(x, y)
10 loops, best of 3: 58.9 ms per loop
In [6]: %timeit splmake(x, y)
1 loop, best of 3: 5.01 s per loop
Here splmake is what spline uses under the hood, and it's also deprecated.
Most interpolation methods in SciPy are function-generating, i.e. they return function which you can then execute on your x data. For example, using CubicSpline method, which connects all points with pointwise cubic spline would be
from scipy.interpolate import CubicSpline
spline = CubicSpline(x, y)
y_smooth = spline(x_smooth)
Based on your description I think that you correctly want to use BSpline. To do so, follow the pattern above, i.e.
from scipy.interpolate import BSpline
order = 2 # smoothness order
spline = BSpline(x, y, order)
y_smooth = spline(x_smooth)
Since you have such amount of data, it probably must be very noisy. I'd suggest using bigger spline order, which relates to the number of knots used for interpolation.
In both cases, your knots, i.e. x and y, should be sorted. These are 1D interpolation (since you are using only x_smooth as input). You can sort them using np.argsort. In short:
from scipy.interpolate import BSpline
sort_idx = np.argsort(x)
x_sorted = x[sort_idx]
y_sorted = y[sort_idx]
order = 20 # smoothness order
spline = BSpline(x_sorted, y_sorted, order)
y_smooth = spline(x_smooth)
plt.plot(x_sorted, y_sorted, '.')
plt.plot(x_smooth, y_smooth, '-')
plt.show()
My problem can be generalize to how to smoothly plot 2d graphs when data points are randomized. Since you are only dealing with two columns of data, if you sort your data by independent variable, at least your data points will be connected in order, and that's how matplotlib connects your data points.
#Dawid Laszuk has provided one solution to sort data by independent variable, and I'll display mine here:
plotting_columns = []
for i in range(len(x)):
plotting_columns.append(np.array([x[i],y[i]]))
plotting_columns.sort(key=lambda pair : pair[0])
plotting_columns = np.array(plotting_columns)
traditional sort() by filter condition could also do the sorting job efficient here.
But it's just your first step. The following steps are not hard either, to smooth your graph, you also want to keep your independent variable in linear ascending order with identical step interval, so
x_smooth = np.linspace(x.min(), x.max(), num_steps)
is enough to do the job. Usually, if you have plenty of data points, for example, more than 10000 points (correctness and accuracy are not human verifiable), you just want to plot the significant points to display the trend, then only smoothing x is enough. So you can plt.plot(x_smooth,y) simply.
You will notice that x_smooth will generate many x values that will not have corresponding y value. When you want to maintain the correctness, you need to use line fitting functions. As #ev-br demonstrated in his answer, spline functions are expensive on purpose. Therefore you might want to do some simpler trick. I smoothed my graph without using those functions. And you have some simple steps to it.
First, round your values so that your data will not vary too much in small intervals. (You can skip this step)
You can change one line when you constructing the plotting_columns as:
plotting_columns.append(np.around(np.array(x[i],y[i]), decimal=4))
After done this, you can filter out the point that you don't want to plot by choosing the points close to the x_smooth values:
new_plots = []
for i in range(len(x_smooth)):
if plotting_columns[:,0][i] >= x_smooth[i] - error and plotting_columns[:,0][i]< x_smooth[i] + error:
new_plots.append(plotting_columns[i])
else:
# Remove all points between the interval #
This is how I solved my problems.
I've generated a bunch of data for the (x,y,z) coordinates of a planet as it orbits around the Sun. Now I want to fit an ellipse through this data.
What I tried to do:
I created a dummy ellipse based on five parameters: The semi-major axis & eccentricity that defines the size & shape and the three euler angles that rotate the ellipse around. Since my data is not always centered at origin I also need to translate the ellipse requiring additional three variables (dx,dy,dz).
Once I initialise this function with these eight variables I get back N number of points that lie on this ellipse. (N = number of data points I am plotting the ellipse through)
I calculate the deviation of these dummy points from the actual data and then I minimise this deviation using some minimisation method to find the best fitting values for these eight variables.
My problem is with the very last part: minimising the deviation and finding the variables' values.
To minimise the deviation I use scipy.optimize.minimize to try and approximate the best fitting variables but it just doesn't do good enough of a job:
Here is an image of what one of my best fits looks like and that's with a very generously accurate initial guess. (blue = data, red = fit)
Here is the entire code. (No data required, it generates its own phony data)
In short, I use this scipy function:
initial_guess = [0.3,0.2,0.1,0.7,3,0.0,-0.1,0.0]
bnds = ((0.2, 0.5), (0.1, 0.3), (0, 2*np.pi), (0, 2*np.pi), (0, 2*np.pi), (-0.5,0.5), (-0.5,0.5), (-0.3,0.3)) #reasonable bounds for the variables
result = optimize.minimize(deviation, initial_guess, args=(data,), method='L-BFGS-B', bounds=bnds, tol=1e-8) #perform minimalisation
semi_major,eccentricity,inclination,periapsis,longitude,dx,dy,dz = result["x"]
To minimize this error (or deviation) function:
def deviation(variables, data):
"""
This function calculates the cumulative seperation between the ellipse fit points and data points and returns it
"""
num_pts = len(data[:,0])
semi_major,eccentricity,inclination,periapsis,longitude,dx,dy,dz = variables
dummy_ellipse = generate_ellipse(num_pts,semi_major,eccentricity,inclination,periapsis,longitude,dz,dy,dz)
deviations = np.zeros(len(data[:,0]))
pair_deviations = np.zeros(len(data[:,0]))
# Calculate separation between each pair of points
for j in range(len(data[:,0])):
for i in range(len(data[:,0])):
pair_deviations[i] = np.sqrt((data[j,0]-dummy_ellipse[i,0])**2 + (data[j,1]-dummy_ellipse[i,1])**2 + (data[j,2]-dummy_ellipse[i,2])**2)
deviations[j] = min(pair_deviations) # only pick the closest point to the data point j.
total_deviation = sum(deviations)
return total_deviation
(My code may be a bit messy & inefficient, I'm new to this)
I may be making some logical error in my coding but I think it comes down to the scipy.minimize.optimize function. I don't know enough about how it works and what to expect of it. I was also recommended to try Markov chain Monte Carlo when dealing with this many variables. I did take a look at the emcee, but it's a little above my head right now.
First, you have a typo in your objective function that prevents optimization of one of the variables:
dummy_ellipse = generate_ellipse(...,dz,dy,dz)
should be
dummy_ellipse = generate_ellipse(...,dx,dy,dz)
Also, taking sqrt out and minimizing the sum of squared euclidean distances makes it numerically somewhat easier for the optimizer.
Your objective function is also not everywhere differentiable because of the min(), as assumed by the BFGS solver, so its performance will be suboptimal.
Also, approaching the problem from analytical geometry perspective may help: an ellipse in 3d is defined as a solution of two equations
f1(x,y,z,p) = 0
f2(x,y,z,p) = 0
Where p are the parameters of the ellipse. Now, to fit the parameters to a data set, you could try to minimize
F(p) = sum_{j=1}^N [f1(x_j,y_j,z_j,p)**2 + f2(x_j,y_j,z_j,p)**2]
where the sum goes over data points.
Even better, in this problem formulation you could use optimize.leastsq, which may be more efficient in least squares problems.
I have done some work in Python, but I'm new to scipy. I'm trying to use the methods from the interpolate library to come up with a function that will approximate a set of data.
I've looked up some examples to get started, and could get the sample code below working in Python(x,y):
import numpy as np
from scipy.interpolate import interp1d, Rbf
import pylab as P
# show the plot (empty for now)
P.clf()
P.show()
# generate random input data
original_data = np.linspace(0, 1, 10)
# random noise to be added to the data
noise = (np.random.random(10)*2 - 1) * 1e-1
# calculate f(x)=sin(2*PI*x)+noise
f_original_data = np.sin(2 * np.pi * original_data) + noise
# create interpolator
rbf_interp = Rbf(original_data, f_original_data, function='gaussian')
# Create new sample data (for input), calculate f(x)
#using different interpolation methods
new_sample_data = np.linspace(0, 1, 50)
rbf_new_sample_data = rbf_interp(new_sample_data)
# draw all results to compare
P.plot(original_data, f_original_data, 'o', ms=6, label='f_original_data')
P.plot(new_sample_data, rbf_new_sample_data, label='Rbf interp')
P.legend()
The plot is displayed as follows:
Now, is there any way to get a polynomial expression representing the interpolated function created by Rbf (i.e. the method created as rbf_interp)?
Or, if this is not possible with Rbf, any suggestions using a different interpolation method, another library, or even a different tool are also welcome.
The RBF uses whatever functions you ask, it is of course a global model, so yes there is a function result, but of course its true that you will probably not like it since it is a sum over many gaussians. You got:
rbf.nodes # the factors for each of the RBF (probably gaussians)
rbf.xi # the centers.
rbf.epsilon # the width of the gaussian, but remember that the Norm plays a role too
So with these things you can calculate the distances (with rbf.xi then pluggin the distances with the factors in rbf.nodes and rbf.epsilon into the gaussian (or whatever function you asked it to use). (You can check the python code of __call__ and _call_norm)
So you get something like sum(rbf.nodes[i] * gaussian(rbf.epsilon, sqrt((rbf.xi - center)**2)) for i, center in enumerate(rbf.nodes)) to give some funny half code/formula, the RBFs function is written in the documentation, but you can also check the python code.
The answer is no, there is no "nice" way to write down the formula, or at least not in a short way. Some types of interpolations, like RBF and Loess, do not directly search for a parametric mathematical function to fit to the data and instead they calculate the value of each new data point separately as a function of the other points.
These interpolations are guaranteed to always give a good fit for your data (such as in your case), and the reason for this is that to describe them you need a very large number of parameters (basically all your data points). Think of it this way: you could interpolate linearly by connecting consecutive data points with straight lines. You could fit any data this way and then describe the function in a mathematical form, but it would take a large number of parameters (at least as many as the number of points). Actually what you are doing right now is pretty much a smoothed version of that.
If you want the formula to be short, this means you want to describe the data with a mathematical function that does not have many parameters (specifically the number of parameters should be much lower than the number of data points). Such examples are logistic functions, polynomial functions and even the sine function (that you used to generate the data). Obviously, if you know which function generated the data that will be the function you want to fit.
RBF likely stands for Radial Basis Function. I wouldn't be surprised if scipy.interpolate.Rbf was the function you're looking for.
However, I doubt you'll be able to find a polynomial expression to represent your result.
If you want to try different interpolation methods, check the corresponding Scipy documentation, that gives link to RBF, splines...
I don’t think SciPy’s RBF will give you the actual function. But one thing that you could do is sample the function that SciPy’s RBF gave you (ie 100 points). Then use Lagrange interpretation with those points. This will generate a polynomial function for you. Here is an example on how this would look. If you do not want to use Lagrange interpolation, You can also use “Newton’s dividend difference method” to generate a polynomial function.
My answer is based on numpy only :
import matplotlib.pyplot as plt
import numpy as np
x_data = [324, 531, 806, 1152, 1576, 2081, 2672, 3285, 3979, 4736]
y_data = [20, 25, 30, 35, 40, 45, 50, 55, 60, 65]
x = np.array(x_data)
y = np.array(y_data)
model = np.poly1d(np.polyfit(x, y, 2))
ynew = model(x)
plt.plot(x, y, 'o', x, ynew, '-' , )
plt.ylabel( str(model).strip())
plt.show()