Fitting "quadratic" surface to data points in 3D - python

I am trying to fit a quadratic plane to a cloud of data points in python. My plane function is of the form
f(x,y,z) = a*x**2 + b*y**2 + c*x*y + d*x + e*y + f - z
Currently, my data points do not have errors associated with them, however, some errors can be assumed if necessary. Following suggestions from here, I work out the vertical distance from a point p0=(x0,y0,z0) (which correspond to my data points) to a point on the plane p=(x,y,z) following this method. I then end up with
def vertical_distance(params,p0):
*** snip ***
nominator = f + a*x**2 + b*y**2 + c*x*y - x0*(2*a*x-c*y-d) - y0*(2*b*y-c*x-e) + z0
denominator = sqrt((2*a*x+c*y+d)**2 + (2*b*y+c*x+e)**2 + (-1)**2)
return nominator/denominator
Ultimately, I think it is the vertical_distance function that I need to minimise. I can happily feed a list of starting parameters (params) and the array of data points to it in two dimensions, however, I am unsure on how to achieve this in 3D. ODR pack seems to only allow data containing x,y or two dimensions. Furthermore, how do I implement the points on the plane (p) into the minimising routine? I guess that during the fit operations the points vary according to the parameter optimisation and thus the exact equation of the plane at that very moment.

I guess that «quadratic surface» would be a more correct term than «plane».
And the problem is to fit z = ax^2 + by^2 + cxy + dx + ey + f
to given set of points P.
To do that via optimization you need to formulate residual function (for instance, vertical distance).
For each 3D points p from P residual is
|p_2 – ap_0^2 + bp_1^2 + c*p_0*p_1 + dp_0 + ep_1 + f|
You need to minimize all residuals, i.e. sum of square of them, variating parameters a…f.
The following code technically should solve above problem. But fitting the problem is multi-extremal and such routine may fail to find right set of parameters without good starting point or globalization of search.
import numpy
import scipy.optimize
P = numpy.random.rand(3,10) # given point set
def quadratic(x,y, a, b, c, d, e, f):
#fit quadratic surface
return a*x**2 + b*y**2 + c*x*y + d*x + e*y + f
def residual(params, points):
#total residual
residuals = [
p[2] - quadratic(p[0], p[1],
params[0], params[1], params[2], params[3], params[4], params[5]) for p in points]
return numpy.linalg.norm(residuals)
result = scipy.optimize.minimize(residual,
(1, 1, 0, 0, 0, 0),#starting point
args=P)

Related

Minimizing this error function, using NumPy

Background
I've been working for some time on attempting to solve the (notoriously painful) Time Difference of Arrival (TDoA) multi-lateration problem, in 3-dimensions and using 4 nodes. If you're unfamiliar with the problem, it is to determine the coordinates of some signal source (X,Y,Z), given the coordinates of n nodes, the time of arrival of the signal at each node, and the velocity of the signal v.
My solution is as follows:
For each node, we write (X-x_i)**2 + (Y-y_i)**2 + (Z-z_i)**2 = (v(t_i - T)**2
Where (x_i, y_i, z_i) are the coordinates of the ith node, and T is the time of emission.
We have now 4 equations in 4 unknowns. Four nodes are obviously insufficient. We could try to solve this system directly, however that seems next to impossible given the highly nonlinear nature of the problem (and, indeed, I've tried many direct techniques... and failed). Instead, we simplify this to a linear problem by considering all i/j possibilities, subtracting equation i from equation j. We obtain (n(n-1))/2 =6 equations of the form:
2*(x_j - x_i)*X + 2*(y_j - y_i)*Y + 2*(z_j - z_i)*Z + 2 * v**2 * (t_i - t_j) = v**2 ( t_i**2 - t_j**2) + (x_j**2 + y_j**2 + z_j**2) - (x_i**2 + y_i**2 + z_i**2)
Which look like Xv_1 + Y_v2 + Z_v3 + T_v4 = b. We try now to apply standard linear least squares, where the solution is the matrix vector x in A^T Ax = A^T b. Unfortunately, if you were to try feeding this into any standard linear least squares algorithm, it'll choke up. So, what do we do now?
...
The time of arrival of the signal at node i is given (of course) by:
sqrt( (X-x_i)**2 + (Y-y_i)**2 + (Z-z_i)**2 ) / v
This equation implies that the time of arrival, T, is 0. If we have that T = 0, we can drop the T column in matrix A and the problem is greatly simplified. Indeed, NumPy's linalg.lstsq() gives a surprisingly accurate & precise result.
...
So, what I do is normalize the input times by subtracting from each equation the earliest time. All I have to do then is determine the dt that I can add to each time such that the residual of summed squared error for the point found by linear least squares is minimized.
I define the error for some dt to be the squared difference between the arrival time for the point predicted by feeding the input times + dt to the least squares algorithm, minus the input time (normalized), summed over all 4 nodes.
for node, time in nodes, times:
error += ( (sqrt( (X-x_i)**2 + (Y-y_i)**2 + (Z-z_i)**2 ) / v) - time) ** 2
My problem:
I was able to do this somewhat satisfactorily by using brute-force. I started at dt = 0, and moved by some step up to some maximum # of iterations OR until some minimum RSS error is reached, and that was the dt I added to the normalized times to obtain a solution. The resulting solutions were very accurate and precise, but quite slow.
In practice, I'd like to be able to solve this in real time, and therefore a far faster solution will be needed. I began with the assumption that the error function (that is, dt vs error as defined above) would be highly nonlinear-- offhand, this made sense to me.
Since I don't have an actual, mathematical function, I can automatically rule out methods that require differentiation (e.g. Newton-Raphson). The error function will always be positive, so I can rule out bisection, etc. Instead, I try a simple approximation search. Unfortunately, that failed miserably. I then tried Tabu search, followed by a genetic algorithm, and several others. They all failed horribly.
So, I decided to do some investigating. As it turns out the plot of the error function vs dt looks a bit like a square root, only shifted right depending upon the distance from the nodes that the signal source is:
Where dt is on horizontal axis, error on vertical axis
And, in hindsight, of course it does!. I defined the error function to involve square roots so, at least to me, this seems reasonable.
What to do?
So, my issue now is, how do I determine the dt corresponding to the minimum of the error function?
My first (very crude) attempt was to get some points on the error graph (as above), fit it using numpy.polyfit, then feed the results to numpy.root. That root corresponds to the dt. Unfortunately, this failed, too. I tried fitting with various degrees, and also with various points, up to a ridiculous number of points such that I may as well just use brute-force.
How can I determine the dt corresponding to the minimum of this error function?
Since we're dealing with high velocities (radio signals), it's important that the results be precise and accurate, as minor variances in dt can throw off the resulting point.
I'm sure that there's some infinitely simpler approach buried in what I'm doing here however, ignoring everything else, how do I find dt?
My requirements:
Speed is of utmost importance
I have access only to pure Python and NumPy in the environment where this will be run
EDIT:
Here's my code. Admittedly, a bit messy. Here, I'm using the polyfit technique. It will "simulate" a source for you, and compare results:
from numpy import poly1d, linspace, set_printoptions, array, linalg, triu_indices, roots, polyfit
from dataclasses import dataclass
from random import randrange
import math
#dataclass
class Vertexer:
receivers: list
# Defaults
c = 299792
# Receivers:
# [x_1, y_1, z_1]
# [x_2, y_2, z_2]
# [x_3, y_3, z_3]
# Solved:
# [x, y, z]
def error(self, dt, times):
solved = self.linear([time + dt for time in times])
error = 0
for time, receiver in zip(times, self.receivers):
error += ((math.sqrt( (solved[0] - receiver[0])**2 +
(solved[1] - receiver[1])**2 +
(solved[2] - receiver[2])**2 ) / c ) - time)**2
return error
def linear(self, times):
X = array(self.receivers)
t = array(times)
x, y, z = X.T
i, j = triu_indices(len(x), 1)
A = 2 * (X[i] - X[j])
b = self.c**2 * (t[j]**2 - t[i]**2) + (X[i]**2).sum(1) - (X[j]**2).sum(1)
solved, residuals, rank, s = linalg.lstsq(A, b, rcond=None)
return(solved)
def find(self, times):
# Normalize times
times = [time - min(times) for time in times]
# Fit the error function
y = []
x = []
dt = 1E-10
for i in range(50000):
x.append(self.error(dt * i, times))
y.append(dt * i)
p = polyfit(array(x), array(y), 2)
r = roots(p)
return(self.linear([time + r for time in times]))
# SIMPLE CODE FOR SIMULATING A SIGNAL
# Pick nodes to be at random locations
x_1 = randrange(10); y_1 = randrange(10); z_1 = randrange(10)
x_2 = randrange(10); y_2 = randrange(10); z_2 = randrange(10)
x_3 = randrange(10); y_3 = randrange(10); z_3 = randrange(10)
x_4 = randrange(10); y_4 = randrange(10); z_4 = randrange(10)
# Pick source to be at random location
x = randrange(1000); y = randrange(1000); z = randrange(1000)
# Set velocity
c = 299792 # km/ns
# Generate simulated source
t_1 = math.sqrt( (x - x_1)**2 + (y - y_1)**2 + (z - z_1)**2 ) / c
t_2 = math.sqrt( (x - x_2)**2 + (y - y_2)**2 + (z - z_2)**2 ) / c
t_3 = math.sqrt( (x - x_3)**2 + (y - y_3)**2 + (z - z_3)**2 ) / c
t_4 = math.sqrt( (x - x_4)**2 + (y - y_4)**2 + (z - z_4)**2 ) / c
print('Actual:', x, y, z)
myVertexer = Vertexer([[x_1, y_1, z_1],[x_2, y_2, z_2],[x_3, y_3, z_3],[x_4, y_4, z_4]])
solution = myVertexer.find([t_1, t_2, t_3, t_4])
print(solution)
It seems like the Bancroft method applies to this problem? Here's a pure NumPy implementation.
# Implementation of the Bancroft method, following
# https://gssc.esa.int/navipedia/index.php/Bancroft_Method
M = np.diag([1, 1, 1, -1])
def lorentz_inner(v, w):
return np.sum(v * (w # M), axis=-1)
B = np.array(
[
[x_1, y_1, z_1, c * t_1],
[x_2, y_2, z_2, c * t_2],
[x_3, y_3, z_3, c * t_3],
[x_4, y_4, z_4, c * t_4],
]
)
one = np.ones(4)
a = 0.5 * lorentz_inner(B, B)
B_inv_one = np.linalg.solve(B, one)
B_inv_a = np.linalg.solve(B, a)
for Lambda in np.roots(
[
lorentz_inner(B_inv_one, B_inv_one),
2 * (lorentz_inner(B_inv_one, B_inv_a) - 1),
lorentz_inner(B_inv_a, B_inv_a),
]
):
x, y, z, c_t = M # np.linalg.solve(B, Lambda * one + a)
print("Candidate:", x, y, z, c_t / c)
My answer might have mistakes (glaring) as I had not heard the TDOA term before this afternoon. Please double check if the method is right.
I could not find solution to your original problem of finding dt corresponding to the minimum error. My answer also deviates from the requirement that other than numpy no third party library had to be used (I used Sympy and largely used the code from here). However I am still posting this thinking that somebody someday might find it useful if all one is interested in ... is to find X,Y,Z of the source emitter. This method also does not take into account real-life situations where white noise or errors might be present or curvature of the earth and other complications.
Your initial test conditions are as below.
from random import randrange
import math
# SIMPLE CODE FOR SIMULATING A SIGNAL
# Pick nodes to be at random locations
x_1 = randrange(10); y_1 = randrange(10); z_1 = randrange(10)
x_2 = randrange(10); y_2 = randrange(10); z_2 = randrange(10)
x_3 = randrange(10); y_3 = randrange(10); z_3 = randrange(10)
x_4 = randrange(10); y_4 = randrange(10); z_4 = randrange(10)
# Pick source to be at random location
x = randrange(1000); y = randrange(1000); z = randrange(1000)
# Set velocity
c = 299792 # km/ns
# Generate simulated source
t_1 = math.sqrt( (x - x_1)**2 + (y - y_1)**2 + (z - z_1)**2 ) / c
t_2 = math.sqrt( (x - x_2)**2 + (y - y_2)**2 + (z - z_2)**2 ) / c
t_3 = math.sqrt( (x - x_3)**2 + (y - y_3)**2 + (z - z_3)**2 ) / c
t_4 = math.sqrt( (x - x_4)**2 + (y - y_4)**2 + (z - z_4)**2 ) / c
print('Actual:', x, y, z)
My solution is as below.
import sympy as sym
X,Y,Z = sym.symbols('X,Y,Z', real=True)
f = sym.Eq((x_1 - X)**2 +(y_1 - Y)**2 + (z_1 - Z)**2 , (c*t_1)**2)
g = sym.Eq((x_2 - X)**2 +(y_2 - Y)**2 + (z_2 - Z)**2 , (c*t_2)**2)
h = sym.Eq((x_3 - X)**2 +(y_3 - Y)**2 + (z_3 - Z)**2 , (c*t_3)**2)
i = sym.Eq((x_4 - X)**2 +(y_4 - Y)**2 + (z_4 - Z)**2 , (c*t_4)**2)
print("Solved coordinates are ", sym.solve([f,g,h,i],X,Y,Z))
print statement from your initial condition gave.
Actual: 111 553 110
and the solution that almost instantly came out was
Solved coordinates are [(111.000000000000, 553.000000000000, 110.000000000000)]
Sorry again if something is totally amiss.

Smooth a curve in Python while preserving the value and slope at the end points

I have two solutions to this problem actually, they are both applied below to a test case. The thing is that none of them is perfect: first one only take into account the two end points, the other one can't be made "arbitrarily smooth": there is a limit in the amount of smoothness one can achieve (the one I am showing).
I am sure there is a better solution, that kind-of go from the first solution to the other and all the way to no smoothing at all. It may already be implemented somewhere. Maybe solving a minimization problem with an arbitrary number of splines equidistributed?
Thank you very much for your help
Ps: the seed used is a challenging one
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter
import numpy as np
import random
def scipy_bspline(cv, n=100, degree=3):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
"""
cv = np.asarray(cv)
count = cv.shape[0]
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = interpolate.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
def round_up_to_odd(f):
return np.int(np.ceil(f / 2.) * 2 + 1)
def generateRandomSignal(n=1000, seed=None):
"""
Parameters
----------
n : integer, optional
Number of points in the signal. The default is 1000.
Returns
-------
sig : numpy array
"""
np.random.seed(seed)
print("Seed was:", seed)
steps = np.random.choice(a=[-1, 0, 1], size=(n-1))
roughSig = np.concatenate([np.array([0]), steps]).cumsum(0)
sig = savgol_filter(roughSig, round_up_to_odd(n/10), 6)
return sig
# Generate a random signal to illustrate my point
n = 1000
t = np.linspace(0, 10, n)
seed = 45136. # Challenging seed
sig = generateRandomSignal(n=1000, seed=seed)
sigInit = np.copy(sig)
# Add noise to the signal
mean = 0
std = sig.max()/3.0
num_samples = n/5
idxMin = n/2-100
idxMax = idxMin + num_samples
tCut = t[idxMin+1:idxMax]
noise = np.random.normal(mean, std, size=num_samples-1) + 2*std*np.sin(2.0*np.pi*tCut/0.4)
sig[idxMin+1:idxMax] += noise
# Define filtering range enclosing the noisy area of the signal
idxMin -= 20
idxMax += 20
# Extreme filtering solution
# Spline between first and last points, the points in between have no influence
sigTrim = np.delete(sig, np.arange(idxMin,idxMax))
tTrim = np.delete(t, np.arange(idxMin,idxMax))
f = interpolate.interp1d(tTrim, sigTrim, kind='quadratic')
sigSmooth1 = f(t)
# My attempt. Not bad but not perfect because there is a limit in the maximum
# amount of smoothing we can add (degree=len(tSlice) is the maximum)
# If I could do degree=10*len(tSlice) and converging to the first solution
# I would be done!
sigSlice = sig[idxMin:idxMax]
tSlice = t[idxMin:idxMax]
cv = np.stack((tSlice, sigSlice)).T
p = scipy_bspline(cv, n=len(tSlice), degree=len(tSlice))
tSlice = p.T[0]
sigSliceSmooth = p.T[1]
sigSmooth2 = np.copy(sig)
sigSmooth2[idxMin:idxMax] = sigSliceSmooth
# Plot
plt.figure()
plt.plot(t, sig, label="Signal")
plt.plot(t, sigSmooth1, label="Solution 1")
plt.plot(t, sigSmooth2, label="Solution 2")
plt.plot(t[idxMin:idxMax], sigInit[idxMin:idxMax], label="What I'd want (kind of, smoother will be even better actually)")
plt.plot([t[idxMin],t[idxMax]], [sig[idxMin],sig[idxMax]],"o")
plt.legend()
plt.show()
sys.exit()
Yes, a minimization is a good way to approach this smoothing problem.
Least squares problem
Here is a suggestion for a least squares formulation: let s[0], ..., s[N] denote the N+1 samples of the given signal to smooth, and let L and R be the desired slopes to preserve at the left and right endpoints. Find the smoothed signal u[0], ..., u[N] as the minimizer of
min_u (1/2) sum_n (u[n] - s[n])² + (λ/2) sum_n (u[n+1] - 2 u[n] + u[n-1])²
subject to
s[0] = u[0], s[N] = u[N] (value constraints),
L = u[1] - u[0], R = u[N] - u[N-1] (slope constraints),
where in the minimization objective, the sums are over n = 1, ..., N-1 and λ is a positive parameter controlling the smoothing strength. The first term tries to keep the solution close to the original signal, and the second term penalizes u for bending to encourage a smooth solution.
The slope constraints require that
u[1] = L + u[0] = L + s[0] and u[N-1] = u[N] - R = s[N] - R. So we can consider the minimization as over only the interior samples u[2], ..., u[N-2].
Finding the minimizer
The minimizer satisfies the Euler–Lagrange equations
(u[n] - s[n]) / λ + (u[n+2] - 4 u[n+1] + 6 u[n] - 4 u[n-1] + u[n-2]) = 0
for n = 2, ..., N-2.
An easy way to find an approximate solution is by gradient descent: initialize u = np.copy(s), set u[1] = L + s[0] and u[N-1] = s[N] - R, and do 100 iterations or so of
u[2:-2] -= (0.05 / λ) * (u - s)[2:-2] + np.convolve(u, [1, -4, 6, -4, 1])[4:-4]
But with some more work, it is possible to do better than this by solving the E–L equations directly. For each n, move the known quantities to the right-hand side: s[n] and also the endpoints u[0] = s[0], u[1] = L + s[0], u[N-1] = s[N] - R, u[N] = s[N]. The you will have a linear system "A u = b", and matrix A has rows like
0, ..., 0, 1, -4, (6 + 1/λ), -4, 1, 0, ..., 0.
Finally, solve the linear system to find the smoothed signal u. You could use numpy.linalg.solve to do this if N is not too large, or if N is large, try an iterative method like conjugate gradients.
you can apply a simple smoothing method and plot the smooth curves with different smoothness values to see which one works best.
def smoothing(data, smoothness=0.5):
last = data[0]
new_data = [data[0]]
for datum in data[1:]:
new_value = smoothness * last + (1 - smoothness) * datum
new_data.append(new_value)
last = datum
return new_data
You can plot this curve for multiple values of smoothness and pick the curve which suits your needs. You can also apply this method only on a range of values in the actual curve by defining start and end

Plotting horizontal hyperbola/circle using fsolve, numpy, and matplotlib

I was recently trying to plot a nonlinear decision boundary, and the function ended up being a partially horizontal hyperbola, where there were multiple y-values for a given x. Although I got it to work, I know there has to be a more pythonic or numpythonic way of plotting this line.
Background: The problem was a perceptron classifier on a set of inputs that were not linearly separable. In order to find this, the inputs were mapped to a general hyperbola function to increase the dimensionality to 5, and have these separable by a hyperplane. The equation for the decision boundary that will be plotted is
d(x) = w0 + w1xx + w2yy + w3xy + wx + w5y
Through the course of the perceptron's gradient descent, the values for w0-w5 are found, and the boundary is the x,y value when d(x)=0.
Current implementation: I got it to work, but I think it is hacky. I first have to create an array of the given size so that I can append these values, and I have to delete the initialized value the first time I append my found value. I then sweep through my the space on my graph and find a y-value, first by guessing high, second by guessing low, in order to find both possible y-values. I put these found values at the front and back of D, in order to plot this using matplotlib.
D = np.array([[0], [0]])
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
a_iter, b_iter = 0, 0 # used as initial guess for numeric solver
for xx in range(x_min, x_max):
# used to print top and bottom sides of hyperbola
yya = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, max(a_iter, 7))
yyb = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, b_iter)
a_iter = yya
b_iter = yyb
# add these points to a single matrix for printing
dda = np.array([[xx],[yya]])
ddb = np.array([[xx],[yyb]])
D = np.concatenate((dda, D), axis=1)
if xx == x_min: # delete initial [0; 0]
D = dda
D = np.concatenate((D, ddb), axis=1)
I know there has to be a better way to do this. Any insight is appreciated.
Edit: Apologies, I realize that without an image this is really difficult to understand. The main issue of finding multiple roots and populating a numpy array are a bit generic. I don't have enough rep to post images, but the link is below
nonlinear classifier
If you want plot an implicit equation curve, you can use pyplot.contour(), here is an example:
np.random.seed(1)
w = np.random.randn(6)
def f(x, y, w):
return w[0] + w[1]*x**2 + w[2]*y**2 + w[3]*x*y + w[4]*x + w[5]*y
X, Y = np.mgrid[-2:2:100j, -2:2:100j]
pl.contour(X, Y, f(X, Y, w), levels=[0])
there are parameterized options too - a trig one, branches centered at 0, pi
t = np.linspace(-np.pi/3, np.pi/3, 200) # 0 centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default blue)
Out[94]: [<matplotlib.lines.Line2D at 0xe26e6a0>]
t = np.linspace(np.pi-np.pi/3, np.pi+np.pi/3, 200) # pi centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default orange)
Out[96]: [<matplotlib.lines.Line2D at 0xf68e780>]
sympy ought to be up to finding the full denormalized, rotated, offset parameterized hyperbola coefficients from the bivariate polynomial ws
(or continue the hackage with a fit)

Jacobian in scipy.optimize.minimize form

My goal is to optimize least squares ofth degree polynomial functios with some constraints, so my goal is to use scipy.optimize.minimize(...., method = 'SLSQP', ....). In optimalization, it is always good to pass Jacobian in the method.
I am not sure, however, how to desing my 'jac' function.
My objective function is desinged like this:
def least_squares(args_pol, x, y):
a, b, c, d, e = args_pol
return ((y-(a*x**4 + b*x**3 + c*x**2 + d*x + e))**2).sum()
where x and y are numpy arrays and contains the coordinates of points. I found in documentation, that 'jacobian' of scipy.ompitmize.minimize is gradient ob objective function and thus its array of first derivatives.
for args_pol its easy to find first derivatives, for example
db = (2*(a*x**4 + b*x**3 + c*x**2 + d*x + e - y)*x**3).sum()
but for each [x_i] in my numpy.array x is derivative
dx_i = 2*(a*x[i]**4 + b*x[i]**3 + c*x[i]**2 + d*x[i] + e - y[i])*
(4*a*x[i]**3 + 3*b*x[i]**2 + 2*c*x[i] + d)
and so on for each y_i. Thus, reasonable way is to compute each derivative as numpy.array dx and dy.
My question is - what form of result should my function for gradient return? For example should it look like
return np.array([[da, db, dc, dd, de], [dx[1], dx[2], .... dx[len(x)-1]],
[dy[1], dy[2],..........dy[len(y)-1]]])
or should it look like
return np.array([da, db, dc, dd, de, dx, dy])
Thanks for any explanations.

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Categories

Resources