equivalent to numpy.linalg.lstsq that allows weighting - python

I am fitting a 2d polynomial with the numpy function linalg.lstsq:
coeffs = np.array([y*0+1, y, x, x**2, y**2]).T
coeff_r, r, rank, s =np.linalg.lstsq(coeffs, values)
Some points that I am trying to fit are more reliable than others.
Is there a way to weigh the points differently?
Thanks

lstsq is enough for this; the weights can be applied to the equations. That is, if in an overdetermined system
3*a + 2*b = 9
2*a + 3*b = 4
5*a - 4*b = 2
you care about the first equation more than about the others, multiply it by some number greater than 1. For example, by 5:
15*a + 10*b = 45
2*a + 3*b = 4
5*a - 4*b = 2
Mathematically, the system is the same, but the least squares solution will be different because it minimizes the sum of squares of the residuals, and the residual of the 1st equation got multiplied by 5.
Here is an example based on your code (with small adjustments to make it more NumPythonic). First, unweighted fit:
import numpy as np
x, y = np.meshgrid(np.arange(0, 3), np.arange(0, 3))
x = x.ravel()
y = y.ravel()
values = np.sqrt(x+y+2) # some values to fit
functions = np.stack([np.ones_like(y), y, x, x**2, y**2], axis=1)
coeff_r = np.linalg.lstsq(functions, values, rcond=None)[0]
values_r = functions.dot(coeff_r)
print(values_r - values)
This displays the residuals as
[ 0.03885814 -0.00502763 -0.03383051 -0.00502763 0.00097465 0.00405298
-0.03383051 0.00405298 0.02977753]
Now I give the 1st data point greater weight.
weights = np.ones_like(x)
weights[0] = 5
coeff_r = np.linalg.lstsq(functions*weights[:, None], values*weights, rcond=None)[0]
values_r = functions.dot(coeff_r)
print(values_r - values)
Residuals:
[ 0.00271103 -0.01948647 -0.04828936 -0.01948647 0.00820407 0.0112824
-0.04828936 0.0112824 0.03700695]
The first residual is now an order of magnitude smaller, of course at the expense of others residuals.

Related

Smooth a curve in Python while preserving the value and slope at the end points

I have two solutions to this problem actually, they are both applied below to a test case. The thing is that none of them is perfect: first one only take into account the two end points, the other one can't be made "arbitrarily smooth": there is a limit in the amount of smoothness one can achieve (the one I am showing).
I am sure there is a better solution, that kind-of go from the first solution to the other and all the way to no smoothing at all. It may already be implemented somewhere. Maybe solving a minimization problem with an arbitrary number of splines equidistributed?
Thank you very much for your help
Ps: the seed used is a challenging one
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter
import numpy as np
import random
def scipy_bspline(cv, n=100, degree=3):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
"""
cv = np.asarray(cv)
count = cv.shape[0]
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = interpolate.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
def round_up_to_odd(f):
return np.int(np.ceil(f / 2.) * 2 + 1)
def generateRandomSignal(n=1000, seed=None):
"""
Parameters
----------
n : integer, optional
Number of points in the signal. The default is 1000.
Returns
-------
sig : numpy array
"""
np.random.seed(seed)
print("Seed was:", seed)
steps = np.random.choice(a=[-1, 0, 1], size=(n-1))
roughSig = np.concatenate([np.array([0]), steps]).cumsum(0)
sig = savgol_filter(roughSig, round_up_to_odd(n/10), 6)
return sig
# Generate a random signal to illustrate my point
n = 1000
t = np.linspace(0, 10, n)
seed = 45136. # Challenging seed
sig = generateRandomSignal(n=1000, seed=seed)
sigInit = np.copy(sig)
# Add noise to the signal
mean = 0
std = sig.max()/3.0
num_samples = n/5
idxMin = n/2-100
idxMax = idxMin + num_samples
tCut = t[idxMin+1:idxMax]
noise = np.random.normal(mean, std, size=num_samples-1) + 2*std*np.sin(2.0*np.pi*tCut/0.4)
sig[idxMin+1:idxMax] += noise
# Define filtering range enclosing the noisy area of the signal
idxMin -= 20
idxMax += 20
# Extreme filtering solution
# Spline between first and last points, the points in between have no influence
sigTrim = np.delete(sig, np.arange(idxMin,idxMax))
tTrim = np.delete(t, np.arange(idxMin,idxMax))
f = interpolate.interp1d(tTrim, sigTrim, kind='quadratic')
sigSmooth1 = f(t)
# My attempt. Not bad but not perfect because there is a limit in the maximum
# amount of smoothing we can add (degree=len(tSlice) is the maximum)
# If I could do degree=10*len(tSlice) and converging to the first solution
# I would be done!
sigSlice = sig[idxMin:idxMax]
tSlice = t[idxMin:idxMax]
cv = np.stack((tSlice, sigSlice)).T
p = scipy_bspline(cv, n=len(tSlice), degree=len(tSlice))
tSlice = p.T[0]
sigSliceSmooth = p.T[1]
sigSmooth2 = np.copy(sig)
sigSmooth2[idxMin:idxMax] = sigSliceSmooth
# Plot
plt.figure()
plt.plot(t, sig, label="Signal")
plt.plot(t, sigSmooth1, label="Solution 1")
plt.plot(t, sigSmooth2, label="Solution 2")
plt.plot(t[idxMin:idxMax], sigInit[idxMin:idxMax], label="What I'd want (kind of, smoother will be even better actually)")
plt.plot([t[idxMin],t[idxMax]], [sig[idxMin],sig[idxMax]],"o")
plt.legend()
plt.show()
sys.exit()
Yes, a minimization is a good way to approach this smoothing problem.
Least squares problem
Here is a suggestion for a least squares formulation: let s[0], ..., s[N] denote the N+1 samples of the given signal to smooth, and let L and R be the desired slopes to preserve at the left and right endpoints. Find the smoothed signal u[0], ..., u[N] as the minimizer of
min_u (1/2) sum_n (u[n] - s[n])² + (λ/2) sum_n (u[n+1] - 2 u[n] + u[n-1])²
subject to
s[0] = u[0], s[N] = u[N] (value constraints),
L = u[1] - u[0], R = u[N] - u[N-1] (slope constraints),
where in the minimization objective, the sums are over n = 1, ..., N-1 and λ is a positive parameter controlling the smoothing strength. The first term tries to keep the solution close to the original signal, and the second term penalizes u for bending to encourage a smooth solution.
The slope constraints require that
u[1] = L + u[0] = L + s[0] and u[N-1] = u[N] - R = s[N] - R. So we can consider the minimization as over only the interior samples u[2], ..., u[N-2].
Finding the minimizer
The minimizer satisfies the Euler–Lagrange equations
(u[n] - s[n]) / λ + (u[n+2] - 4 u[n+1] + 6 u[n] - 4 u[n-1] + u[n-2]) = 0
for n = 2, ..., N-2.
An easy way to find an approximate solution is by gradient descent: initialize u = np.copy(s), set u[1] = L + s[0] and u[N-1] = s[N] - R, and do 100 iterations or so of
u[2:-2] -= (0.05 / λ) * (u - s)[2:-2] + np.convolve(u, [1, -4, 6, -4, 1])[4:-4]
But with some more work, it is possible to do better than this by solving the E–L equations directly. For each n, move the known quantities to the right-hand side: s[n] and also the endpoints u[0] = s[0], u[1] = L + s[0], u[N-1] = s[N] - R, u[N] = s[N]. The you will have a linear system "A u = b", and matrix A has rows like
0, ..., 0, 1, -4, (6 + 1/λ), -4, 1, 0, ..., 0.
Finally, solve the linear system to find the smoothed signal u. You could use numpy.linalg.solve to do this if N is not too large, or if N is large, try an iterative method like conjugate gradients.
you can apply a simple smoothing method and plot the smooth curves with different smoothness values to see which one works best.
def smoothing(data, smoothness=0.5):
last = data[0]
new_data = [data[0]]
for datum in data[1:]:
new_value = smoothness * last + (1 - smoothness) * datum
new_data.append(new_value)
last = datum
return new_data
You can plot this curve for multiple values of smoothness and pick the curve which suits your needs. You can also apply this method only on a range of values in the actual curve by defining start and end

Calculating the derivative of points in python

I want to calculate the derivative of points, a few internet posts suggested using np.diff function. However, I tried using np.diff against manually calculated results (chose a random polynomial equation and differentiated it) to see if I end up with the same results. I used the following eq : Y = (X^3) + (X^2) + 7 and the results i ended up with were different. Any ideas why?. Is there any other method to calculate the differential.
In the problem, I am trying to solve, I have recieved data points of fitted spline function ( not the original data that need to be fitted by spline but the points of the already fitted spline). The x-values are at equal intervals. I only have the points and no equation, what I require is to calculate, the first, second and third derivatives. i.e dy/dx, d2y/dx2, d3y/dx3. Any ideas on how to do this?. Thanks in advance.
xval = [1,2,3,4,5]
yval = []
yval_dashList = []
#selected a polynomial equation
def calc_Y(X):
Y = (X**3) + (X**2) + 7
return(Y)
#calculate y values using equatuion
for i in xval:
yval.append(calc_Y(i))
#output: yval = [9,19,43,87,157]
#manually differentiated the equation or use sympy library (sym.diff(x**3 + x**2 + 7))
def calc_diffY(X):
yval_dash = 3*(X**2) + 2**X
#store differentiated y-values in a list
for i in xval:
yval_dashList.append(yval_dash(i))
#output: yval_dashList = [5,16,35,64,107]
#use numpy diff method on the y values(yval)
numpyDiff = np.diff(yval)
#output: [10,24,44,60]
The values of numpy diff method [10,24,44,60] is different from yval_dashList = [5,16,35,64,107]
The idea behind what you are trying to do is correct, but there are a couple of points to make it work as intended:
There is a typo in calc_diffY(X), the derivative of X**2 is 2*X, not 2**X:
def calc_diffY(X):
yval_dash = 3*(X**2) + 2*X
By doing this you don't obtain much better results:
yval_dash = [5, 16, 33, 56, 85]
numpyDiff = [10. 24. 44. 70.]
To calculate the numerical derivative you should do a "Difference quotient" which is an approximation of a derivative
numpyDiff = np.diff(yval)/np.diff(xval)
The approximation becomes better and better if the values of the points are more dense.
The difference between your points on the x axis is 1, so you end up in this situation (in blue the analytical derivative, in red the numerical):
If you reduce the difference in your x points to 0.1, you get this, which is much better:
Just to add something to this, have a look at this image showing the effect of reducing the distance of the points on which the derivative is numerically calculated, taken from Wikipedia:
I like #lgsp's answer. I will add that you can directly estimate the derivative without having to worry about how much space there is between the values. This just uses the symmetric formula for calculating finite differences, described at this wikipedia page.
Take note, though, of the way delta is specified. I found that when it is too small, higher-order estimates fail. There's probably not a 100% generic value that will always work well!
Also, I simplified your code by taking advantage of numpy broadcasting over arrays to eliminate for loops.
import numpy as np
# selecte a polynomial equation
def f(x):
y = x**3 + x**2 + 7
return y
# manually differentiate the equation
def f_prime(x):
return 3*x**2 + 2*x
# numerically estimate the first three derivatives
def d1(f, x, delta=1e-10):
return (f(x + delta) - f(x - delta)) / (2 * delta)
def d2(f, x, delta=1e-5):
return (d1(f, x + delta, delta) - d1(f, x - delta, delta)) / (2 * delta)
def d3(f, x, delta=1e-2):
return (d2(f, x + delta, delta) - d2(f, x - delta, delta)) / (2 * delta)
# demo output
# note that functions operate in parallel on numpy arrays -- no for loops!
xval = np.array([1,2,3,4,5])
print('y = ', f(xval))
print('y\' = ', f_prime(xval))
print('d1 = ', d1(f, xval))
print('d2 = ', d2(f, xval))
print('d3 = ', d3(f, xval))
And the outputs:
y = [ 9 19 43 87 157]
y' = [ 5 16 33 56 85]
d1 = [ 5.00000041 16.00000132 33.00002049 56.00000463 84.99995374]
d2 = [ 8.0000051 14.00000116 20.00000165 25.99996662 32.00000265]
d3 = [6. 6. 6. 6. 5.99999999]

Linear regression with a forced zero intercept

I am using the least squares method below to calculate coefficients for:
#Estimate coefficients of linear equation y = a + b*x
def calc_coefficients(_x, _y):
x, y = np.mean(_x), np.mean(_y)
xy = np.mean(_x*_y)
x2, y2 = np.mean(_x**2), np.mean(_y**2)
n = len(_x)
b = (xy - x*y) / (x2 - x**2)
a = y - b*x
sig_b = np.sqrt((y2-y**2)/(x2-x**2)-b**2) / np.sqrt(n)
sig_a = sig_b * np.sqrt(x2 - x**2)
return a, b, sig_a, sig_b
example data:
_x= [(0.009412743,0.014965211,0.013263312,0.013529132,0.009989368,0.013932615,0.020849682,0.010953529,0.003608903,0.007220992,0.012750529,0.021608436,0.031742052,0.022482958,0.021137599,0.018703295,0.021633681,0.019866029,0.020260629,0.034433715,0.009241074,0.012027059)]
_y = 0.294158677,0.359935335,0.313484808,0.301917271,0.169190763,0.486254864,0.305846328,0.347077387,0.188928817,0.422194367,0.41157232,0.39281496,0.497935681,0.34763333,0.281712023,0.352045535,0.339958296,0.395932086,0.359905526,0.450004349,0.395200865,0.365162443)]
However, I need a (y-intercept) to be zero. (y = bx).
I have tried using:
np.linalg.lstsq(_x, _y)
but I get this error:
LinAlgError: 1-dimensional array given. Array must be two-dimensional
What is the best method to fit the data for y = bx?
The error is because you pass a 1-dimensional array, which should have been a two-dimensional array of shape (n, 1) - so, a matrix with 1 column. You could just do x.reshape(-1, 1) but here is a way to do least squares fit with an arbitrary set of degrees of x:
import numpy as np
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([3, 6, 5, 7, 9, 1])
degrees = [1] # list of degrees of x to use
matrix = np.stack([x**d for d in degrees], axis=-1) # stack them like columns
coeff = np.linalg.lstsq(matrix, y)[0] # lstsq returns some additional info we ignore
print("Coefficients", coeff)
fit = np.dot(matrix, coeff)
print("Fitted curve/line", fit)
The matrix you pass to lstsq should have columns of the form f(x) where f runs through the terms you allow in the model. So if it's a general linear model, you'll have x**0 and x**1. With zero intercept forced, it's just x**1. In general these don't have to be powers of x, either.
Output for degrees = [1], model y = bx
Coefficients [ 1.41818182]
Fitted curve/line [ 0. 1.41818182 2.83636364 4.25454545 5.67272727 7.09090909]
Output for degrees = [0, 1], model y = a + bx
Coefficients [ 5.0952381 0.02857143]
Fitted curve/line [ 5.0952381 5.12380952 5.15238095 5.18095238 5.20952381 5.23809524]

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Simultaneous fitting to N datasets in Python

I have a single function that I want to fit to a number of different datasets, all with the same number of points. For example, I might want to fit a polynomial to all rows of an image. Is there an efficient and vectorized way of doing this with scipy or other packages, or do I have to resort to a single loop (or use multiprocessing to speed it up a bit)?
You can use numpy.linalg.lstsq:
import numpy as np
# independent variable
x = np.arange(100)
# some sample outputs with random noise
y1 = 3*x**2 + 2*x + 4 + np.random.randn(100)
y2 = x**2 - 4*x + 10 + np.random.randn(100)
# coefficient matrix, where each column corresponds to a term in your function
# this one is simple quadratic polynomial: 1, x, x**2
a = np.vstack((np.ones(100), x, x**2)).T
# result matrix, where each column is one set of outputs
b = np.vstack((y1, y2)).T
solutions, residuals, rank, s = np.linalg.lstsq(a, b)
# each column in solutions is the coefficients of terms
# for the corresponding output
for i, solution in enumerate(zip(*solutions),1):
print "y%d = %.1f + (%.1f)x + (%.1f)x^2" % ((i,) + solution)
# outputs:
# y1 = 4.4 + (2.0)x + (3.0)x^2
# y2 = 9.8 + (-4.0)x + (1.0)x^2

Categories

Resources