Fit quadratic function to data using MSE - python

So my idea is (borrowed from neural network folks) that if I have data set D, I can fit a quadratic curve to it by first calculating the derivative of the error with respect to the parameters (a, b, and c), and then do a small update that minimizes the error. My problem is, the following code doesn't actually manage fit the curve. For linear stuff, a similar approach works, but quadratic seems to fail for some reason. Can you see what I have done wrong (either assumption or just implementation error)
EDIT: The question was not specific enough: This following code will not deal with bias in data very well. For some reason it updates a and b parameters somehow that c gets left behind. This method is similar to robotics (find path using Jacobians) and neural networks (find parameters based on error) so it is not unreasonable algorithm, now the question is, why this specific implementation does not produce results that I would expect.
In the following Python code, I use math as m, and MSE is a function that calculates Mean Squared Error between two arrays. Other than that, the code is self contained
Code:
def quadraticRegression(data, dErr):
a = 1 #Starting values
b = 1
c = 1
a_momentum = 0 #Momentum to counter steady state error
b_momentum = 0
c_momentum = 0
estimate = [a*x**2 + b*x + c for x in range(len(data))] #Estimate curve
error = MSE(data, estimate) #Get errors 'n stuff
errorOld = 0
lr = 0.0000000001 #learning rate
while abs(error - errorOld) > dErr:
#Fit a (dE/da)
deda = sum([ 2*x**2 * (a*x**2 + b*x + c - data[x]) for x in range(len(data)) ])/len(data)
correction = deda*lr
a_momentum = (a_momentum)*0.99 + correction*0.1 #0.99 is to slow down momentum when correction speed changes
a = a - correction - a_momentum
#fit b (dE/db)
dedb = sum([ 2*x*(a*x**2 + b*x + c - data[x]) for x in range(len(data))])/len(data)
correction = dedb*lr
b_momentum = (b_momentum)*0.99 + correction*0.1
b = b - correction - b_momentum
#fit c (dE/dc)
dedc = sum([ 2*(a*x**2 + b*x + c - data[x]) for x in range(len(data))])/len(data)
correction = dedc*lr
c_momentum = (c_momentum)*0.99 + correction*0.1
c = c - correction - c_momentum
#Update model and find errors
estimate = [a*x**2 +b*x + c for x in range(len(data))]
errorOld = error
print(error)
error = MSE(data, estimate)
return a, b, c, error

To me it looks like your code works totally correct! At least algorithm is correct. I've changed your code to use numpy for fast computation instead of pure Python. Also I've configured some params a bit, like changed momentum and learning rate, also implemented MSE.
Then I used matplotlib to draw plot animation. Finally on animation it looks like that your regression actually tries to fit the curve to data. Although at the last iteration for fitting sin(x) for x in [0; 2 * pi] it looks like linear approximation, but still close to data points as much as possible for quadratic curve to be. But for sin(x) for x in [0; pi] it looks like ideal approximation (it starts fitting from around 12-th iteration).
i-th frame of animation just does regression with dErr = 0.7 ** (i + 15).
My animation is a bit slow to just run script, but if you add param save like this python script.py save it will render/save to line.gif plot drawing animation. If you run the script without params it will animation on your PC screen plotting/fitting in real time.
Full code goes after graphics, code needs installing some python modules by running once python -m pip install numpy matplotlib.
Next is sin(x) for x in (0, pi):
Next is sin(x) for x in (0, 2 * pi):
Next is abs(x) for x in (-1, 1):
# Needs: python -m pip install numpy matplotlib
import math, sys
import numpy as np, matplotlib.pyplot as plt, matplotlib.animation as animation
from matplotlib.animation import FuncAnimation
x_range = (0., math.pi, 0.1) # (xmin, xmax, xstep)
y_range = (-0.2, 1.2) # (ymin, ymax)
num_iterations = 50
def f(x):
return np.sin(x)
def derr(iteration):
return 0.7 ** (iteration + 15)
def MSE(a, b):
return (np.abs(np.array(a) - np.array(b)) ** 2).mean()
def quadraticRegression(*, x, data, dErr):
x, data = np.array(x), np.array(data)
assert x.size == data.size, (x.size, data.size)
a = 1 #Starting values
b = 1
c = 1
a_momentum = 0.1 #Momentum to counter steady state error
b_momentum = 0.1
c_momentum = 0.1
estimate = a*x**2 + b*x + c #Estimate curve
error = MSE(data, estimate) #Get errors 'n stuff
errorOld = 0.
lr = 10. ** -4 #learning rate
while abs(error - errorOld) > dErr:
#Fit a (dE/da)
deda = np.sum(2*x**2 * (a*x**2 + b*x + c - data))/len(data)
correction = deda*lr
a_momentum = (a_momentum)*0.99 + correction*0.1 #0.99 is to slow down momentum when correction speed changes
a = a - correction - a_momentum
#fit b (dE/db)
dedb = np.sum(2*x*(a*x**2 + b*x + c - data))/len(data)
correction = dedb*lr
b_momentum = (b_momentum)*0.99 + correction*0.1
b = b - correction - b_momentum
#fit c (dE/dc)
dedc = np.sum(2*(a*x**2 + b*x + c - data))/len(data)
correction = dedc*lr
c_momentum = (c_momentum)*0.99 + correction*0.1
c = c - correction - c_momentum
#Update model and find errors
estimate = a*x**2 +b*x + c
errorOld = error
#print(error)
error = MSE(data, estimate)
return a, b, c, error
fig, ax = plt.subplots()
fig.set_tight_layout(True)
x = np.arange(x_range[0], x_range[1], x_range[2])
#ax.scatter(x, x + np.random.normal(0, 3.0, len(x)))
line0, line1 = None, None
do_save = len(sys.argv) > 1 and sys.argv[1] == 'save'
def g(x, derr):
a, b, c, error = quadraticRegression(x = x, data = f(x), dErr = derr)
return a * x ** 2 + b * x + c
def dummy(x):
return np.ones_like(x, dtype = np.float64) * 100.
def update(i):
global line0, line1
de = derr(i)
if line0 is None:
assert line1 is None
line0, = ax.plot(x, f(x), 'r-', linewidth=2)
line1, = ax.plot(x, g(x, de), 'r-', linewidth=2, color = 'blue')
ax.set_ylim(y_range[0], y_range[1])
if do_save:
sys.stdout.write(str(i) + ' ')
sys.stdout.flush()
label = 'iter {0} derr {1}'.format(i, round(de, math.ceil(-math.log(de) / math.log(10)) + 2))
line1.set_ydata(g(x, de))
ax.set_xlabel(label)
return line1, ax
if __name__ == '__main__':
anim = FuncAnimation(fig, update, frames = np.arange(0, num_iterations), interval = 200)
if do_save:
anim.save('line.gif', dpi = 200, writer = 'imagemagick')
else:
plt.show()

Related

Python, Four-bar linkage angle-time plot

I'm trying to plot the angle vs. time plot for the output angle of a four-bar linkage (angle fi4 in the image below). This angle is calculated using the solution from the https://scholar.cu.edu.eg/?q=anis/files/week04-mdp206-position_analysis-draft.pdf, page 23.
I'm now trying to plot the fi_4(t) plot and am getting some strange results. The diagram displays the input angle fi2 as blue and output angle fi4 as red. Why is the fi2 fluctuating over time? Shouldn't the fi4 have some sort of sine curve?
Am I missing something here?
Four-bar linkage:
The code:
from __future__ import division
import math
import numpy as np
import matplotlib.pyplot as plt
# Input
#lengths of links (tube testing machine actual lengths)
a = 45.5 #mm
b = 250 #mm
c = 140 #mm
d = 244.244 #mm
# Solution for fi2 being a time function, f(time) = angle
f = 16.7/60 #/s
omega = 2 * np.pi * f #rad/s
t = np.linspace(0, 50, 100)
y = a * np.sin(omega * t)
x = a * np.cos(omega * t)
fi2 = np.arctan(y/x)
# Solution of the vector loop equation
#https://scholar.cu.edu.eg/?q=anis/files/week04-mdp206-position_analysis-draft.pdf
K1 = d/a
K2 = d/c
K3 = (a**2 - b**2 + c**2 + d**2)/(2*a*c)
A = np.cos(fi2) - K1 - K2*np.cos(fi2) + K3
B = -2*np.sin(fi2)
C = K1 - (K2+1)*np.cos(fi2) + K3
fi4_1 = 2*np.arctan((-B+np.sqrt(B**2 - 4*A*C))/(2*A))
fi4_2 = 2*np.arctan((-B-np.sqrt(B**2 - 4*A*C))/(2*A))
# Plot the fi2 time diagram and fi4 time diagram
plt.plot(t, np.degrees(fi2), color = 'blue')
plt.plot(t, np.degrees(fi4_2), color = 'red')
plt.show()
Diagram:
The linespace(0, 50, 100) is too fast. Replacing it with:
t = np.linspace(0, 5, 100)
Second, all the calculations involving the bare np.arctan() are incorrect. You should use np.arctan2(y, x), which determines the correct quadrant (unlike anything based on y/x where the respective signs of x and y are lost). So:
fi2 = np.arctan2(y, x) # not: np.arctan(y/x)
...
fi4_1 = 2 * np.arctan2(-B + np.sqrt(B**2 - 4*A*C), 2*A)
fi4_2 = 2 * np.arctan2(-B - np.sqrt(B**2 - 4*A*C), 2*A)
Putting some labels on your plots and showing both solutions for θ_4:
plt.plot(t, np.degrees(fi2) % 360, color = 'k', label=r'$θ_2$')
plt.plot(t, np.degrees(fi4_1) % 360, color = 'b', label=r'$θ_{4_1}$')
plt.plot(t, np.degrees(fi4_2) % 360, color = 'r', label=r'$θ_{4_2}$')
plt.xlabel('t [s]')
plt.ylabel('degrees')
plt.legend()
plt.show()
With these mods, we get:
BTW, do you want to see an amazingly lazy way of solving problems like these? Much more inefficient than your code, but much easier to derive (e.g. for other structures) without trying to express the closed form of your solution:
from scipy.optimize import fsolve
def polar(r, theta):
return r * np.array((np.cos(theta), np.sin(theta)))
def f(th34, th2):
th3, th4 = th34 # solve simultaneously for theta_3 and theta_4
pb_23 = polar(a, th2) + polar(b, th3) # point B based on links a, b
pb_14 = polar(d, 0) + polar(c, th4) # point B based on links d, c
return pb_23 - pb_14 # error: difference of the two
def solve(th2):
th4_1 = np.array([fsolve(f, [0, -1.5], args=(th2_k,))[1] for th2_k in th2])
th4_2 = np.array([fsolve(f, [0, 1.5], args=(th2_k,))[1] for th2_k in th2])
return th4_1, th4_2
Application:
t = np.linspace(0, 5, 100)
th2 = omega * t
th4_1, th4_2 = solve(th2)
twopi = 2 * np.pi
np.allclose(th4_1 % twopi, fi4_1 % twopi)
# True
np.allclose(th4_2 % twopi, fi4_2 % twopi)
# True
Depending on the structure of your mechanism (e.g. 5 links), you may have more than two solutions, and of course more angles, so you'd have to adapt the code above. But you get the idea.
Be warned: fsolve iterates to find a suitable (close enough) solution, so as I said, it is much slower than your closed form.
Update (some clarification/explanation):
The function f computes the position of the point B in two different ways (via R2-R3 and via R1-R4) and returns the difference (as a vector). We solve for the difference to be zero.
That function takes two arguments: one 2-dimensional variable (th34, which is an array [th3, th4]) and one parameter th2; the parameter is constant during one run of fsolve.
The values [0, -1.5] and [0, 1.5] are initialization values (guesses) for th34 (th3 and th4). We call fsolve twice to get the two possible solutions.
All angles refer to your figure. I use th for θ (theta, not phi), but I kept along the original fi4_1 and fi4_2 for comparison.
Modulo 2*pi, th4_1 should be equal to fi4_1 etc., which is tested by np.allclose to account for numerical rounding errors.

Something wrong with Sigmoid curve for Logistic Regression

I'm trying to use logistic regression on the popularity of hits songs on Spotify from 2010-2019 based on their durations and durability, whose data are collected from a .csv file. Basically, since the popularity values of each song are numerical, I have converted each of them to binary numbers "0" to "1". If the popularity value of a hit song is less than 70, I will replace its current value to 0, and vice versa if its value is more than 70. For some reason, as the rest of my code is pretty standard in creating a sigmoid function, the end result is a straight line instead of a sigmoid curve.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('top10s [SubtitleTools.com] (2).csv')
BPM = df.bpm
BPM = np.array(BPM)
Energy = df.nrgy
Energy = np.array(Energy)
Dance = df.dnce
Dance = np.array(Dance)
dB = df.dB
dB = np.array(dB)
Live = df.live
Live = np.array(Live)
Valence = df.val
Valence = np.array(Valence)
Acous = df.acous
Acous = np.array(Acous)
Speech = df.spch
Speech = np.array(Speech)
df.loc[df['popu'] <= 70, 'popu'] = 0
df.loc[df['popu'] > 70, 'popu'] = 1
def Logistic_Regression(X, y, iterations, alpha):
ones = np.ones((X.shape[0], ))
X = np.vstack((ones, X))
X = X.T
b = np.zeros(X.shape[1])
for i in range(iterations):
z = np.dot(X, b)
p_hat = sigmoid(z)
gradient = np.dot(X.T, (y - p_hat))
b = b + alpha * gradient
if (i % 1000 == 0):
print('LL, i ', log_likelihood(X, y, b), i)
return b
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def log_likelihood(X, y, b):
z = np.dot(X, b)
LL = np.sum(y*z - np.log(1 + np.exp(z)))
return LL
def LR1():
Dur = df.dur
Dur = np.array(Dur)
Pop = df.popu
Pop = [int(i) for i in Pop]; Pop = np.array(Pop)
plt.figure(figsize=(10,8))
colormap = np.array(['r', 'b'])
plt.scatter(Dur, Pop, c = colormap[Pop], alpha = .4)
b = Logistic_Regression(Dur, Pop, iterations = 8000, alpha = 0.00005)
print('Done')
p_hat = sigmoid(np.dot(Dur, b[1]) + b[0])
idxDur = np.argsort(Dur)
plt.plot(Dur[idxDur], p_hat[idxDur])
plt.show()
LR1()
df
Your logreg params arent coming out correctly, thus something is wrong in your gradient descent.
If I do
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame({'popu':[0,1,0,1,1,0,0,1,0,0],'dur'[217,283,200,295,221,176,206,260,217,213]})
logreg = LogisticRegression()
logreg.fit(Dur.reshape([10,1]),Pop.reshape([10,1]))
print(logreg.coef_)
print(logreg.intercept_)
I get [0.86473507, -189.79655798]
whereas your params (b) come out [0.012136874150412973 -0.2430389407767768] for this data.
Plot of your vs scikit logregs here

Curve fitting of complex data

I want to fit complex data set with a two functions which shared the same parameters. For this I used
def funcReal(x,a,b,c,d):
return np.real((a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x)))
def funcImag(x,a,b,c,d):
return np.imag((a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x)))`
poptReal, pcovReal = curve_fit(funcReal, x, yReal)
poptImag, pcovImag = curve_fit(funcImag, x, yImag)
Here funcReal is the real part of my model, funcImag the imaginary part, yReal the real part of the data and yImag the imaginary part of the data.
However, both fits does not give me the same parameters for the real and imaginary part.
My question is there a package or a method such that I can realized multi fits for multiple data sets and multiple functions with shared parameters?
To fit both the complex function given above, we can treat the real and imaginary components as a coordinate point, or as a vector. Since curve_fit doesn't care about the order at which data points are inserted in the vectors x (independent data) and y (dependent data), we can simply split the complex data and stack the real and imaginary components using hstack. See the example below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
kappa1 = np.pi
kappa2 = -0.01
def long_function(x, a, b, c, d):
return (a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x))
def funcBoth(x, a, b, c, d):
N = len(x)
x_real = x[:N//2]
x_imag = x[N//2:]
y_real = np.real(long_function(x_real, a, b, c, d))
y_imag = np.imag(long_function(x_imag, a, b, c, d))
return np.hstack([y_real, y_imag])
# Create an independent variable with 100 measurements
N = 100
x = np.linspace(0, 10, N)
# True values of the dependent variable
y = long_function(x, a=1.1, b=0.3, c=-0.2, d=0.23)
# Add uniform complex noise (real + imaginary)
noise = (np.random.rand(N) + 1j * np.random.rand(N) - 0.5 - 0.5j) * 0.1
yNoisy = y + noise
# Split the measurements into a real and imaginary part
yReal = np.real(yNoisy)
yImag = np.imag(yNoisy)
yBoth = np.hstack([yReal, yImag])
# Find the best-fit solution
poptBoth, pcovBoth = curve_fit(funcBoth, np.hstack([x, x]), yBoth)
# Compute the best-fit solution
yFit = long_function(x, *poptBoth)
print(poptBoth)
# Plot the results
plt.figure(figsize=(9, 4))
plt.subplot(121)
plt.plot(x, np.real(yNoisy), "k.", label="Noisy y")
plt.plot(x, np.real(y), "r--", label="True y")
plt.plot(x, np.real(yFit), label="Best fit")
plt.ylabel("Real part of y")
plt.xlabel("x")
plt.legend()
plt.subplot(122)
plt.plot(x, np.imag(yNoisy), "k.")
plt.plot(x, np.imag(y), "r--")
plt.plot(x, np.imag(yFit))
plt.ylabel("Imaginary part of y")
plt.xlabel("x")
plt.tight_layout()
plt.show()
Result:
The best-fit parameters that were found in this example were a = 1.14, b = 0.375, c = -0.236, and d = 0.163, which are close enough to the true parameter values given the amplitude of the noise that I inserted here.

Poincare Section of a system of second order odes

It is the first time I am trying to write a Poincare section code at Python.
I borrowed the piece of code from here:
https://github.com/williamgilpin/rk4/blob/master/rk4_demo.py
and I have tried to run it for my system of second order coupled odes. The problem is that I do not see what I was expecting to. Actually, I need the Poincare section when x=0 and px>0.
I believe that my implementation is not the best out there. I would like to:
Improve the way that the initial conditions are chosen.
Apply the correct conditions (x=0 and px>0) in order to acquire the correct Poincare section.
Create one plot with all the collected poincare section data, not four separate ones.
I would appreciate any help.
This is the code:
from matplotlib.pyplot import *
from scipy import *
from numpy import *
# a simple Runge-Kutta integrator for multiple dependent variables and one independent variable
def rungekutta4(yprime, time, y0):
# yprime is a list of functions, y0 is a list of initial values of y
# time is a list of t-values at which solutions are computed
#
# Dependency: numpy
N = len(time)
y = array([thing*ones(N) for thing in y0]).T
for ii in xrange(N-1):
dt = time[ii+1] - time[ii]
k1 = dt*yprime(y[ii], time[ii])
k2 = dt*yprime(y[ii] + 0.5*k1, time[ii] + 0.5*dt)
k3 = dt*yprime(y[ii] + 0.5*k2, time[ii] + 0.5*dt)
k4 = dt*yprime(y[ii] + k3, time[ii+1])
y[ii+1] = y[ii] + (k1 + 2.0*(k2 + k3) + k4)/6.0
return y
# Miscellaneous functions
n= 1.0/3.0
kappa1 = 0.1
kappa2 = 0.1
kappa3 = 0.1
def total_energy(valpair):
(x, y, px, py) = tuple(valpair)
return .5*(px**2 + py**2) + (1.0/(1.0*(n+1)))*(kappa1*np.absolute(x)**(n+1)+kappa2*np.absolute(y-x)**(n+1)+kappa3*np.absolute(y)**(n+1))
def pqdot(valpair, tval):
# input: [x, y, px, py], t
# takes a pair of x and y values and returns \dot{p} according to the Hamiltonian
(x, y, px, py) = tuple(valpair)
return np.array([px, py, -kappa1*np.sign(x)*np.absolute(x)**n+kappa2*np.sign(y-x)*np.absolute(y-x)**n, kappa2*np.sign(y-x)*np.absolute(y-x)**n-kappa3*np.sign(y)*np.absolute(y)**n]).T
def findcrossings(data, data1):
# returns indices in 1D data set where the data crossed zero. Useful for generating Poincare map at 0
prb = list()
for ii in xrange(len(data)-1):
if (((data[ii] > 0) and (data[ii+1] < 0)) or ((data[ii] < 0) and (data[ii+1] > 0))) and data1[ii] > 0:
prb.append(ii)
return array(prb)
t = linspace(0, 1000.0, 100000)
print ("step size is " + str(t[1]-t[0]))
# Representative initial conditions for E=1
E = 1
x0=0
y0=0
init_cons = [[x0, y0, np.sqrt(2*E-(1.0*i/10.0)*(1.0*i/10.0)-2.0/(n+1)*(kappa1*np.absolute(x0)**(n+1)+kappa2*np.absolute(y0-x0)**(n+1)+kappa3*np.absolute(y0)**(n+1))), 1.0*i/10.0] for i in range(-10,11)]
outs = list()
for con in init_cons:
outs.append( rungekutta4(pqdot, t, con) )
# plot the results
fig1 = figure(1)
for ii in xrange(4):
subplot(2, 2, ii+1)
plot(outs[ii][:,1],outs[ii][:,3])
ylabel("py")
xlabel("y")
title("Full trajectory projected onto the plane")
fig1.suptitle('Full trajectories E = 1', fontsize=10)
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(4):
subplot(2, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
yints = [.5*(outs[ii][cross, 1] + outs[ii][cross+1, 1]) for cross in xcrossings]
pyints = [.5*(outs[ii][cross, 3] + outs[ii][cross+1, 3]) for cross in xcrossings]
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
fig2.suptitle('Poincare Sections E = 1', fontsize=10)
show()
You need to compute the derivatives of the Hamiltonian correctly. The derivative of |y-x|^n for x is
n*(x-y)*|x-y|^(n-2)=n*sign(x-y)*|x-y|^(n-1)
and the derivative for y is almost, but not exactly (as in your code), the same,
n*(y-x)*|x-y|^(n-2)=n*sign(y-x)*|x-y|^(n-1),
note the sign difference. With this correction you can take larger time steps, with correct linear interpolation probably even larger ones, to obtain the images
I changed the integration of the ODE to
t = linspace(0, 1000.0, 2000+1)
...
E_kin = E-total_energy([x0,y0,0,0])
init_cons = [[x0, y0, (2*E_kin-py**2)**0.5, py] for py in np.linspace(-10,10,8)]
outs = [ odeint(pqdot, con, t, atol=1e-9, rtol=1e-8) ) for con in init_cons[:8] ]
Obviously the number and parametrization of initial conditions may change.
The computation and display of the zero-crossings was changed to
def refine_crossing(a,b):
tf = -a[0]/a[2]
while abs(b[0])>1e-6:
b = odeint(pqdot, a, [0,tf], atol=1e-8, rtol=1e-6)[-1];
# Newton step using that b[0]=x(tf) and b[2]=x'(tf)
tf -= b[0]/b[2]
return [ b[1], b[3] ]
# Plot Poincare sections at x=0 and px>0
fig2 = figure(2)
for ii in xrange(8):
#subplot(4, 2, ii+1)
xcrossings = findcrossings(outs[ii][:,0], outs[ii][:,3])
ycrossings = [ refine_crossing(outs[ii][cross], outs[ii][cross+1]) for cross in xcrossings]
yints, pyints = array(ycrossings).T
plot(yints, pyints,'.')
ylabel("py")
xlabel("y")
title("Poincare section x = 0")
and evaluating the result of a longer integration interval

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Categories

Resources