Logarithm function-approximation algorithm - python

I created a function to calculate the parameters of a logarithm-function.
My aim is to predict the future results of data points that follow a logarithm function. But what is the most important is that my algorithm fits the last results better than the whole data points as it is the prediction that matters. I currently use Mean Squared Error to optimize my parameters but I do not know how to weight it such as it takes my most recent data points as more important than the first ones.
Here is my equation:
y = C * log( a * x + b )
Here is my code:
import numpy as np
from sklearn.metrics import mean_squared_error
def approximate_log_function(x, y):
C = np.arange(0.01, 1, step = 0.01)
a = np.arange(0.01, 1, step = 0.01)
b = np.arange(0.01, 1, step = 0.01)
min_mse = 9999999999
parameters = [0, 0, 0]
for i in np.array(np.meshgrid(C, a, b)).T.reshape(-1, 3):
y_estimation = i[0] * np.log(i[1] * np.array(x) + i[2])
mse = mean_squared_error(y, y_estimation)
if mse < min_mse:
min_mse = mse
parameters = [i[0], i[1], i[2]]
return (min_mse, parameters)
You can see in the image below the orange curve is the data I have and the blue line is my fitted line. We see that the line stretch a bit away from the line on the end and I would like to avoid that to improve the prediction from my function.
logarithm function graph
My question is twofold:
Is this actually the best way to do it or is it best to use another function (such as the increasing form of an Exponential Decay)? (y = C ( 1 - e-kt ), k > 0)
How can I change my code so that the last values are more important to be fitted than the first ones.

Usually, in non-linear least-squares, the inverse of the y values is taken as weight, that essentially eliminates outliers, you can expand on that idea by adding a function to calculate the weight based on the x position.
def xWeightA(x):
container=[]
for k in range(len(x)):
if k<int(0.9*len(x)):
container.append(1)
else:
container.append(1.2)
return container
def approximate_log_function(x, y):
C = np.arange(0.01, 1, step = 0.01)
a = np.arange(0.01, 1, step = 0.01)
b = np.arange(0.01, 1, step = 0.01)
min_mse = 9999999999
parameters = [0, 0, 0]
LocalWeight=xWeightA(x)
for i in np.array(np.meshgrid(C, a, b)).T.reshape(-1, 3):
y_estimation = LocalWeight*i[0] * np.log(i[1] * np.array(x) + i[2])
mse = mean_squared_error(y, y_estimation)
if mse < min_mse:
min_mse = mse
parameters = [i[0], i[1], i[2]]
return (min_mse, parameters)
Also, it looks like you're evaluating through the complete objective function, that makes the code to take to much time to find the minimum (at least on my machine). You can use curve_fit or polyfit as suggested, but if the goal is to generate the optimizer try adding an early break or a random search through the grid. Hope it helps

Related

Smooth a curve in Python while preserving the value and slope at the end points

I have two solutions to this problem actually, they are both applied below to a test case. The thing is that none of them is perfect: first one only take into account the two end points, the other one can't be made "arbitrarily smooth": there is a limit in the amount of smoothness one can achieve (the one I am showing).
I am sure there is a better solution, that kind-of go from the first solution to the other and all the way to no smoothing at all. It may already be implemented somewhere. Maybe solving a minimization problem with an arbitrary number of splines equidistributed?
Thank you very much for your help
Ps: the seed used is a challenging one
import matplotlib.pyplot as plt
from scipy import interpolate
from scipy.signal import savgol_filter
import numpy as np
import random
def scipy_bspline(cv, n=100, degree=3):
""" Calculate n samples on a bspline
cv : Array ov control vertices
n : Number of samples to return
degree: Curve degree
"""
cv = np.asarray(cv)
count = cv.shape[0]
degree = np.clip(degree,1,count-1)
kv = np.clip(np.arange(count+degree+1)-degree,0,count-degree)
# Return samples
max_param = count - (degree * (1-periodic))
spl = interpolate.BSpline(kv, cv, degree)
return spl(np.linspace(0,max_param,n))
def round_up_to_odd(f):
return np.int(np.ceil(f / 2.) * 2 + 1)
def generateRandomSignal(n=1000, seed=None):
"""
Parameters
----------
n : integer, optional
Number of points in the signal. The default is 1000.
Returns
-------
sig : numpy array
"""
np.random.seed(seed)
print("Seed was:", seed)
steps = np.random.choice(a=[-1, 0, 1], size=(n-1))
roughSig = np.concatenate([np.array([0]), steps]).cumsum(0)
sig = savgol_filter(roughSig, round_up_to_odd(n/10), 6)
return sig
# Generate a random signal to illustrate my point
n = 1000
t = np.linspace(0, 10, n)
seed = 45136. # Challenging seed
sig = generateRandomSignal(n=1000, seed=seed)
sigInit = np.copy(sig)
# Add noise to the signal
mean = 0
std = sig.max()/3.0
num_samples = n/5
idxMin = n/2-100
idxMax = idxMin + num_samples
tCut = t[idxMin+1:idxMax]
noise = np.random.normal(mean, std, size=num_samples-1) + 2*std*np.sin(2.0*np.pi*tCut/0.4)
sig[idxMin+1:idxMax] += noise
# Define filtering range enclosing the noisy area of the signal
idxMin -= 20
idxMax += 20
# Extreme filtering solution
# Spline between first and last points, the points in between have no influence
sigTrim = np.delete(sig, np.arange(idxMin,idxMax))
tTrim = np.delete(t, np.arange(idxMin,idxMax))
f = interpolate.interp1d(tTrim, sigTrim, kind='quadratic')
sigSmooth1 = f(t)
# My attempt. Not bad but not perfect because there is a limit in the maximum
# amount of smoothing we can add (degree=len(tSlice) is the maximum)
# If I could do degree=10*len(tSlice) and converging to the first solution
# I would be done!
sigSlice = sig[idxMin:idxMax]
tSlice = t[idxMin:idxMax]
cv = np.stack((tSlice, sigSlice)).T
p = scipy_bspline(cv, n=len(tSlice), degree=len(tSlice))
tSlice = p.T[0]
sigSliceSmooth = p.T[1]
sigSmooth2 = np.copy(sig)
sigSmooth2[idxMin:idxMax] = sigSliceSmooth
# Plot
plt.figure()
plt.plot(t, sig, label="Signal")
plt.plot(t, sigSmooth1, label="Solution 1")
plt.plot(t, sigSmooth2, label="Solution 2")
plt.plot(t[idxMin:idxMax], sigInit[idxMin:idxMax], label="What I'd want (kind of, smoother will be even better actually)")
plt.plot([t[idxMin],t[idxMax]], [sig[idxMin],sig[idxMax]],"o")
plt.legend()
plt.show()
sys.exit()
Yes, a minimization is a good way to approach this smoothing problem.
Least squares problem
Here is a suggestion for a least squares formulation: let s[0], ..., s[N] denote the N+1 samples of the given signal to smooth, and let L and R be the desired slopes to preserve at the left and right endpoints. Find the smoothed signal u[0], ..., u[N] as the minimizer of
min_u (1/2) sum_n (u[n] - s[n])² + (λ/2) sum_n (u[n+1] - 2 u[n] + u[n-1])²
subject to
s[0] = u[0], s[N] = u[N] (value constraints),
L = u[1] - u[0], R = u[N] - u[N-1] (slope constraints),
where in the minimization objective, the sums are over n = 1, ..., N-1 and λ is a positive parameter controlling the smoothing strength. The first term tries to keep the solution close to the original signal, and the second term penalizes u for bending to encourage a smooth solution.
The slope constraints require that
u[1] = L + u[0] = L + s[0] and u[N-1] = u[N] - R = s[N] - R. So we can consider the minimization as over only the interior samples u[2], ..., u[N-2].
Finding the minimizer
The minimizer satisfies the Euler–Lagrange equations
(u[n] - s[n]) / λ + (u[n+2] - 4 u[n+1] + 6 u[n] - 4 u[n-1] + u[n-2]) = 0
for n = 2, ..., N-2.
An easy way to find an approximate solution is by gradient descent: initialize u = np.copy(s), set u[1] = L + s[0] and u[N-1] = s[N] - R, and do 100 iterations or so of
u[2:-2] -= (0.05 / λ) * (u - s)[2:-2] + np.convolve(u, [1, -4, 6, -4, 1])[4:-4]
But with some more work, it is possible to do better than this by solving the E–L equations directly. For each n, move the known quantities to the right-hand side: s[n] and also the endpoints u[0] = s[0], u[1] = L + s[0], u[N-1] = s[N] - R, u[N] = s[N]. The you will have a linear system "A u = b", and matrix A has rows like
0, ..., 0, 1, -4, (6 + 1/λ), -4, 1, 0, ..., 0.
Finally, solve the linear system to find the smoothed signal u. You could use numpy.linalg.solve to do this if N is not too large, or if N is large, try an iterative method like conjugate gradients.
you can apply a simple smoothing method and plot the smooth curves with different smoothness values to see which one works best.
def smoothing(data, smoothness=0.5):
last = data[0]
new_data = [data[0]]
for datum in data[1:]:
new_value = smoothness * last + (1 - smoothness) * datum
new_data.append(new_value)
last = datum
return new_data
You can plot this curve for multiple values of smoothness and pick the curve which suits your needs. You can also apply this method only on a range of values in the actual curve by defining start and end

Difference in result between fmin and fminsearch in Matlab and Python

My objective is to perform an Inverse Laplace Transform on some decay data (NMR T2 decay via CPMG). For that, we were provided with the CONTIN algorithm. This algorithm was adapted to Matlab by Iari-Gabriel Marino, and it works very well. I want to adapt this code into Python. The core of the problem is with scipy.optimize.fmin, which is not minimizing the mean square deviation (MSD) in any way similar to Matlab's fminsearch. The latter results in a good minimization, while the former doesn't.
I have gone through line by line of my adapted code in Python, and the original Matlab. I checked every matrix and every output. I used this to identify that the critical point is in fmin. I also tried scipy.optimize.minimize and other minimization algorithms, but none gave even remotely satisfactory results.
I have made two MWE, for Python and Matlab, to make it reproducible to all. The example data were obtained from the documentation of the matlab function. Apologies if this is long code, but I don't really know how to shorten it without sacrificing readability and clarity. I tried to have the lines match as closely as possible. I am using Python 3.7.3, scipy v1.3.0, numpy 1.16.2, Matlab R2018b, on Windows 8.1. It's a relatively recent Anaconda install (<2 months).
My code:
import numpy as np
from scipy.optimize import fmin
import matplotlib.pyplot as plt
def msd(g, y, A, alpha, R, w, constraints):
""" msd: mean square deviation. This is the function to be minimized by fmin"""
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
r = np.diff(g, axis=0, n=2)
yfit = A # g
# Sum of weighted square residuals
VAR = np.sum(w * (y - yfit) ** 2)
# Regularizor
REG = alpha ** 2 * np.sum((r - R # g) ** 2)
# output to be minimized
return VAR + REG
# Objective: match this distribution
g0 = np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]).reshape((-1, 1))
s0 = np.logspace(-3, 6, len(g0)).reshape((-1, 1))
t = np.linspace(0.01, 500, 100).reshape((-1, 1))
sM, tM = np.meshgrid(s0, t)
A = np.exp(-tM / sM)
np.random.seed(1)
# Creates data from the initial distribution with some random noise.
data = (A # g0) + 0.07 * np.random.rand(t.size).reshape((-1, 1))
# Parameters and function start
alpha = 1E-2 # regularization parameter
s = np.logspace(-3, 6, 20).reshape((-1, 1)) # x of the ILT
g0 = np.ones(s.size).reshape((-1, 1)) # guess of y of ILT
y = data # noisy data
options = {'maxiter':1e8, 'maxfun':1e8} # for the fmin function
constraints=['g>0', 'zero_at_extremes'] # constraints for the MSD function
R=np.zeros((len(g0) - 2, len(g0)), order='F') # Regularizor
w=np.ones(y.reshape(-1, 1).size).reshape((-1, 1)) # Weights
sM, tM = np.meshgrid(s, t, indexing='xy')
A = np.exp(-tM/sM)
g0 = g0 * y.sum() / (A # g0).sum() # Makes a "better guess" for the distribution, according to algorithm
print('msd of input data:\n', msd(g0, y, A, alpha, R, w, constraints))
for i in range(5): # Just for testing. If this is extremely high, ~1000, it's still bad.
g = fmin(func=msd,
x0 = g0,
args=(y, A, alpha, R, w, constraints),
**options,
disp=True)[:, np.newaxis]
msdfit = msd(g, y, A, alpha, R, w, constraints)
if 'zero_at_extremes' in constraints:
g[0] = 0
g[-1] = 0
if 'g>0' in constraints:
g = np.abs(g)
g0 = g
print('New guess', g)
print('Final msd of g', msdfit)
# Visualize the fit
plt.plot(s, g, label='Initial approximation')
plt.plot(np.logspace(-3, 6, 11), np.array([0, 0, 10.1625, 25.1974, 21.8711, 1.6377, 7.3895, 8.736, 1.4256, 0, 0]), label='Distribution to match')
plt.xscale('log')
plt.legend()
plt.show()
Matlab:
% Objective: match this distribution
g0 = [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0]';
s0 = logspace(-3,6,length(g0))';
t = linspace(0.01,500,100)';
[sM,tM] = meshgrid(s0,t);
A = exp(-tM./sM);
rng(1);
% Creates data from the initial distribution with some random noise.
data = A*g0 + 0.07*rand(size(t));
% Parameters and function start
alpha = 1e-2; % regularization parameter
s = logspace(-3,6,20)'; % x of the ILT
g0 = ones(size(s)); % initial guess of y of ILT
y = data; % noisy data
options = optimset('MaxFunEvals',1e8,'MaxIter',1e8); % constraints for fminsearch
constraints = {'g>0','zero_at_the_extremes'}; % constraints for MSD
R = zeros(length(g0)-2,length(g0));
w = ones(size(y(:)));
[sM,tM] = meshgrid(s,t);
A = exp(-tM./sM);
g0 = g0*sum(y)/sum(A*g0); % Makes a "better guess" for the distribution
disp('msd of input data:')
disp(msd(g0, y, A, alpha, R, w, constraints))
for k = 1:5
[g,msdfit] = fminsearch(#msd,g0,options,y,A,alpha,R,w,constraints);
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g);
end
g0 = g;
end
disp('New guess')
disp(g)
disp('Final msd of g')
disp(msdfit)
% Visualize the fit
semilogx(s, g)
hold on
semilogx(logspace(-3,6,11), [0 0 10.1625 25.1974 21.8711 1.6377 7.3895 8.736 1.4256 0 0])
legend('First approximation', 'Distribution to match')
hold off
function out = msd(g,y,A,alpha,R,w,constraints)
% msd: The mean square deviation; this is the function
% that has to be minimized by fminsearch
% Constraints and any 'a priori' knowledge
if ismember('zero_at_the_extremes',constraints)
g(1) = 0;
g(end) = 0;
end
if ismember('g>0',constraints)
g = abs(g); % must be g(i)>=0 for each i
end
r = diff(diff(g(1:end))); % second derivative of g
yfit = A*g;
% Sum of weighted square residuals
VAR = sum(w.*(y-yfit).^2);
% Regularizor
REG = alpha^2 * sum((r-R*g).^2);
% Output to be minimized
out = VAR+REG;
end
Here is the optimization in Python
Here is the optimization in Matlab
I have checked the output of MSD of g0 before starting, and both give the value of 2651. After minimization, Python goes up, to 4547, and Matlab goes down to 0.1381.
I think the problem is one of the following. It's in my implementation, that is, I am using fmin wrong, or there's some other passage I got wrong, but I can't figure out what. The fact the MSD increases when it should have decreased with a minimization function is damning. Reading the documentation, the scipy implementation is different from Matlab's (they use the Nelder Mead method described in Lagarias, per their documentation), while scipy uses the original Nelder Mead). Maybe that affects significantly? Or perhaps my initial guess is too bad for scipy's algorithm?
So, quite a long time since I posted this, but I wanted to share what I ended up learning and doing.
The Inverse Laplace Transform for CPMG data is a bit of a misnomer, and it's more properly called just inversion. The general problem is solving a Fredholm integral of the first kind. One way of doing this is the Tikhonov regularization method. Turns out, you can describe this problem quite easily using numpy, and solve it with a scipy package, so I don't have to "reinvent" the wheel with this.
I used the solution shown in this post, and the names here reflect that solution.
def tikhonov_regularized_inversion(
kernel: np.ndarray, alpha: float, data: np.ndarray
) -> np.ndarray:
data = data.reshape(-1, 1)
I = alpha * np.eye(*kernel.shape)
C = np.concatenate([kernel, I], axis=0)
d = np.concatenate([data, np.zeros_like(data)])
x, _ = nnls(C, d.flatten())
Here, kernel is a matrix containing all the possible exponential decay curves, and my solution judges the contribution of each decay curve in the data I received. First, I stack my data as a column, then pad it with zeros, creating the vector d. I then stack my kernel on top of a diagonal matrix containing the regularization parameter alpha along the diagonal, of the same size as the kernel. Last, I call the convenient nnls, a non negative least square solver in scipy.optimize. This is because there's no reason to have a negative contribution, only no contribution.
This solved my problem, it's quick and convenient.

LMS batch gradient descent with NumPy

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()
The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.

Fit a curve for data made up of two distinct regimes

I'm looking for a way to plot a curve through some experimental data. The data shows a small linear regime with a shallow gradient, followed by a steep linear regime after a threshold value.
My data is here: http://pastebin.com/H4NSbxqr
I could fit the data with two lines relatively easily, but I'd like to fit with a continuous line ideally - which should look like two lines with a smooth curve joining them around the threshold (~5000 in the data, shown above).
I attempted this using scipy.optimize curve_fit and trying a function which included the sum of a straight line and an exponential:
y = a*x + b + c*np.exp((x-d)/e)
although despite numerous attempts, it didn't find a solution.
If anyone has any suggestions please, either on the choice of fitting distribution / method or the curve_fit implementation, they would be greatly appreciated.
If you don't have a particular reason to believe that linear + exponential is the true underlying cause of your data, then I think a fit to two lines makes the most sense. You can do this by making your fitting function the maximum of two lines, for example:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def two_lines(x, a, b, c, d):
one = a*x + b
two = c*x + d
return np.maximum(one, two)
Then,
x, y = np.genfromtxt('tmp.txt', unpack=True, delimiter=',')
pw0 = (.02, 30, .2, -2000) # a guess for slope, intercept, slope, intercept
pw, cov = curve_fit(two_lines, x, y, pw0)
crossover = (pw[3] - pw[1]) / (pw[0] - pw[2])
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-')
If you really want a continuous and differentiable solution, it occurred to me that a hyperbola has a sharp bend to it, but it has to be rotated. It was a bit difficult to implement (maybe there's an easier way), but here's a go:
def hyperbola(x, a, b, c, d, e):
""" hyperbola(x) with parameters
a/b = asymptotic slope
c = curvature at vertex
d = offset to vertex
e = vertical offset
"""
return a*np.sqrt((b*c)**2 + (x-d)**2)/b + e
def rot_hyperbola(x, a, b, c, d, e, th):
pars = a, b, c, 0, 0 # do the shifting after rotation
xd = x - d
hsin = hyperbola(xd, *pars)*np.sin(th)
xcos = xd*np.cos(th)
return e + hyperbola(xcos - hsin, *pars)*np.cos(th) + xcos - hsin
Run it as
h0 = 1.1, 1, 0, 5000, 100, .5
h, hcov = curve_fit(rot_hyperbola, x, y, h0)
plt.plot(x, y, 'o', x, two_lines(x, *pw), '-', x, rot_hyperbola(x, *h), '-')
plt.legend(['data', 'piecewise linear', 'rotated hyperbola'], loc='upper left')
plt.show()
I was also able to get the line + exponential to converge, but it looks terrible. This is because it's not a good descriptor of your data, which is linear and an exponential is very far from linear!
def line_exp(x, a, b, c, d, e):
return a*x + b + c*np.exp((x-d)/e)
e0 = .1, 20., .01, 1000., 2000.
e, ecov = curve_fit(line_exp, x, y, e0)
If you want to keep it simple, there's always a polynomial or spline (piecewise polynomials)
from scipy.interpolate import UnivariateSpline
s = UnivariateSpline(x, y, s=x.size) #larger s-value has fewer "knots"
plt.plot(x, s(x))
I researched this a little, Applied Linear Regression by Sanford, and the Correlation and Regression lecture by Steiger had some good info on it. They all however lack the right model, the piecewise function should be
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import lmfit
dfseg = pd.read_csv('segreg.csv')
def err(w):
th0 = w['th0'].value
th1 = w['th1'].value
th2 = w['th2'].value
gamma = w['gamma'].value
fit = th0 + th1*dfseg.Temp + th2*np.maximum(0,dfseg.Temp-gamma)
return fit-dfseg.C
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('gamma', 40.))
mi = lmfit.minimize(err, p)
lmfit.printfuncs.report_fit(mi.params)
b0 = mi.params['th0']; b1=mi.params['th1'];b2=mi.params['th2']
gamma = int(mi.params['gamma'].value)
import statsmodels.formula.api as smf
reslin = smf.ols('C ~ 1 + Temp + I((Temp-%d)*(Temp>%d))' % (gamma,gamma), data=dfseg).fit()
print reslin.summary()
x0 = np.array(range(0,gamma,1))
x1 = np.array(range(0,80-gamma,1))
y0 = b0 + b1*x0
y1 = (b0 + b1 * float(gamma) + (b1 + b2)* x1)
plt.scatter(dfseg.Temp, dfseg.C)
plt.hold(True)
plt.plot(x0,y0)
plt.plot(x1+gamma,y1)
plt.show()
Result
[[Variables]]
th0: 78.6554456 +/- 3.966238 (5.04%) (init= 0)
th1: -0.15728297 +/- 0.148250 (94.26%) (init= 0)
th2: 0.72471237 +/- 0.179052 (24.71%) (init= 0)
gamma: 38.3110177 +/- 4.845767 (12.65%) (init= 40)
The data
"","Temp","C"
"1",8.5536,86.2143
"2",10.6613,72.3871
"3",12.4516,74.0968
"4",16.9032,68.2258
"5",20.5161,72.3548
"6",21.1613,76.4839
"7",24.3929,83.6429
"8",26.4839,74.1935
"9",26.5645,71.2581
"10",27.9828,78.2069
"11",32.6833,79.0667
"12",33.0806,71.0968
"13",33.7097,76.6452
"14",34.2903,74.4516
"15",36,56.9677
"16",37.4167,79.8333
"17",43.9516,79.7097
"18",45.2667,76.9667
"19",47,76
"20",47.1129,78.0323
"21",47.3833,79.8333
"22",48.0968,73.9032
"23",49.05,78.1667
"24",57.5,81.7097
"25",59.2,80.3
"26",61.3226,75
"27",61.9194,87.0323
"28",62.3833,89.8
"29",64.3667,96.4
"30",65.371,88.9677
"31",68.35,91.3333
"32",70.7581,91.8387
"33",71.129,90.9355
"34",72.2419,93.4516
"35",72.85,97.8333
"36",73.9194,92.4839
"37",74.4167,96.1333
"38",76.3871,89.8387
"39",78.0484,89.4516
Graph
I used #user423805 's answer (found via google groups thread: https://groups.google.com/forum/#!topic/lmfit-py/7I2zv2WwFLU ) but noticed it had some limitations when trying to use three or more segments.
Instead of applying np.maximum in the minimizer error function or adding (b1 + b2) in #user423805 's answer, I used the same linear spline calculation for both the minimizer and end-usage:
# least_splines_calc works like this for an example with three segments
# (four threshold params, three gamma params):
#
# for 0 < x < gamma0 : y = th0 + (th1 * x)
# for gamma0 < x < gamma1 : y = th0 + (th1 * x) + (th2 * (x - gamma0))
# for gamma1 < x : y = th0 + (th1 * x) + (th2 * (x - gamma0)) + (th3 * (x - gamma1))
#
def least_splines_calc(x, thresholds, gammas):
if(len(thresholds) < 2):
print("Error: expected at least two thresholds")
return None
applicable_gammas = filter(lambda gamma: x > gamma , gammas)
#base result
y = thresholds[0] + (thresholds[1] * x)
#additional factors calculated depending on x value
for i in range(0, len(applicable_gammas)):
y = y + ( thresholds[i + 2] * ( x - applicable_gammas[i] ) )
return y
def least_splines_calc_array(x_array, thresholds, gammas):
y_array = map(lambda x: least_splines_calc(x, thresholds, gammas), x_array)
return y_array
def err(params, x, data):
th0 = params['th0'].value
th1 = params['th1'].value
th2 = params['th2'].value
th3 = params['th3'].value
gamma1 = params['gamma1'].value
gamma2 = params['gamma2'].value
thresholds = np.array([th0, th1, th2, th3])
gammas = np.array([gamma1, gamma2])
fit = least_splines_calc_array(x, thresholds, gammas)
return np.array(fit)-np.array(data)
p = lmfit.Parameters()
p.add_many(('th0', 0.), ('th1', 0.0),('th2', 0.0),('th3', 0.0),('gamma1', 9.),('gamma2', 9.3)) #NOTE: the 9. / 9.3 were guesses specific to my data, you will need to change these
mi = lmfit.minimize(err_alt, p, args=(np.array(dfseg.Temp), np.array(dfseg.C)))
After minimization, convert the params found by the minimizer into an array of thresholds and gammas to re-use linear_splines_calc to plot the linear splines regression.
Reference: While there's various places that explain least splines (I think #user423805 used http://www.statpower.net/Content/313/Lecture%20Notes/Splines.pdf , which has the (b1 + b2) addition I disagree with in its sample code despite similar equations) , the one that made the most sense to me was this one (by Rob Schapire / Zia Khan at Princeton) : https://www.cs.princeton.edu/courses/archive/spring07/cos424/scribe_notes/0403.pdf - section 2.2 goes into linear splines. Excerpt below:
If you're looking to join what appears to be two straight lines with a hyperbola having a variable radius at/near the intersection of the two lines (which are its asymptotes), I urge you to look hard at Using an Hyperbola as a Transition Model to Fit Two-Regime Straight-Line Data, by Donald G. Watts and David W. Bacon, Technometrics, Vol. 16, No. 3 (Aug., 1974), pp. 369-373.
The formula is drop dead simple, nicely adjustable, and works like a charm. From their paper (in case you can't access it):
As a more useful alternative form we consider an hyperbola for which:
(i) the dependent variable y is a single valued function of the independent variable x,
(ii) the left asymptote has slope theta_1,
(iii) the right asymptote has slope theta_2,
(iv) the asymptotes intersect at the point (x_o, beta_o),
(v) the radius of curvature at x = x_o is proportional to a quantity delta. Such an hyperbola can be written y = beta_o + beta_1*(x - x_o) + beta_2* SQRT[(x - x_o)^2 + delta^2/4], where beta_1 = (theta_1 + theta_2)/2 and beta_2 = (theta_2 - theta_1)/2.
delta is the adjustable parameter that allows you to either closely follow the lines right to the intersection point or smoothly merge from one line to the other.
Just solve for the intersection point (x_o, beta_o), and plug into the formula above.
BTW, in general, if line 1 is y_1 = b_1 + m_1 *x and line 2 is y_2 = b_2 + m_2 * x, then they intersect at x* = (b_2 - b_1) / (m_1 - m_2) and y* = b_1 + m_1 * x*. So, to connect with the formalism above, x_o = x*, beta_o = y* and the two m_*'s are the two thetas.
There is a straightforward method (not iterative, no initial guess) pp.12-13 in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf
The data comes from the scanning of the figure published by IanRoberts in his question. Scanning for the coordinates of the pixels in not accurate. So, don't be surprised by additional deviation.
Note that the abscisses and ordinates scales have been devised by 1000.
The equations of the two segments are
The approximate values of the five parameters are written on the above figure.

Failure of non linear fit to sine curve

I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
Code:
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
fig1.tight_layout()
plt.show()
cntr += 1
prevPhase = phs # Update guess for phase of sine curve
I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.figure(figsize=(15,10))
plt.subplot(211)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.subplot(212)
plt.title("FFT Guess : Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
plt.show()
The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
...
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
to
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.
Without understanding much of your code, according to http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!
By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?
You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12):
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )

Categories

Resources