Linear regression minimizing errors only above the linear

Linear regression minimizing errors only above the linear - python

I have a dataset that resembles the data created in the MWE below:
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
plt.grid()
That is, the errors around the line I want are mostly negative for x>0 and mostly positive for x<0. What I'm looking for is a way to estimate the coefficient K (which in this case is -2).
Really I think the way to do it would be to minimize the error only of the points that fall above the line for x<0 and below the line for x>0, but I'm not sure how to go about it effectively in Python, since everything I can think of involves iterative processes which are slow in Python.

Basically you want to include something that can account for the mean variable in your data generating model. You can do this by modeling a discontinuity at the point x=0 by including a variable in your model that is 0 where x < 0 and 1 where x > 0.
We can even just include the "mean" variable itself and get the same model (with a different interpretation for the second coefficient). Here is a linear model that recovers the correct value for the slope of this discontinuous line. Note that this assumes the slope is the same on the right side of 0 as the left side.
from sklearn.linear_model import LinearRegression
X = np.array([x, mean]).T
reg = LinearRegression().fit(X, y)
print(reg.coef_)

Here is my attempt where I A) fit all data to a straight line, and then B) separate data depending on two criteria: whether x is greater than or less than zero and whether predicted Y is above or below that straight line, and finally C) fit the separated data. The slope is here -2.417 and will vary from run to run depending on the random data.
from matplotlib import pyplot as plt
import numpy as np
sz=100
x = np.linspace(-1, 1, sz)
mean = -np.sign(x)
noise = np.random.randn(*x.shape)
K = -2
y_true = K*x
y = y_true + mean + noise
plt.scatter(x, y, label="Data with error")
plt.plot(x, y_true, "-", label="True line")
###############################
# new section for calculatiing new line
allDataFirstOrderParameters = np.polyfit(x, y, 1)
allDataFirstOrderErrors = y - np.polyval(allDataFirstOrderParameters, x)
newX = []
newY = []
for i in range(len(x)):
if x[i] < 0 and allDataFirstOrderErrors[i] < 0:
newX.append(x[i])
newY.append(y[i])
if x[i] > 0 and allDataFirstOrderErrors[i] > 0:
newX.append(x[i])
newY.append(y[i])
newX = np.array(newX)
newY = np.array(newY)
newFirstOrderParameters = np.polyfit(newX, newY, 1)
print("New Parameters", newFirstOrderParameters)
plotNewX = np.linspace(min(x), max(x))
plotNewY = np.polyval(newFirstOrderParameters, plotNewX)
plt.plot(plotNewX, plotNewY, label="New line")
plt.legend()
plt.show()

Related

Linear fit constrained to go through the first point in Python

I have two numpy 1-d arrays of equal length, x and a response variable y. I'd like to do a least-squares fit that forces the resulting line to go through the point (x_0, y_0). What's the easiest way to pull this off?
Note: some optimization solutions have been proposed. These are good in theory, but the line must go exactly through the point – even if the fit for the rest of the points is worse.

One way to do it would be to center your data on (x_0, y_0), run linear regression (specifying no intercept), and then transform the predictions back to the original scale.
Here's an example:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(100)
y = np.random.randn(100) + x
plt.plot(x, y, "o")
plt.plot(x[0], y[0], "o") # x_0, y_0 is the orange dot
Next, use linear regression without an intercept to fit a line that goes through the transformed data.
from sklearn.linear_model import LinearRegression
lm = LinearRegression(fit_intercept = False)
# center data on x_0, y_0
y2 = y - y[0]
x2 = x - x[0]
# fit model
lm.fit(x2.reshape(-1, 1), y2)
Last, predict the line and plot it back on the original scale
# predict line
preds = lm.predict(np.arange(-5, 5, 0.1).reshape(-1,1))
# plot on original scale
plt.plot(x, y, "o")
plt.plot(x[0], y[0], "o")
# add x_0 and y_0 back to the predictions
plt.plot(np.arange(-5, 5, 0.1) + x[0], preds + y[0])

Least squares not working for a set of y's

I am trying to run a least square algorithm using numpy and is having trouble. Can someone please tell me what I am doing wrong in the given code? When I set y to be y = np.power(X, 1) + np.random.rand(20)*3 or some other reasonable function of x, everything is working fine. But for that particular y defined by those given y values, the plot I am getting is senseless.
Is this some kind of numerical problem?
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()

Are you sure there is a linear relationship between what you are fitting and the y variable data?
Using the code (y = np.power(X, 1) + np.random.rand(20)*3) from your example, you have a linear relationship built into the y variable itself (with some noise) which allows your plot to track relatively well with the linear equation.
X = np.arange(1,21)
#y = np.power(X, 1) + np.random.rand(20)*3
w = np.linalg.lstsq(X.reshape(20, 1), y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X*w[0], 'blue')
plt.show()
However, when you alternate to something like your y variable
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
You end up with something less easy to fit.
Looking at the documentation, if you are attempting to something that fits this set of values, you will need to build in a constant component in which case lstsq does not do by default.
The docs state for lstsq
Return the least-squares solution to a linear matrix equation.
Solves the equation a x = b
If you really want to fit the data to a linear equation, running code like the below will give you something that almost matches your original data. However, the data behind this process seems to have polynomial/exponential driver which would make polyfit better.
X = np.arange(1,21)
y = np.array([-0.00454712, -0.00457764, -0.0045166 , -0.00442505, -0.00427246,
-0.00411987, -0.00378418, -0.003479 , -0.00314331, -0.00259399,
-0.00213623, -0.00146484, -0.00082397, -0.00030518, 0.00027466,
0.00076294, 0.00146484, 0.00192261, 0.00247192, 0.00314331])
#y = np.power(X, 1) + np.random.rand(20)*3
X2 = np.vstack([X, np.ones(len(X))]).T
w = np.linalg.lstsq(X2, y)[0]
plt.plot(X, y, 'red')
plt.plot(X, X.dot(w[0])+w[1], 'blue')
plt.show()

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

I was trying to implement a Radial Basis Function in Python and Numpy as describe by CalTech lecture here. The mathematics seems clear to me so I find it strange that its not working (or it seems to not work). The idea is simple, one chooses a subsampled number of centers for each Gaussian form a kernal matrix and tries to find the best coefficients. i.e. solve Kc = y where K is the guassian kernel (gramm) matrix with least squares. For that I did:
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
but when I try to plot my interpolation with the original data they don't look at all alike:
with 100 random centers (from the data set). I also tried 10 centers which produces essentially the same graph as so does using every data point in the training set. I assumed that using every data point in the data set should more or less perfectly copy the curve but it didn't (overfit). It produces:
which doesn't seem correct. I will provide the full code (that runs without error):
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
from scipy.interpolate import Rbf
import matplotlib.pyplot as plt
## Data sets
def get_labels_improved(X,f):
N_train = X.shape[0]
Y = np.zeros( (N_train,1) )
for i in range(N_train):
Y[i] = f(X[i])
return Y
def get_kernel_matrix(x,W,S):
beta = get_beta_np(S)
#beta = 0.5*tf.pow(tf.div( tf.constant(1.0,dtype=tf.float64),S), 2)
Z = -beta*euclidean_distances(X=x,Y=W,squared=True)
K = np.exp(Z)
return K
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = low_x + (high_x - low_x) * np.random.rand(N,1)
# f(x) = 2*(2(cos(x)^2 - 1)^2 -1
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = get_labels_improved(X , f)
K = 2 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 100
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X,Y=subsampled_data_points,squared=True))
#(C,_,_,_) = np.linalg.lstsq(K,Y_train)
C = np.dot( np.linalg.pinv(Kern), Y )
Y_pred = np.dot( Kern , C )
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()
Since the plots look strange I decided to read the docs for the ploting functions but I couldn't find anything obvious that was wrong.

Scaling of interpolating functions
The main problem is unfortunate choice of standard deviation of the functions used for interpolation:
stddev = 100
The features of your functions (its humps) are of size about 1. So, use
stddev = 1
Order of X values
The mess of red lines is there because plt from matplotlib connects consecutive data points, in the order given. Since your X values are in random order, this results in chaotic left-right movements. Use sorted X:
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
Efficiency issues
Your get_labels_improved method is inefficient, looping over the elements of X. Use Y = f(X), leaving the looping to low-level NumPy internals.
Also, the computation of least-squared solution of an overdetermined system should be done with lstsq instead of computing the pseudoinverse (computationally expensive) and multiplying by it.
Here is the cleaned-up code; using 30 centers gives a good fit.
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import matplotlib.pyplot as plt
N = 5000
low_x =-2*np.pi
high_x=2*np.pi
X = np.sort(low_x + (high_x - low_x) * np.random.rand(N,1), axis=0)
f = lambda x: 2*np.power( 2*np.power( np.cos(x) ,2) - 1, 2) - 1
Y = f(X)
K = 30 # number of centers for RBF
indices=np.random.choice(a=N,size=K) # choose numbers from 0 to D^(1)
subsampled_data_points=X[indices,:] # M_sub x D
stddev = 1
beta = 0.5*np.power(1.0/stddev,2)
Kern = np.exp(-beta*euclidean_distances(X=X, Y=subsampled_data_points,squared=True))
C = np.linalg.lstsq(Kern, Y)[0]
Y_pred = np.dot(Kern, C)
plt.plot(X, Y, 'o', label='Original data', markersize=1)
plt.plot(X, Y_pred, 'r', label='Fitted line', markersize=1)
plt.legend()
plt.show()

Failure of non linear fit to sine curve

I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
Code:
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
fig1.tight_layout()
plt.show()
cntr += 1
prevPhase = phs # Update guess for phase of sine curve

I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.figure(figsize=(15,10))
plt.subplot(211)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.subplot(212)
plt.title("FFT Guess : Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
plt.show()

The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
...
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
to
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.

Without understanding much of your code, according to http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!

By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?

You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12):
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )

Python: two-curve gaussian fitting with non-linear least-squares

My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time.append(float(segs[0]))
counts.append(segs[1])
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
plt.show()

This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
######################################
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
######################################
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
plt.legend()
plt.show()

You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
clf.fit(yourdata)
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
plotgauss1(histdist[1])
plotgauss2(histdist[1])
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
%pylab
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
clf.fit(data)
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws

coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Linear regression minimizing errors only above the linear - python

Related

Linear fit constrained to go through the first point in Python

Least squares not working for a set of y's

How does one implement a subsampled RBF (Radial Basis Function) in Numpy?

Failure of non linear fit to sine curve

Python: two-curve gaussian fitting with non-linear least-squares

Categories

Resources