Say I want to fit a sine function using scipy.optimize.curve_fit. I don't know any parameters of the function. To get the frequency, I do Fourier transform and guess all the other parameters - amplitude, phase, and offset. When running my program, I do get a fit but it does not make sense. What is the problem? Any help will be appreciated.
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
ampl = 1
freq = 24.5
phase = np.pi/2
offset = 0.05
t = np.arange(0,10,0.001)
func = np.sin(2*np.pi*t*freq + phase) + offset
fastfft = np.fft.fft(func)
freq_array = np.fft.fftfreq(len(t),t[0]-t[1])
max_value_index = np.argmax(abs(fastfft))
frequency = abs(freq_array[max_value_index])
def fit(a, f, p, o, t):
return a * np.sin(2*np.pi*t*f + p) + o
guess = (0.9, frequency, np.pi/4, 0.1)
params, fit = sp.optimize.curve_fit(fit, t, func, p0=guess)
a, f, p, o = params
fitfunc = lambda t: a * np.sin(2*np.pi*t*f + p) + o
plt.plot(t, func, 'r-', t, fitfunc(t), 'b-')
The main problem in your program was a misunderstanding, how scipy.optimize.curve_fit is designed and its assumption of the fit function:
ydata = f(xdata, *params) + eps
This means that the fit function has to have the array for the x values as the first parameter followed by the function parameters in no particular order and must return an array for the y values. Here is an example, how to do this:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
#t has to be the first parameter of the fit function
def fit(t, a, f, p, o):
return a * np.sin(2*np.pi*t*f + p) + o
ampl = 1
freq = 2
phase = np.pi/2
offset = 0.5
t = np.arange(0,10,0.01)
#is the same as fit(t, ampl, freq, phase, offset)
func = np.sin(2*np.pi*t*freq + phase) + offset
fastfft = np.fft.fft(func)
freq_array = np.fft.fftfreq(len(t),t[0]-t[1])
max_value_index = np.argmax(abs(fastfft))
frequency = abs(freq_array[max_value_index])
guess = (0.9, frequency, np.pi/4, 0.1)
#renamed the covariance matrix
params, pcov = scipy.optimize.curve_fit(fit, t, func, p0=guess)
a, f, p, o = params
#calculate the fit plot using the fit function
plt.plot(t, func, 'r-', t, fit(t, *params), 'b-')
plt.show()
As you can see, I have also changed the way the fit function for the plot is calculated. You don't need another function - just utilise the fit function with the parameter list, the fit procedure gives you back.
The other problem was that you called the covariance array fit - overwriting the previously defined function fit. I fixed that as well.
P.S.: Of course now you only see one curve, because the perfect fit covers your data points.
Related
I am trying to plot the output of the below function. The function itself calculates the probability density for given paramaters. The plot should be bell shaped like the normal distribution but not necessarily symmetric. How should I plot this? Any suggestions?
I tried to plot with plt.plot(x, y, 'b') but this wouldn't work.
import math
import numpy as np
from scipy import fftpack
from scipy.special import gamma
import matplotlib.pyplot as plt​
u=1
T=1
N=4096
du=0.001534
dt=0.001
mu=0.5
eta=0.25
args=(5,30,35,0.5)
scale=1
#### 1.1 Defining the parameters of the model
def cf_log_cgmy(u, lnS, T, mu ,half_etasq, C, G, M, Y):
omega = -C*gamma(-Y)*(np.power(M-1,Y)-np.power(M,Y)+np.power(G+1,Y)-np.power(G,Y ))
phi_CGMY = C*T*gamma(-Y)*(np.power(M-1j*u,Y)-np.power(M,Y)+np.power(G+1j*u,Y)- np.power(G,Y))
phi = 1j*u*(lnS + (mu+omega-half_etasq)*T) + phi_CGMY - half_etasq*np.power(u,2)
return np.exp(scale*phi)
#### 1.2 From the characteristic function to the probability density function
def cf_to_pdf(cf,du,N):
vec = np.linspace(0,N-1,N)
u = du* (vec-N/2)
res = np.fft.ifftshift(fftpack.fft(np.fft.fftshift(cf(u))))
x = np.linspace(-(np.pi)/du,(np.pi)/du,N, endpoint=False)
y = np.real(res*du/(2*np.pi))
return x , y
def dist_cgmy(args, du, N, dt,mu, eta):
cf = lambda u: cf_log_cgmy(u,0,dt,mu,0.5*np.power(eta,2),args[0],args[1],args[2],args[3])
return cf_to_pdf(cf ,du ,N)
dist_cgmy(args, du, N, dt,mu, eta)
## The output is below
(array([-2047.97435045, -2046.97436297, -2045.9743755 , ...,
2044.97438802, 2045.9743755 , 2046.97436297]),
array([-2.70244620e-08, 2.68320697e-08, -2.66397090e-08, ...,
2.76018306e-08, -2.74093422e-08, 2.72168861e-08]))
I am trying to apply an exponential fit to my data to determine the point at which the value drops by 1/e. When plotted, the fit seems to favor smaller values and does not portray the true relationship.
import numpy as np
import matplotlib
matplotlib.use("TkAgg") # need to set the TkAgg backend explicitly otherwise it introduced a low-level error
from matplotlib import pyplot as plt
import scipy as sc
def autoCorrelation(sample, longTime, temp, plotTau = False ):
# compute empirical autocovariance with lag tau averaged over time longTime
sample.takeTimeStep(timesteps=1500) # 1500 timesteps to let sample reach equilibrium
M = np.zeros(longTime)
for tau in range(longTime):
M[tau] = sample.calcMagnetisation()
sample.takeTimeStep()
M_ave = np.average(M) #time - average
M = (M - M_ave)
autocorrelation = np.correlate(M, M, mode='full')
autocorrelation /= autocorrelation.max() # normalise such that max autocorrelation is 1
autocorrelationArray = autocorrelation[int(len(autocorrelation)/2):]
x = np.arange(0, len(autocorrelationArray), 1)
# apply exponential fit
def exponenial(x, a, b):
return a * np.exp(-b * x)
popt, pcov = curve_fit(exponenial, x, np.absolute(autocorrelationArray)) # array, 2d array
yy = exponenial(x, *popt)
plt.plot(x, np.absolute(autocorrelationArray), 'o', x, yy)
plt.title('Exponential Fit of Magnetisation Autocorrelation against Time for Temperature = ' + str(T) + ' J/k')
plt.xlabel('Time / Number of Iterations ')
plt.ylabel('Magnetisation Autocorrelation')
plt.show()
# prints tau_e value b from exponential a * np.exp(-b * x)
print('tau_e is ' + str(1/popt[1])) # units converted to time steps by taking reciprocal
if __name__ == '__main__':
#plot autocorrelation against time
longTime = 100
temp = [1, 2, 2.3, 2.6, 3, 4]
for T in temp:
magnet = Ising(30, T) # (N, temp)
autoCorrelation(magnet, longTime, temp)
Note: Ising is a class in another .py file containing the functions takeTimeStep and calcMagnetisation.
Expect greater values of tau_e
I want to fit complex data set with a two functions which shared the same parameters. For this I used
def funcReal(x,a,b,c,d):
return np.real((a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x)))
def funcImag(x,a,b,c,d):
return np.imag((a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x)))`
poptReal, pcovReal = curve_fit(funcReal, x, yReal)
poptImag, pcovImag = curve_fit(funcImag, x, yImag)
Here funcReal is the real part of my model, funcImag the imaginary part, yReal the real part of the data and yImag the imaginary part of the data.
However, both fits does not give me the same parameters for the real and imaginary part.
My question is there a package or a method such that I can realized multi fits for multiple data sets and multiple functions with shared parameters?
To fit both the complex function given above, we can treat the real and imaginary components as a coordinate point, or as a vector. Since curve_fit doesn't care about the order at which data points are inserted in the vectors x (independent data) and y (dependent data), we can simply split the complex data and stack the real and imaginary components using hstack. See the example below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
kappa1 = np.pi
kappa2 = -0.01
def long_function(x, a, b, c, d):
return (a + 1j*b)*(np.exp(1j*k*x - kappa1*x) - np.exp(kappa2*x)) + (c + 1j*d)*(np.exp(-1j*k*x - kappa1*x) - np.exp(-kappa2*x))
def funcBoth(x, a, b, c, d):
N = len(x)
x_real = x[:N//2]
x_imag = x[N//2:]
y_real = np.real(long_function(x_real, a, b, c, d))
y_imag = np.imag(long_function(x_imag, a, b, c, d))
return np.hstack([y_real, y_imag])
# Create an independent variable with 100 measurements
N = 100
x = np.linspace(0, 10, N)
# True values of the dependent variable
y = long_function(x, a=1.1, b=0.3, c=-0.2, d=0.23)
# Add uniform complex noise (real + imaginary)
noise = (np.random.rand(N) + 1j * np.random.rand(N) - 0.5 - 0.5j) * 0.1
yNoisy = y + noise
# Split the measurements into a real and imaginary part
yReal = np.real(yNoisy)
yImag = np.imag(yNoisy)
yBoth = np.hstack([yReal, yImag])
# Find the best-fit solution
poptBoth, pcovBoth = curve_fit(funcBoth, np.hstack([x, x]), yBoth)
# Compute the best-fit solution
yFit = long_function(x, *poptBoth)
print(poptBoth)
# Plot the results
plt.figure(figsize=(9, 4))
plt.subplot(121)
plt.plot(x, np.real(yNoisy), "k.", label="Noisy y")
plt.plot(x, np.real(y), "r--", label="True y")
plt.plot(x, np.real(yFit), label="Best fit")
plt.ylabel("Real part of y")
plt.xlabel("x")
plt.legend()
plt.subplot(122)
plt.plot(x, np.imag(yNoisy), "k.")
plt.plot(x, np.imag(y), "r--")
plt.plot(x, np.imag(yFit))
plt.ylabel("Imaginary part of y")
plt.xlabel("x")
plt.tight_layout()
plt.show()
Result:
The best-fit parameters that were found in this example were a = 1.14, b = 0.375, c = -0.236, and d = 0.163, which are close enough to the true parameter values given the amplitude of the noise that I inserted here.
I've been trying to fit the amplitude, frequency and phase of a sine curve given some generated two dimensional toy data. (Code at the end)
To get estimates for the three parameters, I first perform an FFT. I use the values from the FFT as initial guesses for the actual frequency and phase and then fit for them (row by row). I wrote my code such that I input which bin of the FFT I want the frequency to be in, so I can check if the fitting is working well. But there's some pretty strange behaviour. If my input bin is say 3.1 (a non integral bin, so the FFT won't give me the right frequency) then the fit works wonderfully. But if the input bin is 3 (so the FFT outputs the exact frequency) then my fit fails, and I'm trying to understand why.
Here's the output when I give the input bins (in the X and Y direction) as 3.0 and 2.1 respectively:
(The plot on the right is data - fit)
Here's the output when I give the input bins as 3.0 and 2.0:
Question: Why does the non linear fit fail when I input the exact frequency of the curve?
Code:
#! /usr/bin/python
# For the purposes of this code, it's easier to think of the X-Y axes as transposed,
# so the X axis is vertical and the Y axis is horizontal
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimize
import itertools
import sys
PI = np.pi
# Function which accepts paramters to define a sin curve
# Used for the non linear fit
def sineFit(t, a, f, p):
return a * np.sin(2.0 * PI * f*t + p)
xSize = 18
ySize = 60
npt = xSize * ySize
# Get frequency bin from user input
xFreq = float(sys.argv[1])
yFreq = float(sys.argv[2])
xPeriod = xSize/xFreq
yPeriod = ySize/yFreq
# arrays should be defined here
# Generate the 2D sine curve
for jj in range (0, xSize):
for ii in range(0, ySize):
sineGen[jj, ii] = np.cos(2.0*PI*(ii/xPeriod + jj/yPeriod))
# Compute 2dim FFT as well as freq bins along each axis
fftData = np.fft.fft2(sineGen)
fftMean = np.mean(fftData)
fftRMS = np.std(fftData)
xFreqArr = np.fft.fftfreq(fftData.shape[1]) # Frequency bins along x
yFreqArr = np.fft.fftfreq(fftData.shape[0]) # Frequency bins along y
# Find peak of FFT, and position of peak
maxVal = np.amax(np.abs(fftData))
maxPos = np.where(np.abs(fftData) == maxVal)
# Iterate through peaks in the FFT
# For this example, number of loops will always be only one
prevPhase = -1000
for col, row in itertools.izip(maxPos[0], maxPos[1]):
# Initial guesses for fit parameters from FFT
init_phase = np.angle(fftData[col,row])
init_amp = 2.0 * maxVal/npt
init_freqY = yFreqArr[col]
init_freqX = xFreqArr[row]
cntr = 0
if prevPhase == -1000:
prevPhase = init_phase
guess = [init_amp, init_freqX, prevPhase]
# Fit each row of the 2D sine curve independently
for rr in sineGen:
(amp, freq, phs), pcov = optimize.curve_fit(sineFit, xDat, rr, guess)
# xDat is an linspace array, containing a list of numbers from 0 to xSize-1
# Subtract fit from original data and plot
fitData = sineFit(xDat, amp, freq, phs)
sub1 = rr - fitData
# Plot
fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
p1, = ax1.plot(rr, 'g')
p2, = ax1.plot(fitData, 'b')
plt.legend([p1,p2], ["data", "fit"])
ax2 = fig1.add_subplot(122)
p3, = ax2.plot(sub1)
plt.legend([p3], ['residual1'])
fig1.tight_layout()
plt.show()
cntr += 1
prevPhase = phs # Update guess for phase of sine curve
I've tried to distill the important parts of your question into this answer.
First of all, try fitting a single block of data, not an array. Once you are confident that your model is sufficient you can move on.
Your fit is only going to be as good as your model, if you move on to something not "sine"-like you'll need to adjust accordingly.
Fitting is an "art", in that the initial conditions can greatly change the convergence of the error function. In addition there may be more than one minima in your fits, so you often have to worry about the uniqueness of your proposed solution.
While you were on the right track with your FFT idea, I think your implementation wasn't quite correct. The code below should be a great toy system. It generates random data of the type f(x) = a0*sin(a1*x+a2). Sometimes a random initial guess will work, sometimes it will fail spectacularly. However, using the FFT guess for the frequency the convergence should always work for this system. An example output:
import numpy as np
import pylab as plt
import scipy.optimize as optimize
# This is your target function
def sineFit(t, (a, f, p)):
return a * np.sin(2.0*np.pi*f*t + p)
# This is our "error" function
def err_func(p0, X, Y, target_function):
err = ((Y - target_function(X, p0))**2).sum()
return err
# Try out different parameters, sometimes the random guess works
# sometimes it fails. The FFT solution should always work for this problem
inital_args = np.random.random(3)
X = np.linspace(0, 10, 1000)
Y = sineFit(X, inital_args)
# Use a random inital guess
inital_guess = np.random.random(3)
# Fit
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
# Plot the fit
Y2 = sineFit(X, sol)
plt.figure(figsize=(15,10))
plt.subplot(211)
plt.title("Random Inital Guess: Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
# Use an improved "fft" guess for the frequency
# this will be the max in k-space
timestep = X[1]-X[0]
guess_k = np.argmax( np.fft.rfft(Y) )
guess_f = np.fft.fftfreq(X.size, timestep)[guess_k]
inital_guess[1] = guess_f
# Guess the amplitiude by taking the max of the absolute values
inital_guess[0] = np.abs(Y).max()
sol = optimize.fmin(err_func, inital_guess, args=(X,Y,sineFit))
Y2 = sineFit(X, sol)
plt.subplot(212)
plt.title("FFT Guess : Final Parameters: %s"%sol)
plt.plot(X,Y)
plt.plot(X,Y2,'r',alpha=.5,lw=10)
plt.show()
The problem is due to a bad initial guess of the phase, not the frequency. While cycling through the rows of genSine (inner loop) you use the fit result of the previous line as initial guess for the next row which does not work always. If you determine the phase from an fft of the current row and use that as initial guess the fit will succeed.
You could change the inner loop as follows:
for n,rr in enumerate(sineGen):
fftx = np.fft.fft(rr)
fftx = fftx[:len(fftx)/2]
idx = np.argmax(np.abs(fftx))
init_phase = np.angle(fftx[idx])
print fftx[idx], init_phase
...
Also you need to change
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
to
def sineFit(t, a, f, p):
return a * np.cos(2.0 * np.pi * f*t + p)
since phase=0 means that the imaginary part of the fft is zero and thus the function is cosine like.
Btw. your sample above is still lacking definitions of sineGen and xDat.
Without understanding much of your code, according to http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, guess2)
should become:
(amp2, freq2, phs2), pcov = optimize.curve_fit(sineFit, tDat,
sub1, p0=guess2)
Assuming that tDat and sub1 are x and y, that should do the trick. But, once again, it is quite difficult to understand such a complex code with so many interlinked variables and no comments at all. A code should always be build from bottom up, meaning that you don't do a loop of fits when a single one is not working, you don't add noise until the code works to fit the non-noisy examples... Good luck!
By "nothing fancy" I meant something like removing EVERYTHING that is not related with the fit, and doing a simplified mock example such as:
import numpy as np
import scipy.optimize as optimize
def sineFit(t, a, f, p):
return a * np.sin(2.0 * np.pi * f*t + p)
# Create array of x and y with given parameters
x = np.asarray(range(100))
y = sineFit(x, 1, 0.05, 0)
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.05, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
The result of this is exactly the answer:
[1. 0.05 0.]
But if you change guess not too much, just enough:
# Give a guess and fit, printing result of the fitted values
guess = [1., 0.06, 0.]
print optimize.curve_fit(sineFit, x, y, guess)[0]
the result gives absurdly wrong numbers:
[ 0.00823701 0.06391323 -1.20382787]
Can you explain this behavior?
You can use curve_fit with a series of trigonometric functions, usually very robust and ajustable to the precision that you need just by increasing the number of terms... here is an example:
from scipy import sin, cos, linspace
def f(x, a0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12):
return a0 + s1*sin(1*x) + c1*cos(1*x) \
+ s2*sin(2*x) + c2*cos(2*x) \
+ s3*sin(3*x) + c3*cos(3*x) \
+ s4*sin(4*x) + c4*cos(4*x) \
+ s5*sin(5*x) + c5*cos(5*x) \
+ s6*sin(6*x) + c6*cos(6*x) \
+ s7*sin(7*x) + c7*cos(7*x) \
+ s8*sin(8*x) + c8*cos(8*x) \
+ s9*sin(9*x) + c9*cos(9*x) \
+ s10*sin(9*x) + c10*cos(9*x) \
+ s11*sin(9*x) + c11*cos(9*x) \
+ s12*sin(9*x) + c12*cos(9*x)
from scipy.optimize import curve_fit
pi/2. / (x.max() - x.min())
x_norm *= norm_factor
popt, pcov = curve_fit(f, x_norm, y)
x_fit = linspace(x_norm.min(), x_norm.max(), 1000)
y_fit = f(x_fit, *popt)
plt.plot( x_fit/x_norm, y_fit )
My knowledge of maths is limited which is why I am probably stuck. I have a spectra to which I am trying to fit two Gaussian peaks. I can fit to the largest peak, but I cannot fit to the smallest peak. I understand that I need to sum the Gaussian function for the two peaks but I do not know where I have gone wrong. An image of my current output is shown:
The blue line is my data and the green line is my current fit. There is a shoulder to the left of the main peak in my data which I am currently trying to fit, using the following code:
import matplotlib.pyplot as pt
import numpy as np
from scipy.optimize import leastsq
from pylab import *
time = []
counts = []
for i in open('/some/folder/to/file.txt', 'r'):
segs = i.split()
time.append(float(segs[0]))
counts.append(segs[1])
time_array = arange(len(time), dtype=float)
counts_array = arange(len(counts))
time_array[0:] = time
counts_array[0:] = counts
def model(time_array0, coeffs0):
a = coeffs0[0] + coeffs0[1] * np.exp( - ((time_array0-coeffs0[2])/coeffs0[3])**2 )
b = coeffs0[4] + coeffs0[5] * np.exp( - ((time_array0-coeffs0[6])/coeffs0[7])**2 )
c = a+b
return c
def residuals(coeffs, counts_array, time_array):
return counts_array - model(time_array, coeffs)
# 0 = baseline, 1 = amplitude, 2 = centre, 3 = width
peak1 = np.array([0,6337,16.2,4.47,0,2300,13.5,2], dtype=float)
#peak2 = np.array([0,2300,13.5,2], dtype=float)
x, flag = leastsq(residuals, peak1, args=(counts_array, time_array))
#z, flag = leastsq(residuals, peak2, args=(counts_array, time_array))
plt.plot(time_array, counts_array)
plt.plot(time_array, model(time_array, x), color = 'g')
#plt.plot(time_array, model(time_array, z), color = 'r')
plt.show()
This code worked for me providing that you are only fitting a function that is a combination of two Gaussian distributions.
I just made a residuals function that adds two Gaussian functions and then subtracts them from the real data.
The parameters (p) that I passed to Numpy's least squares function include: the mean of the first Gaussian function (m), the difference in the mean from the first and second Gaussian functions (dm, i.e. the horizontal shift), the standard deviation of the first (sd1), and the standard deviation of the second (sd2).
import numpy as np
from scipy.optimize import leastsq
import matplotlib.pyplot as plt
######################################
# Setting up test data
def norm(x, mean, sd):
norm = []
for i in range(x.size):
norm += [1.0/(sd*np.sqrt(2*np.pi))*np.exp(-(x[i] - mean)**2/(2*sd**2))]
return np.array(norm)
mean1, mean2 = 0, -2
std1, std2 = 0.5, 1
x = np.linspace(-20, 20, 500)
y_real = norm(x, mean1, std1) + norm(x, mean2, std2)
######################################
# Solving
m, dm, sd1, sd2 = [5, 10, 1, 1]
p = [m, dm, sd1, sd2] # Initial guesses for leastsq
y_init = norm(x, m, sd1) + norm(x, m + dm, sd2) # For final comparison plot
def res(p, y, x):
m, dm, sd1, sd2 = p
m1 = m
m2 = m1 + dm
y_fit = norm(x, m1, sd1) + norm(x, m2, sd2)
err = y - y_fit
return err
plsq = leastsq(res, p, args = (y_real, x))
y_est = norm(x, plsq[0][0], plsq[0][2]) + norm(x, plsq[0][0] + plsq[0][1], plsq[0][3])
plt.plot(x, y_real, label='Real Data')
plt.plot(x, y_init, 'r.', label='Starting Guess')
plt.plot(x, y_est, 'g.', label='Fitted')
plt.legend()
plt.show()
You can use Gaussian mixture models from scikit-learn:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GMM(n_components=2, covariance_type='full')
clf.fit(yourdata)
m1, m2 = clf.means_
w1, w2 = clf.weights_
c1, c2 = clf.covars_
histdist = matplotlib.pyplot.hist(yourdata, 100, normed=True)
plotgauss1 = lambda x: plot(x,w1*matplotlib.mlab.normpdf(x,m1,np.sqrt(c1))[0], linewidth=3)
plotgauss2 = lambda x: plot(x,w2*matplotlib.mlab.normpdf(x,m2,np.sqrt(c2))[0], linewidth=3)
plotgauss1(histdist[1])
plotgauss2(histdist[1])
You can also use the function below to fit the number of Gaussian you want with ncomp parameter:
from sklearn import mixture
%pylab
def fit_mixture(data, ncomp=2, doplot=False):
clf = mixture.GMM(n_components=ncomp, covariance_type='full')
clf.fit(data)
ml = clf.means_
wl = clf.weights_
cl = clf.covars_
ms = [m[0] for m in ml]
cs = [numpy.sqrt(c[0][0]) for c in cl]
ws = [w for w in wl]
if doplot == True:
histo = hist(data, 200, normed=True)
for w, m, c in zip(ws, ms, cs):
plot(histo[1],w*matplotlib.mlab.normpdf(histo[1],m,np.sqrt(c)), linewidth=3)
return ms, cs, ws
coeffs 0 and 4 are degenerate - there is absolutely nothing in the data that can decide between them. you should use a single zero level parameter instead of two (ie remove one of them from your code). this is probably what is stopping your fit (ignore the comments here saying this is not possible - there are clearly at least two peaks in that data and you should certainly be able to fit to that).
(it may not be clear why i am suggesting this, but what is happening is that coeffs 0 and 4 can cancel each other out. they can both be zero, or one could be 100 and the other -100 - either way, the fit is just as good. this "confuses" the fitting routine, which spends its time trying to work out what they should be, when there is no single right answer, because whatever value one is, the other can just be the negative of that, and the fit will be the same).
in fact, from the plot, it looks like there may be no need for a zero level at all. i would try dropping both of those and seeing how the fit looks.
also, there is no need to fit coeffs 1 and 5 (or the zero point) in the least squares. instead, because the model is linear in those you could calculate their values each loop. this will make things faster, but is not critical. i just noticed you say your maths is not so good, so probably ignore this one.