I'm implementing a Maximum Likelihood Estimator for discrete count data for the purpose of curve fitting, using the result of curve_fit as the starting point for minimize. I defined and tried these methods for multiple distributions, but will include just one for simplicity, which is a logseries distribution.
At this point I have also tried the following methods from statsmodels methods:
statsmodels.discrete.discrete_model.fit
statsmodels.discrete.count_model.fit
statsmodels.base.model.GenericLikelihoodModel
Most curve fits tend to run into overflow errors or nans and zeros inside. I will detail these errors on another post
#Import a few packages
import numpy as np
from scipy.optimize import curve_fit
from scipy.optimize import minimize
from scipy import stats
from numpy import log
import numpy as np
import matplotlib.pyplot as plt
#Given data
x=np.arange(1, 28, 1)
y=np.array([18899, 10427, 6280, 4281, 2736, 1835, 1158, 746, 467, 328, 201, 129, 65, 69, 39, 21, 15, 10, 3, 3, 1, 1, 1, 1, 1, 1, 1])
#Define a custom distribution
def Logser(x, p):
return (-p**x)/(x*log(1-p))
#Doing a least squares curve fit
def lsqfit(x, y):
cf_result = curve_fit(Logser, x, y, p0=0.7, bounds=(0.5,1), method='trf')
return cf_result
param_guess=lsqfit(x,y)[0][0]
print(param_guess)
#Doing a custom MLE definition, minimized using the scipy minimize function
def MLERegression(param_guess):
yhat = Logser(x, param_guess) # predictions based on a parameter value
sd=1 #initially guessed for fitting a normal distribution error around the regressed curve
# next, we flip the Bayesian question
# compute PDF of observed values normally distributed around mean (yhat)
# with a standard deviation of sd
negLL = -np.sum( stats.norm.logpdf(y, loc=yhat, scale=sd) ) #log of the probability density function
return negLL
results = minimize(MLERegression, param_guess, method='L-BFGS-B', bounds=(0.5,1.0), options={'disp': True})
final_param=results['x']
print(final_param)
I've constrained the optimizer to give me results similar to what I expect,(a parameter value around 0.8 or 0.9).. The algorithm outputs zero otherwise
I think this is due to scaling. When I change the equation to "scale * (-p**X)/(X * log(1-p))" by adding a scaling factor, I get the following values without using any bounds: p = 9.0360470735534726E-01 and scale = 5.1189277041342692E+04 that yield the following:
and my fitted value for p is indeed 0.9.
Related
In GPFlow one can add a fitted mean function to the GP regression. When doing this as in the basic example, the result is, that there will be no uncertainties due to the uncertainty in the fit of the mean. E.g. in the example below the error bars don't grow outside the range of available data, as the slope of the linear mean remains fixed at its optimized value. Is there a way to account for these uncertainties, such that the error bands grow when extrapolating?
(The question was originally stated in an issue report but moved here to be more accessible)
import numpy as np
import matplotlib.pyplot as plt
import gpflow
from gpflow.utilities import print_summary
def f(x):
return np.sin(3*x) + x
xtrain = np.linspace(0, 3, 50).reshape([-1, 1])
ytrain = f(xtrain) + 0.5*(np.random.randn(len(xtrain)).reshape([-1, 1]) - 0.5)
k = gpflow.kernels.SquaredExponential()
meanf = gpflow.mean_functions.Linear()
m = gpflow.models.GPR(data=(xtrain, ytrain), kernel=k, mean_function=meanf)
opt = gpflow.optimizers.Scipy()
def objective_closure():
return - m.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure,
m.trainable_variables,
options=dict(maxiter=100))
print_summary(m)
xpl = np.linspace(-5, 10, 100).reshape(100, 1)
mean, var = m.predict_f(xpl)
plt.figure(figsize=(12, 6))
plt.plot(xtrain, ytrain, 'x')
plt.plot(xpl, mean, 'C0', lw=2)
plt.fill_between(xpl[:, 0],
mean[:, 0] - 1.96 * np.sqrt(var[:,0]),
mean[:, 0] + 1.96 * np.sqrt(var[:,0]),
color='C0', alpha=0.2)
Most of GPflow's models only optimise for the MAP estimate of the hyperparameters of the kernel, mean function and likelihood. The models do not account for uncertainty on these hyperparameters during training or prediction. While this could be limiting for certain problems, we often find that this is a sensible compromise between computational complexity and uncertainty quantification.
That being said, in your specific case (i.e. a linear mean function) we can account for uncertainty in the linear trend of the data by specifying a linear kernel function, rather than a linear mean function.
Using your snippet with this model specification:
k = gpflow.kernels.SquaredExponential() + gpflow.kernels.Linear()
meanf = gpflow.mean_functions.Zero()
m = gpflow.models.GPR(data=(xtrain, ytrain), kernel=k, mean_function=meanf)
Gives the following fit, with error bars that grow outside the data range:
I am trying to get a simple fit to my data of an exponential decay of the form a*(x-x0)**b, where I know a and b must be negative. So if you plot it on a log-log plot, I should see a linear trend for the obtained data.
As such, I'm giving scipy.optimize initial guesses where a and b are negative, but it keeps ignoring them and giving me the error,
OptimizeWarning: Covariance of the parameters could not be estimated
.. and giving me values for a and b that are positive. It then also does not give me an exponential decay, but a parabola that bottoms out and begins to increase.
I have tried many different guesses as to the initial parameters over a large range of values (one such is in the code below), but none worked without giving me the nonsensical return and the error. This has made me start to wonder if my code is wrong, or if there's just some obvious way to get good initial guesses into the code that won't be rejected.
import math
import numpy as np
import sys
import matplotlib.pyplot as plt
import scipy as sp
import scipy.optimize
from scipy.optimize import curve_fit
import numpy.polynomial.polynomial as poly
x= [1987, 1993.85, 2003, 2010.45, 2009.3, 2019.4]
t= [31, 8.6, 4.84, 1.96, 3.9, 1.875]
def model_func(x, a, b, x0):
return (a*(x-x0)**b)
# curve fit
p0 = (-.0005,-.0005,100)
opt, pcov = curve_fit(model_func, x, t,p0)
a, b, x0 = opt
# test result
x2 = np.linspace(1980, 2020, 100)
y2 = model_func(x2, a, b,x0)
coefs, cov = poly.polyfit(x, t, 2,full=True)
ffit = poly.polyval(x2, coefs)
plt.loglog(x,t,'.')
plt.loglog(x2, ffit,'--', color="#1f77b4")
print('S = (',coefs[0],'*(t-',coefs[2],')^',coefs[1])
Below I solve a second order ODE that describes a spring-mass dashpot system: u'' +cu'+ku=0. I have no problems with the odeint solver.The odeint function correctly solves the position U(t) over the specified time.
#modeling spring mass system
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy import integrate
#Make the following substitution to make system first order
#Y[1]=y′(t) and Y[0]=y(t),
#system: Y[0]'=Y[1] and Y[1]'=-c*Y[1]-k*Y[0]
#=======================================================
def eq(par,initial_cond,start_t,end_t,incr):
#-time-grid-----------------------------------
t = np.linspace(start_t, end_t,incr)
#differential-eq-system----------------------
def funct(y,t):
ut=y[0]
ut_dt=y[1]
c,k=par
# the model equations u'=Y[1], u''=-k*Y[0]-c*Y[1] from u''+c*u'+k*u=0
f0 =ut_dt
f1 =-k*ut-c*ut_dt
return [f0, f1]
#integrate------------------------------------
ds = integrate.odeint(funct,initial_cond,t)
return (ds[:,0],ds[:,1],t)
#=======================================================
#parameters
c=2. #spring coefficient
k=10. #dampening coefficient
#collect parameters in tuple
coefs=(c,k)
# initial conditions
u0=6.
ud0=0.
y0=[u0,ud0]
start,stop,incr=0,20,100
#Solve and plot solution
F0,F1,T=eq(coefs,y0,start,stop,incr)
plt.figure()
plt.plot(T,F0,'-b',T,F1,'-r')
plt.legend(('u0', 'u1'),'upper center')
plt.title('Mass-Spring System')
However, I would like to use scipy.optimize.fmin() to find the optimal fitting parameters (c,k) for this system if given simulated measurements. So I use the solution from above where c=2, and k=10 and add random noise.
rand_i=randn(incr)
#noiselevel
nl=.05
noisy_data=F0+nl*rand_i
plt.plot(noisy_data,label="noisy_data:c=2,k=10")
plt.legend()
Next, I set up a scoring function for fmin() to minimize. I use a guess for the parameters, c=1,k=1.
from scipy.optimize import fmin
#1.Get 'Real' Data
#====================================================
nd=noisy_data#solution with parameters: c=2,k=10
#====================================================
#2.Set up Info for Model System
#===================================================
# guess parameters
c=1 #spring coefficient
k=1 #dampening coefficient
#collect parameters in tuple
coefs=(c,k)
# initial conditions
u0=6.
ud0=0.
y0=[u0,ud0]
# model steps
#---------------------------------------------------
start_time=0
end_time=20
intervals=100
mt=np.linspace(start_time,end_time,intervals)
#3.Score Fit of System
#=========================================================
def score(parms):
#a.Get Solution to system
F0,F1,T=eq(coefs,y0,start_time,end_time,intervals)
#b.Pick of Model Points to Compare
um=F0
#c.Score Difference between model(ode output) and data points (noisy data)
ss=lambda data,model:((data-model)**2).sum()
return ss(nd,um)
#========================================================
#4.Optimize Fit
#=======================================================
fit_score=score(coefs)
answ=fmin(score,(coefs))
The problem is that fmin doesn't find the correct parameters. It finds that the guess parameters are the best, even though the score function is high. Below I print the fmin solution answ and show that it is the same as the initial guess even after fmin() has been called.
print(answ==[c,k])
Does anyone know why fmin() doesn't find the correct parameters, c=2, k=10?
There is a trivial bug in your code: you define score with input parameter parms, but then refer to said variable as coefs. Fix:
def score(coefs): #changed
#a.Get Solution to system
F0,F1,T=eq(coefs,y0,start_time,end_time,intervals)
#b.Pick of Model Points to Compare
um=F0
#c.Score Difference between model(ode output) and data points (noisy data)
ss=lambda data,model:((data-model)**2).sum()
return ss(nd,um)
Before:
In [369]: answ
Out[369]: array([ 1., 1.])
After:
In [373]: answ
Out[373]: array([ 2.0425695 , 9.96937966])
However, note that answ==(c,k) will always be False, even for a perfect fit: you're working with floating-point numbers. Any meaningful comparison should look like max(abs(answ-[2,10])/abs(answ))<tol or something similar. (I know your original question used this to show that the values didn't change, but still.)
I'm using SciPy instead of MATLAB in a control systems class to plot the step responses of LTI systems. It's worked great so far, but I've run into an issue with a very specific system. With this code:
from numpy import min
from scipy import linspace
from scipy.signal import lti, step
from matplotlib import pyplot as p
# Create an LTI transfer function from coefficients
tf = lti([64], [1, 16, 64])
# Step response (redo it to get better resolution)
t, s = step(tf)
t, s = step(tf, T = linspace(min(t), t[-1], 200))
# Plotting stuff
p.plot(t, s)
p.xlabel('Time / s')
p.ylabel('Displacement / m')
p.show()
The code as-is displays a flat line. If I modify the final coefficient of the denominator to 64.00000001 (i.e., tf = lti([64], [1, 16, 64.0000001])) then it works as it should, showing an underdamped step response. Setting the coefficient to 63.9999999 also works. Changing all the coefficients to have explicit decimal places (i.e., tf = lti([64.0], [1.0, 16.0, 64.0])) doesn't affect anything, so I guess it's not a case of integer division messing things up.
Is this a bug in SciPy, or am I doing something wrong?
This is a limitation of the implementation of the step function. It uses a matrix exponential to find the step response, and it doesn't handle repeated poles well. (Your system has a repeated pole at -8.)
Instead of using step, you can use the function scipy.signal.step2
In [253]: from scipy.signal import lti, step2
In [254]: sys = lti([64], [1, 16, 64])
In [255]: t, y = step2(sys)
In [256]: plot(t, y)
Out[256]: [<matplotlib.lines.Line2D at 0x5ec6b90>]
Can anyone help me out in fitting a gamma distribution in python? Well, I've got some data : X and Y coordinates, and I want to find the gamma parameters that fit this distribution... In the Scipy doc, it turns out that a fit method actually exists but I don't know how to use it :s.. First, in which format the argument "data" must be, and how can I provide the second argument (the parameters) since that's what I'm looking for?
Generate some gamma data:
import scipy.stats as stats
alpha = 5
loc = 100.5
beta = 22
data = stats.gamma.rvs(alpha, loc=loc, scale=beta, size=10000)
print(data)
# [ 202.36035683 297.23906376 249.53831795 ..., 271.85204096 180.75026301
# 364.60240242]
Here we fit the data to the gamma distribution:
fit_alpha, fit_loc, fit_beta=stats.gamma.fit(data)
print(fit_alpha, fit_loc, fit_beta)
# (5.0833692504230008, 100.08697963283467, 21.739518937816108)
print(alpha, loc, beta)
# (5, 100.5, 22)
I was unsatisfied with the ss.gamma.rvs-function as it can generate negative numbers, something the gamma-distribution is supposed not to have. So I fitted the sample through expected value = mean(data) and variance = var(data) (see wikipedia for details) and wrote a function that can yield random samples of a gamma distribution without scipy (which I found hard to install properly, on a sidenote):
import random
import numpy
data = [6176, 11046, 670, 6146, 7945, 6864, 767, 7623, 7212, 9040, 3213, 6302, 10044, 10195, 9386, 7230, 4602, 6282, 8619, 7903, 6318, 13294, 6990, 5515, 9157]
# Fit gamma distribution through mean and average
mean_of_distribution = numpy.mean(data)
variance_of_distribution = numpy.var(data)
def gamma_random_sample(mean, variance, size):
"""Yields a list of random numbers following a gamma distribution defined by mean and variance"""
g_alpha = mean*mean/variance
g_beta = mean/variance
for i in range(size):
yield random.gammavariate(g_alpha,1/g_beta)
# force integer values to get integer sample
grs = [int(i) for i in gamma_random_sample(mean_of_distribution,variance_of_distribution,len(data))]
print("Original data: ", sorted(data))
print("Random sample: ", sorted(grs))
# Original data: [670, 767, 3213, 4602, 5515, 6146, 6176, 6282, 6302, 6318, 6864, 6990, 7212, 7230, 7623, 7903, 7945, 8619, 9040, 9157, 9386, 10044, 10195, 11046, 13294]
# Random sample: [1646, 2237, 3178, 3227, 3649, 4049, 4171, 5071, 5118, 5139, 5456, 6139, 6468, 6726, 6944, 7050, 7135, 7588, 7597, 7971, 10269, 10563, 12283, 12339, 13066]
If you want a long example including a discussion about estimating or fixing the support of the distribution, then you can find it in https://github.com/scipy/scipy/issues/1359 and the linked mailing list message.
Preliminary support to fix parameters, such as location, during fit has been added to the trunk version of scipy.
OpenTURNS has a simple way to do this with the GammaFactory class.
First, let's generate a sample:
import openturns as ot
gammaDistribution = ot.Gamma()
sample = gammaDistribution.getSample(100)
Then fit a Gamma to it:
distribution = ot.GammaFactory().build(sample)
Then we can draw the PDF of the Gamma:
import openturns.viewer as otv
otv.View(distribution.drawPDF())
which produces:
More details on this topic at: http://openturns.github.io/openturns/latest/user_manual/_generated/openturns.GammaFactory.html
1): the "data" variable could be in the format of a python list or tuple, or a numpy.ndarray, which could be obtained by using:
data=numpy.array(data)
where the 2nd data in the above line should be a list or a tuple, containing your data.
2: the "parameter" variable is a first guess you could optionally provide to the fitting function as a starting point for the fitting process, so it could be omitted.
3: a note on #mondano's answer. The usage of moments (mean and variances) to work out the gamma parameters are reasonably good for large shape parameters (alpha>10), but could yield poor results for small values of alpha (See Statistical methods in the atmospheric scineces by Wilks, and THOM, H. C. S., 1958: A note on the gamma distribution. Mon. Wea. Rev., 86, 117–122.
Using Maximum Likelihood Estimators, as that implemented in the scipy module, is regarded a better choice in such cases.