I am doing some data analysis involving fitting datasets to a Generalised Extreme Value (GEV) distribution, but I'm getting some weird results. Here's what I'm doing:
from scipy.stats import genextreme as gev
import numpy
data = [1.47, 0.02, 0.3, 0.01, 0.01, 0.02, 0.02, 0.12, 0.38, 0.02, 0.15, 0.01, 0.3, 0.24, 0.01, 0.05, 0.01, 0.0, 0.06, 0.01, 0.01, 0.0, 0.05, 0.0, 0.09, 0.03, 0.22, 0.0, 0.1, 0.0]
x = numpy.linspace(0, 2, 20)
pdf = gev.pdf(x, *gev.fit(data))
print(pdf)
And the output:
array([ 5.64759709e+05, 2.41090345e+00, 1.16591714e+00,
7.60085002e-01, 5.60415578e-01, 4.42145248e-01,
3.64144425e-01, 3.08947114e-01, 2.67889183e-01,
2.36190826e-01, 2.11002185e-01, 1.90520108e-01,
1.73548832e-01, 1.59264573e-01, 1.47081601e-01,
1.36572220e-01, 1.27416958e-01, 1.19372442e-01,
1.12250072e-01, 1.05901466e-01, 1.00208313e-01,
9.50751375e-02, 9.04240603e-02, 8.61909342e-02,
8.23224528e-02, 7.87739599e-02, 7.55077677e-02,
7.24918532e-02, 6.96988348e-02, 6.71051638e-02,
6.46904782e-02, 6.24370827e-02, 6.03295277e-02,
5.83542648e-02, 5.64993643e-02, 5.47542808e-02,
5.31096590e-02, 5.15571710e-02, 5.00893793e-02,
4.86996213e-02, 4.73819114e-02, 4.61308575e-02,
4.49415891e-02, 4.38096962e-02, 4.27311763e-02,
4.17023886e-02, 4.07200140e-02, 3.97810205e-02,
3.88826331e-02, 3.80223072e-02])
The problem is that the first value is huge, totally distorting all the results, its show quite clearly in a plot:
I've experimented with other data, and random samples, and in some cases it works. The first value in my dataset is significantly higher than the rest, but it is a valid value so I can't just drop it.
Does anyone have any idea why this is happening?
Update
Here is another example showing the problem much more clearly:
In [1]: from scipy.stats import genextreme as gev, kstest
In [2]: data = [0.01, 0.0, 0.28, 0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.13, 0.07, 0.03
, 0.01, 0.42, 0.11, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.26, 1.32, 0.06, 0.02,
1.57, 0.07, 1.56, 0.04]
In [3]: fit = gev.fit(data)
In [4]: kstest(data, 'genextreme', fit)
Out[4]: (0.48015007915450658, 6.966510064376763e-07)
In [5]: x = linspace(0, 2, 200)
In [6]: plot(x, gev.pdf(x, *fit))
Out[6]: [<matplotlib.lines.Line2D at 0x97590f0>]
In [7]: hist(data)
Note specifically, line 4 shows a p-value of about 7e-7, way below what's normally considered acceptable. Here is the plot produced:
First, I think you may want to keep you location parameter fixed at 0.
Second, you got zeros in your data, the resulting fit may have +inf pdf at x=0 e.g. for the GEV fit or for the Weibull fit.
Therefore, the fit is actually correct, but when you plot the pdf (including x=0), the resulting plot is distorted.
Third, I really think scipy should drop the support for x=0 for a bunch of distributions such as Weibull. For x=0, R gives a nice warning of Error in fitdistr(data, "weibull") : Weibull values must be > 0, that is helpful.
In [103]:
p=ss.genextreme.fit(data, floc=0)
ss.genextreme.fit(data, floc=0)
Out[103]:
(-1.372872096699608, 0, 0.011680600795499299)
In [104]:
plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
ss.genextreme.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
label='GEV Fit')
plt.legend(loc='upper right')
plt.savefig('T1.png')
In [105]:
p=ss.expon.fit(data, floc=0)
ss.expon.fit(data, floc=0)
Out[105]:
(0, 0.14838807003769505)
In [106]:
plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(0, 1.6, 100),
ss.expon.pdf(np.linspace(0, 1.6, 100), p[0], p[1]), 'r--',
label='Expon. Fit')
plt.legend(loc='upper right')
plt.savefig('T2.png')
In [107]:
p=ss.weibull_min.fit(data[data!=0], floc=0)
ss.weibull_min.fit(data[data!=0], floc=0)
Out[107]:
(0.67366030738733995, 0, 0.10535422201164378)
In [108]:
plt.hist(data[data!=0], bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
ss.weibull_min.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
label='Weibull_Min Fit')
plt.legend(loc='upper right')
plt.savefig('T3.png')
edit
Your second data (which contains even more 0's )is a good example when MLE fit involving location parameter can become very challenging, especially potentially with a lot of float point overflow/underflow involved:
In [122]:
#fit with location parameter fixed, scanning loc parameter from 1e-8 to 1e1
L=[] #stores the Log-likelihood
P=[] #stores the p value of K-S test
for LC in np.linspace(-8, 1, 200):
fit = gev.fit(data, floc=10**LC)
L.append(np.log(gev.pdf(data, *fit)).sum())
P.append(kstest(data, 'genextreme', fit)[1])
L=np.array(L)
P=np.array(P)
In [123]:
#plot log likelihood, a lot of overflow/underflow issues! (see the zigzag line?)
plt.plot(np.linspace(-8, 1, 200), L,'-')
plt.ylabel('Log-Likelihood')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')
In [124]:
#plot p-value
plt.plot(np.linspace(-8, 1, 200), np.log10(P),'-')
plt.ylabel('$log_{10}($'+'K-S test P value'+'$)$')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')
Out[124]:
<matplotlib.text.Text at 0x107e3050>
In [128]:
#The best fit between with location paramter between 1e-8 to 1e1 has the loglikelihood of 515.18
np.linspace(-8, 1, 200)[L.argmax()]
fit = gev.fit(data, floc=10**(np.linspace(-8, 1, 200)[L.argmax()]))
np.log(gev.pdf(data, *fit)).sum()
Out[128]:
515.17663678368604
In [129]:
#The simple MLE fit is clearly bad, loglikelihood is -inf
fit0 = gev.fit(data)
np.log(gev.pdf(data, *fit0)).sum()
Out[129]:
-inf
In [133]:
#plot the fit
x = np.linspace(0.005, 2, 200)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(data,normed=True, alpha=0.6, bins=20)
Out[133]:
(array([ 8.91719745, 0.8492569 , 0. , 1.27388535, 0. ,
0.42462845, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.42462845, 0. , 0. , 0.8492569 ]),
array([ 0. , 0.0785, 0.157 , 0.2355, 0.314 , 0.3925, 0.471 ,
0.5495, 0.628 , 0.7065, 0.785 , 0.8635, 0.942 , 1.0205,
1.099 , 1.1775, 1.256 , 1.3345, 1.413 , 1.4915, 1.57 ]),
<a list of 20 Patch objects>)
Edit, goodness of fit test for GEV
A side note on KS test. You are testing the goodness-of-fit to a GEV with its parameter ESTIMATED FROM THE DATA itself. In such a case, the p value is invalid, see: itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
There seems to be a lot of studies on the topic of goodness-of-fit test for GEV, I haven't found any available implementations for those yet.
http://onlinelibrary.wiley.com/doi/10.1029/98WR02364/abstract
http://onlinelibrary.wiley.com/doi/10.1029/91WR00077/abstract
http://www.idrologia.polito.it/~laio/articoli/16-WRR%20EDFtest.pdf
Related
Good morning, everyone. I have a set of values.
Arr = np.array([0.11, 0.14, 0.22, 0.26, 0.31, 0.36, 0.44, 0.69, 0.70, 0.70, 0.70, 0.75, 0.98, 1.40])
I have constructed the CDF function in this way:
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
ecdf_ = ecdf(Arr)
plot_ecdf(ecdf_)
Obtaining this figure:
Now I want to divide the space (y-axis) into 5 parts. To do this I am using the following function:
from scipy.stats.qmc import LatinHypercube
engine = LatinHypercube(d=1)
sample = engine.random(n=5) #Array of float64
For example, obtaining 5 values randomly generated:
0.0886183
0.450613
0.808077
0.753524
0.343108
At this point I would like to keep the corresponding values in the CDF as in the picture.
I also observed that in this way the constructed CDF has a discrete set of values. Which may not be optimal for my purpose.
My code does not work when I apply a linear interpolation between points.
Code
y= [0.1 ,0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.71]
x= [44.72, 43.4, 41.5, 39.9, 37.73, 36.1, 33.6, 31.5, 29.7, 26.4, 21.6, 16.8, 3.6, 0]
x_new = 25
y_new = np.interp(x_new, x, y)
print(y_new)
plt.plot(x, y, "og-", x_new, y_new, "or");
Result
Expected result
Can someone help me?
The docs say that the x values must be monotonically increasing. Yours don't.
The x-coordinate sequence is expected to be increasing, but this is
not explicitly enforced. However, if the sequence xp is
non-increasing, interpolation results are meaningless.
Note that, since NaN is unsortable, xp also cannot contain NaNs.
A simple check for xp being strictly increasing is:
np.all(np.diff(xp) > 0)
I have the following code, in which DGauss is a function that generates the expected values. The two arrays, on the other hand, allow me to generate a distribution, that I take as observed values.
The code, based on the observed values, extracts a polynomial (for the moment of the seventh degree) that describes its trend.
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def DGauss(x,I1,I2,sigma1,sigma2):
return I1*np.exp(-x*x/(2*sigma1*sigma1)) + I2*np.exp(-x*x/(2*sigma2*sigma2))
Pos = np.array([3.28, 3.13, 3.08, 3.03, 2.98, 2.93, 2.88, 2.83, 2.78, 2.73, 2.68,
2.63, 2.58, 2.53, 2.48, 2.43, 2.38, 2.33, 2.28, 2.23, 2.18, 2.13,
2.08, 2.03, 1.98, 1.93, 1.88, 1.83, 1.78, 1.73, 1.68, 1.63, 1.58,
1.53, 1.48, 1.43, 1.38, 1.33, 1.28, 1.23, 1.18, 1.13, 1.08, 1.03,
0.98, 0.93, 0.88, 0.83, 0.78, 0.73, 0.68, 0.63, 0.58, 0.53, 0.48,
0.43, 0.38, 0.33, 0.28, 0.23, 0.18, 0.13, 0.08, 0.03])
Val = np.array([0.00986279, 0.01529543, 0.0242624 , 0.0287456 , 0.03238484,
0.03285927, 0.03945234, 0.04615091, 0.05701618, 0.0637672 ,
0.07194268, 0.07763934, 0.08565687, 0.09615262, 0.1043281 ,
0.11350606, 0.1199406 , 0.1260062 , 0.14093328, 0.15079665,
0.16651464, 0.18065023, 0.1938894 , 0.2047541 , 0.21794024,
0.22806706, 0.23793043, 0.25164404, 0.2635118 , 0.28075974,
0.29568682, 0.30871501, 0.3311846 , 0.34648062, 0.36984661,
0.38540666, 0.40618835, 0.4283945 , 0.45002014, 0.48303911,
0.50746062, 0.53167057, 0.5548792 , 0.57835128, 0.60256181,
0.62566436, 0.65704847, 0.68289386, 0.71332794, 0.73258027,
0.769608 , 0.78769989, 0.81407275, 0.83358852, 0.85210239,
0.87109068, 0.89456217, 0.91618782, 0.93760247, 0.95680234,
0.96919757, 0.9783219 , 0.98486193, 0.9931429 ])
f = np.linspace(-9,9,2*len(Pos))
plt.errorbar(Pos, Val, xerr=0.02, yerr=2.7e-3, fmt='o')
popt, pcov = curve_fit(DGauss, Pos, Val)
plt.plot(xfull, DGauss(f, *popt), '--', label='Double Gauss')
x = Pos
y = Val
#z, w = np.polyfit(x, y, 7, full=False, cov=True)
p = np.poly1d(z)
u = np.array(p)
xp = np.linspace(1, 6, 100)
_ = plt.plot(xp, p(xp), '-', color='darkviolet')
x = symbols('x')
list = u[::-1]
poly = sum(S("{:7.3f}".format(v))*x**i for i, v in enumerate(list))
eq_latex = sympy.printing.latex(poly)
print(eq_latex)
#LOOP SUGGESTED BY #Fourier
dof = [1,2,3,4,5,6,7,8,9,10]
for i in dof:
z = np.polyfit(x, y, i, full=False, cov=True)
chi = np.sum((np.polyval(z, x) - y) ** 2)
chinorm = chi/i
plt.plot(chinorm)
What I would like to do now is to make a fit by varying the order of the polynomial to figure out which is the minimum order I need to have a good fit and not exceed the number of free parameters. In particular, I would like to make this fit with different orders and plot the chi-squared, which must be normalized with respect to the number of degrees of freedom.
Could someone help me kindly?
Thanks!
Based on the posted code this should work for your purpose:
chiSquares = []
dofs = 10
for i in np.arange(1,dofs+1):
z = np.polyfit(x, y, i, full=False, cov=False)
chi = np.sum((np.polyval(z, x) - y) ** 2) / np.std(y) #ideally you should divide this using an error for Val array
chinorm = chi/i
chiSquares.append(chinorm)
plt.plot(np.arange(1,dofs+1),chiSquares)
If not evident from the plot, you can further use the F-test to check how much dof is really needed:
n = len(y)
for d, (rss1,rss2) in enumerate(zip(chiSquares,chiSquares[1:])):
p1 = d + 1
p2 = d + 2
F = (rss1-rss2/(p2-p1)) / (rss2/(n-p2))
p = 1.0 - scipy.stats.f.cdf(F,p1,p2)
print 'F-stats: {:.3f}, p-value: {:.5f}'.format(F,p)
I am trying to fit a trapezoid to a set of time series using the curve_fit library from scipy.optimize. The function that I'm using to generate a trapezoid is the following:
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
Where a and c are the slopes, and tau1 and tau2 mark the beginning and the end of the flat phase.
And in order to fit I just use:
popt, pcov = curve_fit(trapezoid, xdata, ydata, method = 'lm')
For most of the cases it works just fine, such as in the following:
However, I'm also getting some cases on which it just fails to fit the data, where it looks like it should be doing ok:
The problem with these cases is that it sets a tau2 (end of the flat phase) smaller than tau1 (beginning of it).
Could anyone suggest a way to solve this issue? Whether by imposing a constraint or in some other way?
Example array for which the fit does not work:
array([1.2 , 1.21, 1.2 , 1.19, 1.21, 1.22, 2.47, 2.53, 2.49, 2.39, 2.28,
2.16, 2.07, 1.99, 1.91, 1.83, 1.74, 1.65, 1.57, 1.5 , 1.45, 1.41,
1.38, 1.35, 1.33, 1.29, 1.24, 1.19, 1.14, 1.11, 1.07, 1.04, 1. ,
0.95, 0.91, 0.87, 0.84, 0.8 , 0.77, 0.74, 0.72, 0.7 , 0.68, 0.66,
0.63, 0.61, 0.59, 0.57, 0.55, 0.52, 0.5 , 0.48, 0.45, 0.43, 0.41,
0.39, 0.38, 0.37, 0.37, 0.36, 0.35, 0.34, 0.34, 0.33])
Which yields: tau1: 8.45, tau2:5.99
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this problem. Lmfit provides a slightly higher level interface to curve fitting, still based on the scipyoptimizers, but with some better abstractions and features.
In particular for your question, lmfit parameters are Python objects that can have bounds, be fixed, or be written as simple algebraic constraints in terms of other variables. This can support imposing tau2 > tau1.
The idea is essentially to set tau2=tau1+taudiff and place a lower bound of 0 on taudiff. While you could rewrite your function to do that in the code, with lmfit you don't have to do that and can put that logic in the Parameters instead.
Converting your script to use lmfit would give something like this:
from lmfit import Model
# use your same model function
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
# turn model function into lmfit Model
tmod = Model(trapezoid)
# create Parameters for this model: they will be *named* according
# to the signature of the model function, and be used as keys in
# an ordered-directory-derived object. Here you can also give
# initial values
params = tmod.make_params(a=1, b=2, c=0.5, tau1=5, tau2=-1)
# now you can set bounds or constraints.
# 1st, add a new variable "taudiff"
params.add('taudiff', value=0.1, min=0, vary=True)
# constraint tau2 to be taudiff+tau1 -- this is no longer a "free variable:
params['tau2'].expr = "taudiff + tau1"
# now do fit to data:
result = tmod.fit(ydata, params, x=xdata)
# print report of fit
print(result.fit_report())
# get best fit params:
for parname, param in result.params:
print(parname, param.value, param.stderr, param.expr)
# get best fit array for plotting
pylab.plot(xdata, ydata)
pylab.plot(xdata, result.best_fit)
Hope that helps.
Just setting t1,t2 to the minimum and maximum value does work
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
(tau1,tau2) = (min(tau1,tau2),max(tau1,tau2))
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
x_data = np.arange(len(A))
popt, pcov = curve_fit(trapezoid, x_data, A, method = 'lm')
print popt
fit = trapezoid(x_data,*popt)
leads to:
Relatively new to python, mainly using it for plotting things. I am currently attempting to determine a best fit line using the 4 parameter logistic (4PL) equation and curve fit from scipy. There are one or two sites showing how 4PL works, but could not get them to work for my data. Example, but similar 4PL data below:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = [2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]
ydata = [0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1]
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata,
guess)
params
Gives warning (also an exponent warning in test data, but not real):
OptimizeWarning: Covariance of the parameters could not be estimated
category=OptimizeWarning)
And the params returns my initial guess. I have tried various initial guesses.
The best fit line is drawn when plotting, but is not a curve and does not go below x = 0 (I cannot find a reason negatives would mess with the 4PL model).
4PL fit plotted
I'm not sure if I am doing something incorrect with the equation, or how the curve fit function works, or both. I have a similar issue using least squares instead of curve fit. I've tried a bunch of variations based off similar equations for fit etc. but have been stuck for awhile, any help in pointing me in the right direction would be much appreciated.
I'm surprised you did not get any warnings or did not share them with us. I can't analyze this task for you by scientific means, just some remarks about technical stuff:
Observation
When running your code, you should some warnings like:
RuntimeWarning: invalid value encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Don't ignore this!
Debugging
Add some prints to your function fourPL, probably all the different components of your function and look what's happening.
Example:
def fourPL(x, A, B, C, D):
print('1: ', (A-D))
print('2: ', (x/C))
print('3: ', (1.0+((x/C)**(B))))
return ((A-D)/(1.0+((x/C)**(B))) + D)
...
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess, maxfev=1)
# maxfev=1 -> let's just check 1 or few it's
Output:
1: -1.0
2: [ 4.60000000e+00 4.60000000e+00 4.00000000e+00 4.00000000e+00
3.40000000e+00 3.40000000e+00 2.00000000e+00 2.00000000e+00
2.00000000e-06 2.00000000e-06 -2.00000000e+00 -2.00000000e+00]
RuntimeWarning: invalid value encountered in power
print('3: ', (1.0+((x/C)**(B))))
3: [ 1.4662524 1.4662524 1.5 1.5 1.54232614
1.54232614 1.70710678 1.70710678 708.10678119 708.10678119
nan nan]
That's enough to stop. nans and infs are bad!
Theory
Now it's time for theory and i won't do that. But usually you now should think about the underlying theory and why these problems occur.
Is there something you missed in regards to the assumptions?
Repair (without checking theory)
Without checking out the theory and just looking over some example found within 30 secs: hmm are negative x-values a problem?
Let's shift x (by the minimum; hardcoded 1 here):
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
Complete code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import scipy.optimize as optimization
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1
ydata = np.array([0.32, 0.3, 0.55, 0.60, 0.88, 0.92, 1.27, 1.21, 1.15, 1.12, 1.1, 1.1])
def fourPL(x, A, B, C, D):
return ((A-D)/(1.0+((x/C)**(B))) + D)
guess = [0, -0.5, 0.5, 1]
params, params_covariance = optimization.curve_fit(fourPL, xdata, ydata, guess)#, maxfev=1)
x_min, x_max = np.amin(xdata), np.amax(xdata)
xs = np.linspace(x_min, x_max, 1000)
plt.scatter(xdata, ydata)
plt.plot(xs, fourPL(xs, *params))
plt.show()
Output:
RuntimeWarning: divide by zero encountered in power
return ((A-D)/(1.0+((x/C)**(B))) + D)
Looks good, but it's time for another theory session: what did our linear-shift do to our results? I'm ignoring this again.
So just one warning and a nice-looking output.
If you want to remove that last warning, add some small epsilon to not have 0's in xdata:
xdata = np.array([2.3, 2.3, 2, 2, 1.7, 1.7, 1, 1, 0.000001, 0.000001, -1, -1]) + 1 + 1e-10
which will achieve the same, without any warning.