Related
I have a Vector Or bunch of vectors (stored in 2D array, by rows)
The vectors are generated as :
MEAN=0, STD-DEV=1/SQRT(vec_len)
and before or after operations have to be normalized in the same form
I want to normalize them in the complex space.
Here is the wrapper of a function:
#staticmethod
def fft_normalize(x, dim=DEF_DIM):
cx = rfft(x, dim=dim)
....
rv = irfft(cx_proj, dim=dim)
return rv
help me fill the dots.
Here is the real-value normalization that I use.
#staticmethod
def normalize(a, dim=DEF_DIM):
norm=torch.linalg.norm(a,dim=dim)
# if torch.eq(norm,0) : return torch.divide(a,st.MIN)
if dim is not None : norm = norm.unsqueeze(dim)
return torch.divide(a,norm)
In [70]: st.normalize(x + 3)
Out[70]:
([[0.05, 0.04, 0.05, ..., 0.04, 0.04, 0.04],
[0.04, 0.04, 0.05, ..., 0.05, 0.04, 0.05],
[0.05, 0.04, 0.05, ..., 0.04, 0.05, 0.04]])
In [71]: st.normalize(x + 5)
Out[71]:
([[0.05, 0.04, 0.05, ..., 0.04, 0.04, 0.04],
[0.04, 0.04, 0.05, ..., 0.05, 0.04, 0.04],
[0.05, 0.04, 0.04, ..., 0.04, 0.05, 0.04]])
In [73]: st.normalize(x + 5).len()
Out[73]: ([1.00, 1.00, 1.00])
In [74]: st.normalize(x + 3).len()
Out[74]: ([1., 1., 1.])
In [75]: st.normalize(x).len()
Out[75]: ([1.00, 1.00, 1.00])
#bad, need normalization
In [76]: (x + 3).len()
Out[76]: ([67.13, 67.13, 67.13])
#staticmethod
def len(a,dim=DEF_DIM): return torch.linalg.norm(a,dim=dim)
I did not want to post this so not to influence possible better solution.. So here is one my attempts .. parts I borrowed from what I found.
This only works for 1D vectors ;(
#staticmethod
def fft_normalize(x, dim=DEF_DIM):# Normalize a vector x in complex domain.
c = rfft(x,dim=dim)
ri = torch.vstack([c.real, c.imag])
norm = torch.abs(c)
print(norm.shape, ri.shape)
# norm = torch.linalg.norm(ri, dim=dim)
# if dim is not None : norm = norm.unsqueeze(dim)
if torch.any(torch.eq(norm,0)): norm[torch.eq(norm,0)] = st.MIN #!fixme
ri= torch.divide(ri,norm) #2D fails here
c_proj = ri[0,:] + 1j * ri[1,:]
rv = irfft(c_proj, dim=dim)
return rv
adapted the solution of Thibault Cimic ... seems to work for 1D vectors, but not for 2D
#staticmethod
def fft_normalize(x, dim=DEF_DIM, dot_dim=None):# Normalize a vector x in complex domain.
c = rfftn(x,dim=dim)
c_conj = torch.conj(c)
if dot_dim is None : dot_dim = st.dot_dims(c, c_conj)
c_norm = torch.sqrt(torch.tensordot(c, c_conj, dims=dot_dim))
c_proj = torch.divide(c, c_norm)
rv = irfftn(c_proj, dim=dim)
return rv
I'm guessing you want to normalize with the norm associated to the natural complex inner product. So is that what you're trying to do :
def fft_normalize(x, dim=DEF_DIM):# Normalize a vector x in complex domain.
c = rfft(x,dim=dim)
c_norm = math.sqrt(c.dot(numpy.conjugate(c)))
c_proj = c/c_norm
rv = irfft(c_proj, dim=dim)
return rv
I have a selection of lists of variables
import numpy.random as npr
w = [0.02, 0.03, 0.05, 0.07, 0.11, 0.13, 0.17]
x = 1
y = False
z = [0.12, 0.2, 0.25, 0.05, 0.08, 0.125, 0.175]
v = npr.choice(w, x, y, z)
I want to find the probability of the value V being a selection of variables eg; False or 0.12.
How do I do this.
Heres what I've tried;
import numpy.random as npr
import math
w = [0.02, 0.03, 0.05, 0.07, 0.11, 0.13, 0.17]
x = 1
y = False
z = [0.12, 0.2, 0.25, 0.05, 0.08, 0.125, 0.175]
v = npr.choice(w, x, y, z)
from collections import Counter
c = Counter(0.02, 0.03, 0.05, 0.07, 0.11, 0.13, 0.17,1,False,0.12, 0.2, 0.25, 0.05, 0.08, 0.125, 0.175)
def probability(0.12):
return float(c[v]/len(w,x,y,z))
which I'm getting that 0.12 is an invalid syntax
There are several issues in the code, I think you want the following:
import numpy.random as npr
import math
from collections import Counter
def probability(v=0.12):
return float(c[v]/len(combined))
w = [0.02, 0.03, 0.05, 0.07, 0.11, 0.13, 0.17]
x = [1]
y = [False]
z = [0.12, 0.2, 0.25, 0.05, 0.08, 0.125, 0.175]
combined = w + x + y + z
v = npr.choice(combined)
c = Counter(combined)
print(probability())
print(probability(v=0.05))
1) def probability(0.12) does not make sense; you will have to pass a variable which can also have a default value (above I use 0.12)
2) len(w, x, y, z) does not make much sense either; you probably look for a list that combines all the elements of w, x, y and z. I put all of those in the list combined.
3) One would also have to put in an additional check, in case the user passes e.g. v=12345 which is not included in combined (I leave this to you).
The above will print
0.0625
0.125
which gives the expected outcome.
I have the following code, in which DGauss is a function that generates the expected values. The two arrays, on the other hand, allow me to generate a distribution, that I take as observed values.
The code, based on the observed values, extracts a polynomial (for the moment of the seventh degree) that describes its trend.
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
def DGauss(x,I1,I2,sigma1,sigma2):
return I1*np.exp(-x*x/(2*sigma1*sigma1)) + I2*np.exp(-x*x/(2*sigma2*sigma2))
Pos = np.array([3.28, 3.13, 3.08, 3.03, 2.98, 2.93, 2.88, 2.83, 2.78, 2.73, 2.68,
2.63, 2.58, 2.53, 2.48, 2.43, 2.38, 2.33, 2.28, 2.23, 2.18, 2.13,
2.08, 2.03, 1.98, 1.93, 1.88, 1.83, 1.78, 1.73, 1.68, 1.63, 1.58,
1.53, 1.48, 1.43, 1.38, 1.33, 1.28, 1.23, 1.18, 1.13, 1.08, 1.03,
0.98, 0.93, 0.88, 0.83, 0.78, 0.73, 0.68, 0.63, 0.58, 0.53, 0.48,
0.43, 0.38, 0.33, 0.28, 0.23, 0.18, 0.13, 0.08, 0.03])
Val = np.array([0.00986279, 0.01529543, 0.0242624 , 0.0287456 , 0.03238484,
0.03285927, 0.03945234, 0.04615091, 0.05701618, 0.0637672 ,
0.07194268, 0.07763934, 0.08565687, 0.09615262, 0.1043281 ,
0.11350606, 0.1199406 , 0.1260062 , 0.14093328, 0.15079665,
0.16651464, 0.18065023, 0.1938894 , 0.2047541 , 0.21794024,
0.22806706, 0.23793043, 0.25164404, 0.2635118 , 0.28075974,
0.29568682, 0.30871501, 0.3311846 , 0.34648062, 0.36984661,
0.38540666, 0.40618835, 0.4283945 , 0.45002014, 0.48303911,
0.50746062, 0.53167057, 0.5548792 , 0.57835128, 0.60256181,
0.62566436, 0.65704847, 0.68289386, 0.71332794, 0.73258027,
0.769608 , 0.78769989, 0.81407275, 0.83358852, 0.85210239,
0.87109068, 0.89456217, 0.91618782, 0.93760247, 0.95680234,
0.96919757, 0.9783219 , 0.98486193, 0.9931429 ])
f = np.linspace(-9,9,2*len(Pos))
plt.errorbar(Pos, Val, xerr=0.02, yerr=2.7e-3, fmt='o')
popt, pcov = curve_fit(DGauss, Pos, Val)
plt.plot(xfull, DGauss(f, *popt), '--', label='Double Gauss')
x = Pos
y = Val
#z, w = np.polyfit(x, y, 7, full=False, cov=True)
p = np.poly1d(z)
u = np.array(p)
xp = np.linspace(1, 6, 100)
_ = plt.plot(xp, p(xp), '-', color='darkviolet')
x = symbols('x')
list = u[::-1]
poly = sum(S("{:7.3f}".format(v))*x**i for i, v in enumerate(list))
eq_latex = sympy.printing.latex(poly)
print(eq_latex)
#LOOP SUGGESTED BY #Fourier
dof = [1,2,3,4,5,6,7,8,9,10]
for i in dof:
z = np.polyfit(x, y, i, full=False, cov=True)
chi = np.sum((np.polyval(z, x) - y) ** 2)
chinorm = chi/i
plt.plot(chinorm)
What I would like to do now is to make a fit by varying the order of the polynomial to figure out which is the minimum order I need to have a good fit and not exceed the number of free parameters. In particular, I would like to make this fit with different orders and plot the chi-squared, which must be normalized with respect to the number of degrees of freedom.
Could someone help me kindly?
Thanks!
Based on the posted code this should work for your purpose:
chiSquares = []
dofs = 10
for i in np.arange(1,dofs+1):
z = np.polyfit(x, y, i, full=False, cov=False)
chi = np.sum((np.polyval(z, x) - y) ** 2) / np.std(y) #ideally you should divide this using an error for Val array
chinorm = chi/i
chiSquares.append(chinorm)
plt.plot(np.arange(1,dofs+1),chiSquares)
If not evident from the plot, you can further use the F-test to check how much dof is really needed:
n = len(y)
for d, (rss1,rss2) in enumerate(zip(chiSquares,chiSquares[1:])):
p1 = d + 1
p2 = d + 2
F = (rss1-rss2/(p2-p1)) / (rss2/(n-p2))
p = 1.0 - scipy.stats.f.cdf(F,p1,p2)
print 'F-stats: {:.3f}, p-value: {:.5f}'.format(F,p)
I am trying to fit a trapezoid to a set of time series using the curve_fit library from scipy.optimize. The function that I'm using to generate a trapezoid is the following:
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
Where a and c are the slopes, and tau1 and tau2 mark the beginning and the end of the flat phase.
And in order to fit I just use:
popt, pcov = curve_fit(trapezoid, xdata, ydata, method = 'lm')
For most of the cases it works just fine, such as in the following:
However, I'm also getting some cases on which it just fails to fit the data, where it looks like it should be doing ok:
The problem with these cases is that it sets a tau2 (end of the flat phase) smaller than tau1 (beginning of it).
Could anyone suggest a way to solve this issue? Whether by imposing a constraint or in some other way?
Example array for which the fit does not work:
array([1.2 , 1.21, 1.2 , 1.19, 1.21, 1.22, 2.47, 2.53, 2.49, 2.39, 2.28,
2.16, 2.07, 1.99, 1.91, 1.83, 1.74, 1.65, 1.57, 1.5 , 1.45, 1.41,
1.38, 1.35, 1.33, 1.29, 1.24, 1.19, 1.14, 1.11, 1.07, 1.04, 1. ,
0.95, 0.91, 0.87, 0.84, 0.8 , 0.77, 0.74, 0.72, 0.7 , 0.68, 0.66,
0.63, 0.61, 0.59, 0.57, 0.55, 0.52, 0.5 , 0.48, 0.45, 0.43, 0.41,
0.39, 0.38, 0.37, 0.37, 0.36, 0.35, 0.34, 0.34, 0.33])
Which yields: tau1: 8.45, tau2:5.99
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful for this problem. Lmfit provides a slightly higher level interface to curve fitting, still based on the scipyoptimizers, but with some better abstractions and features.
In particular for your question, lmfit parameters are Python objects that can have bounds, be fixed, or be written as simple algebraic constraints in terms of other variables. This can support imposing tau2 > tau1.
The idea is essentially to set tau2=tau1+taudiff and place a lower bound of 0 on taudiff. While you could rewrite your function to do that in the code, with lmfit you don't have to do that and can put that logic in the Parameters instead.
Converting your script to use lmfit would give something like this:
from lmfit import Model
# use your same model function
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
return y
# turn model function into lmfit Model
tmod = Model(trapezoid)
# create Parameters for this model: they will be *named* according
# to the signature of the model function, and be used as keys in
# an ordered-directory-derived object. Here you can also give
# initial values
params = tmod.make_params(a=1, b=2, c=0.5, tau1=5, tau2=-1)
# now you can set bounds or constraints.
# 1st, add a new variable "taudiff"
params.add('taudiff', value=0.1, min=0, vary=True)
# constraint tau2 to be taudiff+tau1 -- this is no longer a "free variable:
params['tau2'].expr = "taudiff + tau1"
# now do fit to data:
result = tmod.fit(ydata, params, x=xdata)
# print report of fit
print(result.fit_report())
# get best fit params:
for parname, param in result.params:
print(parname, param.value, param.stderr, param.expr)
# get best fit array for plotting
pylab.plot(xdata, ydata)
pylab.plot(xdata, result.best_fit)
Hope that helps.
Just setting t1,t2 to the minimum and maximum value does work
def trapezoid(x, a, b, c, tau1, tau2):
y = np.zeros(len(x))
c = -np.abs(c)
a = np.abs(a)
(tau1,tau2) = (min(tau1,tau2),max(tau1,tau2))
y[:int(tau1)] = a*x[:int(tau1)] + b
y[int(tau1):int(tau2)] = a*tau1 + b
y[int(tau2):] = c*(x[int(tau2):]-tau2) + (a*tau1 + b)
x_data = np.arange(len(A))
popt, pcov = curve_fit(trapezoid, x_data, A, method = 'lm')
print popt
fit = trapezoid(x_data,*popt)
leads to:
I am doing some data analysis involving fitting datasets to a Generalised Extreme Value (GEV) distribution, but I'm getting some weird results. Here's what I'm doing:
from scipy.stats import genextreme as gev
import numpy
data = [1.47, 0.02, 0.3, 0.01, 0.01, 0.02, 0.02, 0.12, 0.38, 0.02, 0.15, 0.01, 0.3, 0.24, 0.01, 0.05, 0.01, 0.0, 0.06, 0.01, 0.01, 0.0, 0.05, 0.0, 0.09, 0.03, 0.22, 0.0, 0.1, 0.0]
x = numpy.linspace(0, 2, 20)
pdf = gev.pdf(x, *gev.fit(data))
print(pdf)
And the output:
array([ 5.64759709e+05, 2.41090345e+00, 1.16591714e+00,
7.60085002e-01, 5.60415578e-01, 4.42145248e-01,
3.64144425e-01, 3.08947114e-01, 2.67889183e-01,
2.36190826e-01, 2.11002185e-01, 1.90520108e-01,
1.73548832e-01, 1.59264573e-01, 1.47081601e-01,
1.36572220e-01, 1.27416958e-01, 1.19372442e-01,
1.12250072e-01, 1.05901466e-01, 1.00208313e-01,
9.50751375e-02, 9.04240603e-02, 8.61909342e-02,
8.23224528e-02, 7.87739599e-02, 7.55077677e-02,
7.24918532e-02, 6.96988348e-02, 6.71051638e-02,
6.46904782e-02, 6.24370827e-02, 6.03295277e-02,
5.83542648e-02, 5.64993643e-02, 5.47542808e-02,
5.31096590e-02, 5.15571710e-02, 5.00893793e-02,
4.86996213e-02, 4.73819114e-02, 4.61308575e-02,
4.49415891e-02, 4.38096962e-02, 4.27311763e-02,
4.17023886e-02, 4.07200140e-02, 3.97810205e-02,
3.88826331e-02, 3.80223072e-02])
The problem is that the first value is huge, totally distorting all the results, its show quite clearly in a plot:
I've experimented with other data, and random samples, and in some cases it works. The first value in my dataset is significantly higher than the rest, but it is a valid value so I can't just drop it.
Does anyone have any idea why this is happening?
Update
Here is another example showing the problem much more clearly:
In [1]: from scipy.stats import genextreme as gev, kstest
In [2]: data = [0.01, 0.0, 0.28, 0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.13, 0.07, 0.03
, 0.01, 0.42, 0.11, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.26, 1.32, 0.06, 0.02,
1.57, 0.07, 1.56, 0.04]
In [3]: fit = gev.fit(data)
In [4]: kstest(data, 'genextreme', fit)
Out[4]: (0.48015007915450658, 6.966510064376763e-07)
In [5]: x = linspace(0, 2, 200)
In [6]: plot(x, gev.pdf(x, *fit))
Out[6]: [<matplotlib.lines.Line2D at 0x97590f0>]
In [7]: hist(data)
Note specifically, line 4 shows a p-value of about 7e-7, way below what's normally considered acceptable. Here is the plot produced:
First, I think you may want to keep you location parameter fixed at 0.
Second, you got zeros in your data, the resulting fit may have +inf pdf at x=0 e.g. for the GEV fit or for the Weibull fit.
Therefore, the fit is actually correct, but when you plot the pdf (including x=0), the resulting plot is distorted.
Third, I really think scipy should drop the support for x=0 for a bunch of distributions such as Weibull. For x=0, R gives a nice warning of Error in fitdistr(data, "weibull") : Weibull values must be > 0, that is helpful.
In [103]:
p=ss.genextreme.fit(data, floc=0)
ss.genextreme.fit(data, floc=0)
Out[103]:
(-1.372872096699608, 0, 0.011680600795499299)
In [104]:
plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
ss.genextreme.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
label='GEV Fit')
plt.legend(loc='upper right')
plt.savefig('T1.png')
In [105]:
p=ss.expon.fit(data, floc=0)
ss.expon.fit(data, floc=0)
Out[105]:
(0, 0.14838807003769505)
In [106]:
plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(0, 1.6, 100),
ss.expon.pdf(np.linspace(0, 1.6, 100), p[0], p[1]), 'r--',
label='Expon. Fit')
plt.legend(loc='upper right')
plt.savefig('T2.png')
In [107]:
p=ss.weibull_min.fit(data[data!=0], floc=0)
ss.weibull_min.fit(data[data!=0], floc=0)
Out[107]:
(0.67366030738733995, 0, 0.10535422201164378)
In [108]:
plt.hist(data[data!=0], bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
ss.weibull_min.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
label='Weibull_Min Fit')
plt.legend(loc='upper right')
plt.savefig('T3.png')
edit
Your second data (which contains even more 0's )is a good example when MLE fit involving location parameter can become very challenging, especially potentially with a lot of float point overflow/underflow involved:
In [122]:
#fit with location parameter fixed, scanning loc parameter from 1e-8 to 1e1
L=[] #stores the Log-likelihood
P=[] #stores the p value of K-S test
for LC in np.linspace(-8, 1, 200):
fit = gev.fit(data, floc=10**LC)
L.append(np.log(gev.pdf(data, *fit)).sum())
P.append(kstest(data, 'genextreme', fit)[1])
L=np.array(L)
P=np.array(P)
In [123]:
#plot log likelihood, a lot of overflow/underflow issues! (see the zigzag line?)
plt.plot(np.linspace(-8, 1, 200), L,'-')
plt.ylabel('Log-Likelihood')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')
In [124]:
#plot p-value
plt.plot(np.linspace(-8, 1, 200), np.log10(P),'-')
plt.ylabel('$log_{10}($'+'K-S test P value'+'$)$')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')
Out[124]:
<matplotlib.text.Text at 0x107e3050>
In [128]:
#The best fit between with location paramter between 1e-8 to 1e1 has the loglikelihood of 515.18
np.linspace(-8, 1, 200)[L.argmax()]
fit = gev.fit(data, floc=10**(np.linspace(-8, 1, 200)[L.argmax()]))
np.log(gev.pdf(data, *fit)).sum()
Out[128]:
515.17663678368604
In [129]:
#The simple MLE fit is clearly bad, loglikelihood is -inf
fit0 = gev.fit(data)
np.log(gev.pdf(data, *fit0)).sum()
Out[129]:
-inf
In [133]:
#plot the fit
x = np.linspace(0.005, 2, 200)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(data,normed=True, alpha=0.6, bins=20)
Out[133]:
(array([ 8.91719745, 0.8492569 , 0. , 1.27388535, 0. ,
0.42462845, 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0.42462845, 0. , 0. , 0.8492569 ]),
array([ 0. , 0.0785, 0.157 , 0.2355, 0.314 , 0.3925, 0.471 ,
0.5495, 0.628 , 0.7065, 0.785 , 0.8635, 0.942 , 1.0205,
1.099 , 1.1775, 1.256 , 1.3345, 1.413 , 1.4915, 1.57 ]),
<a list of 20 Patch objects>)
Edit, goodness of fit test for GEV
A side note on KS test. You are testing the goodness-of-fit to a GEV with its parameter ESTIMATED FROM THE DATA itself. In such a case, the p value is invalid, see: itl.nist.gov/div898/handbook/eda/section3/eda35g.htm
There seems to be a lot of studies on the topic of goodness-of-fit test for GEV, I haven't found any available implementations for those yet.
http://onlinelibrary.wiley.com/doi/10.1029/98WR02364/abstract
http://onlinelibrary.wiley.com/doi/10.1029/91WR00077/abstract
http://www.idrologia.polito.it/~laio/articoli/16-WRR%20EDFtest.pdf