Is there an efficient implementation in Python to evaluate the PDF of a multivariate normal distribution when there are missing values in x? I guess the idea would just be that you'd effectively reduce the dimensionality to whatever number of available data points you had for a particular vector for which you are trying to evaluate the probability. But I can't figure out if the scipy implementation has a way to ignore masked values.
e.g.,
from scipy.stats import multivariate_normal as mvnorm
import numpy as np
means = [0.0,0.0,0.0]
cov = np.array([[1.0,0.2,0.2],[0.2,1.0,0.2],[0.2,0.2,1.0]])
d = mvnorm(means,cov)
x = [0.5,-0.2,np.nan]
d.pdf(x)
yields output:
nan
(as expected)
Is there a way to efficiently evaluate the PDF for only values that are present (in this case, making effectively 3D case into a bivariate case?) using this implementation?
This question is a bit of a tricky in terms of math and code. Let me elaborate.
First, the code. scipy.stats does not offer nan-handling as you desire. Speedy code likely requires implementing the multivariate normal distribution PDF by hand and applying it to NumPy arrays directly. Leveraging vectorization is the only way to efficiently offer this functionality for large-scale datasets. On the other hand, the nan-tolerant function nanTol_pdf() below provides the desired functionality while staying true to the multivariate normal distribution as implemented in SciPy. You might find it sufficient for your use case.
def nanTol_pdf(d, x):
'''
Function returns function value of multivariate probability density conditioned on
non-NAN indices of the input vector x
'''
assert isinstance(d, stats._multivariate.multivariate_normal_frozen) and (isinstance(x,list) or isinstance(x,np.ndarray))
# check presence of nan entries
if any(np.isnan(x)):
# indices
subIndex = np.argwhere(~np.isnan(x)).reshape(-1)
# lower-dimensional multiv. Gaussian distribution
lowDim_mean = d.mean[subIndex]
lowDim_cov = cov[np.ix_(subIndex, subIndex)]
lowDim_d = mvnorm(lowDim_mean, lowDim_cov)
return (lowDim_d.pdf(x[subIndex]))
else:
return d.pdf(x)
Regardless, the fact we can do it shouldn't stop us to think if we should.
Second, the math. Mathematically speaking, it is unclear what you attempt to achieve. In your example, SciPy returns nan as you query it with an ill-defined input vector x. Output not-defined, i.e. returning not a number (nan) seems to be the most appropriate answer. Jointly truncating the distribution d and input vector x circumvents numerical problems but opens up statistical questions. In particular, since the probability density function values cannot be understood as (conditional) probabilities. Moreover, the output alone conceals if truncation was applied. Remember that nanTol_pdf() will happily provide a non-negative real number as an output as long as at least one entry in the vector is a real number. Your use case will decide if this is reasonable.
Finally, I would suggest at least considering missing data imputation techniques before moving forward. Let me know if this helps.
Related
I have to combine p values and get one p value.
I'm using scipy.stats.combine_pvalues function, but it is giving very small combined p value, is it normal?
e.g.:
>>> import scipy
>>> p_values_list=[8.017444955844044e-06, 0.1067379119652372, 5.306374345615846e-05, 0.7234201655194492, 0.13050605094545614, 0.0066989543716175, 0.9541246420333787]
>>> test_statistic, combined_p_value = scipy.stats.combine_pvalues(p_values_list, method='fisher',weights=None)
>>> combined_p_value
4.331727536209026e-08
As you see, combined_p_value is smaller than any given p value in the p_values_list?
How can it be?
Thanks in advance,
Burcak
It is correct, because you are testing all of your p-values come from a random uniform distribution. The alternate hypothesis is that at least one of them is true. Which in your case is very possible.
We can simulate this, by drawing from a random uniform distribution 1000 times, the length of your p-values:
import numpy as np
from scipy.stats import combine_pvalues
from matplotlib import pyplot as plt
random_p = np.random.uniform(0,1,(1000,len(p_values_list)))
res = np.array([combine_pvalues(i,method='fisher',weights=None) for i in random_p])
plt.hist(fisher_p)
From your results, the chi-square is 62.456 which is really huge and no where near the simulated chi-square above.
One thing to note is that the combining you did here does not take into account directionality, if that is possible in your test, you might want to consider using stouffer's Z along with weights. Also another sane way to check is to run simulation like the above, to generate list of p-values under the null hypothesis and see how they differ from what you observed.
Interesting paper but maybe a bit on the statistics side
I am by no means an expert in this field, but am interested in your question. Following some reading of wiki it seems to me that the combined_p_value tells you the likelihood of all p-values in the list been obtained under the same null-hypothesis. Which is very unlikely considering two extremely small values.
Your set has two extremely small values: 1st and 3rd. If the thought process I described is correct, removing any of them should yield a much higher p-value, which is indeed the case:
remove 1st: p-value of 0.00010569305282803985
remove 3rd: p-value of 2.4713196031837724e-05
In conclusion, I think that this is a correct way of interpreting the meta-analysis that combine_pvalues actually describes.
Suppose I have two tensors, p1 and p2 in tensorflow of the same shape which contain probilities, some of which might be zero or one. Is their and elegant way of calculating the log-likelihood pointwise: p1*log(p2) + (1-p1)*log(1-p2)?
Implementing it naively using the tensorflow functions
p1*tf.log(p2) + (1-p1)*tf.log(1-p2)
risks calling 0*tf.log(0) which will give a nan.
As an initial hack (there most be a better solution) I add an epsilon inside the log:
eps = 1e-10
p1*tf.log(p2+eps) + (1-p1)*tf.log(1-p2+eps)
which prevents a log(0).
Please take a look at the CRF. It contains implementation of Log Likelihood. In particular you can take a look at the implementation
I cannot get scipy.optimize.curve_fit to properly fit my data which is visually apparent. I know approximately what the parameter values should be and if I evaluate the function with the given parameters the calculated and experimental data appear to agree well:
However, when I use scipy.optimize.curve_fit the output parameters with the smallest error is a much worse fit (by visual inspection). If I use the "known" parameters as my initial guess and bound the parameters to a relatively narrow window as shown in the example of output from fit function:
I obtain error values ~10^2 times larger but the visual appearance of the fit seems better. The only way I can get a decent looking fit for the data is to bound all the parameters with ~ 0.3 units of the "known" parameter. I plan on using this code to fit more complex data that I will not know the parameters before hand, so I cannot just use the calculated plot.
The relevant code is included below:
import matplotlib.pyplot as plt
import numpy as np
import scipy
from scipy.optimize import curve_fit
d_1= 2.72 #Anstroms
sig_cp_1= 0.44
sig_int_1= 1.03
d_1, sig_cp_1,sig_int_1=float(d_1),float(sig_cp_1),float(sig_int_1)
Ref=[]
Qz_F=[]
Ref_F=[]
g=open("Exp_Fresnal.csv",'rb')#"Test_Fresnal.csv", 'rb')
reader=csv.reader(g)
for row in reader:
Qz_F.append(row[0])
Ref.append(row[1])
Ref_F.append(row[2])
Ref=map(lambda a:float(a),Ref)
Ref_F=map(lambda a:float(a),Ref_F)
Qz_F=map(lambda a:float(a),Qz_F)
Ref_F_arr=np.array((Ref_F))
Qz_arr=np.array((Qz_F))
x=np.array((Qz_arr,Ref_F))
def func(x,d,sig_int,sig_cp):
return (x[1])*(abs(x[0]*d*(np.exp((-sig_int**2)*(x[0]**2)/2)/(1-np.exp(complex(0,1)*x[0]*d)*np.exp((-sig_cp**2)*(x[0]**2)/2)))))**2
DC_ref=func(x,d_1,sig_int_1,sig_cp_1)
Y=np.array((Ref))
popt, pcov=curve_fit(func,x,Y,)#p0=[2.72,1.0,0.44])
perr=np.sqrt(np.diag(pcov))
print "par=",popt;print"Var=",perr
Fit=func(x,*popt)
Fit=func(x,*popt)
Ref=np.transpose(np.array([Ref]))
Qz_F=np.transpose(Qz_F)
plt.plot(Qz_F, Ref, 'bs',label='Experimental')
plt.plot(Qz_F, Fit, 'r--',label='Fit w/ DCM model')
plt.axis([0,3,10**(-10),100])
plt.yscale('log')
plt.title('Reflectivity',fontweight='bold',fontsize=15)
plt.ylabel('Reflectivity',fontsize=15)
plt.xlabel('qz /A^-1',fontsize=15)
plt.legend(loc='upper right',numpoints=1)
plt.show()
The arrays are imported from a file (which I cannot include) and there are no outlier points that would cause the fit to become this distorted. Any help is appreciated.
Edit
I included additional code and the input data to go along with the code but you will have to re-save it as a MS-Dos .CSV
#WarrenWeckesser has a really good point, but further note that the y axis is logarithmic. That apparently huge error at the right end is something like 1e-5 in magnitude, while the points on the top left have reflectivity values of around 0.1. The square error coming from the tail is simply insignificant compared to the huge terms on the left.
I'm sure curve_fit works great. If you want a better visual fit, I suggest trying a fit to log(y) with the log() of your model (either that, or weight your points during fitting); then the result might be more stable visually (and from a physical point of view). Since you're probably trying to give an overall broad-spectrum description of your system, this might be closer to what you expect (but this will inevitably lead to a less precise fit where the reflectivity is high).
I am working with IFFT and have a set of real and imaginary values with their respective frequencies (x-axis). The frequencies are not equidistant, I can't use a discrete IFFT, and I am unable to fit my data correctly, because the values are so jumpy at the beginning. So my plan is to "stretch out" my frequency data points on a lg-scale, fit them (with polyfit) and then return - somehow - to normal scale.
f = data[0:27,0] #x-values
re = daten[0:27,5] #y-values
lgf = p.log10(f)
polylog_re = p.poly1d(p.polyfit(lgf, re, 6))
The fit works definitely better (http://imgur.com/btmC3P0), but is it possible to then transform my polynom back into the normal x-scaling? Right now I'm using those logarithmic fits for my IFFT and take the log10 of my transformed values for plotting etc., but that probably defies all mathematical logic and results in errors.
Your fit is perfectly valid but not a regular polynomial fit. By using log_10(x), you use another model function. Something like y(x)=sum(a_i * 10^(x_i^i). If this is okay for you, you are done. When you wan't to do some more maths, I would suggest using the natural logarithm instead the one to base 10.
I have a data set of complex numbers, and I'd like to be able to find parameters that best fit the data. Can you fit data in complex numbers using leastsq as implemented by scipy in python?
For example, my code is something like this:
import cmath
from scipy.optimize import leastsq
def residuals(p,y,x):
L,Rs,R1,C=p
denominator=1+(x**2)*(C**2)*(R1**2)
sim=complex(Rs+R1/denominator,x*L-(R1**2)*x*C/denominator)
return(y-sim)
z=<read in data, store as complex number>
x0=np.array[1, 2, 3, 4]
res = leastsq(residuals,x0, args=(z,x))
However, residuals doesn't like working with my complex number, I get the error:
File "/tmp/tmp8_rHYR/___code___.py", line 63, in residuals
sim=complex(Rs+R1/denominator,x*L-(R1**_sage_const_2 )*x*C/denominator)
File "expression.pyx", line 1071, in sage.symbolic.expression.Expression.__complex__ (sage/symbolic/expression.cpp:7112)
TypeError: unable to simplify to complex approximation
I'm guessing that I need to work only with floats/doubles rather than complex numbers. In that case, how can I evaluate the real and complex parts separately and then lump them back together into a single error metric for residuals to return?
The least squares function in scipy wants a real residual returned because it is difficult to compare complex values (e.g. is 1+2j greater or less than 2+1j?). Remember the residual is essentially a measure of the quality of the set of parameters passed in, it tells leastsq how close to the true fit it is.
What you can do is add the error (y-sim) in quadrature, appending these lines after you calculate 'sim' in your residuals function:
a = y-sim
return a.real**2 + a.imag**2
So long as y and sim are both np.array's of complex's then this will work and is relatively efficient.