I want to calculate squared coefficient of variation i.e.:
For sample X, where
My work so far
I cannot simply take exponent of ln(X) because those numbers are to small and will be treated as 0. I read on wikipedia that we can use estimation of form (when sample is big):
However, when I'm using this formula, I obtain very unstable result:
vec = np.array([-750.1729, -735.0251])
np.exp(np.var(vec)) - 1
8.1818556596351e+24
Is there any opportunity to have result more accurate? Or it has to be this way, because variance is very big with respect to mean?
Related
So, even in the absence of errors due to the finite machine precision it would be mathematically incorrect to think of a scenario where finite number of points sampled from a Gaussian distribution give exactly zero mean always. One would truly need an infinite number of points for this to be exactly true.
Nonetheless, I am manually (in an ad hoc manner) trying to center the distribution so that the mean is at zero. For that I first generate a gaussian distribution, find it's mean and then shift each point with that mean. By doing this I take the mean very close to zero but then I encounter a small value close to the machine precision (of the order 10**(-17)) and I do not know how to make it exactly zero.
Here is the code I used:
import numpy as np
n=10000
X=np.random.normal(0,1,size=(2,n))[0,:]
Xm=np.mean(X)
print("Xm = ", Xm)
Y=np.random.normal(0,1,size=(2,n))[1,:]
Ym=np.mean(Ym)
print("Ym = ", Y)
for i in range(len(X)):
X[i]=X[i]-Xm
Y[i]=Y[i]-Ym
new_X=np.mean(X)
new_Y=np.mean(Y)
print(new_X)
print(new_Y)
Output:
Zreli = 0.002713682499601005
Preli = -0.0011499576497770079
-3.552713678800501e-18
2.2026824808563105e-17
I am not good with code but mathematically you could have a while loop that checks for the sum of the numbers you have to not be 0. If it isn't 0 you would add 1 to the lowest unit you allow.
I'm trying to plot a mathematical expression in python. I have a sum of functions f_i of the following type
-x/r^2*exp(-rx)+2/r^3*(1-exp(-rx))-x/r^2*exp(-r x_i)
where x_i values between 1/360 and 50. r is quite small, meaning 0.0001. I'm interested in plotting the behavior of these functions (actually I'm interested in plotting sum f_i(x) n_i, for some real n_i) as x converges to zero. I know the exact analytical expression, which I can reproduce. However, the plot for very small x tends to start oscillating and doesn't seem to converge. I'm now wondering if this has to do with the floating-point precision in python. I'm considering very small x, like 0.0000001
.0000001 isn't very small.
Check out these pages:
https://docs.python.org/3/library/stdtypes.html#typesnumeric
https://docs.python.org/3/library/sys.html#sys.float_info
Try casting your intermediate values in the computations to float before doing math.
I need to identify which statistic let me to find on digital image which line has the highest variation. I am using Variance (square units, calculated as numpy.var(x)) and Coefficient of Variation (unitless, calculated as numpy.sd(x)/numpy.mean(x)), but I got different values, as here:
v1 = line(VAR(x))
v2 = line(CV(x))
print(v1,v2)
The result:
(12,17)
Should not both find the same line?
Which one could be better to use in this case?
Coefficient of variation and variance are not supposed to choose the same array on a random data. Coefficient of variation will be sensitive to both variance and the scale of your data, whereas variance will be geared towards variation in your data.
Please see the example:
import numpy as np
x = np.random.randn(10)
x1= x+10
np.var(x), np.std(x)/np.mean(x)
(2.0571740850649021, -2.2697110381499224)
np.var(x1), np.std(x1)/np.mean(x1)
(2.0571740850649016, 0.1531035017615747)
Which one to choose depends on your application, but I'm leaning towards variance in your case.
Variance defines how much it varies from the mean (No Noise in the data) or Median(Noise in the data).
Coefficient of variation defines standarddeviation divided by mean. It always expressed in percentages.
I need to calculate PDFs of mixture of Dirichlet distribution in python. But for each mixture component there is the normalizing constant, which is the inverse beta function which has gamma function of sum of the hyper-parameters as the numerator. So even for a sum of hyper-parameters of size '60' it goes unbounded. Please suggest me a work around for this problem. What happens when I ignore the normalizing constant?
First its not the calculation of NC itself that is the problem. For a single dirichlet I have no problem . But what I have here is a mixture of product of dirichlets, so each mixture component is a product of many dirichlets each with its own NCs. So the product of these goes unbounded. Regarding my objective, I have a joint distribution of p(s,T,O), where 's' is discrete, 'T' and 'O' are the dirichlet variables i.e. a set of vectors of parameters which sum to '1'. Now since 's' is discrete and finite I have |S| set of mixture of product of dirichlet components for each 's'. Now my objective here is to find p(s|T,O). So I directly substitute a particular (T,O) and calculate the value of each p('s'|T,O). For this I need to calc the NCs. If there is only one mixture component then I can ignore the norm constant, calc. and renormalise finally, but since I have several mixture components each components will have different scaling and so I can't renormalise. This is my conundrum.
Some ideas. (1) To calculate the normalizing factor exactly, maybe you can rewrite the gamma function via gamma(a_i + 1) = a_i gamma(a_i) (a_i need not be an integer, let the base case be a_i < 1) and then you'll have sum(a_i, i, 1, n) terms in the numerator and denominator and you can reorder them so that you divide the largest term by the largest term and multiply those single ratios together instead of computing an enormous numerator and an enormous denominator and dividing those. (2) If you don't need to be exact, maybe you can apply Stirling's approximation. (3) Maybe you don't need the pdf at all. For some purposes, you just need a function which is proportional to the pdf. I believe Markov chain Monte Carlo is like that. So, what is the larger goal you are trying to achieve here?
I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)