I need to calculate PDFs of mixture of Dirichlet distribution in python. But for each mixture component there is the normalizing constant, which is the inverse beta function which has gamma function of sum of the hyper-parameters as the numerator. So even for a sum of hyper-parameters of size '60' it goes unbounded. Please suggest me a work around for this problem. What happens when I ignore the normalizing constant?
First its not the calculation of NC itself that is the problem. For a single dirichlet I have no problem . But what I have here is a mixture of product of dirichlets, so each mixture component is a product of many dirichlets each with its own NCs. So the product of these goes unbounded. Regarding my objective, I have a joint distribution of p(s,T,O), where 's' is discrete, 'T' and 'O' are the dirichlet variables i.e. a set of vectors of parameters which sum to '1'. Now since 's' is discrete and finite I have |S| set of mixture of product of dirichlet components for each 's'. Now my objective here is to find p(s|T,O). So I directly substitute a particular (T,O) and calculate the value of each p('s'|T,O). For this I need to calc the NCs. If there is only one mixture component then I can ignore the norm constant, calc. and renormalise finally, but since I have several mixture components each components will have different scaling and so I can't renormalise. This is my conundrum.
Some ideas. (1) To calculate the normalizing factor exactly, maybe you can rewrite the gamma function via gamma(a_i + 1) = a_i gamma(a_i) (a_i need not be an integer, let the base case be a_i < 1) and then you'll have sum(a_i, i, 1, n) terms in the numerator and denominator and you can reorder them so that you divide the largest term by the largest term and multiply those single ratios together instead of computing an enormous numerator and an enormous denominator and dividing those. (2) If you don't need to be exact, maybe you can apply Stirling's approximation. (3) Maybe you don't need the pdf at all. For some purposes, you just need a function which is proportional to the pdf. I believe Markov chain Monte Carlo is like that. So, what is the larger goal you are trying to achieve here?
Related
I need to approximate a function y(x) with a step function of height h where each "high" segment has a length l_i=n_i*l_0 and every "low" segment has a length of d_j=n_j*d_0 where n_i must be an integer. The function is strictly positive, (not strictly) steadily decreasing and continuous.
My function has been derived in sympy and is available as symbolic equation but it's acceptable to convert to numpy/scipy if beneficial.
My first approach was to solve the segments pairwise.
The end application requires the total difference, i.e. the integral between the approximation and target function, to be minimized pairwise.
Another practical constraint is for the segments to be as short as possible, with the constraint of n being an integer.
I would also need to take over any residual of the integral sum into the next calculation because the total approximation should also minimize the accumulated error.
The approach I thought about taking would involve doing a segment wise integral from x_0 to x_1 and from x_1 to x_2, find for which x_1, x_2 the sum of these integrals changes sign (or is minimized) and then find the lowest common denominator of n_i and n_j.
integral = smp.integrate(y-h,(x,x_0,x_1)) + smp.integrate(y,(x,x_1,x_2)
One approach would be to switch over to scipy.optimize.minimize at this point, however, I have read it has problems with integer values? On the other hand, I don't know how I could find a relationship for x_1(x_2) for which the integral would be close to 0 in sympy either as I just started using sympy yesterday. Any help would be hugely appreciated!
So, even in the absence of errors due to the finite machine precision it would be mathematically incorrect to think of a scenario where finite number of points sampled from a Gaussian distribution give exactly zero mean always. One would truly need an infinite number of points for this to be exactly true.
Nonetheless, I am manually (in an ad hoc manner) trying to center the distribution so that the mean is at zero. For that I first generate a gaussian distribution, find it's mean and then shift each point with that mean. By doing this I take the mean very close to zero but then I encounter a small value close to the machine precision (of the order 10**(-17)) and I do not know how to make it exactly zero.
Here is the code I used:
import numpy as np
n=10000
X=np.random.normal(0,1,size=(2,n))[0,:]
Xm=np.mean(X)
print("Xm = ", Xm)
Y=np.random.normal(0,1,size=(2,n))[1,:]
Ym=np.mean(Ym)
print("Ym = ", Y)
for i in range(len(X)):
X[i]=X[i]-Xm
Y[i]=Y[i]-Ym
new_X=np.mean(X)
new_Y=np.mean(Y)
print(new_X)
print(new_Y)
Output:
Zreli = 0.002713682499601005
Preli = -0.0011499576497770079
-3.552713678800501e-18
2.2026824808563105e-17
I am not good with code but mathematically you could have a while loop that checks for the sum of the numbers you have to not be 0. If it isn't 0 you would add 1 to the lowest unit you allow.
I want to calculate squared coefficient of variation i.e.:
For sample X, where
My work so far
I cannot simply take exponent of ln(X) because those numbers are to small and will be treated as 0. I read on wikipedia that we can use estimation of form (when sample is big):
However, when I'm using this formula, I obtain very unstable result:
vec = np.array([-750.1729, -735.0251])
np.exp(np.var(vec)) - 1
8.1818556596351e+24
Is there any opportunity to have result more accurate? Or it has to be this way, because variance is very big with respect to mean?
Consider this matrix:
[.6, .7]
[.4, .3]
This is a Markov chain matrix; the columns each sum to 1. This can represent a population distribution, transition rates, etc.
To get the population at equilibrium, take the eigenvalues and eigenvectors...
From wolfram alpha, the eigenvalues and their corresponding eigenvectors are:
l1 = 1, v1 = [4/7, 1]
l2 = -1/10, v2 = [-1,1]
For the population at equilibrium, take the eigenvector that corresponds to the eigenvalue of 1, and scale it so the total = 1.
Vector = [7/4, 1]
Total = 11/4
So multiply the vector by 4/11...
4/11 * [7/4, 1] = [7/11, 4/11]
Therefore at equilibrium the first state has 7/11 of the population and the other state has 4/11.
If you take the desired eigenvector, [7/4, 1] and l2 normalize it (so all squared values sum up to 1), you get roughly [.868, .496].
That's all fine. But when you get the eigenvectors from python...
mat = np.array([[.6, .7], [.4, .3]])
vals, vecs = np.linalg.eig(mat)
vecs = vecs.T #(because you want left eigenvectors)
One of the eigenvectors it spits out is the [.868, .496] one, for l2 normed ones. Now, you can pretty easily scale it again so the sum of each value is 1 (instead of the sum of THE SQUARE of each value) being 1... just do the vector * 1/sum(vector). But is there a way to skip this step? Why add the computaitonal expense to my script, having to sum up the vector each time I do this? Can you get numpy, scipy, etc to spit out the l1 normalized vector instead of the l2 normalized vector? Also, is that the correct usage of the terms l1 and l2...?
Note: I have seen previous questions asking how to get the markov steady states in this manner. My qusetion is different, I am asking how to get numpy to spit out a vector normalized in the way I want, and I am explaining my reasoning by including the markov part.
I think you're assuming that np.linalg.eig computes eigenvectors and eigenvalues like you would by hand. It doesn't. Under the hood, it uses a highly optimized (and famous) FORTRAN library called LAPACK. This library uses numerical techniques that are sort of out of scope, but long story short it doesn't compute the eigenvalues for a 2x2 like you would by hand. I believe it uses the QR algorithm most of the time, and sometimes QZ, or even others. It's not all that simple: I think it even chooses different algorithms based on the matrix structure/size sometimes (I'm not a LAPACK expert, so don't quote me here). What I do know is that LAPACK has been vetted over about 40 years and it is pretty darned fast, and with great speed comes great complexity.
Wolfram Alpha, on the other hand, is using Mathematica on the backend, which is a symbolic solver (i.e. not floating point arithmetic). That's why you get the "same" result as if you'd do it by hand.
Long story short, asking to get you the L1 norm from np.linalg.eig just isn't possible: if you look at the QR algorithm, each iteration will have the L2 normalized vector (that converges to an eigenvector). You'll have trouble getting it from most numerical libraries for the simple reason that a lot of them depend on LAPACK or use similar algorithms (for instance MATLAB outputs unit vectors as well).
At the end of the day, it doesn't really matter if the vector is normalized or not. It really just has to be in the right direction. If you need to scale it for a proportion, then do that. It'll be vectorized (i.e. fast) by numpy since it's a simple multiply.
HTH.
I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)