I have two arrays. x is the independent variable, and counts is the number of counts of x occurring, like a histogram. I know I can calculate the mean by defining a function:
def mean(x,counts):
return np.sum(x*counts) / np.sum(counts)
Is there a general function I can use to calculate each moment from the distribution defined by x and counts? I would also like to compute the variance.
You could use the moment function from scipy. It calculates the n-th central moment of your data.
You could also define your own function, which could look something like this:
def nmoment(x, counts, c, n):
return np.sum(counts*(x-c)**n) / np.sum(counts)
In that function, c is meant to be the point around which the moment is taken, and n is the order. So to get the variance you could do nmoment(x, counts, np.average(x, weights=counts), 2).
import scipy as sp
from scipy import stats
stats.moment(counts, moment = 2) #variance
stats.moment returns nth central moment.
Numpy supports order statistics now
https://numpy.org/doc/stable/reference/routines.statistics.html
np.average
np.std
np.var
etc
Related
I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.
Bugfix
You can use scipy.stats.binom equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.
Another way would be to use the CDF and it's inverse, something like:
from scipy import stats
dist = stats.binom(100, 0.5)
# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])
# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())
should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution
note that for large values you might as well use the normal approximation
I have a 2-D array of coordinates and each coordinates correspond to a value z (like z=f(x,y)). Now I want to divide this whole 2-D coordinate set into, for example, 100 even bins. And calculate the median value of z in each bin. Then use scipy.interpolate.griddata function to create a interpolated z surface. How can I achieve it in python? I was thinking of using np.histogram2d but I think there is no median function in it. And I found myself have hard time understanding how scipy.stats.binned_statistic work. Can someone help me please. Thanks.
With numpy.histogram2d you can both count the number of data and sum it, thus it gives you the possibility to compute the average.
I would try something like this:
import numpy as np
coo=np.array([np.arange(1000),np.arange(1000)]).T #your array coordinates
def func(x, y): return x*(1-x)*np.sin(np.pi*x) / (1.5+np.sin(2*np.pi*y**2)**2)
z = func(coo[:,0], coo[:,1])
(n,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100) # here we get counting
(tot,ex,ey)=np.histogram2d(coo[:,0], coo[:,1],bins=100,weights=z) # here we get total over z
average=tot/n
average=np.nan_to_num(average) #cure 0/0
print(average)
you'll need a few functions or one depending on how you want to structure things:
function to create the bins should take in your data, determine how big each bin is and return an array or array of arrays (also called lists in python).
Happy to help with this but would need more information about the data.
get the median of the bins:
Numpy (part of scipy) has a median function
http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.median.html
essentially the median on an array called
"bin"
would be:
$ numpy.median(bin)
Note: numpy.median does accept multiple arrays, so you could get the median for some or all of your bins at once. numpy.median(bins) which would return an array of the median for each bin
Updated
Not 100% on your example code, so here goes:
import numpy as np
# added some parenthesis as I wasn't sure of the math. also removed ;'s
def bincalc(x, y):
return x*(1-x)*(np.sin(np.pi*x))/(1.5+np.sin(2*(np.pi*y)**2)**2)
coo = np.random.rand(1000,2)
tcoo = coo[0]
a = []
for i in tcoo:
a.append(bincalc(coo[0],coo[1]))
z_med = np.median(a)
print(z_med)`
I need to compute the mutual information, and so the shannon entropy of N variables.
I wrote a code that compute shannon entropy of certain distribution.
Let's say that I have a variable x, array of numbers.
Following the definition of shannon entropy I need to compute the probability density function normalized, so using the numpy.histogram is easy to get it.
import scipy.integrate as scint
from numpy import*
from scipy import*
def shannon_entropy(a, bins):
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
x=binedg[:-1]
g=-p*log2(p)
g[isnan(g)]=0.
return scint.simps(g,x=x)
Choosing inserting x, and carefully the bin number this function works.
But this function is very dependent on the bin number: choosing different values of this parameter I got different values.
Particularly if my input is an array of values constant:
x=[0,0,0,....,0,0,0]
the entropy of this variables obviously has to be 0, but if I choose the bin number equal to 1 I got the right answer, if I choose different values I got strange non sense (negative) answers.. what I am feeling is that numpy.histogram have the arguments normed=True or density= True that (as said in the official documentation) they should give back the histogram normalized, and probably I do some error in the moment that I swich from the probability density function (output of numpy.histogram) to the probability mass function (input of shannon entropy), I do:
p,binedg= histogram(a,bins,normed=True)
p=p/len(p)
I would like to find a way to solve these problems, I would like to have an efficient method to compute the shannon entropy independent of the bin number.
I wrote a function to compute the shannon entropy of a distribution of more variables, but I got the same error.
The code is this, where the input of the function shannon_entropydd is the array where at each position there is each variable that has to be involved in the statistical computation
def intNd(c,axes):
assert len(c.shape) == len(axes)
assert all([c.shape[i] == axes[i].shape[0] for i in range(len(axes))])
if len(axes) == 1:
return scint.simps(c,axes[0])
else:
return intNd(scint.simps(c,axes[-1]),axes[:-1])
def shannon_entropydd(c,bins=30):
hist,ax=histogramdd(c,bins,normed=True)
for i in range(len(ax)):
ax[i]=ax[i][:-1]
p=-hist*log2(hist)
p[isnan(p)]=0
return intNd(p,ax)
I need these quantities in order to be able to compute the mutual information between certain set of variables:
M_info(x,y,z)= H(x)+H(z)+H(y)- H(x,y,z)
where H(x) is the shannon entropy of the variable x
I have to find a way to compute these quantities so if some one has a completely different kind of code that works I can switch on it, I don't need to repair this code but find a right way to compute this statistical functions!
The result will depend pretty strongly on the estimated density. Can you assume a specific form for the density? You can reduce the dependence of the result on the estimate if you avoid histograms or other general-purpose estimates such as kernel density estimates. If you can give more detail about the variables involved, I can make more specific comments.
I worked with estimates of mutual information as part of the work for my dissertation [1]. There is some stuff about MI in section 8.1 and appendix F.
[1] http://riso.sourceforge.net/docs/dodier-dissertation.pdf
I think that if you choose bins = 1, you will always find an entropy of 0, as there is no "uncertainty" over the possible bin the values are in ("uncertainty" is what entropy measures). You should choose an number of bins "big enough" to account for the diversity of the values that your variable can take. If you have discrete values: for binary values, you should take such that bins >= 2. If the values that can take your variable are in {0,1,2}, you should have bins >= 3, and so on...
I must say that I did not read your code, but this works for me:
import numpy as np
x = [0,1,1,1,0,0,0,1,1,0,1,1]
bins = 10
cx = np.histogram(x, bins)[0]
def entropy(c):
c_normalized = c/float(np.sum(c))
c_normalized = c_normalized[np.nonzero(c_normalized)]
h = -sum(c_normalized * np.log(c_normalized))
return h
hx = entropy(cx)
I would like to integrate a function in python and provide the probability density (measure) used to sample values. If it's not obvious, integrating f(x)dx in [a,b] implicitly use the uniform probability density over [a,b], and I would like to use my own probability density (e.g. exponential).
I can do it myself, using np.random.* but then
I miss the optimizations available in scipy.integrate.quad. Or maybe all those optimizations assume the uniform density?
I need to do the error estimation myself, which is not trivial. Or maybe it is? Maybe the error is just the variance of sum(f(x))/n?
Any ideas?
As unutbu said, if you have the density function, the you can just integrate the product of your function with the pdf using scipy.integrate.quad.
For the distribution that are available in scipy.stats, we can also just use the expect function.
For example
>>> from scipy import stats
>>> f = lambda x: x**2
>>> stats.norm.expect(f, loc=0, scale=1)
1.0000000000000011
>>> stats.norm.expect(f, loc=0, scale=np.sqrt(2))
1.9999999999999996
scipy.integrate.quad also has some predefined weight functions, although they are not normalized to be probability density functions.
The approximation error depends on the settings for the call to integrate.quad.
Just for the sake of brevity, 3 ways were suggested for calculating the expected value of f(x) under the probability p(x):
Assuming p is given in closed-form, use scipy.integrate.quad to evaluate f(x)p(x)
Assuming p can be sampled from, sample N values x=P(N), then evaluate the expected value by np.mean(f(X)) and the error by np.std(f(X))/np.sqrt(N)
Assuming p is available at stats.norm, use stats.norm.expect(f)
Assuming we have the CDF(x) of the distribution rather than p(x), calculate H=Inverse[CDF] and then integrate f(H(x)) using scipy.integrate.quad
Another possibilty would be to integrate x -> f( H(x)) where H is the inverse of the cumulative distribution of your probability distribtion.
[This is because of change of variable: replacing y=CDF(x) and noting that p(x)=CDF'(x) yields the change dy=p(x)dx and thus int{f(x)p(x)dx}==int{f(x)dy}==int{f(H(y))dy with H the inverse of CDF.]
I have a signal generated by a simulation program. Because the solver in this program has a variable time step, I have a signal with unevenly spaced data. I have two lists, a list with the signal values, and another list with the times at which each value occurred. The data could be something like this
npts = 500
t=logspace(0,1,npts)
f1 = 0.5
f2 = 0.6
sig=(1+sin(2*pi*f1*t))+(1+sin(2*pi*f2*t))
I would like to be able to perform a frequency analysis on this signal using python. It seems I cannot use the fft function in numpy, because this requires evenly spaced data. Are there any standard functions which could help me find the frequencies contained in this signal?
The most common algorithm to solve such things is called a Least-Squares Spectral analysis of frequencies. It looks like this will be in a future release of the scipy.signals package. Maybe there is a current version, but I can't seem to find it... In addition, there is some code available from Astropython, which I will not copy in it's entirety, but it essentially creates a lomb class which you can use the following code to get some values out.. What you need to do is the following:
import numpy
import lomb
x = numpy.arange(10)
y = numpy.sin(x)
fx,fy, nout, jmax, prob = lomb.fasper(x,y, 6., 6.)
Very simple, just look up the formula for a Fourier transform, and implement it as a discrete sum over your data values:
given a set of values f(x) over some set of x, then for each frequency k,
F(k) = sum_x ( exp( +/-i * k *x ) )
choose your k's ranging from 0 to 2*pi / min separation in x.
and, you can use 2 * pi / max(x) as the increment size
For a test case, use something for which you known the correct answer, c.f., a single cos( k' * x ) for some k', or a Gaussian.
An easy way out is to interpolate to evenly-spaced time intervals