I've looked online and have yet to find an answer or way to figure the following
I'm translating some MATLAB code to Python where in MATLAB im looking to find the kernel density estimation with the function:
[p,x] = ksdensity(data)
where p is the probability at point x in the distribution.
Scipy has a function but only returns p.
Is there a way to find the probability at values of x?
That form of the ksdensity call automatically generates an arbitrary x. scipy.stats.gaussian_kde() returns a callable function that can be evaluated with any x of your choosing. The equivalent x would be np.linspace(data.min(), data.max(), 100).
import numpy as np
from scipy import stats
data = ...
kde = stats.gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
p = kde(x)
Another option is the kernel density estimator in the Scikit-Learn Python package, sklearn.neighbors.KernelDensity
Here is a little example similar to the Matlab documentation for ksdensity for a Gaussian distribution:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
# similar to MATLAB ksdensity example x = [randn(30,1); 5+randn(30,1)];
Vecvalues=np.concatenate((np.random.normal(0,1,30), np.random.normal(5,1,30)))[:,None]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(Vecvalues)
logkde = kde.score_samples(Vecpoints)
The plot this produces looks like:
Matlab is orders of magnitude faster than KernelDensity when it comes to finding the optimal bandwidth. Any idea of how to make the KernelDenisty faster? – Yuca Jul 16 '18 at 20:58
Hi, Yuca. The matlab use Scott rule to estimate the bandwidth, which is fast but requires the data from the normal distribution. For more information, please see this Post.
I have a data intensive application where one of the core computations uses the percent point function of a lognormal distribution.
The code currently uses Scipy, but it is excruciatingly slow. Is there a way to compute/approximate this function using numpy or other package more efficiently?
Is it possible that you're calling lognorm.ppf many times with different values rather than once with all the values in an array? That would slow it down a lot. When I tested it with 10^6 values it took ~81ms. I was able to cut that in half using spline interpolation with an average absolute error of 2e-06, and a maximal absolute error of 0.0023. See code below but that seems not worth to me.
from scipy.stats import lognorm
import numpy as np
from scipy.interpolate import UnivariateSpline
s = 1.5
t = np.linspace(0,0.95,10**3)
vals = lognorm.ppf(t, s)
ppf = UnivariateSpline(t,vals,s=10e-10)
x = np.linspace(0,0.95,10**6)
error = np.abs(ppf(x)-lognorm.ppf(x, s))
%timeit ppf(x)
%timeit lognorm.ppf(x, s)
np.mean(error), np.max(error)
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
return exp(rk_normal(state, mean, sigma));
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
I'm new to Bayesian stats and I'm trying to estimate the posterior of a poisson (likelihood) and gamma distribution (prior) in Python. The parameter I'm trying to estimate is the lambda variable in the poisson distribution. I think the posterior will take the form of a gamma distribution (conjugate prior?) but I don't want to leverage that. The only thing I'm given is the data (named "my_data"). Here's my code:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats
prior= scipy.stats.gamma.pdf(x,alpha,beta) #the parameters dont matter for now
likelihood_temp = lambda yi, a: scipy.stats.poisson.pmf(yi, a)
likelihood = lambda y, a: np.log(np.prod([likelihood_temp(data, a) for data in my_data]))
posterior=likelihood(my_data,lambda_estimate) * prior
When I try to plot the posterior I get an empty plot. I plotted the prior and it looks fine, so I think the issue is the likelihood. I took the log because the data is fairly large and I didn't want things to get unstable. Can anyone point out the issues in my code? Any help would be appreciated.
In Bayesian statistics, one goal is to calculate the posterior distribution of the parameter (lambda) given the data and the prior over a range of possible values for lambda. In your code, you calculating the prior over the array x, but you are taking a single value for lambda to calculate the likelihood. The posterior and likelihood should be over x as well, something like:
posterior = [likelihood(my_data, lambda_i) for lambda_i in x] * prior
(assuming you are not taking the logs of the prior and likelihood)
You might want to take a look at the PyMC3 library.
I would recommend you to have a look at the conjugate_prior module.
You could just type:
from conjugate_prior import GammaPoisson
model = GammaPoisson(prior_a, prior_b)
model = model.update(...)
credible_interval = model.posterior(lower_bound, upper_bound)
I'm having trouble finding quantile functions for well-known probability distributions in Python, do they exist? In particular, is there an inverse normal distribution function? I couldn't find anything in either Numpy or Scipy.
Check the .ppf() method of any distribution class in scipy.stats.
This is the equivalent of a quantile function (otherwise named as percent point function or inverse CDF)
An example with the exponential distribution from scipy.stats:
# analysis libs
import scipy
import numpy as np
# plotting libs
import matplotlib as mpl
import matplotlib.pyplot as plt
# Example with the exponential distribution
c = 0
lamb = 2
# Create a frozen exponential distribution instance with specified parameters
exp_obj = scipy.stats.expon(c,1/float(lamb))
x_in = np.linspace(0,1,200) # 200 numbers in [0,1], input for ppf()
y_out = exp_obj.ppf(x_in)
plt.plot(x_in,y_out) # graphically check the results of the inverse CDF
It seems new but I've found this about numpy and quantile. Maybe you can have a look (not tested)
To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.