I have a data intensive application where one of the core computations uses the percent point function of a lognormal distribution.
The code currently uses Scipy, but it is excruciatingly slow. Is there a way to compute/approximate this function using numpy or other package more efficiently?
Is it possible that you're calling lognorm.ppf many times with different values rather than once with all the values in an array? That would slow it down a lot. When I tested it with 10^6 values it took ~81ms. I was able to cut that in half using spline interpolation with an average absolute error of 2e-06, and a maximal absolute error of 0.0023. See code below but that seems not worth to me.
from scipy.stats import lognorm
import numpy as np
from scipy.interpolate import UnivariateSpline
s = 1.5
t = np.linspace(0,0.95,10**3)
vals = lognorm.ppf(t, s)
ppf = UnivariateSpline(t,vals,s=10e-10)
x = np.linspace(0,0.95,10**6)
error = np.abs(ppf(x)-lognorm.ppf(x, s))
%timeit ppf(x)
%timeit lognorm.ppf(x, s)
np.mean(error), np.max(error)
Related
I am doing numerical integration for discrete data in python. The code I use looks like this:
from scipy.integrate import simpson
import numpy as np
t = np.array((0,1,2,3,4,5))
y = np.array((1,2,3,4,5,6))
res = simpson(y=y, x=t)
Now I want to find t1, for which the integration of y from t[0] to t1 can be the percentile of res (like 50%, 70%).
For other integration method, like scipy.integrate.trapezoid, I can get the detailed formula of the interpolation easier because they are straight lines. But for scipy.integrate.simpson, it is too complicated especially for large array. Is there a convenient way?
Playing around with fitting data to Weibull distributions, using Matlab wblrnd and wblfit functions, and Python scipy.stats.weibull_min.fit function, I found that Matlab outperforms Python by almost 2 orders of magnitude. I am looking for some help to improve the performance of the Python code.
The problem:
While converting Matlab code to Python, I came across the following code:
weibull_parameters = zeros(10000, 2)
for i = 1:10000
data = sort(wblrnd(alpha, beta, 1, 24))
[weibull_parameters(i, :), ~] = wblfit(data, confidence_interval, censoring_array)
end
This code generates 24 random numbers from a Weibull distribution and then fits the resulting data vector again to a Weibull distribution.
In Python I translated this to:
from scipy.stats import weibull_min
import numpy as np
data = np.sort(alpha * np.random.default_rng().weibull(beta, (10000, 24)))
weibull_parameters = np.zeros((10000, 2))
for idx, row in enumerate(data):
weibull_parameters[idx, :] = weibull_min.fit(row, floc=0)[2::-2]
Here I generate the full random data in one go and then iterate over the rows to get the corresponding Weibull parameters using the weibull_min.fit function. The slicing at the end is to select only the scale and shape parameters in the output and put them in the correct order.
The main problem I encountered is that the calculation performance in Python is terrible. Matlab runs this code in a few seconds, however for Python it takes 1-1.5 seconds per 100 iterations (on my laptop), so the difference in performance is almost 2 orders of magnitude.
Is there a way that I can improve the performance in Python? Is is possible to vectorize the fitting calculation? I couldn't find anything online on this topic unfortunately.
Note 1: Matlab allows the user to specify a confidence interval in the wblfit function however for Python I couldn't find a way to include that, so I ignored that.
Note 2: The only option I could find to include censoring was using the surpyval package, however the performance was even more dreadful (about 10 seconds per 100 iterations)
Python is not know for being the fastest language out there. There are things you can do to speed it up but you will find there is a balance between accuracy and speed.
As for ways to fit a Weibull distribution, there are several packages to do this. The packages scipy, surpyval, lifelines, and reliability will all fit complete data. The last 3 will also handle censored data which scipy will not.
I'm the author of reliability, so I'll present you an example using this package:
from reliability.Distributions import Weibull_Distribution
from reliability.Fitters import Fit_Weibull_2P
import time
import numpy as np
rows=100
samples = 24
data_array = np.empty((rows,samples))
true_parameters = np.empty((rows,2))
for i in range(rows):
alpha = np.random.randint(low=1,high=999) + np.random.rand() #alpha between 1 and 1000
beta = np.random.randint(low=1,high=10) - np.random.rand()/2 #beta between 0.5 and 10
true_parameters[i][0] = alpha
true_parameters[i][1] = beta
dist = Weibull_Distribution(alpha=alpha,beta=beta)
data_array[i]=dist.random_samples(samples)
start_time = time.time()
parameters = np.empty((rows,2))
for i in range(rows):
fit = Fit_Weibull_2P(failures=data_array[i],show_probability_plot=False,print_results=False)
parameters[i][0] = fit.alpha
parameters[i][1] = fit.beta
runtime = time.time()-start_time
# np.set_printoptions(suppress=True) #supresses the scientific notation used by numpy
# print('True parameters:')
# print(true_parameters)
# print('Fitted parameters:')
# print(parameters)
print('Runtime:',runtime,'seconds')
print('Runtime per iteration:',runtime/rows,'seconds')
When I run this it gives:
Runtime: 3.378781318664551 seconds
Runtime per iteration: 0.033787813186645504 seconds
Based on the times you quoted in your question, this is about twice as slow as scipy but only one third of the time taken by surpyval.
I hope this helps to show you a different way to do the same thing, but I understand it probably isn't the performance improvement you are seeking. The only way you will get a big performance improvement is to use least squares estimation in pure python, perhaps accelerated using numba. Such an approach will likely give you results that are inferior to MLE, but as I said earlier, there is a balance between speed and accuracy, as well as between speed and coding convenience.
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output:
I've looked online and have yet to find an answer or way to figure the following
I'm translating some MATLAB code to Python where in MATLAB im looking to find the kernel density estimation with the function:
[p,x] = ksdensity(data)
where p is the probability at point x in the distribution.
Scipy has a function but only returns p.
Is there a way to find the probability at values of x?
Thanks!
That form of the ksdensity call automatically generates an arbitrary x. scipy.stats.gaussian_kde() returns a callable function that can be evaluated with any x of your choosing. The equivalent x would be np.linspace(data.min(), data.max(), 100).
import numpy as np
from scipy import stats
data = ...
kde = stats.gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
p = kde(x)
Another option is the kernel density estimator in the Scikit-Learn Python package, sklearn.neighbors.KernelDensity
Here is a little example similar to the Matlab documentation for ksdensity for a Gaussian distribution:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
np.random.seed(12345)
# similar to MATLAB ksdensity example x = [randn(30,1); 5+randn(30,1)];
Vecvalues=np.concatenate((np.random.normal(0,1,30), np.random.normal(5,1,30)))[:,None]
Vecpoints=np.linspace(-8,12,100)[:,None]
kde = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(Vecvalues)
logkde = kde.score_samples(Vecpoints)
plt.plot(Vecpoints,np.exp(logkde))
plt.show()
The plot this produces looks like:
Matlab is orders of magnitude faster than KernelDensity when it comes to finding the optimal bandwidth. Any idea of how to make the KernelDenisty faster? – Yuca Jul 16 '18 at 20:58
Hi, Yuca. The matlab use Scott rule to estimate the bandwidth, which is fast but requires the data from the normal distribution. For more information, please see this Post.
To get the correlation between two arrays in python, I am using:
from scipy.stats import pearsonr
x, y = [1,2,3], [1,5,7]
cor, p = pearsonr(x, y)
However, as stated in the docs, the p-value returned from pearsonr() is only meaningful with datasets larger than 500. So how can I get a p-value that is reasonable for small datasets?
My temporary solution:
After reading up on linear regression, I have come up with my own small script, which basically uses Fischer transformation to get the z-score, from which the p-value is calculated:
import numpy as np
from scipy.stats import zprob
n = len(x)
z = np.log((1+cor)/(1-cor))*0.5*np.sqrt(n-3))
p = zprob(-z)
It works. However, I am not sure if it is more reasonable that p-value given by pearsonr(). Is there a python module which already has this functionality? I have not been able to find it in SciPy or Statsmodels.
Edit to clarify:
The dataset in my example is simplified. My real dataset is two arrays of 10-50 values.