Playing around with fitting data to Weibull distributions, using Matlab wblrnd and wblfit functions, and Python scipy.stats.weibull_min.fit function, I found that Matlab outperforms Python by almost 2 orders of magnitude. I am looking for some help to improve the performance of the Python code.
The problem:
While converting Matlab code to Python, I came across the following code:
weibull_parameters = zeros(10000, 2)
for i = 1:10000
data = sort(wblrnd(alpha, beta, 1, 24))
[weibull_parameters(i, :), ~] = wblfit(data, confidence_interval, censoring_array)
end
This code generates 24 random numbers from a Weibull distribution and then fits the resulting data vector again to a Weibull distribution.
In Python I translated this to:
from scipy.stats import weibull_min
import numpy as np
data = np.sort(alpha * np.random.default_rng().weibull(beta, (10000, 24)))
weibull_parameters = np.zeros((10000, 2))
for idx, row in enumerate(data):
weibull_parameters[idx, :] = weibull_min.fit(row, floc=0)[2::-2]
Here I generate the full random data in one go and then iterate over the rows to get the corresponding Weibull parameters using the weibull_min.fit function. The slicing at the end is to select only the scale and shape parameters in the output and put them in the correct order.
The main problem I encountered is that the calculation performance in Python is terrible. Matlab runs this code in a few seconds, however for Python it takes 1-1.5 seconds per 100 iterations (on my laptop), so the difference in performance is almost 2 orders of magnitude.
Is there a way that I can improve the performance in Python? Is is possible to vectorize the fitting calculation? I couldn't find anything online on this topic unfortunately.
Note 1: Matlab allows the user to specify a confidence interval in the wblfit function however for Python I couldn't find a way to include that, so I ignored that.
Note 2: The only option I could find to include censoring was using the surpyval package, however the performance was even more dreadful (about 10 seconds per 100 iterations)
Python is not know for being the fastest language out there. There are things you can do to speed it up but you will find there is a balance between accuracy and speed.
As for ways to fit a Weibull distribution, there are several packages to do this. The packages scipy, surpyval, lifelines, and reliability will all fit complete data. The last 3 will also handle censored data which scipy will not.
I'm the author of reliability, so I'll present you an example using this package:
from reliability.Distributions import Weibull_Distribution
from reliability.Fitters import Fit_Weibull_2P
import time
import numpy as np
rows=100
samples = 24
data_array = np.empty((rows,samples))
true_parameters = np.empty((rows,2))
for i in range(rows):
alpha = np.random.randint(low=1,high=999) + np.random.rand() #alpha between 1 and 1000
beta = np.random.randint(low=1,high=10) - np.random.rand()/2 #beta between 0.5 and 10
true_parameters[i][0] = alpha
true_parameters[i][1] = beta
dist = Weibull_Distribution(alpha=alpha,beta=beta)
data_array[i]=dist.random_samples(samples)
start_time = time.time()
parameters = np.empty((rows,2))
for i in range(rows):
fit = Fit_Weibull_2P(failures=data_array[i],show_probability_plot=False,print_results=False)
parameters[i][0] = fit.alpha
parameters[i][1] = fit.beta
runtime = time.time()-start_time
# np.set_printoptions(suppress=True) #supresses the scientific notation used by numpy
# print('True parameters:')
# print(true_parameters)
# print('Fitted parameters:')
# print(parameters)
print('Runtime:',runtime,'seconds')
print('Runtime per iteration:',runtime/rows,'seconds')
When I run this it gives:
Runtime: 3.378781318664551 seconds
Runtime per iteration: 0.033787813186645504 seconds
Based on the times you quoted in your question, this is about twice as slow as scipy but only one third of the time taken by surpyval.
I hope this helps to show you a different way to do the same thing, but I understand it probably isn't the performance improvement you are seeking. The only way you will get a big performance improvement is to use least squares estimation in pure python, perhaps accelerated using numba. Such an approach will likely give you results that are inferior to MLE, but as I said earlier, there is a balance between speed and accuracy, as well as between speed and coding convenience.
Related
I am re-learning introductory statistics and wanted to try implementing my own versions of the general and unpooled formulas that find the T Value. I implemented it in 2 ways, one by just replicating the formulas as is as Python Functions. The other was to use Python's ability to generate a normal distribution and use that to find the difference in means. But I noticed my values were pretty different in both versions. So my question is why is there a difference? Is it with how the function works itself?
Here's the "generate a distribution itself" method:
from numpy.random import seed
from numpy.random import normal
from scipy import stats
from datetime import datetime
import math
#Plan: Generate 2 random normal distributions of the desired critiera. And T Test them
data1 = normal(loc=65.2, scale=7.8, size=30)
data2 = normal(loc=70.3, scale=8.4, size=30)
stats.ttest_ind(a=data1, b=data2)
Ttest_indResult(statistic=-2.029830829733737, pvalue=0.04696953433513939)
As you can see, it gives a T statistic of ~-2.0298 and a p value of ~ 0.0470.
Here's my "manual version":
def pop_2_mean_pooled_t(mean1, mean2, s1, s2, n1, n2):
dof = (n1+n2)-2
mean_diff = mean1 - mean2
#The N part on the right
right_n = math.sqrt((1/n1) + (1/n2))
#The Sp part
sp_numereator_left = ((n1-1)*(s1**2))
sp_numberator_right = ((n2-1)*(s2**2))
sp = math.sqrt((sp_numereator_left + sp_numberator_right)/(dof))
pooled_sp = sp*right_n
t = mean_diff/pooled_sp
p = stats.t.cdf(t, dof)
print("T is " +str(t))
print("p is " +str(p))
return t, p
pop_2_mean_pooled_t(65.2, 70.3, 7.8, 8.4, 30, 30)
T is -2.4368742610942298
p is 0.00895208222413155
(-2.4368742610942298, 0.00895208222413155)
As you can see, it gives a T statistic of ~-2.439 and a p value of ~ 0.009.
My question is why is there a discrepancy here? My "manual version" is closer to the example I was referencing. But surely the generator one should also be?
My understanding is that if a sample is significantly large enough, it would resemble a normal distribution. Therefore, one could generate a normal distribution using code and use that to approximate the corresponding T Values. For some reason, that differed quite a bit from my "manual" version
Your thinking is basically correct (I did not check your formulae though). What your encountering is in the nature of the problem: the two random samples you're drawing are, well, random and they differ in subsequent runs, so you will always get a different p-value ant the t-statistics.
Two suggestions from me:
increase the sample size in the first snippet to hundreds (not 30): you should get much closer to the stats from the second snippet.
keep 30 samples in the first snippet but run the simulation several times; you will learn the distributions of p-values and t-statistics and, again, you can check the values from your second snippet against the simulated distributions.
(Some conceptual flaws occur in this approach, e.g. repeated testing affects the p-value, but let us put them aside for now; the goal is to see your two sets of values converge.)
Recently I wanted to demonstrate generating a continuous random variable using the universality of the Uniform. For that, I wanted to use the combination of numpy and matplotlib. However, the generated random variable seems a little bit off to me - and I don't know whether it is caused by the way in which NumPy's random uniform and vectorized works or if I am doing something fundamentally wrong here.
Let U ~ Unif(0, 1) and X = F^-1(U). Then X is a real variable with a CDF F (please note that the F^-1 here denotes the quantile function, I also omit the second part of the universality because it will not be necessary).
Let's assume that the CDF of interest to me is:
then:
According to the universality of the uniform, to generate a real variable, it is enough to plug U ~ Unif(0, 1) in the F-1. Therefore, I've written a very simple code snippet for that:
U = np.random.uniform(0, 1, 1000000)
def logistic(u):
x = np.log(u / (1 - u))
return x
logistic_transform = np.vectorize(logistic)
X = logistic_transform(U)
However, the result seems a little bit off to me - although the histogram of a generated real variable X resembles a logistic distribution (which simplified CDF I've used) - the r.v. seems to be distributed in a very unequal way - and I can't wrap my head around exactly why it is so. I would be grateful for any suggestions on that. Below are the histograms of U and X.
You have a large sample size, so you can increase the number of bins in your histogram and still get a good number samples per bin. If you are using matplotlib's hist function, try (for exampe) bins=400. I get this plot, which has the symmetry that I think you expected:
Also--and this is not relevant to the question--your function logistic will handle a NumPy array without wrapping it with vectorize, so you can save a few CPU cycles by writing X = logistic(U). And you can save a few lines of code by using scipy.special.logit instead of implementing it yourself.
Now I have 1 loop that populates a 3D NumPy matrix. I'm not exactly the best at understanding a 3D array structure even though I know it's really just a XxYxZ representation of the normal XxY that I'm used to thinking in (2D). So if you want to know what this is it is a Brownian Bridge (BB) construction used in Monte Carlo simulations for financial problems. Credit for the original code (derived from the commentary which fixed the original post by author Kenta Oono located here): https://gist.github.com/delta2323/6bb572d9473f3b523e6e. You don't really need to know anything about the math behind it; it just basically chops up a path of steps (21 in this example), begins at 0, has normally distributed shocks (hence np.random.randn) applied until it reaches the end, which is also 0. Each path is applied to a simulated price to randomly "shock it" over time, generating a potential path the asset could follow on its way to expiration. Although these are totally uncorrelated, so I suppose I would pass a V matrix in as well to correlate the paths to be correct, however, let us keep it simple:
import numpy as np
from matplotlib import pyplot
import timeit
steps = 21
underlyings = 3
sims = 131072
seed = 0 # fix the seed for replicating results
np.random.seed(seed)
def sample_path_batches(underlyings, steps, sims):
dt = 1.0 / (steps-1)
dt_sqrt = np.sqrt(dt)
B = np.empty((underlyings, steps, sims), dtype=float)
B[:,0, :] = 0 # set first step to 0
for n in range(steps - 2):
t = n * dt
xi = np.random.randn(underlyings, sims) * dt_sqrt
B[:, n + 1, :] = B[:, n, :] * (1 - dt / (1 - t)) + xi
B[:, -1, :] = 0 # set last step to 0
return B
start_time = timeit.default_timer()
B = sample_path_batches(underlyings, steps, sims)
print('\n' + 'Run time for ', sims, ' simulation steps * underlyings: ',
np.round((timeit.default_timer() - start_time),3), ' seconds')
pyplot.plot(B[:,:,np.random.randint(0,sims)].T); # plot a random simulation set of paths
pyplot.show()
Run time for 131072 simulation steps * underlyings: 2.014 seconds
So anyhow, that's way too slow for my application, although my original version with a 2nd inner loop was around 15 seconds. So I've seen where people have vectorized NumPy through np.vectorize or used maps to "flatten" a loop, but I can't visualize how to actually do it myself. I'm looking for an optimal "native Python" implementation that will produce the same numbers. B is the 3D NumPy array. You can just copy and paste it and run it online if you want: https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/HEAD?urlpath=lab/tree/demo
Any suggestions are appreciated!!! Even if it is just "restructure the loop like this, then apply np.vectorize" or whatever, I'm pretty good at taking a suggestion and making it work off a simple "new view" into how to visualize the problem. I would usually just do this type of thing in Cython (nogil / OpenMP / prange) but I'd like to know to "flatten" a loop in general, with normal math libraries built into NumPy or Pandas or whatever works.
One simple solution to speed up this code is to parallelize it using Numba. You only need to use the decorator #nb.njit('float64[:,:,::1](int64, int64, int64)', parallel=True) for the function sample_path_batches (where nb is the Numba module). Note that dtype=float must be replaced with dtype=np.float64 in the function so that Numba can compile the code correctly. Note that parallel=True should automatically parallelize the np.random.randn call as well as the basic following operation in the loop. On a 10-core machine this is 7 times faster (it takes 0.253 second with Numpy and 0.036 with a parallel implementation of Numba). If you do not see any improvement, you could also try to parallelize it manually using prange.
Additionally, you can use np.float32 types for significantly faster performance (up to 2 times faster theoretically). However, Numpy do not currently support such types for np.random.randn. Instead, np.random.default_rng().random(size=underlyings*sims, dtype=np.float32).reshape(underlyings, sims) should be used. Unfortunately, it is probably not yet supported by Numba since Numpy add this quite recently...
If you have an Nvidia GPU, another solution is to use CUDA to execute the function on the GPU. This should be much faster. Note that Numba have specific optimized functions to generate random np.float32 values on the GPU using CUDA (see here).
I am trying to understand what is the problem with the following code:
import pymc3 as pm
import theano as t
X = t.shared(train_new)
features = list(map(str, range(train_new.shape[1])))
with pm.Model() as logistic_model:
glm = pm.glm.GLM(X, targets, labels=features,
intercept=False, family='binomial')
trace = pm.sample(3000, tune=3000, jobs=-1)
The dataset is by no means big: its shape is (891, 13). Here is what I concluded on my own:
the problem is surely not the hardware because the performance is the same both on my laptop and on a c4.2xlarge AWS instance;
it cannot be theano.shared because if I remove it the performance is again the same;
the problem does not appear to be in pymc3.glm.GLM because when I manually build the model (which is probably simpler than the one in GLM) the performance is just as terrible:
with pm.Model() as logistic_model:
invlogit = lambda x: 1 / (1 + pm.math.exp(-x))
σ = pm.HalfCauchy('σ', beta=2)
β = pm.Normal('β', 0, sd=σ, shape=X.get_value().shape[1])
π = invlogit(tt.dot(X, β))
likelihood = pm.Bernoulli('likelihood', π, observed=targets)
It starts at around 200 it/s and the quickly falls to 5 it/s. After half sampling, it decreases further to around 2 it/s. This is a serious problem, as the model barely converges with a couple of thousands of samplings. I need to perform many more samples than what this situation currently allows.
This is the log:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
99%|█████████▊| 5923/6000 [50:00<00:39, 1.97it/s]
I tried with pm.Metropolis() as step, and it was a bit faster but it didn't converge.
MWE: a file with a minimal working example showing the issue and the data is here:
https://gist.github.com/rubik/74ddad91317b4d366d3879e031e03396
A non-centered version of the model should work much better:
β_raw = pm.Normal('β_raw', 0, sd=1, shape=X.get_value().shape[1])
β = pm.Deterministic('β', β_raw * σ)
Usually your first impulse if the effective sample size is small shouldn't be to just increase the number of samples, but to try and play with the parametrization a bit.
Also, you can use tt.nnet.sigmoid instead of your custom invlogit, that might be faster/more stable.
I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)