Ok, so my current curve fitting code has a step that uses scipy.stats to determine the right distribution based on the data,
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in zip(distributions, mles)]
for dist in sorted(zip(distributions, mles), key=lambda d: d[1]):
print dist
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
print [mod[0].name for mod in sorted(zip(distributions, mles), key=lambda d: d[1])]
Where data is a list of numeric values. This is working great so far for fitting unimodal distributions, confirmed in a script that randomly generates values from random distributions and uses curve_fit to redetermine the parameters.
Now I would like to make the code able to handle bimodal distributions, like the example below:
Is it possible to get a MLE for a pair of models from scipy.stats in order to determine if a particular pair of distributions are a good fit for the data?, something like
distributions = [st.laplace, st.norm, st.expon, st.dweibull, st.invweibull, st.lognorm, st.uniform]
distributionPairs = [[modelA.name, modelB.name] for modelA in distributions for modelB in distributions]
and use those pairs to get an MLE value of that pair of distributions fitting the data?
It's not a complete answer but it may help you to solve your problem. Let say you know your problem is generated by two densities.
A solution would be to use k-mean or EM algorithm.
Initalization.
You initialize your algorithm by affecting every observation to one or the other density. And you initialize the two densities (you initialize the parameters of the density, and one of the parameter in your case is "gaussian", "laplace", and so on...
Iteration.
Then, iterately, you run the two following steps :
Step 1.
Optimize the parameters assuming that the affectation of every point is right. You can now use any optimization solver. This step provide you with an estimation of the best two densities (with given parameter) that fit your data.
Step 2.
You classify every observation to one density or the other according to the greatest likelihood.
You repeat until convergence.
This is very well explained in this web-page
https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html
If you do not know how many densities have generated your data, the problem is more difficult. You have to work with penalized classification problem, which is a bit harder.
Here is a coding example in an easy case : you know that your data comes from 2 different Gaussians (you don't know how many variables are generated from each density). In your case, you can adjust this code to loop on every possible pair of density (computationally longer, but would empirically work I presume)
import scipy.stats as st
import numpy as np
#hard coded data generation
data = np.random.normal(-3, 1, size = 1000)
data[600:] = np.random.normal(loc = 3, scale = 2, size=400)
#initialization
mu1 = -1
sigma1 = 1
mu2 = 1
sigma2 = 1
#criterion to stop iteration
epsilon = 0.1
stop = False
while not stop :
#step1
classification = np.zeros(len(data))
classification[st.norm.pdf(data, mu1, sigma1) > st.norm.pdf(data, mu2, sigma2)] = 1
mu1_old, mu2_old, sigma1_old, sigma2_old = mu1, mu2, sigma1, sigma2
#step2
pars1 = st.norm.fit(data[classification == 1])
mu1, sigma1 = pars1
pars2 = st.norm.fit(data[classification == 0])
mu2, sigma2 = pars2
#stopping criterion
stop = ((mu1_old - mu1)**2 + (mu2_old - mu2)**2 +(sigma1_old - sigma1)**2 +(sigma2_old - sigma2)**2) < epsilon
#result
print("The first density is gaussian :", mu1, sigma1)
print("The first density is gaussian :", mu2, sigma2)
print("A rate of ", np.mean(classification), "is classified in the first density")
Hope it helps.
Related
I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.
Imagine we tossed a biased coin 8 times (we don’t know how biased it is), and we recorded 5 heads (H) to 3 tails (T) so far. What is the probability of that the next 3 tosses will all be tails? In other words, we are wondering the expected probability of having 5Hs and 6Ts after 11th tosses.
I want to build a MCMC simulation model using pyMC3 to find the Bayesian solution. There is also an analytical solution within the Bayesian approach for this problem. So, I will be able to compare the results derived from the simulation, the analytical way as well as the classical frequentest way. Let me briefly explain what I could do so far:
Frequentest solution:
If we consider the probability for a single toss:
E(T) = p = (3/8) = 0,375
Then, the ultimate answer is E({T,T,T}) = p^3 = (3/8)^3 = 0,052.
2.1. Bayesian solution with analytical way:
Please assume the unknown parameter “p” represents for the bias of the coin.
If we consider the probability for a single toss:
E(T) = Integral0-1[p * P(p | H = 5, T = 3) dp] = 0,400 (I calculated the result after some algebraic manipulation)
Similarly, the ultimate answer is:
E({T,T,T}) = Integral0-1[p^3 * P(p | H = 5, T = 3) dp] = 10/11 = 0,909.
2.2. Bayesian solution with MCMC simulation:
When we consider the probability for a single toss, I built the model in pyMC3 as below:
Head: 0
Tail: 1
data = [0, 0, 0, 0, 0, 1, 1, 1]
import pymc3 as pm
with pm.Model() as coin_flipping:
p = pm.Uniform(‘p’, lower=0, upper=1)
y = pm.Bernoulli(‘y’, p=p, observed=data)
trace = pm.sample(1000)
pm.traceplot(trace)
After I run this code, I got that the posterior mean is E(T) =0,398 which is very close to the result of analytical solution (0,400). I am happy so far, but this is not the ultimate answer. I need a model that simulate the probability of E({T,T,T}). I appreciate if someone help me on this step.
One way is to do this empirically is with PyMC3's posterior predictive sampling. That is, once you have a posterior sampling, you can generate samplings from random parameterizations of the model. The pymc3.sample_posterior_predictive() method will generate new samples the size of your original observed data. Since you are only interested in three flips, we can just ignore the additional flips it generates. For example, if you wanted 10000 random sets of predicted flips, you would do
with pm.Model() as coin_flipping:
# this is still uniform, but I always prefer Beta for proportions
p = pm.Beta(‘p’, alpha=1, beta=1)
pm.Bernoulli(‘y’, p=p, observed=data)
# chains looked a bit waggly at 1K; 10K looks smoother
trace = pm.sample(10000, random_seed=2019, chains=4)
# note this generates (10000, 8) observations
post_pred = pm.sample_posterior_predictive(trace, samples=10000, random_seed=2019)
To then see how frequent the next three flips are (1,1,1), we can do
np.mean(np.sum(post_pred['y'][:,:3], axis=1) == 3)
# 0.0919
Analytic Solution
In this example, since we have an analytic posterior predictive distribution (Beta-Binomial[k | n, a=4, b=6] - see the Wikipedia table of conjugate distributions for details), we can exactly calculate the probability of observing three tails in the next three flips as follows:
from scipy.special import comb, beta as beta_fn
n, k = 3, 3 # flips, tails
a, b = 4, 6 # 1 + observed tails, 1 + observed heads
comb(n, k) * beta_fn(n + a, n - k + b) / beta_fn(a, b)
# 0.09090909090909091
Note that the beta_fn is the Euler Beta function, as distinct from the Beta distribution.
I have generated random data using:
bkg= 240-140*np.random.power(3.5,50000)
I plotted the points into a histogram by using
h_all = plt.hist(all,bins=binedges,histtype='step')
My question is, provided that I know the pdf (in this case called "bkg") can I generate a curve using scipy.optimize that fits the points generated perfectly, and what equation it is for the curve ?
First of all, remark that your bkg is NOT a probability density function (pdf). Rather, it is a list of observations from a pdf. By calling matplotlib.pyplot.hist on this list of observations, you get to see a curve that approximates the (offset and scaled version of the) probability density function. If you are given this curve, it is possible to get a good estimation of the parameters needed to model this, provided you've been given the parameterized model a priori.
For example:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
offset, scale, a, nsamples = 240, -140, 3.5, 500000
bkg = offset + scale*np.random.power(a, nsamples) # values range between (offset, offset+scale), which map to 0 and 1
nbins = 100
count, bins, ignored = plt.hist(bkg, bins=nbins, histtype='stepfilled', edgecolor='none')
If now you are given the centers of these bins and the counts,
xdata = .5*(bins[1:]+bins[:-1])
ydata = count
and you are asked to find the parameters of the power distribution function that fits to this data (-> someone told you this, you trust that source), then you could go about in the following manner.
First, observe that the power distribution function P(x,a) is a monotonously increasing function (i.e. P(x1, a ) < P(x2, a) when 0 <= x1 < x2 <= 1). That means that the dataset given above has been flipped left-to-right, or that it represents factor*P(x, a ) with factor < 0.
Next, notice that the given data is not given over the interval [0,1], typical for a probability density function. That means that you should rescale the given xdata to the [0,1] interval prior to attempting to fit the power function distribution to it. Just by observing the graph, you figure out that the values that 0 and 1 map to are 100 and 240. However, this is just luck here, because matplotlib chose a sensible range for plotting. When you are confronted with not actually knowing the limits to which 0 and 1 have mapped to, you could choose the less optimal (but still very good) choice of xdata[0] - binwidth/2 and xdata[-1] + binwidth/2 or (a slightly worse choice) xdata[0] and xdata[-1]. From the previous paragraph, you know that 1 maps to xdata[0] - binwidth/2 :=: a and 0 maps to xdata[-1] + binwidth/2 :=: b. The linear map that does this is lambda x: (a - b)*x + b (simple algebra).
If you pass this to [0,1]-mapped version of the xdata to curve_fit, it'll give you a good guess for the exponent.
def get_model(nobservations, binwidth, scale, offset):
def model(bin_centers, exponent):
x = (bin_centers - offset)/scale
y = exponent*x**(exponent - 1)
normed_y = nobservations * binwidth * y / np.abs(scale)
return normed_y
return model
binwidth = np.diff(xdata)[0]
p0, _ = curve_fit(get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2), xdata, ydata)
print(p0) # prints e.g.: 3.37117679
plt.plot(xdata, get_model(nsamples, binwidth, scale=-xdata.ptp() - binwidth, offset=xdata[-1] + binwidth/2)(xdata, *p0))
At this moment, you have found a rather accurate description of the distribution
that was used to generate the observations of bkg:
f(x) = offset + scale*(exponent * x**(exponent - 1))
= (xdata[-1] + binwidth/2) + (-xdata.ptp() - binwidth)*(p0[0] * x**(p0[0] - 1))
~ 234.85 - 1.34.85*(3.37 * x**(3.37 - 1))
By the way, I'd like to point out that replicating bkg (the observations from the distribution)
as a perfect copy is something you can only do if you know the exact parameters of the distribution (240, -140 and 3.5) AND set the seed for the random number generation equal to the seed that was in effect prior to the initial call to np.random.power.
If you'd like to fit a curve to the histogram using splines, you should retrieve the knots and coefficients from the generated spline and pass those into the function of bspleval, as shown here. The topic of writing out those equations is a long one however, and there are numerous resources on the internet that you can check to understand how it's done. Needless to say, that function bspleval is what you'll need in case you want to go that route. If it were me, I'd go the route of curve fitting shown above.
I have a list of n observations, each of which is the sum of two Weibull-distributed variables:
x[i] = t1[i] + t2[i]
t1[i] ~ Weibull(shape1, scale1)
t2[i] ~ Weibull(shape2, scale2)
My goal is:
1) Estimate the shape and scale parameters for both Weibull distributions (shape1, scale1, shape2, scale2),
2) For each observation x[i], estimate t1[i] (and t2[i] follows from this).
(Aside: Each observation x[i] is the age of cancer diagnosis, and t1[i] and t2[i] are two different time periods in the development of the tumor. The actual model involves mutation data as well, but before I try that out, I want to make sure that I can use PyMC for this simpler problem.)
I am using PyMC2 to make these estimates, and it looks like the run converges, but to incorrect results. I do not know whether there is a problem with my PyMC model syntax, with the MCMC settings, or both. I tried adapting this advice on using Potentials to model latent variables. First I define x[i] and t1[i] for each observation:
for i in xrange(n):
x[i] = pm.Index('x_%i'%i, x=data, index=i) # data is a list of observations
t1[i] = pm.Weibull('t1_%i'%i, alpha=shape1, beta=scale1)
# Ensure that initial guess for t1 is not more than the observed sum:
if t1[i].value >= x[i].value:
t1[i].value = 0.95 * x[i].value
Then I define a Deterministic for t2[i] = x[i] - t1[i]:
for i in xrange(n):
def subtractfunc(t1=t1, x=x, ii=i):
return x[ii] - t1[ii]
t2[i] = pm.Lambda('t2_%i'%i, subtractfunc)
And last I define the Potential for t2[i]:
t2dist = np.empty(n, dtype=object)
for i in xrange(n):
def weibfunc(t2=t2, shape2=shape2, scale2=scale2, ii=i):
return pm.weibull_like(t2[ii], alpha=shape2, beta=scale2)
t2dist[i] = pm.Potential(logp = weibfunc,
name = 't2dist_%i'%i,
parents = {'shape2':shape2, 'scale2':scale2, 't2':t2},
doc = 'weibull potential for t2',
verbose = 0,
cache_depth = 2)
You can see my full code here. I test by simulating 60 independent observations, with shape1 = 1, scale1 = 30, shape2 = 6.5, scale2 = 10, and I run 1e5 iterations of AdaptiveMetropolis. The results converge to a mean of shape1=1.94, scale1=37.9, shape2=0.55, scale2=36.1, and the 95% HPDs do not include the true values. This resulting distribution is not even in the right ballpark, as this histogram shows. (Blue shows the simulated data x[i] that I used, while the red shows the completely different inferred distribution from a representative iteration in the MCMC run.)
Running again with a different random seed, I get shape1=4.65, scale1=23.3, shape2=0.83, scale2=21.3. This distribution is somewhat closer to the truth. Is there some way to change the MCMC settings to consistently get decent results for this sort of problem? Any advice about using PyMC more effectively is much appreciated.
Update -- tried an "assisted" MCMC run:
I also tried assisting the MCMC run by initializing population-level parameters with values close to the truth. The results are somewhat better, but I now find a systematic bias. The histogram below shows the true distribution of observations (blue) against the fitted distribution (red). The right tail fits nicely, but the fit fails to capture the sharp peak at the left side. This bias occurs consistently, for population sizes n = 60 and 100. I am not sure if this is more of a PyMC question or a general MCMC algorithm issue.
The following code fits a oversimplified generalized linear model using statsmodels
model = smf.glm('Y ~ 1', family=sm.families.NegativeBinomial(), data=df)
results = model.fit()
This gives the coefficient and a stderr:
coef stderr
Intercept 2.9471 0.120
Now I want to graphically compare the real distribution of the variable Y (histogram) with the distribution that comes from the model.
But I need two parameters r and p to evaluate the stats.nbinom(r,p) and plot it.
Is there a way to retrieve the parameters from the results of the fitting?
How can I plot the PMF?
Generalized linear models, GLM, in statsmodels currently does not estimate the extra parameter of the Negative Binomial distribution. Negative Binomial belongs to the exponential family of distributions only for fixed shape parameter.
However, statsmodels also has Negative Binomial as a Maximum Likelihood Model in discrete_model which estimates all parameters.
The parameterization of the Negative Binomial for count regression is in terms of the mean or expected value, which is different from the parameterization in scipy.stats.nbinom. Actually, there are two different commonly used parameterization for the Negative Binomial count regression, usually called nb1 and nb2
Here is a quickly written script that recovers the scipy.stats.nbinom parameters, n=size and p=prob from the estimated parameters. Once you have the parameters for the scipy.stats.distribution you can use all the available method, rvs, pmf, and so on.
Something like this should be made available in statsmodels.
In a few example runs, I got results like this
data generating parameters 50 0.25
estimated params 51.7167511571 0.256814610633
estimated params 50.0985814878 0.249989725917
Aside, because of the underlying exponential reparameterization, the scipy optimizers have sometimes problems to converge. In those cases, either providing better starting values or using Nelder-Mead as optimization method usually helps.
import numpy as np
from scipy import stats
import statsmodels.api as sm
# generate some data to check
nobs = 1000
n, p = 50, 0.25
dist0 = stats.nbinom(n, p)
y = dist0.rvs(size=nobs)
x = np.ones(nobs)
loglike_method = 'nb1' # or use 'nb2'
res = sm.NegativeBinomial(y, x, loglike_method=loglike_method).fit(start_params=[0.1, 0.1])
print dist0.mean()
print res.params
mu = res.predict() # use this for mean if not constant
mu = np.exp(res.params[0]) # shortcut, we just regress on a constant
alpha = res.params[1]
if loglike_method == 'nb1':
Q = 1
elif loglike_method == 'nb2':
Q = 0
size = 1. / alpha * mu**Q
prob = size / (size + mu)
print 'data generating parameters', n, p
print 'estimated params ', size, prob
#estimated distribution
dist_est = stats.nbinom(size, prob)
BTW: I ran into this before but didn't have time to look at it
https://github.com/statsmodels/statsmodels/issues/106