Understading hyperopt's TPE algorithm

Understading hyperopt's TPE algorithm - python

I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?

So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.

Related

Is there a way to get the probability of a prediction using XGBoostRegressor?

I have built a XGBoostRegressor model using around 200 categorical features predicting a countinous time variable.
But I would want to get both the actual prediction and the probability of that prediction as output. Is there any way to get this from the XGBoostRegressor model?
So I both want and P(Y|X) as output. Any idea how to do this?

There is no probability in regression, In regression the only output you will get is a predicted value thats why it is called regression, so for any regressor probability of a prediction is not possible. Its only there in classification.

As mentioned before, there is no probability associated with regression.
However, you could probably add a confidence interval on that regression, to see whether or not your regression can be trusted.
One thing to note though, is that the variance might not be the same along the data.
Let's assume that you study a time based phenomenon. Specifically, you have the temperature (y) after (x) time (in sec for instance) inside an oven. At x = 0s it is at 20°C, and you start heating it, and want to know the evolution in order to predict the temperature after x seconds. The variance could be the same after 20 seconds and after 5 minutes, or be completely different. This is called heteroscedasticity.
If you want to use a confidence interval, you probably want to make sure that you took care of heteroscedasticity, so your interval is the same for all the data.
You can probably try to get the distribution of your known outputs and compare the prediction on that curve, and check the pvalue. But that would only give you a measure of how realistic it is to get that output, without taking the input into consideration. If you know your inputs/outputs are in a specific interval, this could work.
EDIT
This is how I would do it. Obviously the outputs are your real outputs.
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from scipy.interpolate import interp1d
N = 1000 # The number of sample
mean = 0
std = 1
outputs = np.random.normal(loc=mean, scale=std, size=N)
# We want to get a normed histogram (since this is PDF, if we integrate
# it must be equal to 1)
nbins = N / 10
n = int(N / nbins)
p, x = np.histogram(outputs, bins=n, normed=True)
plt.hist(outputs, bins=n, normed=True)
x = x[:-1] + (x[ 1] - x[0])/2 # converting bin edges to centers
# Now we want to interpolate :
# f = CubicSpline(x=x, y=p, bc_type='not-a-knot')
f = interp1d(x=x, y=p, kind='quadratic', fill_value='extrapolate')
x = np.linspace(-2.9*std, 2.9*std, 10000)
plt.plot(x, f(x))
plt.show()
# To check :
area = integrate.quad(f, x[0], x[-1])
print(area) # (should be close to 1)
Now, the interpolate method is not great for outliers. if a predicted data is extremely far (more than 3 times the std) from your distribution, it wont work. Other than that, you can now use the PDF to get meaningful results.
It is not perfect, but it is the best I came up with in that time. I'm sure there are some better ways to do it. If your data follow a normal law, it becomes trivial.

I suggest you to look into Ngboost (essentially a wrapper of Xgboost which provides eventually a probabilistic model.
Here you can find slides on the Ngboost functioning and the seminal Ngboost paper.
The basic idea is to assume a specific distribution for $P(Y|X=x)$ (by default is the Gaussian distribution) and fit an Xgboost model to estimate the best parameters of the distribution (for the Gaussian $\mu$ and $\sigma$. The model will split the variables' space into different regions with different distributions, i.e. same family (eg. Gaussian) but different parameters.
After training the model, you're provided with the method '''pred_dist''' which returns the estimated distribution $P(Y|X=x)$ for a given set of values $x$

applying "tighter" bounds in scipy.optimize.curve_fit

I have a dataset that I am trying to fit with parameters (a,b,c,d) that are within +/- 5% of the true fitting parameters. However, when I do this with scipy.optimize.curve_fit I get the error "ValueError: 'x0' is infeasible." inside the least squares package.
If I relax my bounds, then optimize.curve_fit seems to work as desired. I have also noticed that my parameters that are larger seem to be more flexible in applying bounds (i.e. I can get these to work with tighter constraints, but never below 20%). The following code is an MWE and has two sets of bounds (variable B), one that works and one that returns an error.
# %% import modules
import IPython as IP
IP.get_ipython().magic('reset -sf')
import matplotlib.pyplot as plt
import os as os
import numpy as np
import scipy as sp
import scipy.io as sio
plt.close('all')
#%% Load the data
capacity = np.array([1.0,9.896265560165975472e-01,9.854771784232364551e-01,9.823651452282157193e-01,9.797717842323651061e-01,9.776970954356846155e-01,9.751037344398340023e-01,9.735477178423235234e-01,9.714730290456431439e-01,9.699170124481327759e-01,9.683609958506222970e-01,9.668049792531120401e-01,9.652489626556015612e-01,9.636929460580913043e-01,9.621369294605808253e-01,9.610995850622406911e-01,9.595435684647302121e-01,9.585062240663900779e-01,9.574688796680497216e-01,9.559128630705394647e-01,9.548755186721991084e-01,9.538381742738588631e-01,9.528008298755185068e-01,9.517634854771783726e-01,9.507261410788381273e-01,9.496887966804978820e-01,9.486514522821576367e-01,9.476141078838172804e-01,9.460580912863070235e-01,9.450207468879666672e-01,9.439834024896265330e-01,9.429460580912862877e-01,9.419087136929459314e-01,9.408713692946057972e-01,9.393153526970953182e-01,9.382780082987551840e-01,9.372406639004148277e-01,9.356846473029045708e-01,9.346473029045642145e-01,9.330912863070539576e-01,9.320539419087136013e-01,9.304979253112033444e-01,9.289419087136928654e-01,9.273858921161826085e-01,9.258298755186721296e-01,9.242738589211617617e-01,9.227178423236513938e-01,9.211618257261410259e-01,9.196058091286306579e-01,9.180497925311202900e-01,9.159751037344397995e-01,9.144190871369294316e-01,9.123443983402489410e-01,9.107883817427384621e-01,9.087136929460579715e-01,9.071576763485477146e-01,9.050829875518671130e-01,9.030082987551866225e-01,9.009336099585061319e-01,8.988589211618257524e-01,8.967842323651451508e-01,8.947095435684646603e-01,8.926348547717841697e-01,8.905601659751035681e-01,8.884854771784231886e-01,8.864107883817426980e-01,8.843360995850622075e-01,8.817427385892115943e-01,8.796680497925309927e-01,8.775933609958505022e-01,8.749999999999998890e-01,8.729253112033195094e-01,8.708506224066390189e-01,8.682572614107884057e-01,8.661825726141078041e-01,8.635892116182571909e-01,8.615145228215767004e-01,8.589211618257260872e-01,8.563278008298754740e-01,8.542531120331948724e-01,8.516597510373442592e-01,8.490663900414936460e-01,8.469917012448132665e-01,8.443983402489626533e-01,8.418049792531120401e-01,8.397302904564315496e-01,8.371369294605809364e-01,8.345435684647303232e-01,8.324688796680497216e-01,8.298755186721991084e-01,8.272821576763484952e-01,8.246887966804978820e-01,8.226141078838173915e-01,8.200207468879667783e-01,8.174273858921160540e-01,8.153526970954355635e-01,8.127593360995849503e-01,8.101659751037343371e-01,8.075726141078837239e-01,8.054979253112033444e-01,8.029045643153527312e-01,8.003112033195021180e-01,7.977178423236515048e-01,7.956431535269707922e-01,7.930497925311201790e-01,7.904564315352695658e-01,7.883817427385891863e-01,7.857883817427385731e-01,7.831950207468879599e-01,7.811203319502073583e-01,7.785269709543567451e-01,7.759336099585061319e-01,7.738589211618256414e-01,7.712655601659750282e-01,7.686721991701244150e-01,7.665975103734440355e-01,7.640041493775934223e-01,7.619294605809127097e-01,7.593360995850620965e-01,7.567427385892114833e-01,7.546680497925311037e-01,7.520746887966804906e-01,7.499999999999998890e-01,7.474066390041492758e-01,7.453319502074687852e-01,7.427385892116181720e-01,7.406639004149377925e-01,7.380705394190871793e-01,7.359958506224064667e-01,7.339211618257260872e-01,7.313278008298754740e-01,7.292531120331949834e-01,7.266597510373443702e-01,7.245850622406637687e-01,7.225103734439833891e-01,7.199170124481327759e-01,7.178423236514521744e-01,7.157676348547717948e-01,7.136929460580911933e-01,7.110995850622405801e-01,7.090248962655600895e-01,7.069502074688797100e-01,7.048755186721989974e-01,7.022821576763483842e-01,7.002074688796680046e-01,6.981327800829875141e-01,6.960580912863069125e-01,6.939834024896265330e-01,6.919087136929459314e-01,6.898340248962655519e-01,6.877593360995849503e-01])
cycles = np.arange(0,151)
#%% fit the capacity data
# define the empicrial model to be fitted
def He_model(k,a,b,c,d):
return a*np.exp(b*k)+c*np.exp(d*k)
# Fit the entire data set with the function
P0 = [-40,0.005,40,-0.005]
fit, tmp = sp.optimize.curve_fit(He_model, cycles,capacity, p0=P0,maxfev=10000000)
capacity_fit = He_model(cycles, fit[0], fit[1],fit[2], fit[3])
# track all four data points with a 5% bound from the best fit
b_lim = np.zeros((4,2))
for i in range(4):
b_lim[i,0] = fit[i]-np.abs(0.2*fit[i]) # these should be 0.05
b_lim[i,1] = fit[i]+np.abs(0.2*fit[i])
# bounds that work
B = ([b_lim[0,0],-np.inf,b_lim[2,0],-np.inf],[b_lim[0,1], np.inf, b_lim[2,1], np.inf])
# bounds that do not work, but are closer to what I want.
#B = ([b_lim[0,0],b_lim[1,0],b_lim[2,0],b_lim[3,0]],[b_lim[0,1], b_lim[1,1], b_lim[2,1], b_lim[3,1]])
fit_4_5per, tmp = sp.optimize.curve_fit(He_model, cycles,capacity , p0=P0,bounds=B)
capacity_4_5per = He_model(cycles, fit_4_5per[0], fit_4_5per[1], fit_4_5per[2], fit_4_5per[3])
#%% plot the results
plt.figure()
plt.plot(cycles,capacity,'o',fillstyle='none',label='data',markersize=4)
plt.plot(cycles,capacity_fit,'--',label='best fit')
plt.plot(cycles,capacity_4_5per,'-.',label='5 percent bounds')
plt.ylabel('capacity')
plt.xlabel('charge cycles')
plt.legend()
plt.ylim([0.70,1])
plt.tight_layout()
I understand that optimize.curve_fit may need some space to explore the data set to find the optimum spot, however, I feel that 5% should be enough for it. Maybe I am wrong? Is there maybe a better way to apply bounds to a parameter?

The ValueError of "x0 is infeasible" is coming because your initial values violate the bounds. Printing out the parameters values and bounds will show this.
Basically, you're setting the bounds too cleverly, based on the first refined values. But the refined values are different enough from your starting values that the bounds for the second call to curve_fit mean the initial values fall outside the bounds.
More importantly, what leads you to "feel that 5% should be enough"? Primarily, you should apply bounds to make sure the model makes sense, and secondarily to help the fit avoid false solutions. You're calculating the bounds based on an initial fit, so I doubt there's a strong physical justification for those bounds. Why not let the fit do its job?

Plotting confidence intervals for Maximum Likelihood Estimate

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .

Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.

The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.

Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)

Truncated multivariate normal in SciPy?

I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.
Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.
So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?
I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.

So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.
I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function
from numpy.linalg import inv
def lnprob_trunc_norm(x, mean, bounds, C):
if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
return -np.inf
else:
return -0.5*(x-mean).dot(inv(C)).dot(x-mean)
Here, C is the covariance matrix of the multivariate normal. Then, you can run something like
S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))
pos, prob, state = S.run_mcmc(pos, Nsteps)
for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,
pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)
or sampled from an untruncated multivariate normal,
pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)
and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.
The number of steps for convergence is up to you.
Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.

I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.
The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.
The code is available at: https://github.com/brunzema/truncated-mvn-sampler.
An Example:
d = 10 # dimensions
# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)
# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf
# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)
Plotting the first dimension results in:
Reference:
Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148

Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.
My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from , which can handle general linear constraints in the form of , even when you have non-full rank D and more constraints than the dimensionality.
import numpy as np
from trun_mvnt import rtmvn, rtmvt
########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix
D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])
n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin)
# Numpy array n-by-p as result!
random_sample
########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal
D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])
n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!

A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.

Why does scipy.optimize.curve_fit produce parameters which are barely different from the guess?

I've been trying to fit some histogram data with scipy.optimize.curve_fit, but so far I haven't once been able to produce fit parameters that differ significantly from my guess parameters.
I wouldn't be terribly surprised to find that the more arcane parameters in my fit get stuck in local minima, but even linear coefficients won't move from my initial guesses!
If you've seen anything like this before, I'd love some advice. Do least-squared minimization routines just not work for certain classes of functions?
I try this,
import numpy as np
from matplotlib.pyplot import *
from scipy.optimize import curve_fit
def grating_hist(x,frac,xmax,x0):
# model data to be turned into a histogram
dx = x[1]-x[0]
z = np.linspace(0,1,20000,endpoint=True)
grating = np.cos(frac*np.pi*z)
norm_grating = xmax*(grating-grating[-1])/(1-grating[-1])+x0
# produce the histogram
bin_edges = np.append(x,x[-1]+x[1]-x[0])
hist,bin_edges = np.histogram(norm_grating,bins=bin_edges)
return hist
x = np.linspace(0,5,512)
p_data = [0.7,1.1,0.8]
pct = grating_hist(x,*p_data)
p_guess = [1,1,1]
p_fit,pcov = curve_fit(grating_hist,x,pct,p0=p_guess)
plot(x,pct,label='Data')
plot(x,grating_hist(x,*p_fit),label='Fit')
legend()
show()
print 'Data Parameters:', p_data
print 'Guess Parameters:', p_guess
print 'Fit Parameters:', p_fit
print 'Covariance:',pcov
and I see this: http://i.stack.imgur.com/GwXzJ.png (I'm new here, so I can't post images)
Data Parameters: [0.7, 1.1, 0.8]
Guess Parameters: [1, 1, 1]
Fit Parameters: [ 0.97600854 0.99458336 1.00366634]
Covariance: [[ 3.50047574e-06 -5.34574971e-07 2.99306123e-07]
[ -5.34574971e-07 9.78688795e-07 -6.94780671e-07]
[ 2.99306123e-07 -6.94780671e-07 7.17068753e-07]]
Whaaa? I'm pretty sure this isn't a local minimum for variations in xmax and x0, and it's a long way from the global minimum best fit. The fit parameters still don't change, even with better guesses. Different choices for curve functions (e.g. the sum of two normal distributions) do produce new parameters for the same data, so I know it's not the data itself. I also tried the same thing with scipy.optimize.leastsq itself just in case, but no dice; the parameters still don't move. If you have any thoughts on this, I'd love to hear them!

The problem you're facing is actually not due to curve_fit (or leastsq). It is due to the landscape of the objective of your optimisation problem. In your case the objective is the sum of residuals' squares, which you are trying to minimise. Now, if you look closely at your objective in a close surrounding of your initial conditions, for example using the code below, which only focuses on the first parameter:
p_ind = 0
eps = 1e-6
n_points = 100
frac_surroundings = np.linspace(p_guess[p_ind] - eps, p_guess[p_ind] + eps, n_points)
obj = []
temp_guess = p_guess.copy()
for p in frac_surroundings:
temp_guess[0] = p
obj.append(((grating_hist(x, *p_data) - grating_hist(x, *temp_guess))**2.0).sum())
py.plot(frac_surroundings, obj)
py.show()
you will notice that the landscape is piecewise constant (you can easily check that the situation is the same for other parameters. The problem with that is that these pieces are of the order of 10^-6, whereas the initial step of the fitting procedure is somewhere around 10^-8, hence the procedure ends quickly concluding that you cannot improve from the given initial condition. You could try to fix it by changing epsfcn parameter in curve_fit, but you would quickly notice that the landscape, on top of being piecewise constant, is also very "rugged". In other words, curve_fit is simply not well suited for such a problem, which is simply difficult for gradient based methods, as it is highly non-convex. Probably, some stochastic optimisation methods could do a better job. That is, however, a different question/problem.

I think it is a local minimum, or the algorith fails for a non trivial reason. It is far easier to fit the data to the input, instead of fitting the statistical description of the data to the statistical description of the input.
Here's a modified version of the code doing so:
z = np.linspace(0,1,20000,endpoint=True)
def grating_hist_indicator(x,frac,xmax,x0):
# model data to be turned into a histogram
dx = x[1]-x[0]
grating = np.cos(frac*np.pi*z)
norm_grating = xmax*(grating-grating[-1])/(1-grating[-1])+x0
return norm_grating
x = np.linspace(0,5,512)
p_data = [0.7,1.1,0.8]
pct = grating_hist(x,*p_data)
pct_indicator = grating_hist_indicator(x,*p_data)
p_guess = [1,1,1]
p_fit,pcov = curve_fit(grating_hist_indicator,x,pct_indicator,p0=p_guess)
plot(x,pct,label='Data')
plot(x,grating_hist(x,*p_fit),label='Fit')
legend()
show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understading hyperopt's TPE algorithm - python

Related

Is there a way to get the probability of a prediction using XGBoostRegressor?

applying "tighter" bounds in scipy.optimize.curve_fit

Plotting confidence intervals for Maximum Likelihood Estimate

Truncated multivariate normal in SciPy?

Why does scipy.optimize.curve_fit produce parameters which are barely different from the guess?

Categories

Resources