Latin hypercube sampling with python - python

I would like to sample a distribution defined by a function in multiple dimensions (2,3,4):
f(x, y, ...) = ...
The distributions might be ugly, non standard (like a 3D spline on data, sum of gaussians ect.). To this end I would like to uniformly sample the 2..4 dimensional space, and than with an additional random number accept or reject the given point of the space into my sample.
Is there a ready to use python lib for this purpose?
Is there python lib for generating the points in this 2..4 dimensional space with latin hypercube sampling, or with other uniform sampling method? Bruteforce sampling with independent random numbers usually results in more and less dense regimes of the space.
if 1) and 2) doesn't exist, is there anybody who is kind enough to share his implementation for the same or similar problem.
I'll use it in a python code, but links to other solutions are also acknowledged.

I guess this is a late answer, but this is also for future visitors. I have just put up an implementation of multi-dimensional uniform Latin Hypercube sampling on git. It is minimal, but very easy to use. You can generate uniform random variables sampled in n dimensions using Latin Hypercube Sampling, if your variables are independent. Below is an example plot comparing Monte Carlo and Latin Hypercube Sampling with Multi-dimensional Uniformity (LHS-MDU) in two dimensions with zero correlation.
import lhsmdu
import matplotlib.pyplot as plt
import numpy
l = lhsmdu.sample(2,10) # Latin Hypercube Sampling of two variables, and 10 samples each.
k = lhsmdu.createRandomStandardUniformMatrix(2,10) # Monte Carlo Sampling
fig = plt.figure()
ax = fig.gca()
ax.set_xticks(numpy.arange(0,1,0.1))
ax.set_yticks(numpy.arange(0,1,0.1))
plt.scatter(k[0], k[1], color="b", label="LHS-MDU")
plt.scatter(l[0], l[1], color="r", label="MC")
plt.grid()
plt.show()

Now the pyDOE library provides a tool to generate Latin-hypercube-based samples.
https://pythonhosted.org/pyDOE/randomized.html
to generate samples over n dimensions:
lhs(n, [samples, criterion, iterations])
where n is the number of dimensions, samples as the total number of the sample space.

Here is an update of Sahil M's answer for Python 3 (update from Python 2 to Python 3 and some minor code changes to match code and figure):
import lhsmdu
import matplotlib.pyplot as plt
import numpy
l = lhsmdu.sample(2,10) # Latin Hypercube Sampling of two variables, and 10 samples each.
k = lhsmdu.createRandomStandardUniformMatrix(2,10) # Monte Carlo Sampling
fig = plt.figure()
ax = fig.gca()
ax.set_xticks(numpy.arange(0,1,0.1))
ax.set_yticks(numpy.arange(0,1,0.1))
plt.scatter([k[0]], [k[1]], color="r", label="MC")
plt.scatter([l[0]], [l[1]], color="b", label="LHS-MDU")
plt.legend()
plt.grid()
plt.show()
I once encountered a Python memory error running this script. Any suggestions why this could happen or how to changes the script so it doesn't happen anymore in the future?

Latin Hypercube sampling is now part of SciPy since version 1.7. See the doc.
from scipy.stats.qmc import LatinHypercube
engine = LatinHypercube(d=2)
sample = engine.random(n=100)
It support centering, strength and optimization.

This 2-D example samples uniformly on two dimensions, chooses each point with constant probability (thus keeping a binomially distributed number of points), selects randomly and without replacement those points from the sample space, and generates a pair of vectors which you can then pass through to your function f:
import numpy as np
import random
resolution = 10
keepprob = 0.5
min1, max1 = 0., 1.
min2, max2 = 3., 11.
keepnumber = np.random.binomial(resolution * resolution, keepprob,1)
array1,array2 = np.meshgrid(np.linspace(min1,max1,resolution),np.linspace(min2,max2,resolution))
randominixes = random.sample(list(range(resolution * resolution)), int(keepnumber))
randominixes.sort()
vec1Sampled,vec2Sampled = array1.flatten()[randominixes],array2.flatten()[randominixes]

Related

Marginal Density Probability using np

P = np.array(
[
[0.03607908, 0.03760034, 0.00503184, 0.0205082 , 0.01051408,
0.03776221, 0.00131325, 0.03760817, 0.01770659],
[0.03750162, 0.04317351, 0.03869997, 0.03069872, 0.02176718,
0.04778769, 0.01021053, 0.00324185, 0.02475319],
[0.03770951, 0.01053285, 0.01227089, 0.0339596 , 0.02296711,
0.02187814, 0.01925662, 0.0196836 , 0.01996279],
[0.02845139, 0.01209429, 0.02450163, 0.00874645, 0.03612603,
0.02352593, 0.00300314, 0.00103487, 0.04071951],
[0.00940187, 0.04633153, 0.01094094, 0.00172007, 0.00092633,
0.02032679, 0.02536328, 0.03552956, 0.01107725]
]
)
Here's the dataset where X corresponds to rows and Y corresponds to columns. I'm trying to figure out how to calculate the Covariance and Marginal Density Probability (MDP) for Y (columns). for the Covariance I have the following.
np.cov(P)
array([[ 2.247e-04, 6.999e-05, 2.571e-05, -2.822e-05, 1.061e-04],
[ 6.999e-05, 2.261e-04, 9.535e-07, 8.165e-05, -2.013e-05],
[ 2.571e-05, 9.535e-07, 7.924e-05, 1.357e-05, -8.118e-05],
[-2.822e-05, 8.165e-05, 1.357e-05, 2.039e-04, -1.267e-04],
[ 1.061e-04, -2.013e-05, -8.118e-05, -1.267e-04, 2.372e-04]])
How do I get the MDP? Also, is there a way to use numpy to select just the X and Y vals and assign them to variables where X= P's rows and Y=P's columns?
The data stored in P are a little ambiguous. In statistics, the X and Y have a very specific meaning. Usually, each row refers to one observation (i.e. datapoint) of some statistical object while the column represents a feature that is measured for each statistical object. In your case, there would be 9 observations with 5 features. This is referred to as a design matrix X, considered exogenous (independent), and serves as the foundation of most statistical learning algorithms. In supervised learning, there is additionally a vector Y; its length equals the number of rows in the X.
Your task at hand is of unsupervised nature as there is no true exogenous data Y and you are interested in the distribution of X alone. This opens up additional questions. Indeed, np.cov() computes the empirical covariance matrix measuring the pairwise covariance between each of these 5 features resulting in a symmetric 5x5 matrix as you indicated. Asking for the marginal probability density of each column (i.e. feature), however, refers to the univariate distribution of each feature alone. The covariance matrix in between features is irrelevant for this task.
There are several methodologies to obtain estimates of an unknown distribution given some data. Broadly speaking, they fall into two categories: parametric and non-parametric. I'll explain how each methodology works and can be implemented by leveraging NumPy exactly in the way you eluded to.
1. Parametric density estimation
In many cases, one assumes that the data stems from a particular parametric distribution. This distributional assumption is mostly based on convenience rather than prior knowledge. Then, estimating the unknown parameter values completely determines the distribution(s). In your case, for example, you could assume that each feature is univariate normal distributed.
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats
# mere numerical grid to plot densities (no statistical significance)
x_ = np.linspace(0.0, 0.055, 1000)
# estimate mean (mu) and standard deviation (sigma) for each column
mean_vec = np.mean(P, axis=1)
std_vec = np.std(P, axis=1)
# histograms
for i in range(5):
plt.plot(x_, stats.norm.pdf(x_, loc=mean_vec[i], scale=std_vec[I]),
label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('parametric via univariate Normal densities', fontsize=14)
plt.legend(loc='upper right')
plt.show()
2. Nonparametric density estimation
Alternatively, you can use a histogram as a non-parametric estimator of the unknown probability density functions (of each column/feature). Note, however, that you still have to choose a bandwidth h that determines the width of the bins. Additionally, non-parametric tools require a larger sample size to provide accurate estimates. Your sample size 9 is likely insufficient.
import numpy as np
from matplotlib import pyplot as plt
# endpoints on all bins: implies bandwith h=0.00229
bins = np.linspace(0.0, 0.055, 25)
h = np.diff(bins)[0]
# histograms
for i in range(5):
plt.hist(P[:,i], bins, alpha=0.5, label='Col. {}'.format(i+1))
plt.suptitle('Marginal Distribution Estimates', fontsize=21, y=1.025)
plt.title('nonparametric via histograms (h={})'.format(round(h, 4)), fontsize=14)
plt.legend(loc='upper right')
plt.show()

cvxpy.error.DCPError: Problem does not follow DCP rules

Getting my head around the DCP rules. I am looking at the portfolio optimisation example provided on the CVXPY website (see below original codes). Had a look at some of the other queries that deals with DCP rules but couldn't get the answer I wanted.
I tried replacing the Sigma (i.e. covariance) in their code (which is randomly generated) with cov generated from some historic returns for some of the asset classes. Everything else is the same.
Yet I get cvxpy.error.DCPError: Problem does not follow DCP rules.
I have also added pics of the two Sigmas (one generated randomly by the CVXPY code and the other Sigma(1) is the historic cov array I use)
Both are 9*9 arrays, but as I mentioned replacing the randomly generated array with an array with historic numbers gives me that error, all other codes kept the same. Any idea what's causing this issue?
# Generate data for long only portfolio optimization.
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
np.random.seed(1)
n = 10
mu = np.abs(np.random.randn(n, 1))
Sigma = np.random.randn(n, n)
Sigma = Sigma.T.dot(Sigma)
# Long only portfolio optimization.
import cvxpy as cp
w = cp.Variable(n)
gamma = cp.Parameter(nonneg=True)
ret = mu.T*w
risk = cp.quad_form(w, Sigma)
prob = cp.Problem(cp.Maximize(ret - gamma*risk),
[cp.sum(w) == 1,
w >= 0])
# Compute trade-off curve.
SAMPLES = 100
risk_data = np.zeros(SAMPLES)
ret_data = np.zeros(SAMPLES)
gamma_vals = np.logspace(-2, 3, num=SAMPLES)
for i in range(SAMPLES):
gamma.value = gamma_vals[i]
prob.solve()
risk_data[i] = cp.sqrt(risk).value
ret_data[i] = ret.value
# Plot long only trade-off curve.
import matplotlib.pyplot as plt
#%matplotlib inline
#%config InlineBackend.figure_format = 'svg'
markers_on = [29, 40]
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(risk_data, ret_data, 'g-')
for marker in markers_on:
plt.plot(risk_data[marker], ret_data[marker], 'bs')
ax.annotate(r"$\gamma = %.2f$" % gamma_vals[marker], xy=(risk_data[marker]+.08, ret_data[marker]-.03))
for i in range(n):
plt.plot(cp.sqrt(Sigma[i,i]).value, mu[i], 'ro')
plt.xlabel('Standard deviation')
plt.ylabel('Return')
plt.show()
Observation: it is possible that the variance-covariance matrix in a portfolio optimization is not positive semi-definite. In theory the convariance matrix can be shown to be positive semi-definite. However, due to floating-point rounding errors, we may actually see (slightly) negative eigenvalues. (Note: a positive semi-definite matrix has non-negative eigenvalues).
I know of three approaches to handle this:
Use a statistical technique called shrinkage,
Perturb the diagonal a bit by adding small constant to each diagonal element,
Use a variant of the standard portfolio model based on mean adjusted returns.
For details see link.

Understading hyperopt's TPE algorithm

I am illustrating hyperopt's TPE algorithm for my master project and cant seem to get the algorithm to converge. From what i understand from the original paper and youtube lecture the TPE algorithm works in the following steps:
(in the following, x=hyperparameters and y=loss)
Start by creating a search history of [x,y], say 10 points.
Sort the hyperparameters according to their loss and divide them into two sets using some quantile γ (γ = 0.5 means the sets will be equally sized)
Make a kernel density estimation for both the poor hyperparameter group (g(x)) and good hyperparameter group (l(x))
Good estimations will have low probability in g(x) and high probability in l(x), so we propose to evaluate the function at argmin(g(x)/l(x))
Evaluate (x,y) pair at the proposed point and repeat steps 2-5.
I have implemented this in python on the objective function f(x) = x^2, but the algorithm fails to converge to the minimum.
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde
def objective_func(x):
return x**2
def measure(x):
noise = np.random.randn(len(x))*0
return x**2+noise
def split_meassures(x_obs,y_obs,gamma=1/2):
#split x and y observations into two sets and return a seperation threshold (y_star)
size = int(len(x_obs)//(1/gamma))
l = {'x':x_obs[:size],'y':y_obs[:size]}
g = {'x':x_obs[size:],'y':y_obs[size:]}
y_star = (l['y'][-1]+g['y'][0])/2
return l,g,y_star
#sample objective function values for ilustration
x_obj = np.linspace(-5,5,10000)
y_obj = objective_func(x_obj)
#start by sampling a parameter search history
x_obs = np.linspace(-5,5,10)
y_obs = measure(x_obs)
nr_iterations = 100
for i in range(nr_iterations):
#sort observations according to loss
sort_idx = y_obs.argsort()
x_obs,y_obs = x_obs[sort_idx],y_obs[sort_idx]
#split sorted observations in two groups (l and g)
l,g,y_star = split_meassures(x_obs,y_obs)
#aproximate distributions for both groups using kernel density estimation
kde_l = gaussian_kde(l['x']).evaluate(x_obj)
kde_g = gaussian_kde(g['x']).evaluate(x_obj)
#define our evaluation measure for sampling a new point
eval_measure = kde_g/kde_l
if i%10==0:
plt.figure()
plt.subplot(2,2,1)
plt.plot(x_obj,y_obj,label='Objective')
plt.plot(x_obs,y_obs,'*',label='Observations')
plt.plot([-5,5],[y_star,y_star],'k')
plt.subplot(2,2,2)
plt.plot(x_obj,kde_l)
plt.subplot(2,2,3)
plt.plot(x_obj,kde_g)
plt.subplot(2,2,4)
plt.semilogy(x_obj,eval_measure)
plt.draw()
#find point to evaluate and add the new observation
best_search = x_obj[np.argmin(eval_measure)]
x_obs = np.append(x_obs,[best_search])
y_obs = np.append(y_obs,[measure(np.asarray([best_search]))])
plt.show()
I suspect this happens because we keep sampling where we are most certain, thus making l(x) more and more narrow around this point, which doesn't change where we sample at all. So where is my understanding lacking?
So, I am still learning about TPE as well. But here's are the two problems in this code:
This code will only evaluate a few unique point. Because the best location is calculated based on the best recommended by the kernel density functions but there is no way for the code to do exploration of the search space. For example, what acquisition functions do.
Because this code is simply appending new observations to the list of x and y. It adds a whole lot of duplicates. The duplicates lead to a skewed set of observations and that leads to a very weird split and you can easily see that in the later plots. The eval_measure starts as something similar to the objective function but diverges later on.
If you remove the duplicates in x_obs and y_obs you can remove the problem no. 2. However, the first problem can only be removed through the addition of some way of exploring the search space.

Truncated multivariate normal in SciPy?

I'm trying to automate a process that at some point needs to draw samples from a truncated multivariate normal. That is, it's a normal multivariate normal distribution (i.e. Gaussian) but the variables are constrained to a cuboid. My given inputs are the mean and covariance of the full multivariate normal but I need samples in my box.
Up to now, I'd just been rejecting samples outside the box and resampling as necessary, but I'm starting to find that my process sometimes gives me (a) large covariances and (b) means that are close to the edges. These two events conspire against the speed of my system.
So what I'd like to do is sample the distribution correctly in the first place. Googling led only to this discussion or the truncnorm distribution in scipy.stats. The former is inconclusive and the latter seems to be for one variable. Is there any native multivariate truncated normal? And is it going to be any better than rejecting samples, or should I do something smarter?
I'm going to start working on my own solution, which would be to rotate the untruncated Gaussian to it's principal axes (with an SVD decomposition or something), use a product of truncated Gaussians to sample the distribution, then rotate that sample back, and reject/resample as necessary. If the truncated sampling is more efficient, I think this should sample the desired distribution faster.
So, according to the Wikipedia article, sampling a multivariate truncated normal distribution (MTND) is more difficult. I ended up taking a relatively easy way out and using an MCMC sampler to relax an initial guess towards the MTND as follows.
I used emcee to do the MCMC work. I find this package phenomenally easy-to-use. It only requires a function that returns the log-probability of the desired distribution. So I defined this function
from numpy.linalg import inv
def lnprob_trunc_norm(x, mean, bounds, C):
if np.any(x < bounds[:,0]) or np.any(x > bounds[:,1]):
return -np.inf
else:
return -0.5*(x-mean).dot(inv(C)).dot(x-mean)
Here, C is the covariance matrix of the multivariate normal. Then, you can run something like
S = emcee.EnsembleSampler(Nwalkers, Ndim, lnprob_trunc_norm, args = (mean, bounds, C))
pos, prob, state = S.run_mcmc(pos, Nsteps)
for given mean, bounds and C. You need an initial guess for the walkers' positions pos, which could be a ball around the mean,
pos = emcee.utils.sample_ball(mean, np.sqrt(np.diag(C)), size=Nwalkers)
or sampled from an untruncated multivariate normal,
pos = numpy.random.multivariate_normal(mean, C, size=Nwalkers)
and so on. I personally do several thousand steps of sample discarding first, because it's fast, then force the remaining outliers back within the bounds, then run the MCMC sampling.
The number of steps for convergence is up to you.
Note also that emcee easily supports basic parallelization by adding the argument threads=Nthreads to the EnsembleSampler initialization. So you can make this blazing fast.
I have reimplemented an algorithm which does not depend on MCMC but creates independent and identically distributed (iid) samples from the truncated multivariate normal distribution. Having iid samples can be very useful! I used to also use emcee as described in the answer by Warrick, but for convergence the number of samples needed exploded in higher dimensions, making it impractical for my use case.
The algorithm was introduced by Botev (2016) and uses an accept-reject algorithm based on minimax exponential tilting. It was originally implemented in MATLAB but reimplementing it for Python increased the performance significantly compared to running it using the MATLAB engine in Python. It also works well and is fast at higher dimensions.
The code is available at: https://github.com/brunzema/truncated-mvn-sampler.
An Example:
d = 10 # dimensions
# random mu and cov
mu = np.random.rand(d)
cov = 0.5 - np.random.rand(d ** 2).reshape((d, d))
cov = np.triu(cov)
cov += cov.T - np.diag(cov.diagonal())
cov = np.dot(cov, cov)
# constraints
lb = np.zeros_like(mu) - 1
ub = np.ones_like(mu) * np.inf
# create truncated normal and sample from it
n_samples = 100000
tmvn = TruncatedMVN(mu, cov, lb, ub)
samples = tmvn.sample(n_samples)
Plotting the first dimension results in:
Reference:
Botev, Z. I., (2016), The normal law under linear restrictions: simulation and estimation via minimax tilting, Journal of the Royal Statistical Society Series B, 79, issue 1, p. 125-148
Simulating truncated multivariate normal can be tricky and usually involves some conditional sampling by MCMC.
My short answer is, you can use my code (https://github.com/ralphma1203/trun_mvnt)!!! It implements the Gibbs sampler algorithm from , which can handle general linear constraints in the form of , even when you have non-full rank D and more constraints than the dimensionality.
import numpy as np
from trun_mvnt import rtmvn, rtmvt
########## Traditional problem, probably what you need... ##########
##### lower < X < upper #####
# So D = identity matrix
D = np.diag(np.ones(4))
lower = np.array([-1,-2,-3,-4])
upper = -lower
Mean = np.zeros(4)
Sigma = np.diag([1,2,3,4])
n = 10 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin)
# Numpy array n-by-p as result!
random_sample
########## Non-full rank problem (more constraints than dimension) ##########
Mean = np.array([0,0])
Sigma = np.array([1, 0.5, 0.5, 1]).reshape((2,2)) # bivariate normal
D = np.array([1,0,0,1,1,-1]).reshape((3,2)) # non-full rank problem
lower = np.array([-2,-1,-2])
upper = np.array([2,3,5])
n = 500 # want 500 final sample
burn = 100 # burn-in first 100 iterates
thin = 1 # thinning for Gibbs
random_sample = rtmvn(n, Mean, Sigma, D, lower, upper, burn, thin) # Numpy array n-by-p as result!
A little late I guess but for the record, you could use Hamiltonian Monte Carlo. A module in Matlab exists named HMC exact. It shouldn't be too difficult to translate in Py.

Separate mixture of gaussians in Python

There is a result of some physical experiment, which can be represented as a histogram [i, amount_of(i)]. I suppose that result can be estimated by a mixture of 4 - 6 Gaussian functions.
Is there a package in Python which takes a histogram as an input and returns the mean and variance of each Gaussian distribution in the mixture distribution?
Original data, for example:
This is a mixture of gaussians, and can be estimated using an expectation maximization approach (basically, it finds the centers and means of the distribution at the same time as it is estimating how they are mixed together).
This is implemented in the PyMix package. Below I generate an example of a mixture of normals, and use PyMix to fit a mixture model to them, including figuring out what you're interested in, which is the size of subpopulations:
# requires numpy and PyMix (matplotlib is just for making a histogram)
import random
import numpy as np
from matplotlib import pyplot as plt
import mixture
random.seed(010713) # to make it reproducible
# create a mixture of normals:
# 1000 from N(0, 1)
# 2000 from N(6, 2)
mix = np.concatenate([np.random.normal(0, 1, [1000]),
np.random.normal(6, 2, [2000])])
# histogram:
plt.hist(mix, bins=20)
plt.savefig("mixture.pdf")
All the above code does is generate and plot the mixture. It looks like this:
Now to actually use PyMix to figure out what the percentages are:
data = mixture.DataSet()
data.fromArray(mix)
# start them off with something arbitrary (probably based on a guess from the figure)
n1 = mixture.NormalDistribution(-1,1)
n2 = mixture.NormalDistribution(1,1)
m = mixture.MixtureModel(2,[0.5,0.5], [n1,n2])
# perform expectation maximization
m.EM(data, 40, .1)
print m
The output model of this is:
G = 2
p = 1
pi =[ 0.33307859 0.66692141]
compFix = [0, 0]
Component 0:
ProductDist:
Normal: [0.0360178848449, 1.03018725918]
Component 1:
ProductDist:
Normal: [5.86848468319, 2.0158608802]
Notice it found the two normals quite correctly (one N(0, 1) and one N(6, 2), approximately). It also estimated pi, which is the fraction in each of the two distributions (you mention in the comments that's what you're most interested in). We had 1000 in the first distribution and 2000 in the second distribution, and it gets the division almost exactly right: [ 0.33307859 0.66692141]. If you want to get this value directly, do m.pi.
A few notes:
This approach takes a vector of values, not a histogram. It should be easy to convert your data into a 1D vector (that is, turn [(1.4, 2), (2.6, 3)] into [1.4, 1.4, 2.6, 2.6, 2.6])
We had to guess the number of gaussian distributions in advance (it won't figure out a mix of 4 if you ask for a mix of 2).
We had to put in some initial estimates for the distributions. If you make even remotely reasonable guesses it should converge to the correct estimates.

Categories

Resources