Suppose i have a function that returns a 3 by 3, 2d array with random entries in a given bound:
def random3by3Matrix(smallest_num, largest_num):
matrix = [[0 for x in range(3)] for y in range(3)]
for i in range(3):
for j in range(3):
matrix[i][j] = random.randrange(int(smallest_num),
int(largest_num + 1)) if smallest_num != largest_num else smallest_num
return matrix
print(random3by3Matrix(-10, 10))
Code above returns something like this:
[[-6, 10, -4], [-10, -9, 8], [10, 1, 1]]
How would I write a unittest for a function like this? I thought of using a helper function:
def isEveryEntryGreaterEqual(list1, list2):
for i in range(len(list1)):
for j in range(len(list1[0])):
if not (list1[i][j] <= list2[i][j]):
return False
return True
class TestFunction(unittest.TestCase):
def test_random3by3Matrix(self):
lower_bound = [[-10 for x in range(3)] for y in range(3)]
upper_bound = [[10 for x in range(3)] for y in range(3)]
self.assertEqual(True, isEveryEntryGreaterEqual(lower_bound, random3by3Matrix(-10,10)))
self.assertEqual(True, isEveryEntryGreaterEqual(random3by3Matrix(-10,10), upper_bound))
But is there a cleaner way to do this?
Furthermore, how would you test that all of your values are not only between the boundaries, but also distributet randomly?
Test matrix bounds
It looks like you want to test if every single element in the matrix is greater than some value, independently of where in the matrix this element is. You can make this code shorter and more readable by e.g. extracting all the elements from the matrix and checking them all in one go, instead of the double for loop. You can easily transform any array-nesting with numpy.flatten() to a 1D array, and then test the resulting 1D array in one go with python's built-in all() method. This way, you can avoid looping over all the elements yourself:
import numpy as np
def is_matrix_in_bounds(matrix, low, high):
flat_list = np.flatten(matrix) # create a 1D list
# Each element is a boolean, that is True if it's within bounds
in_bounds = [low <= e <= high for e in flat_list]
# all() returns True if each element in in_bounds is 'True
# returns False as soon as a single element in in_bounds is False
return all(in_bounds)
class TestFunction(unittest.TestCase):
def test_random3by3Matrix(self):
lower_bound = -10
upper_bound = 10
matrix = random3by3Matrix(-10,10)
self.assertEqual(True, is_matrix_in_bounds(matrix, lower_bound, upper_bound))
If you will be using things like the matrix and the bounds in multiple tests, it may be beneficial to make them class attributes, so you don't have to define them in each test function.
Test matrix randomness
Testing if some matrix is truly randomly distributed is a bit harder, since it will involve a statistical test to check if the variables are randomly distributed or not. The best you can do here is calculate the odds that they are indeed randomly distributed, and put a threshold on how low these odds are allowed to be. Since the matrix is random and the values in the matrix do not depend on each other, you're in luck, because you can again test them as if they were a 1D distribution.
To test this, you should create a second random uniform distribution, and test the goodness of fit between your matrix and the new distribution with a Kolmogorov-Smirnov test. This considers the two distributions as random samples, and tests how likely it is that they were drawn from the same underlying distribution. In your case: a random uniform distribution. If the distributions are vastly different, it will have a very low p-value (i.e. the odds of these distributions being drawn from the same underlying distribution is low). If they are similar, the p-value will be high. You want a random matrix, so you want a high p-value. The usual cutoff for this is 0.05 (which means that 1/20 distributions will be considered non-random, because they look kinda non-random by happenstance). Python provides such a test with the scipy module. Here, you can either pass two samples (a two-sample ks test), or pass the name of some distribution and specify the parameters (a one-sample ks test). For the latter case, the distribution name should be the name of a distribution in scipy.stats, and you can pass the arguments to create such a distribution via the keyword args=().
import numpy as np
from scipy import stats
def test_matrix_randomness(matrix, low, high):
lower_bound = -10
upper_bound = 10
matrix = random3by3matrix(-10, 10)
# two-sample test
random_dist = np.random.random_integers(low=lower_bound, high=upper_bound, size=3*3)
statistic, p_value = stats.kstest(random_dist, np.flatten(matrix))
# one-sample test, equivalent, but neater
# doesn't require creating a second distribution
statistic, p_value = stats.kstest(random_dist, "randint", args=(-10, 10))
self.assertEqual(True, p_value > 0.05)
Note that unittests with a random aspect will sometimes fail. Such is the nature of randomness.
see:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.randint.html#scipy.stats.randint
Related
Is there a function in numpy/scipy that lets you sample multinomial from a vector of small log probabilities, without losing precision? example:
# sample element randomly from these log probabilities
l = [-900, -1680]
the naive method fails because of underflow:
import scipy
import numpy as np
# this makes a all zeroes
a = np.exp(l) / scipy.misc.logsumexp(l)
r = np.random.multinomial(1, a)
this is one attempt:
def s(l):
m = np.max(l)
norm = m + np.log(np.sum(np.exp(l - m)))
p = np.exp(l - norm)
return np.where(np.random.multinomial(1, p) == 1)[0][0]
is this the best/fastest method and can np.exp() in the last step be avoided?
First of all, I believe the problem you're encountering is because you're normalizing your probabilities incorrectly. This line is incorrect:
a = np.exp(l) / scipy.misc.logsumexp(l)
You're dividing a probability by a log probability, which makes no sense. Instead you probably want
a = np.exp(l - scipy.misc.logsumexp(l))
If you do that, you find a = [1, 0] and your multinomial sampler works as expected up to floating point precision in the second probability.
A Solution for Small N: Histograms
That said, if you still need more precision and performance is not as much of a concern, one way you could make progress is by implementing a multinomial sampler from scratch, and then modifying this to work at higher precision.
NumPy's multinomial function is implemented in Cython, and essentially performs a loop over a number of binomial samples and combines them into a multinomial sample.
and you can call it like this:
np.random.multinomial(10, [0.1, 0.2, 0.7])
# [0, 1, 9]
(Note that the precise output values here & below are random, and will change from call to call).
Another way you might implement a multinomial sampler is to generate N uniform random values, then compute the histogram with bins defined by the cumulative probabilities:
def multinomial(N, p):
rand = np.random.uniform(size=N)
p_cuml = np.cumsum(np.hstack([[0], p]))
p_cuml /= p_cuml[-1]
return np.histogram(rand, bins=p_cuml)[0]
multinomial(10, [0.1, 0.2, 0.7])
# [1, 1, 8]
With this method in mind, we can think about doing things to higher precision by keeping everything in log-space. The main trick is to realize that the log of uniform random deviates is equivalent to the negative of exponential random deviates, and so you can do everything above without ever leaving log space:
def multinomial_log(N, logp):
log_rand = -np.random.exponential(size=N)
logp_cuml = np.logaddexp.accumulate(np.hstack([[-np.inf], logp]))
logp_cuml -= logp_cuml[-1]
return np.histogram(log_rand, bins=logp_cuml)[0]
multinomial_log(10, np.log([0.1, 0.2, 0.7]))
# [1, 2, 7]
The resulting multinomial draws will maintain precision even for very small values in the p array.
Unfortunately, these histogram-based solutions will be much slower than the native numpy.multinomial function, so if performance is an issue you may need another approach. One option would be to adapt the Cython code linked above to work in log-space, using similar mathematical tricks as I used here.
A Solution for Large N: Poisson Approximation
The problem with the above solution is that as N grows large, it becomes very slow.
I was thinking about this and realized there's a more efficient way forward, despite np.random.multinomial failing for probabilities smaller than 1E-16 or so.
Here's an example of that failure: on a 64-bit machine, this will always give zero for the first entry because of the way the code is implemented, when in reality it should give something near 10:
np.random.multinomial(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
If you dig into the source, you can trace this issue to the binomial function upon which the multinomial function is built. The cython code internally does something like this:
def multinomial_basic(N, p, size=None):
results = np.array([np.random.binomial(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_basic(1E18, [1E-17, 1])
# array([ 0, 1000000000000000000])
The problem is that the binomial function chokes on very small values of p – this is because the algorithm computes the value (1 - p), so the value of p is limited by floating-point precision.
So what can we do? Well, it turns out that for small values of p, the Poisson distribution is an extremely good approximation of the binomial distribution, and the implementation doesn't have these issues. So we can build a robust multinomial function based on a robust binomial sampler that switches to a Poisson sampler at small p:
def binomial_robust(N, p, size=None):
if p < 1E-7:
return np.random.poisson(N * p, size)
else:
return np.random.binomial(N, p, size)
def multinomial_robust(N, p, size=None):
results = np.array([binomial_robust(N, pi, size) for pi in p])
results[-1] = int(N) - results[:-1].sum(0)
return np.rollaxis(results, 0, results.ndim)
multinomial_robust(1E18, [1E-17, 1])
array([ 12, 999999999999999988])
The first entry is nonzero and near 10 as expected! Note that we can't use N larger than 1E18, because it will overflow the long integer.
But we can confirm that our approach works for smaller probabilities using the size parameter, and averaging over results:
p = [1E-23, 1E-22, 1E-21, 1E-20, 1]
size = int(1E6)
multinomial_robust(1E18, p, size).mean(0)
# array([ 1.70000000e-05, 9.00000000e-05, 9.76000000e-04,
# 1.00620000e-02, 1.00000000e+18])
We see that even for these very small probabilities, the multinomial values are turning up in the right proportion. The result is a very robust and very fast approximation to the multinomial distribution for small p.
How to use python to generate a random sparse symmetric matrix ?
In MATLAB, we have a function "sprandsym (size, density)"
But how to do that in Python?
If you have scipy, you could use sparse.random. The sprandsym function below generates a sparse random matrix X, takes its upper triangular half, and adds its transpose to itself to form a symmetric matrix. Since this doubles the diagonal values, the diagonals are subtracted once.
The non-zero values are normally distributed with mean 0 and standard deviation
of 1. The Kolomogorov-Smirnov test is used to check that the non-zero values is
consistent with a drawing from a normal distribution, and a histogram and
QQ-plot is generated too to visualize the distribution.
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import matplotlib.pyplot as plt
np.random.seed((3,14159))
def sprandsym(n, density):
rvs = stats.norm().rvs
X = sparse.random(n, n, density=density, data_rvs=rvs)
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
return result
M = sprandsym(5000, 0.01)
print(repr(M))
# <5000x5000 sparse matrix of type '<class 'numpy.float64'>'
# with 249909 stored elements in Compressed Sparse Row format>
# check that the matrix is symmetric. The difference should have no non-zero elements
assert (M - M.T).nnz == 0
statistic, pval = stats.kstest(M.data, 'norm')
# The null hypothesis is that M.data was drawn from a normal distribution.
# A small p-value (say, below 0.05) would indicate reason to reject the null hypothesis.
# Since `pval` below is > 0.05, kstest gives no reason to reject the hypothesis
# that M.data is normally distributed.
print(statistic, pval)
# 0.0015998040114 0.544538788914
fig, ax = plt.subplots(nrows=2)
ax[0].hist(M.data, normed=True, bins=50)
stats.probplot(M.data, dist='norm', plot=ax[1])
plt.show()
PS. I used
upper_X = sparse.triu(X)
result = upper_X + upper_X.T - sparse.diags(X.diagonal())
instead of
result = (X + X.T)/2.0
because I could not convince myself that the non-zero elements in (X + X.T)/2.0 have the right distribution. First, if X were dense and normally distributed with mean 0 and variance 1, i.e. N(0, 1), then (X + X.T)/2.0 would be N(0, 1/2). Certainly we could fix this by using
result = (X + X.T)/sqrt(2.0)
instead. Then result would be N(0, 1). But there is yet another problem: If X is sparse, then at nonzero locations, X + X.T would often be a normally distributed random variable plus zero. Dividing by sqrt(2.0) will squash the normal distribution closer to 0 giving you a more tightly spiked distribution. As X becomes sparser, this may be less and less like a normal distribution.
Since I didn't know what distribution (X + X.T)/sqrt(2.0) generates, I opted for copying the upper triangular half of X (thus repeating what I know to be normally distributed non-zero values).
The matrix needs to be symmetric too, which seems to be glossed over by the two answers here;
def sparseSym(rank, density=0.01, format='coo', dtype=None, random_state=None):
density = density / (2.0 - 1.0/rank)
A = scipy.sparse.rand(rank, rank, density=density, format=format, dtype=dtype, random_state=random_state)
return (A + A.transpose())/2
This will create a sparse matrix, and then adds it's transpose to itself to make it symmetric.
It takes into account the fact that the density will increase as you add the two together, and the fact that there is no additional increase in density from the diagonal terms.
unutbu's answer is the best one for performance and extensibility - numpy and scipy, together, have a lot of the functionality from matlab.
If you can't use them for whatever reason, or you're looking for a pure python solution, you could try
from random import randgauss, randint
sparse = [ [0 for i in range(N)] for j in range(N)]
# alternatively, if you have numpy but not scipy:
# sparse = numpy.zeros(N,N)
for _ in range(num_terms):
(i,j) = (randint(0,n),randint(0,n))
x = randgauss(0,1)
sparse[i][j] = x
sparse[j][i] = x
Although it might give you a little more control than unutbu's solution, you should expect it to be significantly slower; scipy is a dependency you probably don't want to avoid
I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?
In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.
This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.
let's say I want to pull out random values from a linear distribution function, I'm not sure how I would do that..
say I have a function y = 3x then I want to be able to pull out a random value from that line.
this is what I've tried:
x,y = [],[]
for i in range(10):
a = random.uniform(0,3)
x.append(a)
b = 3*a
y.append(b)
This gives me y values that are taken from this linear function (distribution per say). Now if this is correct, how would I do the same for a distribution that looks like a horizontal line?
that is what if I had a horizontal line function y = 3, how can I get random values pulled out from there?
Just define your function, using a lambda or explicit definition, and then call it to get the y-value:
def func(x):
return 3
points = []
for i in range(10):
x = random.uniform(0, 3)
points.append((x, func(x))
A linear function with a slope of 0 in this case is fairly trivial.
EDIT: I think I understand that question a little more clearly now. You are looking to randomly generate a point that lies under the curve? That is quite tricky to directly calculate for an arbitrary function, and you probably will want a bound to your function (i.e a < x < b). Supposing we have a bound, one simple method would be to generate a random number in a box containing the curve, and simply discard it if it isn't under the curve. This will be perfectly random.
def linearfunc(x):
return 3 * x
def getRandom(func, maxi, a, b):
while True:
x = random.uniform(a, b)
y = random.uniform(0, maxi)
if y < func(x):
return (x, y)
points = [getRandom(linearFunc, 9, 0, 3) for i in range(10)]
This method requires knowing an upper bound (maxi) to the function on the specified interval, and the tighter the upper bound, the less sampling misses will occur.
The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.
I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:
def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f
# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?
You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].
After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:
[inverted_function(random.random()) for x in range(1000)]
More on Inverse Transform Sampling:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
Also, there is a good question on StackOverflow related to the topic:
Pythonic way to select list elements with different probability
This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.
class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function
the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,
the overhead is minimal
a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape = pdf.shape
self.pdf = pdf.ravel()
self.sort = sort
self.interpolation = interpolation
self.transform = transform
#a pdf can not be negative
assert(np.all(pdf>=0))
#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
#property
def ndim(self):
return len(self.shape)
#property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)
if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
And as a more real-world relevant example:
x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:] #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1) #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
Here is a rather nice way of performing inverse transform sampling with a decorator.
import numpy as np
from scipy.interpolate import interp1d
def inverse_sample_decorator(dist):
def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
return wrapper
Using this decorator on a Gaussian distribution, for example:
#inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.
samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),
def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
This function requires a function target_density which takes in a data-point and computes its probability.
For details check-out this detailed answer of mine.
import numpy as np
import scipy.interpolate as interpolate
def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.