I need to generate a vector of random float numbers between [0,1] such
that their sum equals 1 and that are distributed non-uniformly.
Is there any python function that generates such a vector?
Best wishes
The distribution you are probably looking for is called the Dirichlet distribution. There's no built-in function in Python for drawing random numbers from a Dirichlet distribution, but NumPy contains one:
>>> from numpy.random.mtrand import dirichlet
>>> print dirichlet([1] * n)
This will give you n numbers that sum up to 1, and the probability of each such combination will be equal.
Alternatively, if you don't have NumPy, you can make use of the fact that a random sample drawn from an n-dimensional Dirichlet distribution can be generated by drawing n independent samples from a gamma distribution with shape and scale parameters equal to 1 and then dividing the samples with the sum:
>>> from random import gammavariate
>>> def dirichlet(n):
... samples = [gammavariate(1, 1) for _ in xrange(n)]
... sum_samples = sum(samples)
... return [x/sum_samples for x in samples]
The reason why you need a Dirichlet distribution is because if you simply draw random numbers uniformly from some interval and then divide them by the sum of them, the resulting distribution will be biased towards samples consisting of roughly equal numbers. See Luc Devroye's book for more on this topic.
There is a nicer example in Wikipedia page: Dirichlet distribution.
The code below generate a k dimension sample:
params = [a1, a2, ..., ak]
sample = [random.gammavariate(a,1) for a in params]
sample = [v/sum(sample) for v in sample]
Related
For some vector m (of length N) of numbers in R we can write
rnorm(N, mean = m, sd = 1)
and this will give a vector of length N where each element will be a sample for a normal distribution centred at the different elements of m. My question is, is it possible to do the same easily with numpy? As far as I can tell numpy.random.normal() requires the loc to be the same for all the elements. The point is that I want a random vector with different means.
Also while writing this, would it work to sample from a standard normal distribution and transform this sample? That would be easier.
One way you can do is random sampling at center 0 then move the sample:
m, N = np.array([1,2,3]), 1000
np.random.seed(42)
samples = np.random.randn(N,len(m)) + m
I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.
Bugfix
You can use scipy.stats.binom equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.
Another way would be to use the CDF and it's inverse, something like:
from scipy import stats
dist = stats.binom(100, 0.5)
# limit ourselves to [60, 100]
lo, hi = dist.cdf([60, 100])
# draw a sample
x = dist.ppf(stats.uniform(lo, hi-lo).rvs())
should give us values in the range. note that due to floating point precision, this might give you values outside of what you want. it gets worse above the mean of the distribution
note that for large values you might as well use the normal approximation
More specifically, given a natural number d, how can I generate random vectors in R^d such that each vector x has Euclidean norm <= 1?
Generating random vectors via numpy.random.rand(1,d) is no problem, but the likelihood of such a random vector having norm <= 1 is predictably bad for even not-small d. For example, even for d = 10 about 0.2% percent of such random vectors have appropriately small norm. So that seems like a silly solution.
EDIT: Re: Walter's comment, yes, I'm looking for a uniform distribution over vectors in the unit ball in R^d.
Based on the Wolfram Mathworld article on hypersphere point picking and Nate Eldredge's answer to a similar question on math.stackexchange.com, you can generate such a vector by generating a vector of d independent Gaussian random variables and a random number U uniformly distributed over the closed interval [0, 1], then normalizing the vector to norm U^(1/d).
Based on the answer by user2357112, you need something like this:
import numpy as np
...
inv_d = 1.0 / d
for ...:
gauss = np.random.normal(size=d)
length = np.linalg.norm(gauss)
if length == 0.0:
x = gauss
else:
r = np.random.rand() ** inv_d
x = np.multiply(gauss, r / length)
# conceptually: / length followed by * r
# do something with x
(this is my second Python program, so don't shoot at me...)
The tricks are that
the combination of d independent gaussian variables with same σ is a gaussian distribution in d dimensions, which, remarkably, has spherical symmetry,
the gaussian distribution in d dimensions can be projected onto the unit sphere by dividing by the norm, and
the uniform distribution in a d-dimensional unit sphere has cumulative radial distribution rd (which is what you need to invert)
this is the Python / Numpy code I am using. Since it does not use loops, is much faster:
n_vectors=1000
d=2
rnd_vec=np.random.uniform(-1, 1, size=(n_vectors, d)) # the initial random vectors
unif=np.random.uniform(size=n_vectors) # a second array random numbers
scale_f=np.expand_dims(np.linalg.norm(rnd_vec, axis=1)/unif, axis=1) # the scaling factors
rnd_vec=rnd_vec/scale_f # the random vectors in R^d
The second array of random numbers (unif) is needed as second scaling factor because otherwise all the vectors will have euclidean norm equal to one.
I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?
In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.
This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.
If I have a an N^3 array of triplets in a numpy array, how do I do a vector sum on all of the triplets in the array? For some reason I just can't wrap my brain around the summation indices. Here is what I tried, but it doesn't seem to work:
a = np.random.random((5,5,5,3)) - 0.5
s = a.sum((0,1,2))
np.linalg.norm(s)
I would expect that as N gets large, if the sum is working correctly I should converge to 0, but I just keep getting bigger. The sum gives me a vector that is the correct shape (3x1), but obviously I must be doing something wrong. I know this should be easy, but I'm just not getting it.
Thanks in advance!
Is is easier to understand you problem analytically if instead of uniform random numbers we use standard normal numbers, and the qualitative results can be applied to your particular case:
>>> a = np.random.normal(0, 1, size=(5, 5, 5, 3))
>>> s = a.sum(axis=(0, 1, 2))
So now each of the three items of s is the sum of 125 numbers, each drawn from a standard normal distribution. It is a well established fact that adding up two normal distributions gives you another normal distribution with mean the sum of the means, and variance the sum of the variances. So each of the three values in s will be distributed as a random sample from a normal distribution with mean 0 and standard deviation sqrt(125) = 11.18.
The fact that the variance of the distribution grows means that, even though if you run your code many times, you will see an average value of 0 for each of those numbers, on any given run you are more likely to see larger offsets from 0.
Furthermore you then go and compute the norm of those three values. Squaring three standard normal distributions and adding them together gives you a chi-squared distribution. If you then take the square root, you get a chi distribution. The former is easier to deal with, and it predicts that the average value of the square of the norm of your three values will be 3 * 125. And it most certainly seems to be:
>>> mean_norm_sq = 0
>>> for n in xrange(1000):
... a = np.random.normal(0, 1, size=(5, 5, 5, 3))
... s = a.sum(axis=(0, 1, 2))
... mean_norm_sq += np.sum(s**2)
...
>>> mean_norm_sq / 1000
374.47629802482447
As the comments note, there is no reason why the squared sum should approach zero. By the description, an array of N three-dimensional vectors sounds like it should have the shape of (N,3) not (N,N,N,3), but I may be misunderstanding it. Either way, it is simple to observe what happens in the two cases:
import numpy as np
avg_sum = []
sq_sum = []
N_val = 2**np.arange(15)
for N in N_val:
A = np.random.random((N,3)) - 0.5
avg_sum.append( A.sum(axis=1).mean() )
sq_sum.append ( (A**2).sum(axis=1).mean() )
import pylab as plt
plt.plot(N_val, avg_sum, label="Average sum")
plt.plot(N_val, sq_sum, label="Squared sum")
plt.legend(loc="best")
plt.show()
The average sum goes to zero as your intuition expects.