Difference between functions generating random numbers in numpy

Difference between functions generating random numbers in numpy - python

I am trying to understand what is the difference, if any, between these functions:
numpy.random.rand()
numpy.random.random()
numpy.random.uniform()
It seems that they produce a random sample from a uniform distribution. So, without any parameter in the function, is there any difference?

numpy.random.uniform(low=0.0, high=1.0, size=None) - uniform samples from arbitrary range
Draw samples from a uniform distribution.
Samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform.
numpy.random.random(size=None) - uniform distribution between 0 and 1
Return random floats in the half-open interval [0.0, 1.0).
Results are from the “continuous uniform” distribution over the stated interval. To sample Unif[a, b), b > a multiply the output of random_sample by (b-a) and add a:
(b - a) * random_sample() + a
numpy.random.rand(d0, d1, ..., dn) - Samples from a uniform distribution to populate an array of a given shape
Random values in a given shape.
Create an array of the given shape and propagate it with random samples from a uniform distribution over [0, 1).
To answer your other question, given all default parameters all of the functions numpy.random.uniform, numpy.random.random, and numpy.random.rand are identical.

Short answer
Without parameters, the three functions are equivalent, producing a random float in the range [0.0,1.0).
Details
numpy.random.rand is a convenience function that accepts an arbitrary number of parameters as dimensions. It's different from the other numpy.random functions, numpy.zeros, and numpy.ones also, in that all of the others accept shapes, i.e. N-tuples (specified as Python lists or tuples). The following two lines produce identical results (the random seed notwithstanding):
import numpy as np
x = np.random.random_sample((1,2,3)) # a single tuple as parameter
x = np.random.rand(1,2,3) # integers as parameters
numpy.random.random is an alias for numpy.random.random_sample.
numpy.random.uniform allows you to specify the limits of the distribution, with the low and high keyword parameters, instead of using the default [0.0,1.0).

Related

Given a mean and standard deviation generate random numbers based on a geometric distribution and binomial distribution

I have some code to generate numbers simulating a normal distribution. Still, given a mean and standard deviation, my question is, how can I generate random numbers to resemble geometric and binomial distributions? I know numpy has these packages, but they use probability as arguments instead of mean and standard deviation.
import matplotlib.pyplot as plt
import numpy as np
x = np.random.normal(10, 20, 10000)
plt.hist(x, density=True, bins=30) # density=False would make counts
plt.ylabel('Probability')
plt.xlabel('Data')
plt.show()

You can generate normals with the same mean and standard deviation as a given geometric or binomial, but you generally can't go the other way. That's because all distributions are characterized by parameters, but for most distributions the mean and standard deviation are not the defining parameters, but rather they are functions of the parameters. Let's look at the two specific distributions named in your question. In both cases we'll treat variance and standard deviation as equivalent for your interests, since the latter is the square root of the former.
Given a geometric random variable with its single parameter p, the mean and variance are 1/p and (1-p)/p2, respectively. They can't be independent values chosen arbitrarily.
Similarly, a binomial random variable is fully determined by its parameters n (number of samples taken) and p (probability that a given sample is a "success"). Its mean and variance are n·p and n·p(1-p), respectively. Again, you can't just choose any old values because arbitrary choices may not yield feasible solutions of the requirements that integer n>0 and 0≤p≤1 produce the target values for mean and variance.
Going the other way is generally not a problem though. Given a binomial, geometric, exponential, Poisson,… with associated parameterization, you can calculate the corresponding mean and variance/standard-deviation and use those values to parameterize a normal. The exception would be some pathological distributions which don't have finite moments, such as the Cauchy distribution.

Random numbers with user-defined continuous probability distribution

I would like to simulate something on the subject of photon-photon-interaction. In particular, there is Halpern scattering. Here is the German Wikipedia entry on it Halpern-Streuung. And there the differential cross section has an angular dependence of (3+(cos(theta))^2)^2.
I would like to have a generator of random numbers between 0 and 2*Pi, which corresponds to the density function ((3+(cos(theta))^2)^2)*(1/(99*Pi/4)). So the values around 0, Pi and 2*Pi should occur a little more often than the values around Pi/2 and 3.
I have already found that there is a function on how to randomly output discrete values with user-defined probability values numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2]). I could work with that in an emergency, should there be nothing else. But actually I already want a continuous probability distribution here.
I know that even if there is such a Python command where you can enter a mathematical distribution function, it basically only produces discrete distributions of values, since no irrational numbers with 1s and 0s can be represented. But still, such a command would be more elegant with a continuous function.

Assuming the density function you have is proportional to a probability density function (PDF) you can use the rejection sampling method: Draw a number in a box until the box falls within the density function. It works for any bounded density function with a closed and bounded domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). In this case, the bound is 64/(99*math.pi) and the algorithm works as follows:
import math
import random
def sample():
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
while True: # Do the following until a value is returned
# Choose an X inside the desired sampling domain.
x=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
pdf=(((3+(math.cos(x))**2)**2)*(1/(99*math.pi/4)))
# Does (x,y) fall in the PDF?
if y<pdf:
# Yes, so return x
return x
# No, so loop
See also the section "Sampling from an Arbitrary Distribution" in my article on randomization.
The following shows the method's correctness by showing the probability that the returned sample is less than π/8. For correctness, the probability should be close to 0.0788:
print(sum(1 if sample()<math.pi/8 else 0 for _ in range(1000000))/1000000)

I had two suggestions in mind. The inverse transform sampling method and the "Deletion metode" (I'll just call it that). The inverse transform sampling method: There is an inverse function to my distribution. But I get problems in several places with the math. functions because of the domain. E.g. math.sqrt(-1). You would still have to trick around with if-queries here.That's why I decided to use Peter's suggestion.
And if you collect values in a loop and plot them in a histogram, it also looks quite good. Here with 40000 values and 100 bins
Here is the whole code for someone who is interested
import numpy as np
import math
import random
import matplotlib.pyplot as plt
N=40000
bins=100
def Deletion_method():
x=None
while x==None:
mn=0 # Lowest value of domain
mx=2*math.pi # Highest value of domain
bound=64/(99*math.pi) # Upper bound of PDF value
# Choose an X inside the desired sampling domain.
xrad=random.uniform(mn,mx)
# Choose a Y between 0 and the maximum PDF value.
y=random.uniform(0,bound)
# Calculate PDF
P=((3+(math.cos(xrad))**2)**2)*(1/(99*math.pi/4))
# Does (x,y) fall in the PDF?
if y<P:
x=xrad
return(x)
Values=[]
for k in range(0, N):
Values=np.append(Values, [Deletion_method()])
plt.hist(Values, bins)
plt.show()

Sampling from a set according to unnormalized log-probabilities in NumPy

I have a 1-D np.ndarray filled with unnormalized log-probabilities that define a categorical distribution. I would like to sample an integer index from this distribution. Since many of the probabilities are small, normalizing and exponentiating the log-probabilities introduces significant numerical error, therefore I cannot use np.random.choice. Effectively, I am looking for a NumPy equivalent to TensorFlow's tf.random.categorical, which works on unnormalized log-probabilities.
If there is not a function in NumPy that achieves this directly, what is an efficient manner to implement such sampling?

In general, there are many ways to choose an integer with a custom distribution, but most of them take weights that are proportional to the given probabilities. If the weights are log probabilities instead, then a slightly different approach is needed. Perhaps the simplest algorithm for this is rejection sampling, described below and implemented in Python. In the following algorithm, the maximum log-probability is max, and there are k integers to choose from.
Choose a uniform random integer i in [0, k).
Get the log-weight corresponding to i, then generate an exponential(1) random number, call it ex.
If max minus ex is less than the log-weight, return i. Otherwise, go to step 1.
The time complexity for rejection sampling is constant on average, especially if max is set to equal the true maximum weight. On the other hand, the expected number of iterations per sample depends greatly on the shape of the distribution. See also Keith Schwarz's discussion on the "Fair Die/Biased Coin Loaded Die" algorithm.
Now, Python code for this algorithm follows.
import random
import math
def categ(c):
# Do a weighted choice of an item with the
# given log-probabilities.
cm=max(c) # Find max log probability
while True:
# Choose an item at random
x=random.randint(0,len(c)-1)
# Choose it with probability proportional
# to exp(c[x])
y=cm-random.expovariate(1)
# Alternatively: y=math.log(random.random())+cm
if y<c[x]:
return x
The code above generates one variate at a time and uses only Python's base modules, rather than NumPy. Another answer shows how rejection sampling can be implemented in NumPy by blocks of random variates at a time (demonstrated on a different random sampling task, though).
The so-called "Gumbel max trick", used above all in machine learning, can be used to sample from a distribution with unnormalized log probabilities. This involves—
("Gumbel") adding a separate Gumbel random variate to each log probability, namely −ln(−ln(U)) where U is a random variate greater than 0 and less than 1, then
("max") choosing the item corresponding to the highest log probability.
However, the time complexity for this algorithm is linear in the number of items.
The following code illustrates the Gumbel max trick:
import random
import math
def categ(c):
# Do a weighted choice of an item with the
# given log-probabilities, using the Gumbel max trick
return max([[c[i]-math.log(-math.log(random.random())),i] \
for i in range(len(c))])[1]
# Or:
# return max([[c[i]-math.log(random.expovariate(1)),i] \
# for i in range(len(c))])[1]

Alternative to np.arange or np.linspace with non-uniform intervals

I can get a uniform grid on [0,2*pi) with numpy's function np.arange(), however, I would want a grid with the same number of points but having more density of points on certain interval, i.e having a finer grid on [pi,1.5*pi] for example. How can I achieve this, is there a numpy function that accepts a density function and it's output is the grid with that density?

I'm surprised that I can't find a similar Q&A on Stack Overflow. There are a few on doing something similar for random numbers from a discrete distribution, but not for continuous distributions and also not as a modified np.arange or np.linspace.
If you need to get an x range for plotting that has finer sampling in areas where you expect the function to fluctuate more rapidly, you can create a nonlinear function that takes inputs in the range 0 to 1 and produces outputs in the same range that proceeds nonlinearly. For example:
def f(x):
return x**2
angles = 2*np.pi*f(np.linspace(0, 1, num, endpoint=False))
This will produce fine sampling near zero and coarse sampling near 2*pi.
For more fine-grained control of the sampling density, you can use the function below. As a bonus, it will also allow random sampling.
import numpy as np
def density_space(xs, ps, n, endpoint=False, order=1, random=False):
"""Draw samples with spacing specified by a density function.
Copyright Han-Kwang Nienhuys (2020).
License: any of: CC-BY, CC-BY-SA, BSD, LGPL, GPL.
Reference: https://stackoverflow.com/a/62740029/6228891
Parameters:
- xs: array, ordered by increasing values.
- ps: array, corresponding densities (not normalized).
- n: number of output values.
- endpoint: whether to include x[-1] in the output.
- order: interpolation order (1 or 2). Order 2 will
require dense sampling and a smoothly varying density
to work correctly.
- random: whether to return random samples, ignoring endpoint).
in this case, n can be a shape tuple.
Return:
- array, shape (n,), with values from xs[0] to xs[-1]
"""
from scipy.interpolate import interp1d
from scipy.integrate import cumtrapz
cps = cumtrapz(ps, xs, initial=0)
cps *= (1/cps[-1])
intfunc = interp1d(cps, xs, kind=order)
if random:
return intfunc(np.random.uniform(size=n))
else:
return intfunc(np.linspace(0, 1, n, endpoint=endpoint))
Test:
values = density_space(
[0, 100, 101, 200],
[1, 1, 2, 2],
n=12, endpoint=True)
print(np.around(values))
[ 0. 27. 54. 82. 105. 118. 132. 146. 159. 173. 186. 200.]
The cumulative density function is created using trapezoid integration, which is essentially based on linear interpolation. A higher-order integration is not safe because the input may have (near-)discontinuities, like the jump from x=100 to x=101 in the example. A discontinuity in the input results in a discontinuous first derivative in the cumulative density function (cps in the code), which will cause problems with a smooth interpolation (order 2 or above). Hence the recommendation to use order=2 only for a smooth density function - and not to use any higher orders.

Generating random numbers with a given probability density function

I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?

In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.

This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.