Fast arbitrary distribution random sampling (inverse transform sampling)

Fast arbitrary distribution random sampling (inverse transform sampling) - python

The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.
I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:
def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f
# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?

You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].
After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:
[inverted_function(random.random()) for x in range(1000)]
More on Inverse Transform Sampling:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
Also, there is a good question on StackOverflow related to the topic:
Pythonic way to select list elements with different probability

This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.
class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function
the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,
the overhead is minimal
a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape = pdf.shape
self.pdf = pdf.ravel()
self.sort = sort
self.interpolation = interpolation
self.transform = transform
#a pdf can not be negative
assert(np.all(pdf>=0))
#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
#property
def ndim(self):
return len(self.shape)
#property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)
if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
And as a more real-world relevant example:
x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:] #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1) #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()

Here is a rather nice way of performing inverse transform sampling with a decorator.
import numpy as np
from scipy.interpolate import interp1d
def inverse_sample_decorator(dist):
def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
return wrapper
Using this decorator on a Gaussian distribution, for example:
#inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.
samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),
def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))

I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
This function requires a function target_density which takes in a data-point and computes its probability.
For details check-out this detailed answer of mine.

import numpy as np
import scipy.interpolate as interpolate
def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.

Related

Random number generator from a given distribution function

I am completely new in programming. I have density function which have two range. How can i get random function according to this function.
The probability density function for the last return time is:
(1/sqrt(2*pi*std**2))*exp(-(x+24-µ2)**2/2*std**2) , 0 < x ≤ µ2 − 12
f(x) =
(1/sqrt(2*pi*std**2))*exp(-(x-µ2)**2/2*std**2) , µ2 − 12 < x ≤ 24
std=3.4,µ2=17.6
After finding for couple hours i get this answers
1.get random number from 0to 1
2.calculate cdf
3.calculate inverse cdf
4.get random number
But i dont know how i can implement this in python.

You can create your own distribution using scipy.stats.rv_continuous as the base class. This class has fast default implementations of CDF, random number generator, SF, ISF, ect given the PDF of the distribution. You can implement your own distribution using something like:
import numpy as np
from numpy import exp
from scipy.stats import rv_continuous
class my_distribution_gen(rv_continuous):
def _logpdf(self, x, mu, std):
# code the log of your pdf function here
result = # here goes your equation
return result
def _pdf(self, x, mu, std):
return exp(self._logpdf(x, mu, std))
my_distribution = my_distribution_gen(name='my_distribution')
Once you have the above class ready, you can enjoy the default implementations by calling methods like rvs, cdf, etc.
mu, std = 0, 1
rvs = my_distribution.rvs(mu, std)

Generate bins with even number of samples in each

I understand that I can generate bins for arrays with numpy using numpy.histogram or numpy.digitize, and have in the past. However, the analysis I need to do requires there to be an even number of samples in each bin, where the data is not uniformly distributed.
Say I have approximately normally distributed data in an array, A = numpy.random.random(1000). How can I bin this data (either by creating an index or finding values which define the extents of each bin) in a way that there is an even number of samples in each?
I know this can be treated as an optimization problem, and can solve it as such, i.e.:
import numpy as np
from scipy.optimize import fmin
def generate_even_bins(A, n):
x0 = np.linspace(A.min(), A.max(), n)
def bin_counts(x, A):
if any(np.diff(x)) <= 0:
return 1e6
else:
binned = np.digitize(A, x)
return np.abs(np.diff(np.bincount(binned))).sum()
return fmin(bin_counts, x0, args=(A,))
... but is there something already available, perhaps in numpy or scipy.stats that implements this? If not shouldn't there be?

Generating random numbers with a given probability density function

I want to specify the probability density function of a distribution and then pick up N random numbers from that distribution in Python. How do I go about doing that?

In general, you want to have the inverse cumulative probability density function. Once you have that, then generating the random numbers along the distribution is simple:
import random
def sample(n):
return [ icdf(random.random()) for _ in range(n) ]
Or, if you use NumPy:
import numpy as np
def sample(n):
return icdf(np.random.random(n))
In both cases icdf is the inverse cumulative distribution function which accepts a value between 0 and 1 and outputs the corresponding value from the distribution.
To illustrate the nature of icdf, we'll take a simple uniform distribution between values 10 and 12 as an example:
probability distribution function is 0.5 between 10 and 12, zero elsewhere
cumulative distribution function is 0 below 10 (no samples below 10), 1 above 12 (no samples above 12) and increases linearly between the values (integral of the PDF)
inverse cumulative distribution function is only defined between 0 and 1. At 0 it is 10, at 12 it is 1, and changes linearly between the values
Of course, the difficult part is obtaining the inverse cumulative density function. It really depends on your distribution, sometimes you may have an analytical function, sometimes you may want to resort to interpolation. Numerical methods may be useful, as numerical integration can be used to create the CDF and interpolation can be used to invert it.

This is my function to retrieve a single random number distributed according to the given probability density function. I used a Monte-Carlo like approach. Of course n random numbers can be generated by calling this function n times.
"""
Draws a random number from given probability density function.
Parameters
----------
pdf -- the function pointer to a probability density function of form P = pdf(x)
interval -- the resulting random number is restricted to this interval
pdfmax -- the maximum of the probability density function
integers -- boolean, indicating if the result is desired as integer
max_iterations -- maximum number of 'tries' to find a combination of random numbers (rand_x, rand_y) located below the function value calc_y = pdf(rand_x).
returns a single random number according the pdf distribution.
"""
def draw_random_number_from_pdf(pdf, interval, pdfmax = 1, integers = False, max_iterations = 10000):
for i in range(max_iterations):
if integers == True:
rand_x = np.random.randint(interval[0], interval[1])
else:
rand_x = (interval[1] - interval[0]) * np.random.random(1) + interval[0] #(b - a) * random_sample() + a
rand_y = pdfmax * np.random.random(1)
calc_y = pdf(rand_x)
if(rand_y <= calc_y ):
return rand_x
raise Exception("Could not find a matching random number within pdf in " + max_iterations + " iterations.")
In my opinion this solution is performing better than other solutions if you do not have to retrieve a very large number of random variables. Another benefit is that you only need the PDF and avoid calculating the CDF, inverse CDF or weights.

Numerical Accuracy with scipy.optimize.curve_fit in Python

I am having issues with the numerical accuracy of scipy.optimize.curve_fit function in python. It seems to me that I can only get ~ 8 digits of accuracy when I desire ~ 15 digits. I have some data (at this point artificially created) made from the following data creation:
where term 1 ~ 10^-3, term 2 ~ 10^-6, and term 3 is ~ 10^-11. In the data, I vary A randomly (it is a Gaussian error). I then try to fit this to a model:
where lambda is a constant, and I only fit alpha (it is a parameter in the function). Now what I would expect is to see a linear relationship between alpha and A because terms 1 and 2 in the data creation are also in the model, so they should cancel perfectly;
So;
However, what happens is for small A (~10^-11 and below), alpha does not scale with A, that is to say, as A gets smaller and smaller, alpha levels out and remains constant.
For reference, I call the following:
op, pcov = scipy.optimize.curve_fit(model, xdata, ydata, p0=None, sigma=sig)
My first thought was that I was not using double precision, but I am pretty sure that python automatically creates numbers in double precision. Then I thought it was an issue with the documentation perhaps that cuts off the digits? Anyways, I could put my code in here but it is sort of complicated. Is there a way to ensure that the curve fitting function saves my digits?
Thank you so much for your help!
EDIT: The below is my code:
# Import proper packages
import numpy as np
import numpy.random as npr
import scipy as sp
import scipy.constants as spc
import scipy.optimize as spo
from matplotlib import pyplot as plt
from numpy import ndarray as nda
from decimal import *
# Declare global variables
AU = 149597871000.0
test_lambda = 20*AU
M_Sun = (1.98855*(sp.power(10.0,30.0)))
M_Jupiter = (M_Sun/1047.3486)
test_jupiter_mass = M_Jupiter
test_sun_mass = M_Sun
rad_jup = 5.2*AU
ran = np.linspace(AU, 100*AU, num=100)
delta_a = np.power(10.0, -11.0)
chi_limit = 118.498
# Model acceleration of the spacecraft from the sun (with Yukawa term)
def model1(distance, A):
return (spc.G)*(M_Sun/(distance**2.0))*(1 +A*(np.exp(-distance/test_lambda))) + (spc.G)*(M_Jupiter*distance)/((distance**2.0 + rad_jup**2.0)**(3.0/2.0))
# Function that creates a data point for test 1
def data1(distance, dela):
return (spc.G)*(M_Sun/(distance**2.0) + (M_Jupiter*distance)/((distance**2.0 + rad_jup**2.0)**(3.0/2.0))) + dela
# Generates a list of 100 data sets varying by ~&a for test 1
def generate_data1():
data_list = []
for i in range(100):
acc_lst = []
for dist in ran:
x = data1(dist, npr.normal(0, delta_a))
acc_lst.append(x)
data_list.append(acc_lst)
return data_list
# Generates a list of standard deviations at each distance from the sun. Since &a is constant, the standard deviation of each point is constant
def generate_sig():
sig = []
for i in range(100):
sig.append(delta_a)
return sig
# Finds alpha for test 1, since we vary &a in test 1, we need to generate new data for each time we find alpha
def find_alpha1(data_list, sig):
alphas = []
for data in data_list:
op, pcov = spo.curve_fit(model1, ran, data, p0=None, sigma=sig)
alphas.append(op[0])
return alphas
# Tests the dependence of alpha on &a and plots the dependence
def test1():
global delta_a
global test_lambda
test_lambda = 20*AU
delta_a = 10.0**-20.0
alphas = []
delta_as = []
for i in range(20):
print i
data_list = generate_data1()
print np.array(data_list[0])
sig = generate_sig()
alpha = find_alpha1(data_list, sig)
delas = []
for alp in alpha:
if alp < 0:
x = 0
plt.loglog(delta_a, abs(alp), '.' 'r')
else:
x = 0
plt.loglog(delta_a, alp, '.' 'b')
delta_a *= 10
plt.xlabel('Delta A')
plt.ylabel('Alpha (at Lambda = 5 AU)')
plt.show()
def main():
test1()
if __name__ == '__main__':
main()

I believe this is to do with the minimisation algorithm used here, and the maximum obtainable precision.
I remember reading about it in numerical recipes a few years ago, I'll see if i can dig up a reference for you.
edit:
link to numerical recipes here - skip down to page 394 and then read that chapter. Note the third paragraph on page 404:
"Indulge us a ﬁnal reminder that tol should generally be no smaller
than the square root of your machine’s ﬂoating-point precision."
And mathematica mention that if you want accuracy, then you need to go for a different method, and that they don't infact use LMA unless the problem is recognised as being a sum of squares problem.
Given that you're just doing a one dimensional fit, it might be a good exercise to try just implementing one of the fitting algorithms they mention in that chapter.
What are you actually trying to achieve though? From what i understand about it, you're essentially trying to work out the amount of random noise you've added to the curve. But then that's not really what you're doing - unless i've understood wrong...
Edit2:
So after reading how you generate the data, there's an issue with the data and the model you're applying.
You're essentially fitting the two sides of this:
You're essentially trying to fit the height of a gaussian to random numbers. You're not fitting the gaussian to the frequency of those numbers.
Looking at your code, and judging from what you've said, this isn't you end goal, and you're just wanting to get used to the optimise method?
It would make more sense if you randomly adjusted the distance from the sun, and then fit to the data and see if you can minimise to find the distance which generated the data set?

Fourier transform of a Gaussian is not a Gaussian, but thats wrong! - Python

I am trying to utilize Numpy's fft function, however when I give the function a simple gausian function the fft of that gausian function is not a gausian, its close but its halved so that each half is at either end of the x axis.
The Gaussian function I'm calculating is
y = exp(-x^2)
Here is my code:
from cmath import *
from numpy import multiply
from numpy.fft import fft
from pylab import plot, show
""" Basically the standard range() function but with float support """
def frange (min_value, max_value, step):
value = float(min_value)
array = []
while value < float(max_value):
array.append(value)
value += float(step)
return array
N = 256.0 # number of steps
y = []
x = frange(-5, 5, 10/N)
# fill array y with values of the Gaussian function
cache = -multiply(x, x)
for i in cache: y.append(exp(i))
Y = fft(y)
# plot the fft of the gausian function
plot(x, abs(Y))
show()
The result is not quite right, cause the FFT of a Gaussian function should be a Gaussian function itself...

np.fft.fft returns a result in so-called "standard order": (from the docs)
If A = fft(a, n), then A[0]
contains the zero-frequency term (the
mean of the signal), which is always
purely real for real inputs. Then
A[1:n/2] contains the
positive-frequency terms, and
A[n/2+1:] contains the
negative-frequency terms, in order of
decreasingly negative frequency.
The function np.fft.fftshift rearranges the result into the order most humans expect (and which is good for plotting):
The routine np.fft.fftshift(A)
shifts transforms and their
frequencies to put the zero-frequency
components in the middle...
So using np.fft.fftshift:
import matplotlib.pyplot as plt
import numpy as np
N = 128
x = np.arange(-5, 5, 10./(2 * N))
y = np.exp(-x * x)
y_fft = np.fft.fftshift(np.abs(np.fft.fft(y))) / np.sqrt(len(y))
plt.plot(x,y)
plt.plot(x,y_fft)
plt.show()

Your result is not even close to a Gaussian, not even one split into two halves.
To get the result you expect, you will have to position your own Gaussian with the center at index 0, and the result will also be positioned that way. Try the following code:
from pylab import *
N = 128
x = r_[arange(0, 5, 5./N), arange(-5, 0, 5./N)]
y = exp(-x*x)
y_fft = fft(y) / sqrt(2 * N)
plot(r_[y[N:], y[:N]])
plot(r_[y_fft[N:], y_fft[:N]])
show()
The plot commands split the arrays in two halfs and swap them to get a nicer picture.

It is being displayed with the center (i.e. mean) at coefficient index zero. That is why it appears that the right half is on the left, and vice versa.
EDIT: Explore the following code:
import scipy
import scipy.signal as sig
import pylab
x = sig.gaussian(2048, 10)
X = scipy.absolute(scipy.fft(x))
pylab.plot(x)
pylab.plot(X)
pylab.plot(X[range(1024, 2048)+range(0, 1024)])
The last line will plot X starting from the center of the vector, then wrap around to the beginning.

A fourier transform implicitly repeats indefinitely, as it is a transform of a signal that implicitly repeats indefinitely. Note that when you pass y to be transformed, the x values are not supplied, so in fact the gaussian that is transformed is one centred on the median value between 0 and 256, so 128.
Remember also that translation of f(x) is phase change of F(x).

Following on from Sven Marnach's answer, a simpler version would be this:
from pylab import *
N = 128
x = ifftshift(arange(-5,5,5./N))
y = exp(-x*x)
y_fft = fft(y) / sqrt(2 * N)
plot(fftshift(y))
plot(fftshift(y_fft))
show()
This yields a plot identical to the above one.
The key (and this seems strange to me) is that NumPy's assumed data ordering --- in both frequency and time domains --- is to have the "zero" value first. This is not what I'd expect from other implementations of FFT, such as the FFTW3 libraries in C.
This was slightly fudged in the answers from unutbu and Steve Tjoa above, because they're taking the absolute value of the FFT before plotting it, thus wiping away the phase issues resulting from not using the "standard order" in time.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.