How do I implement the Probability density function of a Gaussian Distribution - python

I need to implement a class in Python, that represents a Univariate (for now) Normal Distribution. What I have in mind is as follows
class Norm():
def __init__(self, mu=0, sigma_sq=1):
self.mu = mu
self.sigma_sq = sigma_sq
# some initialization if necessary
def sample(self):
# generate a sample, where the probability of the value
# of the sample being generated is distributed according
# a normal distribution with a particular mean and variance
pass
N = Norm()
N.sample()
The generated samples should be distributed according to the following probability density function
I know that scipy.stats and Numpy provide functions to do this, but I need to understand how these functions are implemented. Any help would be appreciated, thanks :)

I ended up using the advice by #sascha. I looked at both this wikipedia article and the Numpy source and found this randomkit.c file that implemented the functions rk_gauss (which implements the Box Muller Transform), rk_double and rk_random (which implements the Mersenne Twister Random Number Generator that simulates a Uniformly Distributed Random Variable, required by the Box Muller Transform).
I then adapted the Mersenne Twister Generator from here and implemented the Box Muller Transform to simulate a gaussian (more information about Random Twister Generator here).
Here is the code I ended up writing:
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
class Distribution():
def __init__(self):
pass
def plot(self, number_of_samples=100000):
# the histogram of the data
n, bins, patches = plt.hist([self.sample() for i in range(number_of_samples)], 100, normed=1, facecolor='g', alpha=0.75)
plt.show()
def sample(self):
# dummy sample function (to be overridden)
return 1
class Uniform_distribution(Distribution):
# Create a length 624 list to store the state of the generator
MT = [0 for i in xrange(624)]
index = 0
# To get last 32 bits
bitmask_1 = (2 ** 32) - 1
# To get 32. bit
bitmask_2 = 2 ** 31
# To get last 31 bits
bitmask_3 = (2 ** 31) - 1
def __init__(self, seed):
self.initialize_generator(seed)
def initialize_generator(self, seed):
"Initialize the generator from a seed"
global MT
global bitmask_1
MT[0] = seed
for i in xrange(1,624):
MT[i] = ((1812433253 * MT[i-1]) ^ ((MT[i-1] >> 30) + i)) & bitmask_1
def generate_numbers(self):
"Generate an array of 624 untempered numbers"
global MT
for i in xrange(624):
y = (MT[i] & bitmask_2) + (MT[(i + 1 ) % 624] & bitmask_3)
MT[i] = MT[(i + 397) % 624] ^ (y >> 1)
if y % 2 != 0:
MT[i] ^= 2567483615
def sample(self):
"""
Extract a tempered pseudorandom number based on the index-th value,
calling generate_numbers() every 624 numbers
"""
global index
global MT
if index == 0:
self.generate_numbers()
y = MT[index]
y ^= y >> 11
y ^= (y << 7) & 2636928640
y ^= (y << 15) & 4022730752
y ^= y >> 18
index = (index + 1) % 624
# divide by 4294967296, which is the largest 32 bit number
# to normalize the output value to the range [0,1]
return y*1.0/4294967296
class Norm(Distribution):
def __init__(self, mu=0, sigma_sq=1):
self.mu = mu
self.sigma_sq = sigma_sq
self.uniform_distribution_1 = Uniform_distribution(datetime.now().microsecond)
self.uniform_distribution_2 = Uniform_distribution(datetime.now().microsecond)
# some initialization if necessary
def sample(self):
# generate a sample, where the value of the sample being generated
# is distributed according a normal distribution with a particular
# mean and variance
u = self.uniform_distribution_1.sample()
v = self.uniform_distribution_2.sample()
return ((self.sigma_sq**0.5)*((-2*np.log(u))**0.5)*np.cos(2*np.pi*v)) + self.mu
This works perfectly, and generates a pretty good Gaussian
Norm().plot(10000)

Using the Box-Müller method:
def sample(self):
x = np.random.uniform(0,1,[2])
z = np.sqrt(-2*np.log(x[0]))*np.cos(2*np.pi*x[1])
return z * self.sigma_sq + self.mu

Related

Output for an implementation of a Monte Carlo numerical solver for a 1D potential well doesn't match expected output

I am trying to implement a numerical solver for a 1D Harmonic well's ground state using the Metropolis algorithm and the Feynman Path Integral technique in Python. When I run my program, I end with a distribution of the different points that my particle has gone to; this distribution ought to match up with that of a particle trapped in a 1D harmonic well. It does not. I have gone through and rewritten my code; I have checked it to similar code used for the same purpose; it all looks like it should work, yet it doesn't.
In blue is the histogram of my results, with Density set to True; the orange line is the function describing the expected distribution
As can be seen in the image, what I have ended up with is a distribution that isn't dissimilar to what I was expecting, but it isn't the correct distribution. The code I used (see below), is based on Lepage (2005) work on the same topic, although I used a slightly different formula to describe the same physical system.
import numpy as np
import random
import matplotlib.pyplot as plt
time = 4 #time over which we evolve our function
steps = 7 #number of steps we take
epsilon = 3 #the pos & neg bounds of our rand variable
N_cor = 100 #the number of times we need to thermalise our function before we take a path
N_cf = 20000 #the number of paths we take
def S(x, j, t, s): #the action of our potential well
e = t / s
return (1/(2*e))*(x[j] - x[j - 1])**2 + ((x[j] + x[j-1])/2)**2/2
def update(x, t, s, eps):
for j in range(0, s):
old_x = x[j] #old x value
old_Sj = S(x, j, t, s) #original action value
x[j] = x[j] + random.uniform(-eps,eps) #new x value
dS = S(x, j, t, s) - old_Sj #change in action
if dS > 0 and np.exp(-dS) < random.uniform(0,1): #check for Metropolis alg
x[j] = old_x
return x
def gamma(t, s, eps, thermal_num, num_paths):
zeros = np.zeros(s) #our initial path with s steps
gamma_arr = np.empty(0) #our initial empty result
for i in range(0, 10*thermal_num): #thermalisation
zeros = update(zeros, t, s, eps)
for j in range(0, num_paths):
for i in range(0, thermal_num): #thermalising again
zeros = update(zeros, t, s, eps)
gamma_arr = np.append(gamma_arr, zeros) #add new path post thermalising
#print(zeros)
#print(gamma_arr)
return gamma_arr
test = gamma(time, steps, epsilon, N_cor, N_cf)
x = np.arange(-4, 4, 0.1)
y = 1/np.sqrt(np.pi)*np.exp(-(x**2)) #expected result
plt.hist(test, bins= 500, density = True)
plt.plot(x, y)
plt.show()

Markov Chain Monte Carlo integration and infinite while loop

I'm implementing a Markov Chain Monte Carlo with both Metropolis and Barker's α's for numerical integration. I've created a class called MCMCIntegrator(). Below the __init__() method, are the g(x) the PDF of the function I'm integrating and alpha method, implementing the Metropolis and Barker α's.
import numpy as np
import scipy.stats as st
class MCMCIntegrator:
def __init__(self):
self.size = 1000
self.std = 0.6
self.real_int = 0.06496359
self.sample = None
#staticmethod
def g(x):
return st.gamma.pdf(x, 1, scale=1.378008857)*np.abs(np.cos(1.10257704))
def alpha(self, a, b, method):
if method:
return min(1, self.g(b) / self.g(a))
else:
return self.g(b) / (self.g(a) + self.g(b))
The size is the size of the sample that the class must generate, std is the standard deviation of the normal kernel, which you will see in a few seconds. The real_int is the value of the integral from 1 to 2 of the function we're integrating. I've generated it with a R script. Now, to the problem.
def _chain(self, method):
"""
Markov chain heat-up with burn-in
:param method: Metropolis or Barker alpha
:return: np.array containing the sample
"""
old = 0
sample = np.zeros(int(self.size * 1.3))
i = 0
while i != len(sample):
new = np.random.normal(loc=old, scale=self.std)
new = abs(new)
al = self.alpha(old, new, method=method)
u = np.random.uniform()
if al > u:
sample[i] = new
i += 1
old = new
return np.array(sample)
Below this method is an integrate() method that calculates the proportion of numbers in the [1, 2] interval:
def integrate(self, method=None):
"""
Integration step
"""
sample = self._chain(method=method)
# discarding 30% of the sample for the burn-in
ind = int(len(sample)*0.3)
sample = sample[ind:]
setattr(self, "sample", sample)
sample = [1 if 1 < v < 2 else 0 for v in sample]
return np.mean(sample)
this is the main function:
def main():
print("-- RESULTS --".center(20), end='\n')
mcmc = MCMCIntegrator()
print(f"\t{mcmc.integrate()}", end='\n')
print(f"\t{np.abs(mcmc.integrate() - mcmc.real_int) / mcmc.real_int}")
if __name__ == "__main__":
main()
I'm stuck in an infinite while loop, with no idea why this is happening.
While I have no prior exposure to Python and no direct explanation for the infinite loop, here are some problematic issues in the code:
The
while i != len(sample):
loop increments the value i only when the Uniform variate is below the acceptance probability if al > u: This is not how Metropolis-Hastings operates. If the Uniform variate is above the acceptance probability, the current value of the chain is duplicated. but this does not explain for the infinite loop since a proposed value should eventually be accepted.
If the target density is
st.gamma.pdf(x, 1, scale=1.378008857)*np.abs(np.cos(1.10257704))
then (i) what is the point of the second and constant term np.abs(np.cos(1.10257704)) and (ii) where is the need for so many digits?
The proposal distribution
new = np.random.normal(loc=old, scale=self.std)
new = abs(new)
is a folded normal, which density is not symmetric. Hence it should appear in the Metropolis-Hastings probability but may have little impact since the scale is small.
Here is my R rendering of the Python code (edited and corrected)
self.size = 1e5
self.std = 0.6
self.real_int = 0.06496359
g <- function(x){dgamma(x, shape=1, scale=1.378)}
alpha <- function(a, b, method=1)ifelse(method,
min(1, r <- g(b) / g(a)), 1 / (1 + 1 / r))
old = 0
smple = rep(0,self.size * 1.3)
for (i in 1:length(smple)){
new = abs(old+self.std*rnorm(1))
al = alpha(old, new, 0)
old=smple[i]=ifelse(al > runif(1), new, old)
}
ind = trunc(length(smple)*0.3)
smple = sample[ind:length(smple)]
hist(smple,pro=TRUE,nclass=10*log2(self.size),col="wheat")
curve(g(x),add=TRUE,lwd=2,col="sienna")
clearly reproducing the Gamma target:
without the correction for the non-symmetric proposal. The correction would be
q <- function(a, b)
dnorm(b-a,sd=self.std)+dnorm(-b-a,sd=self.std)
alpha <- function(a, b, method=1){
return(ifelse(method,
min(1, r <- g(b) * q(b,a) / g(a) / q(a,b)),
1 / (1 + 1/r)))}
old = 0
smple = rep(0,self.size * 1.3)
for (i in 1:length(smple)){
new = abs(old+self.std*rnorm(1))
al = alpha(old, new, 3)
old=smple[i]=ifelse(al > runif(1), new, old)
}
and makes no difference in the above picture. (The acceptance rate for the Metropolis ratio is 85%, while for Baker's, it is 48%.)

How do I calculate PDF (probability density function) in Python?

I have the following code below that prints the PDF graph for a particular mean and standard deviation.
http://imgur.com/a/oVgML
Now I need to find the actual probability, of a particular value. So for example if my mean is 0, and my value is 0, my probability is 1. This is usually done by calculating the area under the curve. Similar to this:
http://homepage.divms.uiowa.edu/~mbognar/applets/normal.html
I am not sure how to approach this problem
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
def normal(power, mean, std, val):
a = 1/(np.sqrt(2*np.pi)*std)
diff = np.abs(np.power(val-mean, power))
b = np.exp(-(diff)/(2*std*std))
return a*b
pdf_array = []
array = np.arange(-2,2,0.1)
print array
for i in array:
print i
pdf = normal(2, 0, 0.1, i)
print pdf
pdf_array.append(pdf)
plt.plot(array, pdf_array)
plt.ylabel('some numbers')
plt.axis([-2, 2, 0, 5])
plt.show()
print
Unless you have a reason to implement this yourself. All these functions are available in scipy.stats.norm
I think you asking for the cdf, then use this code:
from scipy.stats import norm
print(norm.cdf(x, mean, std))
If you want to write it from scratch:
class PDF():
def __init__(self,mu=0, sigma=1):
self.mean = mu
self.stdev = sigma
self.data = []
def calculate_mean(self):
self.mean = sum(self.data) // len(self.data)
return self.mean
def calculate_stdev(self,sample=True):
if sample:
n = len(self.data)-1
else:
n = len(self.data)
mean = self.mean
sigma = 0
for el in self.data:
sigma += (el - mean)**2
sigma = math.sqrt(sigma / n)
self.stdev = sigma
return self.stdev
def pdf(self, x):
return (1.0 / (self.stdev * math.sqrt(2*math.pi))) * math.exp(-0.5*((x - self.mean) / self.stdev) ** 2)
The area under a curve y = f(x) from x = a to x = b is the same as the integral of f(x)dx from x = a to x = b. Scipy has a quick easy way to do integrals. And just so you understand, the probability of finding a single point in that area cannot be one because the idea is that the total area under the curve is one (unless MAYBE it's a delta function). So you should get 0 ≤ probability of value < 1 for any particular value of interest. There may be different ways of doing it, but a conventional way is to assign confidence intervals along the x-axis like this. I would read up on Gaussian curves and normalization before continuing to code it.

Population Monte Carlo implementation

I am trying to implement the Population Monte Carlo algorithm as described in this paper (see page 78 Fig.3) for a simple model (see function model()) with one parameter using Python. Unfortunately, the algorithm doesn't work and I can't figure out what's wrong. See my implementation below. The actual function is called abc(). All other functions can be seen as helper-functions and seem to work fine.
To check whether the algorithm workds, I first generate observed data with the only parameter of the model set to param = 8. Therefore, the posterior resulting from the ABC algorithm should be centered around 8. This is not the case and I'm wondering why.
I would appreciate any help or comments.
# imports
from math import exp
from math import log
from math import sqrt
import numpy as np
import random
from scipy.stats import norm
# globals
N = 300 # sample size
N_PARTICLE = 300 # number of particles
ITERS = 5 # number of decreasing thresholds
M = 10 # number of words to remember
MEAN = 7 # prior mean of parameter
SD = 2 # prior sd of parameter
def model(param):
recall_prob_all = 1/(1 + np.exp(M - param))
recall_prob_one_item = np.exp(np.log(recall_prob_all) / float(M))
return sum([1 if random.random() < recall_prob_one_item else 0 for item in range(M)])
## example
print "Output of model function: \n" + str(model(10)) + "\n"
# generate data from model
def generate(param):
out = np.empty(N)
for i in range(N):
out[i] = model(param)
return out
## example
print "Output of generate function: \n" + str(generate(10)) + "\n"
# distance function (sum of squared error)
def distance(obsData,simData):
out = 0.0
for i in range(len(obsData)):
out += (obsData[i] - simData[i]) * (obsData[i] - simData[i])
return out
## example
print "Output of distance function: \n" + str(distance([1,2,3],[4,5,6])) + "\n"
# sample new particles based on weights
def sample(particles, weights):
return np.random.choice(particles, 1, p=weights)
## example
print "Output of sample function: \n" + str(sample([1,2,3],[0.1,0.1,0.8])) + "\n"
# perturbance function
def perturb(variance):
return np.random.normal(0,sqrt(variance),1)[0]
## example
print "Output of perturb function: \n" + str(perturb(1)) + "\n"
# compute new weight
def computeWeight(prevWeights,prevParticles,prevVariance,currentParticle):
denom = 0.0
proposal = norm(currentParticle, sqrt(prevVariance))
prior = norm(MEAN,SD)
for i in range(len(prevParticles)):
denom += prevWeights[i] * proposal.pdf(prevParticles[i])
return prior.pdf(currentParticle)/denom
## example
prevWeights = [0.2,0.3,0.5]
prevParticles = [1,2,3]
prevVariance = 1
currentParticle = 2.5
print "Output of computeWeight function: \n" + str(computeWeight(prevWeights,prevParticles,prevVariance,currentParticle)) + "\n"
# normalize weights
def normalize(weights):
return weights/np.sum(weights)
## example
print "Output of normalize function: \n" + str(normalize([3.,5.,9.])) + "\n"
# sampling from prior distribution
def rprior():
return np.random.normal(MEAN,SD,1)[0]
## example
print "Output of rprior function: \n" + str(rprior()) + "\n"
# ABC using Population Monte Carlo sampling
def abc(obsData,eps):
draw = 0
Distance = 1e9
variance = np.empty(ITERS)
simData = np.empty(N)
particles = np.empty([ITERS,N_PARTICLE])
weights = np.empty([ITERS,N_PARTICLE])
for t in range(ITERS):
if t == 0:
for i in range(N_PARTICLE):
while(Distance > eps[t]):
draw = rprior()
simData = generate(draw)
Distance = distance(obsData,simData)
Distance = 1e9
particles[t][i] = draw
weights[t][i] = 1./N_PARTICLE
variance[t] = 2 * np.var(particles[t])
continue
for i in range(N_PARTICLE):
while(Distance > eps[t]):
draw = sample(particles[t-1],weights[t-1])
draw += perturb(variance[t-1])
simData = generate(draw)
Distance = distance(obsData,simData)
Distance = 1e9
particles[t][i] = draw
weights[t][i] = computeWeight(weights[t-1],particles[t-1],variance[t-1],particles[t][i])
weights[t] = normalize(weights[t])
variance[t] = 2 * np.var(particles[t])
return particles[ITERS-1]
true_param = 9
obsData = generate(true_param)
eps = [15000,10000,8000,6000,3000]
posterior = abc(obsData,eps)
#print posterior
I stumbled upon this question as I was looking for pythonic implementations of PMC algorithms, since, quite coincidentally, I'm currently in the process of applying the techniques in this exact paper to my own research.
Can you post the results you're getting? My guess is that 1) you're using a poor choice of distance function (and/or similarity thresholds), or 2) you're not using enough particles. I may be wrong here (I'm not very well-versed in sample statistics), but your distance function implicitly suggests to me that the ordering of your random draws matters. I'd have to think about this more to determine whether it actually has any effect on the convergence properties (it may not), but why don't you simply use the mean or median as your sample statistic?
I ran your code with 1000 particles and a true parameter value of 8, while using the absolute difference between sample means as my distance function, for three iterations with epsilons of [0.5, 0.3, 0.1]; the peak of my estimated posterior distribution seems to be approaching 8 just like it should on each iteration, alongside a reduction in the population variance. Note that there is still a noticeable rightward bias, but this is because of the asymmetry of your model (parameter values of 8 or less can never result in more than 8 observed successes, while all parameters values greater than 8 can, leading to a rightward skewedness in the distribution).
Here's the plot of my results:

Exercise on calculating and plotting cumulated empirical distribution

I was trying to finish an exercise in Jonh Stachurski's book (a textbook devoted to teach economists how to use Python). One of these is about how to calculate and plot cumulated empirical distribution. They provide a class called ecdf to calculate empirical distribution function
# Filename: ecdf.py
# Author: John Stachurski
# Date: December 2008
# Corresponds to: Listing 6.3
class ECDF:
def __init__(self, observations):
self.observations = observations
def __call__(self, x):
counter = 0.0
for obs in self.observations:
if obs <= x:
counter += 1
return counter / len(self.observations)
And the excercise reads
【Exercise 6.1.12】 Add a method to the ECDF class that uses Matplotlib to plot the em-
pirical distribution over a specified interval. Replicate the four graphs in figure 6.3
(modulo randomness).
the figure is need to be replicated is
and an illusion of algorithm
The following is my initial attempt
from ecdf import ECDF
import numpy as np
import matplotlib.pyplot as plt
from srs import SRS
from math import sqrt
from random import lognormvariate
# =========================
# parameters and arguments
# =========================
alpha, sigma2, s, delta = 0.3, 0.2, 0.5, 0.1
# numbers of draws
n = 1000
# length of each markov chain
t = 20
num_simu = [4,25,100,5000]
# Define F(k, z) = s k^alpha z + (1 - delta) k
F = lambda k, z: s * (k**alpha) * z + (1 - delta) * k
lognorm = lambda: lognormvariate(0, sqrt(sigma2))
# =====================
# create empirical distribution
# =====================
# different draw numbers
k = np.linspace(0,25,500)
for n in num_simu:
for x in range(n):
# list used to store capital stock (kt) in the last periods (t=20)
kt = []
solow_srs = SRS(F=F, phi=lognorm, X=1.0)
px = solow_srs.sample_path(t)
kt.append(px[-1])
# generate the empirical distribution function
F = ECDF(kt)
prob_kt_n = [F(i) for i in k] # need to determine range
# n refers to the n-th draw
# ==================================
# use for-loop to create subplots
# ==================================
#k = np.linspace(0,25,500)
#num_rows,num_cols = 2,2
The difficulties to me are 1) How can I store list/array of empirical distribution results for different draw numbers in the given graph. 2) How to create subplots using a for-loop. I also encountered some other tiny errors.
Thank you for your suggestions.
About (1), my advice is to create a dictionary (i.e. something like d = {} and then d[n] = ECDF(data) for each number n of observations).
Dunno about (2).

Categories

Resources