I am trying to implement the Population Monte Carlo algorithm as described in this paper (see page 78 Fig.3) for a simple model (see function model()) with one parameter using Python. Unfortunately, the algorithm doesn't work and I can't figure out what's wrong. See my implementation below. The actual function is called abc(). All other functions can be seen as helper-functions and seem to work fine.
To check whether the algorithm workds, I first generate observed data with the only parameter of the model set to param = 8. Therefore, the posterior resulting from the ABC algorithm should be centered around 8. This is not the case and I'm wondering why.
I would appreciate any help or comments.
# imports
from math import exp
from math import log
from math import sqrt
import numpy as np
import random
from scipy.stats import norm
# globals
N = 300 # sample size
N_PARTICLE = 300 # number of particles
ITERS = 5 # number of decreasing thresholds
M = 10 # number of words to remember
MEAN = 7 # prior mean of parameter
SD = 2 # prior sd of parameter
def model(param):
recall_prob_all = 1/(1 + np.exp(M - param))
recall_prob_one_item = np.exp(np.log(recall_prob_all) / float(M))
return sum([1 if random.random() < recall_prob_one_item else 0 for item in range(M)])
## example
print "Output of model function: \n" + str(model(10)) + "\n"
# generate data from model
def generate(param):
out = np.empty(N)
for i in range(N):
out[i] = model(param)
return out
## example
print "Output of generate function: \n" + str(generate(10)) + "\n"
# distance function (sum of squared error)
def distance(obsData,simData):
out = 0.0
for i in range(len(obsData)):
out += (obsData[i] - simData[i]) * (obsData[i] - simData[i])
return out
## example
print "Output of distance function: \n" + str(distance([1,2,3],[4,5,6])) + "\n"
# sample new particles based on weights
def sample(particles, weights):
return np.random.choice(particles, 1, p=weights)
## example
print "Output of sample function: \n" + str(sample([1,2,3],[0.1,0.1,0.8])) + "\n"
# perturbance function
def perturb(variance):
return np.random.normal(0,sqrt(variance),1)[0]
## example
print "Output of perturb function: \n" + str(perturb(1)) + "\n"
# compute new weight
def computeWeight(prevWeights,prevParticles,prevVariance,currentParticle):
denom = 0.0
proposal = norm(currentParticle, sqrt(prevVariance))
prior = norm(MEAN,SD)
for i in range(len(prevParticles)):
denom += prevWeights[i] * proposal.pdf(prevParticles[i])
return prior.pdf(currentParticle)/denom
## example
prevWeights = [0.2,0.3,0.5]
prevParticles = [1,2,3]
prevVariance = 1
currentParticle = 2.5
print "Output of computeWeight function: \n" + str(computeWeight(prevWeights,prevParticles,prevVariance,currentParticle)) + "\n"
# normalize weights
def normalize(weights):
return weights/np.sum(weights)
## example
print "Output of normalize function: \n" + str(normalize([3.,5.,9.])) + "\n"
# sampling from prior distribution
def rprior():
return np.random.normal(MEAN,SD,1)[0]
## example
print "Output of rprior function: \n" + str(rprior()) + "\n"
# ABC using Population Monte Carlo sampling
def abc(obsData,eps):
draw = 0
Distance = 1e9
variance = np.empty(ITERS)
simData = np.empty(N)
particles = np.empty([ITERS,N_PARTICLE])
weights = np.empty([ITERS,N_PARTICLE])
for t in range(ITERS):
if t == 0:
for i in range(N_PARTICLE):
while(Distance > eps[t]):
draw = rprior()
simData = generate(draw)
Distance = distance(obsData,simData)
Distance = 1e9
particles[t][i] = draw
weights[t][i] = 1./N_PARTICLE
variance[t] = 2 * np.var(particles[t])
continue
for i in range(N_PARTICLE):
while(Distance > eps[t]):
draw = sample(particles[t-1],weights[t-1])
draw += perturb(variance[t-1])
simData = generate(draw)
Distance = distance(obsData,simData)
Distance = 1e9
particles[t][i] = draw
weights[t][i] = computeWeight(weights[t-1],particles[t-1],variance[t-1],particles[t][i])
weights[t] = normalize(weights[t])
variance[t] = 2 * np.var(particles[t])
return particles[ITERS-1]
true_param = 9
obsData = generate(true_param)
eps = [15000,10000,8000,6000,3000]
posterior = abc(obsData,eps)
#print posterior
I stumbled upon this question as I was looking for pythonic implementations of PMC algorithms, since, quite coincidentally, I'm currently in the process of applying the techniques in this exact paper to my own research.
Can you post the results you're getting? My guess is that 1) you're using a poor choice of distance function (and/or similarity thresholds), or 2) you're not using enough particles. I may be wrong here (I'm not very well-versed in sample statistics), but your distance function implicitly suggests to me that the ordering of your random draws matters. I'd have to think about this more to determine whether it actually has any effect on the convergence properties (it may not), but why don't you simply use the mean or median as your sample statistic?
I ran your code with 1000 particles and a true parameter value of 8, while using the absolute difference between sample means as my distance function, for three iterations with epsilons of [0.5, 0.3, 0.1]; the peak of my estimated posterior distribution seems to be approaching 8 just like it should on each iteration, alongside a reduction in the population variance. Note that there is still a noticeable rightward bias, but this is because of the asymmetry of your model (parameter values of 8 or less can never result in more than 8 observed successes, while all parameters values greater than 8 can, leading to a rightward skewedness in the distribution).
Here's the plot of my results:
Related
I'm implementing a Markov Chain Monte Carlo with both Metropolis and Barker's α's for numerical integration. I've created a class called MCMCIntegrator(). Below the __init__() method, are the g(x) the PDF of the function I'm integrating and alpha method, implementing the Metropolis and Barker α's.
import numpy as np
import scipy.stats as st
class MCMCIntegrator:
def __init__(self):
self.size = 1000
self.std = 0.6
self.real_int = 0.06496359
self.sample = None
#staticmethod
def g(x):
return st.gamma.pdf(x, 1, scale=1.378008857)*np.abs(np.cos(1.10257704))
def alpha(self, a, b, method):
if method:
return min(1, self.g(b) / self.g(a))
else:
return self.g(b) / (self.g(a) + self.g(b))
The size is the size of the sample that the class must generate, std is the standard deviation of the normal kernel, which you will see in a few seconds. The real_int is the value of the integral from 1 to 2 of the function we're integrating. I've generated it with a R script. Now, to the problem.
def _chain(self, method):
"""
Markov chain heat-up with burn-in
:param method: Metropolis or Barker alpha
:return: np.array containing the sample
"""
old = 0
sample = np.zeros(int(self.size * 1.3))
i = 0
while i != len(sample):
new = np.random.normal(loc=old, scale=self.std)
new = abs(new)
al = self.alpha(old, new, method=method)
u = np.random.uniform()
if al > u:
sample[i] = new
i += 1
old = new
return np.array(sample)
Below this method is an integrate() method that calculates the proportion of numbers in the [1, 2] interval:
def integrate(self, method=None):
"""
Integration step
"""
sample = self._chain(method=method)
# discarding 30% of the sample for the burn-in
ind = int(len(sample)*0.3)
sample = sample[ind:]
setattr(self, "sample", sample)
sample = [1 if 1 < v < 2 else 0 for v in sample]
return np.mean(sample)
this is the main function:
def main():
print("-- RESULTS --".center(20), end='\n')
mcmc = MCMCIntegrator()
print(f"\t{mcmc.integrate()}", end='\n')
print(f"\t{np.abs(mcmc.integrate() - mcmc.real_int) / mcmc.real_int}")
if __name__ == "__main__":
main()
I'm stuck in an infinite while loop, with no idea why this is happening.
While I have no prior exposure to Python and no direct explanation for the infinite loop, here are some problematic issues in the code:
The
while i != len(sample):
loop increments the value i only when the Uniform variate is below the acceptance probability if al > u: This is not how Metropolis-Hastings operates. If the Uniform variate is above the acceptance probability, the current value of the chain is duplicated. but this does not explain for the infinite loop since a proposed value should eventually be accepted.
If the target density is
st.gamma.pdf(x, 1, scale=1.378008857)*np.abs(np.cos(1.10257704))
then (i) what is the point of the second and constant term np.abs(np.cos(1.10257704)) and (ii) where is the need for so many digits?
The proposal distribution
new = np.random.normal(loc=old, scale=self.std)
new = abs(new)
is a folded normal, which density is not symmetric. Hence it should appear in the Metropolis-Hastings probability but may have little impact since the scale is small.
Here is my R rendering of the Python code (edited and corrected)
self.size = 1e5
self.std = 0.6
self.real_int = 0.06496359
g <- function(x){dgamma(x, shape=1, scale=1.378)}
alpha <- function(a, b, method=1)ifelse(method,
min(1, r <- g(b) / g(a)), 1 / (1 + 1 / r))
old = 0
smple = rep(0,self.size * 1.3)
for (i in 1:length(smple)){
new = abs(old+self.std*rnorm(1))
al = alpha(old, new, 0)
old=smple[i]=ifelse(al > runif(1), new, old)
}
ind = trunc(length(smple)*0.3)
smple = sample[ind:length(smple)]
hist(smple,pro=TRUE,nclass=10*log2(self.size),col="wheat")
curve(g(x),add=TRUE,lwd=2,col="sienna")
clearly reproducing the Gamma target:
without the correction for the non-symmetric proposal. The correction would be
q <- function(a, b)
dnorm(b-a,sd=self.std)+dnorm(-b-a,sd=self.std)
alpha <- function(a, b, method=1){
return(ifelse(method,
min(1, r <- g(b) * q(b,a) / g(a) / q(a,b)),
1 / (1 + 1/r)))}
old = 0
smple = rep(0,self.size * 1.3)
for (i in 1:length(smple)){
new = abs(old+self.std*rnorm(1))
al = alpha(old, new, 3)
old=smple[i]=ifelse(al > runif(1), new, old)
}
and makes no difference in the above picture. (The acceptance rate for the Metropolis ratio is 85%, while for Baker's, it is 48%.)
So, for a Monte-Carlo class, I created a uniform random number generator to simulate a normal distribution and use it for MC option pricing, and I'm going terribly wrong somewhere. It uses a simple linear congruential generator (lcg) which generates a random vector that is fed into an numerical approximation of a inverse normal distribution (beasley-springer-morrow algorithm) to generate standard normally distributed values (elaboration on the exact process can be found here).
This is my code so far.
Rng:
def lcrm_generator(num, a, seed, c):
mod = 4294967296 # 2^32
val = int(seed) #int() is to ensure seeds with a leading zero still get accepted by the program
rng = []
for i in range(num):
val = (a * val + c) % mod
rng.append(val/(mod-1))
return rng
Inverse normal approximator:
def bsm_algorithm(u):
# These are my necessary initial constants
a0 = 2.50662823884; a1 = -18.61500062529; a2 = 41.39119773534; a3 = -25.44106049637;
b0 = -8.47351093090; b1 = 23.08336743743; b2 = -21.06224101826; b3 = 3.13082909833;
c0 = 0.3374754822726147; c1 = 0.9761690190917186; c2 = 0.1607979714918209; c3 = 0.0276438810333863;
c4 = 0.0038405729373609; c5 = 0.0003951896511919; c6 = 0.0000321767881768; c7 = 0.0000002888167364;
c8 = 0.0000003960315187;
x = [0]*len(u)
for i in range(len(u)):
y = u[i] - 0.5
if abs(y) < 0.42:
r = y**2
x[i] = y*(((a3*r+a2)*r+a1)*r+a0)/((((b3*r+b2)*r+b1)*r+b0)*r+1)
else:
r = u[i]
if y > 0:
r = 1 - u[i]
r = log(-log(r))
x[i] = c0+r*(c1+r*(c2+r*(c3+r*(c4+r*(c5+r*(c6+r*(c7+r*c8)))))))
if y < 0:
x[i] = -x[i]
return x
Combining these two with the following and drawing the histogram shows the data looks correctly normal,
a=lcrm_generator(100000,301,"0",21)
b = bsm_algorithm(a)
plt.hist(b, bins=100)
plt.show()
And option pricing function:
def LookbackOP(S,K,r,sigma,intervals,sims,Call_Put=1):
## My objects that will determine the option prices.
path = [0]*intervals
values = [0]*sims
## Objects to hold the random nums used for simulation.
randuni = [0]*sims
randnorm = [0]*sims
for i in range(sims):
randuni[i] = lcrm_generator(intervals,301,i,21)
randnorm[i] = bsm_algorithm(randuni[i])
# Generating the simulation one by one.
for i in range(sims):
path[0] = 1
## Below is to generate one whole simulated path.
################## MY INCORRECT WAY ##################
for j in range(1,intervals):
path[j] = path[j-1]*exp((r - .5*sigma**2)*(1/intervals) + sqrt(1/intervals)*randnorm[i][j])
################## CORRECT BUILT-IN WAY ##################
# path[j] = path[j-1]*exp((r - .5*sigma**2)*(1/intervals) + sqrt(1/intervals)*np.random.normal(0,1))
## For each separate simulation, price the option either as a call or a put.
if Call_Put == 1:
values[i] = max(S*max(path)-K,0)
elif Call_Put == 0:
values[i] = max(K-S*min(path),0)
else:
print("Error: You inputted something other than '1 = Call', or '0 = Put'")
plt.hist(values,bins=30)
plt.show()
## To get expected return of option by takeing their avg.
option_value = np.mean(values)
print(option_value)
return option_value
In the last block of code, my error is pointed out, which seemingly can be fixed by simply using numpy's normal rng. Using one versus the other produces drastically different answers, and I'm tempted to trust numpy over myself, but my code looks normal, so where am I going wrong.
First,
a=lcrm_generator(100000,301,"0",21)
this looks weird - why do you need a character for seed? Anyway, good parameters are in the table here: https://en.wikipedia.org/wiki/Linear_congruential_generator. But problem is not LCG, I believe, but there might be systemic difference in your gaussian.
I ran code
from scipy.stats import norm
q = lcrm_generator(100000, 301, "0", 21)
g = bsm(q)
param = norm.fit(g)
print(param)
for 100 000, 1 000 000 and 10 000 000 samples, and my output is
(0.0009370998731855792, 0.9982155665317061)
(-0.0006429485100838258, 0.9996875045073682)
(-0.0007464875819397183, 1.0002711142625116)
and there is no improvement between 1 000 000 and 10 000 000 samples. Basically, you use some approximation for gaussian inverse function, and those are just artifacts of approximation, nothing to be done about it.
Numpy, I believe, is using one of the precise normal sampling methods (Box-Muller, I think)
I have an stochastic differential equation (SDE) that I am trying to solve using Milsteins method but am getting results that disagree with experiment.
The SDE is
which I have broken up into 2 first order equations:
eq1:
eq2:
Then I have used the Ito form:
So that for eq1:
and for eq2:
My python code used to attempt to solve this is like so:
# set constants from real data
Gamma0 = 4000 # defines enviromental damping
Omega0 = 75e3*2*np.pi # defines the angular frequency of the motion
eta = 0 # set eta 0 => no effect from non-linear p*q**2 term
T_0 = 300 # temperature of enviroment
k_b = scipy.constants.Boltzmann
m = 3.1e-19 # mass of oscillator
# set a and b functions for these 2 equations
def a_p(t, p, q):
return -(Gamma0 - Omega0*eta*q**2)*p
def b_p(t, p, q):
return np.sqrt(2*Gamma0*k_b*T_0/m)
def a_q(t, p, q):
return p
# generate time data
dt = 10e-11
tArray = np.arange(0, 200e-6, dt)
# initialise q and p arrays and set initial conditions to 0, 0
q0 = 0
p0 = 0
q = np.zeros_like(tArray)
p = np.zeros_like(tArray)
q[0] = q0
p[0] = p0
# generate normally distributed random numbers
dwArray = np.random.normal(0, np.sqrt(dt), len(tArray)) # independent and identically distributed normal random variables with expected value 0 and variance dt
# iterate through implementing Milstein's method (technically Euler-Maruyama since b' = 0
for n, t in enumerate(tArray[:-1]):
dw = dwArray[n]
p[n+1] = p[n] + a_p(t, p[n], q[n])*dt + b_p(t, p[n], q[n])*dw + 0
q[n+1] = q[n] + a_q(t, p[n], q[n])*dt + 0
Where in this case p is velocity and q is position.
I then get the following plots of q and p:
I expected the resulting plot of position to look something like the following, which I get from experimental data (from which the constants used in the model are determined):
Have I implemented Milstein's method correctly?
If I have, what else might be wrong my process of solving the SDE that'd causing this disagreement with the experiment?
You missed a term in the drift coefficient, note that to the right of dp there are two dt terms. Thus
def a_p(t, p, q):
return -(Gamma0 - Omega0*eta*q**2)*p - Omega0**2*q
which is actually the part that makes the oscillator into an oscillator. With that corrected the solution looks like
And no, you did not implement the Milstein method as there are no derivatives of b_p which are what distinguishes Milstein from Euler-Maruyama, the missing term is +0.5*b'(X)*b(X)*(dW**2-dt).
There is also a derivative-free version of Milsteins method as a two-stage kind-of Runge-Kutta method, documented in wikipedia or the original in arxiv.org (PDF).
The step there is (vector based, duplicate into X=[p,q], K1=[k1_p,k1_q] etc. to be close to your conventions)
S = random_choice_of ([-1,1])
K1 = a(X )*dt + b(X )*(dW - S*sqrt(dt))
Xh = X + K1
K2 = a(Xh)*dt + b(Xh)*(dW + S*sqrt(dt))
X = X + 0.5 * (K1+K2)
This is my second attempt at implementing gradient descent in one variable and it always diverges. Any ideas?
This is simple linear regression for minimizing the residual sum of squares in one variable.
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
#y=mx+b
#some line to predict y values from x values
m=1.
b=1.
#a predicted y-value has value mx + b
for i in range(0,10):
#calculate y-value predictions for all x-values
predicted_yvalues = list()
for x in xvalues:
predicted_yvalues.append(m*x + b)
# predicted_yvalues holds the predicted y-values
#now calculate the residuals = y-value - predicted y-value for each point
residuals = list()
number_of_points = len(yvalues)
for n in range(0,number_of_points):
residuals.append(yvalues[n] - predicted_yvalues[n])
## calculate the residual sum of squares from the residuals, that is,
## square each residual and add them all up. we will try to minimize
## the residual sum of squares later.
residual_sum_of_squares = 0.
for r in residuals:
residual_sum_of_squares += r**2
print("RSS = %s" % residual_sum_of_squares)
##
##
##
#now make a version of the residuals which is multiplied by the x-values
residuals_times_xvalues = list()
for n in range(0,number_of_points):
residuals_times_xvalues.append(residuals[n] * xvalues[n])
#now create the sums for the residuals and for the residuals times the x-values
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# now multiply the sums by a positive scalar and add each to m and b.
residuals_sum *= 0.1
residuals_times_xvalues_sum *= 0.1
b += residuals_sum
m += residuals_times_xvalues_sum
#and repeat until convergence.
#convergence occurs when ||sum vector|| < some tolerance.
# ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )
#check for convergence
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
break
return (b, m)
Result:
gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
Out[108]:
(-3.475524066284303e+16, -2.4195981188763203e+17)
The gradients are huge -- hence you are following large vectors for long distances (0.1 times a large number is large). Find unit vectors in the appropriate direction. Something like this (with comprehensions replacing your loops):
def gradient_descent_wtf(xvalues, yvalues):
tolerance = 0.1
m=1.
b=1.
for i in range(0,10):
predicted_yvalues = [m*x+b for x in xvalues]
residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]
residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
print("RSS = %s" % residual_sum_of_squares)
residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]
residuals_sum = sum(residuals)
residuals_times_xvalues_sum = sum(residuals_times_xvalues)
# (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
# gradient direction. *Find a unit vector which points in same direction*
magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
residuals_sum /= magnitude
residuals_times_xvalues_sum /= magnitude
b += residuals_sum * (0.1)
m += residuals_times_xvalues_sum * (0.1)
#check for convergence -- this needs work!
magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
if magnitude_of_sum_vector < tolerance:
break
return (b, m)
For example:
>>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 368732.1655050716
RSS = 367039.18363896786
RSS = 365354.0543519137
RSS = 363676.7775934381
RSS = 362007.3533123621
RSS = 360345.7814567845
RSS = 358692.061974069
RSS = 357046.1948108295
RSS = 355408.17991291644
(1.1157111313023558, 1.9932828425473605)
which is certainly much more plausible.
It isn't a trivial matter to make a numerically stable gradient-descent algorithm. You might want to consult a decent textbook in numerical analysis.
First, Your code is right.
But you should consider something about math when you do linear regression.
For example, the residual is -205.8 and your learning rate is 0.1 so you will get a huge descent step -25.8.
It's a so large step that you can't go back to the correct m and b. You have to make your step small enough.
There are two ways to make gradient descent step reasonable:
initialize a small learning rate, such as 0.001 and 0.0003.
Divide your step by the total amount of your input values.
The numpy.linalg.lstsq(a,b) function accepts an array a with size nx2 and a 1-dimensional array b which is the dependent variable.
How would I go about doing a least squares regression where the data points are presented as a 2d array generated from an image file? The array looks something like this:
[[0, 0, 0, 0, e]
[0, 0, c, d, 0]
[b, a, f, 0, 0]]
where a, b, c, d, e, f are positive integer values.
I want to fit a line to these points. Can I use np.linalg.lstsq (and if so, how) or is there something which may make more sense (and if so, how)?
Thanks very much.
once a while I saw a similar python program from
# Prac 2 for Monte Carlo methods in a nutshell
# Richard Chopping, ANU RSES and Geoscience Australia, October 2012
# Useage
# python prac_q2.py [number of bootstrap runs]
# e.g. python prac_q2.py 10000
# would execute this and perform 10 000 bootstrap runs.
# Default is 100 runs.
# sys cause I need to access the arguments the script was called with
import sys
# math cause it's handy for scalar maths
import math
# time cause I want to benchmark how long things take
import time
# numpy cause it gives us awesome array / matrix manipulation stuff
import numpy
# scipy just in case
import scipy
# scipy.stats to make life simpler statistcally speaking
import scipy.stats as stats
def main():
print "Prac 2 solution: no graphs"
true_model = numpy.array([17.0, 10.0, 1.96])
# Here's a nifty way to write out numpy arrays.
# Unlike the data table in the prac handouts, I've got time first
# and height second.
# You can mix up the order but you need to change a lot of calculations
# to deal with this change.
data = numpy.array([[1.0, 26.94],
[2.0, 33.45],
[3.0, 40.72],
[4.0, 42.32],
[5.0, 44.30],
[6.0, 47.19],
[7.0, 43.33],
[8.0, 40.13]])
# Perform the least squares regression to find the best fit solution
best_fit = regression(data)
# Nifty way to get out elements from an array
m1,m2,m3 = best_fit
print "Best fit solution:"
print "m1 is", m1, "and m2 is", m2, "and m3 is", m3
# Calculate residuals from the best fit solution
best_fit_resid = residuals(data, best_fit)
print "The residuals from the best fit solution are:"
print best_fit_resid
print ""
# Bootstrap part
# --------------
# Number of bootstraps to run. 100 is a minimum and our default number.
num_booties = 100
# If we have an argument to the python script, use this as the
# number of bootstrap runs
if len(sys.argv) > 1:
num_booties = int(sys.argv[1])
# preallocate an array to store the results.
ensemble = numpy.zeros((num_booties, 3))
print "Starting up the bootstrap routine"
# How to do timing within a Python script - here I start a stopwatch running
start_time = time.clock()
for index in range(num_booties):
# Print every 10 % so we know where we're up to in long runs
if print_progress(index, num_booties):
percent = (float(index) / float(num_booties)) * 100.0
print "Have completed", percent, "percent"
# For each iteration of the bootstrap algorithm,
# first calculate mixed up residuals...
resamp_resid = resamp_with_replace(best_fit_resid)
# ... then generate new data...
new_data = calc_new_data(data, best_fit, resamp_resid)
# ... then perform another regression to generate a new set of m1, m2, m3
bootstrap_model = regression(new_data)
ensemble[index] = (bootstrap_model[0], bootstrap_model[1], bootstrap_model[2])
# Done with the loop
# Calculate the time the run took - what's the current time, minus when we started.
loop_time = time.clock() - start_time
print ""
print "Ensemble calculated based on", num_booties, "bootstrap runs."
print "Bootstrap runs took", loop_time, "seconds."
print ""
# Stats on the ensemble time
# --------------------------
B = num_booties
# Mean is pretty simple, 1.0/B to force it to use floating points
# This gives us an array of the means of the 3 model parameters
mean = 1.0/B * numpy.sum(ensemble, axis=0)
print "Mean is ([m1 m2 m3]):", mean
# Variance
var2 = 1.0/B * numpy.sum(((ensemble - mean)**2), axis=0)
print "Variance squared is ([m1 m2 m3]):", var2
# Bias
bias = mean - best_fit
print "Bias is ([m1 m2 m3]):", bias
bias_corr = best_fit - bias
print "Bias corrected solution is ([m1 m2 m3]):", bias_corr
print "The original solution was ([m1 m2 m3]):", best_fit
print "And the true solution is ([m1 m2 m3]):", true_model
print ""
# Confidence intervals
# ---------------------
# Sort column 1 to calculate confidence intervals
# Sorting in numpy sucks.
# Need to declare what the fields are (so it knows how to sort it)
# f8 => numpy's floating point number
# Then need to delcare what we sort it on
# Here we sort on the first column, then the second, then the third.
# f0,f1,f2 field 0, then field 1, then field 2.
# Then we make sure we sort it by column (axis = 0)
# Then we take a view of that data as a float64 so it works properly
sorted_m1 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f0','f1','f2'], axis=0).view(numpy.float64)
# stats is my name for scipy.stats
# This has a wonderful function that calculates percentiles, including performing interpolation
# (important for low numbers of bootstrap runs)
m1_perc0p5 = stats.scoreatpercentile(sorted_m1,0.5)[0]
m1_perc2p5 = stats.scoreatpercentile(sorted_m1,2.5)[0]
m1_perc16 = stats.scoreatpercentile(sorted_m1,16)[0]
m1_perc84 = stats.scoreatpercentile(sorted_m1,84)[0]
m1_perc97p5 = stats.scoreatpercentile(sorted_m1,97.5)[0]
m1_perc99p5 = stats.scoreatpercentile(sorted_m1,99.5)[0]
print "m1 68% confidence interval is from", m1_perc16, "to", m1_perc84
print "m1 95% confidence interval is from", m1_perc2p5, "to", m1_perc97p5
print "m1 99% confidence interval is from", m1_perc0p5, "to", m1_perc99p5
print ""
# Now column 2, sort it...
sorted_m2 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f1','f0','f2'], axis=0).view(numpy.float64)
# ... and do stats.
m2_perc0p5 = stats.scoreatpercentile(sorted_m2,0.5)[1]
m2_perc2p5 = stats.scoreatpercentile(sorted_m2,2.5)[1]
m2_perc16 = stats.scoreatpercentile(sorted_m2,16)[1]
m2_perc84 = stats.scoreatpercentile(sorted_m2,84)[1]
m2_perc97p5 = stats.scoreatpercentile(sorted_m2,97.5)[1]
m2_perc99p5 = stats.scoreatpercentile(sorted_m2,99.5)[1]
print "m2 68% confidence interval is from", m2_perc16, "to", m2_perc84
print "m2 95% confidence interval is from", m2_perc2p5, "to", m2_perc97p5
print "m2 99% confidence interval is from", m2_perc0p5, "to", m2_perc99p5
print ""
# and finally column 3, again, sort it..
sorted_m3 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f2','f1','f0'], axis=0).view(numpy.float64)
# ... and do stats.
m3_perc0p5 = stats.scoreatpercentile(sorted_m3,0.5)[1]
m3_perc2p5 = stats.scoreatpercentile(sorted_m3,2.5)[1]
m3_perc16 = stats.scoreatpercentile(sorted_m3,16)[1]
m3_perc84 = stats.scoreatpercentile(sorted_m3,84)[1]
m3_perc97p5 = stats.scoreatpercentile(sorted_m3,97.5)[1]
m3_perc99p5 = stats.scoreatpercentile(sorted_m3,99.5)[1]
print "m3 68% confidence interval is from", m3_perc16, "to", m3_perc84
print "m3 95% confidence interval is from", m3_perc2p5, "to", m3_perc97p5
print "m3 99% confidence interval is from", m3_perc0p5, "to", m3_perc99p5
print ""
# End of the main function
#
#
# Helper functions go down here
#
#
# regression
# This takes a 2D numpy array and performs a least-squares regression
# using the formula on the practical sheet, page 3
# Stored in the top are the real values
# Returns an array of m1, m2 and m3.
def regression(data):
# While testing, just return the real values
# real_values = numpy.array([17.0, 10.0, 1.96])
# Creating the G matrix
# ---------------------
# Because I'm using numpy arrays here, we need
# to learn some notation.
# data[:,0] is the FIRST column
# Length of this = number of time samples in data
N = len(data[:,0])
# numpy.sum adds up all data in a row or column.
# Axis = 0 implies add up each column. [0] at end
# returns the sum of the first column
# This is the sum of Ti for i = 1..N
sum_Ti = numpy.sum(data, axis=0)[0]
# numpy.power takes each element of an array and raises them to a given power
# In this one call we also take the sum of the columns (as above) after they have
# been squared, and then just take the t column
sum_Ti2 = numpy.sum(numpy.power(data, 2), axis=0)[0]
# Now we need to get the cube of Ti, then sum that result
sum_Ti3 = numpy.sum(numpy.power(data, 3), axis=0)[0]
# Finally we need the quartic of Ti, then sum that result
sum_Ti4 = numpy.sum(numpy.power(data, 4), axis=0)[0]
# Now we can construct the G matrix
G = numpy.array([[N, sum_Ti, -0.5 * sum_Ti2],
[sum_Ti, sum_Ti2, -0.5 * sum_Ti3],
[-0.5 * sum_Ti2, -0.5 * sum_Ti3, 0.25 * sum_Ti4]])
# We also need to take the inverse of the G matrix
G_inv = numpy.linalg.inv(G)
# Creating the d matrix
# ---------------------
# Hello numpy.sum, my old friend...
sum_Yi = numpy.sum(data, axis=0)[1]
# numpy.prod multiplies the values in an array.
# We need to do the products along axis 1 (i.e. row by row)
# Then sum all the elements
sum_TiYi = numpy.sum(numpy.prod(data, axis=1))
# The final element we need is a bit tricky.
# We need the product as above
TiYi = numpy.prod(data, axis=1)
# Then we get tricky. * works how we need it here,
# remember that the Ti column is referenced by data[:,0] as above
Ti2Yi = TiYi * data[:,0]
# Then we sum
sum_Ti2Yi = numpy.sum(Ti2Yi)
#With all the elements, we make the d matrix
d = numpy.array([sum_Yi,
sum_TiYi,
-0.5 * sum_Ti2Yi])
# Do the linear algebra stuff
# To multiple numpy arrays in a matrix style,
# we need to use numpy.dot()
# Not the most useful notation, but there you go.
# To help out the Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
result = G_inv.dot(d)
#Return this result
return result
# residuals:
# Takes in a data array, and an array of best fit paramers
# calculates the difference between the observed and predicted data
# and returns an array
def residuals(data, best_fit):
# Extract ti from the data array
ti = data[:,0]
# We also need an array of the square of ti
ti2 = numpy.power(ti, 2)
# Extract yi
yi = data[:,1]
# Calculate residual (data minus predicted)
result = yi - best_fit[0] - (best_fit[1] * ti) + (0.5 * best_fit[2] * ti2)
return result
# resamp_with_replace:
# Perform a dataset resampling with replacement on parameter set.
# Uses numpy.random to generate the random numbers to pick the indices to look up.
# So for item 0, ... N, we look up a random index from the set and put that in
# our resampled data.
def resamp_with_replace(set):
# How many things do we need to do this for?
N = len(set)
# Preallocate our result array
result = numpy.zeros(N)
# Generate N random integers between 0 and N-1
indices = numpy.random.randint(0, N - 1, N)
# For i from the set 0...N-1 (that's what the range() command gives us),
# our result for that i is given by the index we randomly generated above
for i in range(N):
result[i] = set[indices[i]]
return result
# calc_new_data:
# Given a set of resampled residuals, use the model parameters to derive
# new data. This is used for bootstrapping the residuals.
# true_data is a numpy array of rows of ti, yi. We only need the ti column though.
# model is an array of three parameters, corresponding to m1, m2, m3.
# residuals are an array of our resudials
def calc_new_data(true_data, model, residuals):
# Extract the time information from the new data array
ti = true_data[:,0]
# Calculate new data using array maths
# This goes through and does the sums etc for each element of the array
# Nice and compact way to represent it.
y_new = residuals + model[0] + (model[1] * ti) - (0.5 * model[2] * ti**2)
# Our result needs to be an array of ti, y_new, so we need to combine them using
# the numpy.column_stack routine
result = numpy.column_stack((ti, y_new))
# Return this combined array
return result
# print_progress:
# Just a quick thing that returns true if we want to print for this index
# and false otherwise
def print_progress(index, total):
index = float(index)
total = float(total)
result = False
# Floating point maths is irritating
# We want to print at the start, every 10%, and at the end.
# This works up to index = 100,000
# Would also be lovely if Python had a switch statement
if (((index / total) * 100) <= 0.00001):
result = True
elif (((index / total) * 100) >= 9.99999) and (((index / total) * 100) <= 10.00001):
result = True
elif (((index / total) * 100) >= 19.99999) and (((index / total) * 100) <= 20.00001):
result = True
elif (((index / total) * 100) >= 29.99999) and (((index / total) * 100) <= 30.00001):
result = True
elif (((index / total) * 100) >= 39.99999) and (((index / total) * 100) <= 40.00001):
result = True
elif (((index / total) * 100) >= 49.99999) and (((index / total) * 100) <= 50.00001):
result = True
elif (((index / total) * 100) >= 59.99999) and (((index / total) * 100) <= 60.00001):
result = True
elif (((index / total) * 100) >= 69.99999) and (((index / total) * 100) <= 70.00001):
result = True
elif (((index / total) * 100) >= 79.99999) and (((index / total) * 100) <= 80.00001):
result = True
elif (((index / total) * 100) >= 89.99999) and (((index / total) * 100) <= 90.00001):
result = True
elif ((((index+1) / total) * 100) > 99.99999):
result = True
else:
result = False
return result
#
#
# End of helper functions
#
#
# So we can easily execute our script
if __name__ == "__main__":
main()
I guess you can take a look, here is link to complete information
Use sklearn instead of numpy (sklearn is derived from numpy but much better for this kind of calculation) :
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
clf.coef_
array([ 0.5, 0.5])