Least squares regression on 2d array - python
The numpy.linalg.lstsq(a,b) function accepts an array a with size nx2 and a 1-dimensional array b which is the dependent variable.
How would I go about doing a least squares regression where the data points are presented as a 2d array generated from an image file? The array looks something like this:
[[0, 0, 0, 0, e]
[0, 0, c, d, 0]
[b, a, f, 0, 0]]
where a, b, c, d, e, f are positive integer values.
I want to fit a line to these points. Can I use np.linalg.lstsq (and if so, how) or is there something which may make more sense (and if so, how)?
Thanks very much.
once a while I saw a similar python program from
# Prac 2 for Monte Carlo methods in a nutshell
# Richard Chopping, ANU RSES and Geoscience Australia, October 2012
# Useage
# python prac_q2.py [number of bootstrap runs]
# e.g. python prac_q2.py 10000
# would execute this and perform 10 000 bootstrap runs.
# Default is 100 runs.
# sys cause I need to access the arguments the script was called with
import sys
# math cause it's handy for scalar maths
import math
# time cause I want to benchmark how long things take
import time
# numpy cause it gives us awesome array / matrix manipulation stuff
import numpy
# scipy just in case
import scipy
# scipy.stats to make life simpler statistcally speaking
import scipy.stats as stats
def main():
print "Prac 2 solution: no graphs"
true_model = numpy.array([17.0, 10.0, 1.96])
# Here's a nifty way to write out numpy arrays.
# Unlike the data table in the prac handouts, I've got time first
# and height second.
# You can mix up the order but you need to change a lot of calculations
# to deal with this change.
data = numpy.array([[1.0, 26.94],
[2.0, 33.45],
[3.0, 40.72],
[4.0, 42.32],
[5.0, 44.30],
[6.0, 47.19],
[7.0, 43.33],
[8.0, 40.13]])
# Perform the least squares regression to find the best fit solution
best_fit = regression(data)
# Nifty way to get out elements from an array
m1,m2,m3 = best_fit
print "Best fit solution:"
print "m1 is", m1, "and m2 is", m2, "and m3 is", m3
# Calculate residuals from the best fit solution
best_fit_resid = residuals(data, best_fit)
print "The residuals from the best fit solution are:"
print best_fit_resid
print ""
# Bootstrap part
# --------------
# Number of bootstraps to run. 100 is a minimum and our default number.
num_booties = 100
# If we have an argument to the python script, use this as the
# number of bootstrap runs
if len(sys.argv) > 1:
num_booties = int(sys.argv[1])
# preallocate an array to store the results.
ensemble = numpy.zeros((num_booties, 3))
print "Starting up the bootstrap routine"
# How to do timing within a Python script - here I start a stopwatch running
start_time = time.clock()
for index in range(num_booties):
# Print every 10 % so we know where we're up to in long runs
if print_progress(index, num_booties):
percent = (float(index) / float(num_booties)) * 100.0
print "Have completed", percent, "percent"
# For each iteration of the bootstrap algorithm,
# first calculate mixed up residuals...
resamp_resid = resamp_with_replace(best_fit_resid)
# ... then generate new data...
new_data = calc_new_data(data, best_fit, resamp_resid)
# ... then perform another regression to generate a new set of m1, m2, m3
bootstrap_model = regression(new_data)
ensemble[index] = (bootstrap_model[0], bootstrap_model[1], bootstrap_model[2])
# Done with the loop
# Calculate the time the run took - what's the current time, minus when we started.
loop_time = time.clock() - start_time
print ""
print "Ensemble calculated based on", num_booties, "bootstrap runs."
print "Bootstrap runs took", loop_time, "seconds."
print ""
# Stats on the ensemble time
# --------------------------
B = num_booties
# Mean is pretty simple, 1.0/B to force it to use floating points
# This gives us an array of the means of the 3 model parameters
mean = 1.0/B * numpy.sum(ensemble, axis=0)
print "Mean is ([m1 m2 m3]):", mean
# Variance
var2 = 1.0/B * numpy.sum(((ensemble - mean)**2), axis=0)
print "Variance squared is ([m1 m2 m3]):", var2
# Bias
bias = mean - best_fit
print "Bias is ([m1 m2 m3]):", bias
bias_corr = best_fit - bias
print "Bias corrected solution is ([m1 m2 m3]):", bias_corr
print "The original solution was ([m1 m2 m3]):", best_fit
print "And the true solution is ([m1 m2 m3]):", true_model
print ""
# Confidence intervals
# ---------------------
# Sort column 1 to calculate confidence intervals
# Sorting in numpy sucks.
# Need to declare what the fields are (so it knows how to sort it)
# f8 => numpy's floating point number
# Then need to delcare what we sort it on
# Here we sort on the first column, then the second, then the third.
# f0,f1,f2 field 0, then field 1, then field 2.
# Then we make sure we sort it by column (axis = 0)
# Then we take a view of that data as a float64 so it works properly
sorted_m1 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f0','f1','f2'], axis=0).view(numpy.float64)
# stats is my name for scipy.stats
# This has a wonderful function that calculates percentiles, including performing interpolation
# (important for low numbers of bootstrap runs)
m1_perc0p5 = stats.scoreatpercentile(sorted_m1,0.5)[0]
m1_perc2p5 = stats.scoreatpercentile(sorted_m1,2.5)[0]
m1_perc16 = stats.scoreatpercentile(sorted_m1,16)[0]
m1_perc84 = stats.scoreatpercentile(sorted_m1,84)[0]
m1_perc97p5 = stats.scoreatpercentile(sorted_m1,97.5)[0]
m1_perc99p5 = stats.scoreatpercentile(sorted_m1,99.5)[0]
print "m1 68% confidence interval is from", m1_perc16, "to", m1_perc84
print "m1 95% confidence interval is from", m1_perc2p5, "to", m1_perc97p5
print "m1 99% confidence interval is from", m1_perc0p5, "to", m1_perc99p5
print ""
# Now column 2, sort it...
sorted_m2 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f1','f0','f2'], axis=0).view(numpy.float64)
# ... and do stats.
m2_perc0p5 = stats.scoreatpercentile(sorted_m2,0.5)[1]
m2_perc2p5 = stats.scoreatpercentile(sorted_m2,2.5)[1]
m2_perc16 = stats.scoreatpercentile(sorted_m2,16)[1]
m2_perc84 = stats.scoreatpercentile(sorted_m2,84)[1]
m2_perc97p5 = stats.scoreatpercentile(sorted_m2,97.5)[1]
m2_perc99p5 = stats.scoreatpercentile(sorted_m2,99.5)[1]
print "m2 68% confidence interval is from", m2_perc16, "to", m2_perc84
print "m2 95% confidence interval is from", m2_perc2p5, "to", m2_perc97p5
print "m2 99% confidence interval is from", m2_perc0p5, "to", m2_perc99p5
print ""
# and finally column 3, again, sort it..
sorted_m3 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f2','f1','f0'], axis=0).view(numpy.float64)
# ... and do stats.
m3_perc0p5 = stats.scoreatpercentile(sorted_m3,0.5)[1]
m3_perc2p5 = stats.scoreatpercentile(sorted_m3,2.5)[1]
m3_perc16 = stats.scoreatpercentile(sorted_m3,16)[1]
m3_perc84 = stats.scoreatpercentile(sorted_m3,84)[1]
m3_perc97p5 = stats.scoreatpercentile(sorted_m3,97.5)[1]
m3_perc99p5 = stats.scoreatpercentile(sorted_m3,99.5)[1]
print "m3 68% confidence interval is from", m3_perc16, "to", m3_perc84
print "m3 95% confidence interval is from", m3_perc2p5, "to", m3_perc97p5
print "m3 99% confidence interval is from", m3_perc0p5, "to", m3_perc99p5
print ""
# End of the main function
#
#
# Helper functions go down here
#
#
# regression
# This takes a 2D numpy array and performs a least-squares regression
# using the formula on the practical sheet, page 3
# Stored in the top are the real values
# Returns an array of m1, m2 and m3.
def regression(data):
# While testing, just return the real values
# real_values = numpy.array([17.0, 10.0, 1.96])
# Creating the G matrix
# ---------------------
# Because I'm using numpy arrays here, we need
# to learn some notation.
# data[:,0] is the FIRST column
# Length of this = number of time samples in data
N = len(data[:,0])
# numpy.sum adds up all data in a row or column.
# Axis = 0 implies add up each column. [0] at end
# returns the sum of the first column
# This is the sum of Ti for i = 1..N
sum_Ti = numpy.sum(data, axis=0)[0]
# numpy.power takes each element of an array and raises them to a given power
# In this one call we also take the sum of the columns (as above) after they have
# been squared, and then just take the t column
sum_Ti2 = numpy.sum(numpy.power(data, 2), axis=0)[0]
# Now we need to get the cube of Ti, then sum that result
sum_Ti3 = numpy.sum(numpy.power(data, 3), axis=0)[0]
# Finally we need the quartic of Ti, then sum that result
sum_Ti4 = numpy.sum(numpy.power(data, 4), axis=0)[0]
# Now we can construct the G matrix
G = numpy.array([[N, sum_Ti, -0.5 * sum_Ti2],
[sum_Ti, sum_Ti2, -0.5 * sum_Ti3],
[-0.5 * sum_Ti2, -0.5 * sum_Ti3, 0.25 * sum_Ti4]])
# We also need to take the inverse of the G matrix
G_inv = numpy.linalg.inv(G)
# Creating the d matrix
# ---------------------
# Hello numpy.sum, my old friend...
sum_Yi = numpy.sum(data, axis=0)[1]
# numpy.prod multiplies the values in an array.
# We need to do the products along axis 1 (i.e. row by row)
# Then sum all the elements
sum_TiYi = numpy.sum(numpy.prod(data, axis=1))
# The final element we need is a bit tricky.
# We need the product as above
TiYi = numpy.prod(data, axis=1)
# Then we get tricky. * works how we need it here,
# remember that the Ti column is referenced by data[:,0] as above
Ti2Yi = TiYi * data[:,0]
# Then we sum
sum_Ti2Yi = numpy.sum(Ti2Yi)
#With all the elements, we make the d matrix
d = numpy.array([sum_Yi,
sum_TiYi,
-0.5 * sum_Ti2Yi])
# Do the linear algebra stuff
# To multiple numpy arrays in a matrix style,
# we need to use numpy.dot()
# Not the most useful notation, but there you go.
# To help out the Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
result = G_inv.dot(d)
#Return this result
return result
# residuals:
# Takes in a data array, and an array of best fit paramers
# calculates the difference between the observed and predicted data
# and returns an array
def residuals(data, best_fit):
# Extract ti from the data array
ti = data[:,0]
# We also need an array of the square of ti
ti2 = numpy.power(ti, 2)
# Extract yi
yi = data[:,1]
# Calculate residual (data minus predicted)
result = yi - best_fit[0] - (best_fit[1] * ti) + (0.5 * best_fit[2] * ti2)
return result
# resamp_with_replace:
# Perform a dataset resampling with replacement on parameter set.
# Uses numpy.random to generate the random numbers to pick the indices to look up.
# So for item 0, ... N, we look up a random index from the set and put that in
# our resampled data.
def resamp_with_replace(set):
# How many things do we need to do this for?
N = len(set)
# Preallocate our result array
result = numpy.zeros(N)
# Generate N random integers between 0 and N-1
indices = numpy.random.randint(0, N - 1, N)
# For i from the set 0...N-1 (that's what the range() command gives us),
# our result for that i is given by the index we randomly generated above
for i in range(N):
result[i] = set[indices[i]]
return result
# calc_new_data:
# Given a set of resampled residuals, use the model parameters to derive
# new data. This is used for bootstrapping the residuals.
# true_data is a numpy array of rows of ti, yi. We only need the ti column though.
# model is an array of three parameters, corresponding to m1, m2, m3.
# residuals are an array of our resudials
def calc_new_data(true_data, model, residuals):
# Extract the time information from the new data array
ti = true_data[:,0]
# Calculate new data using array maths
# This goes through and does the sums etc for each element of the array
# Nice and compact way to represent it.
y_new = residuals + model[0] + (model[1] * ti) - (0.5 * model[2] * ti**2)
# Our result needs to be an array of ti, y_new, so we need to combine them using
# the numpy.column_stack routine
result = numpy.column_stack((ti, y_new))
# Return this combined array
return result
# print_progress:
# Just a quick thing that returns true if we want to print for this index
# and false otherwise
def print_progress(index, total):
index = float(index)
total = float(total)
result = False
# Floating point maths is irritating
# We want to print at the start, every 10%, and at the end.
# This works up to index = 100,000
# Would also be lovely if Python had a switch statement
if (((index / total) * 100) <= 0.00001):
result = True
elif (((index / total) * 100) >= 9.99999) and (((index / total) * 100) <= 10.00001):
result = True
elif (((index / total) * 100) >= 19.99999) and (((index / total) * 100) <= 20.00001):
result = True
elif (((index / total) * 100) >= 29.99999) and (((index / total) * 100) <= 30.00001):
result = True
elif (((index / total) * 100) >= 39.99999) and (((index / total) * 100) <= 40.00001):
result = True
elif (((index / total) * 100) >= 49.99999) and (((index / total) * 100) <= 50.00001):
result = True
elif (((index / total) * 100) >= 59.99999) and (((index / total) * 100) <= 60.00001):
result = True
elif (((index / total) * 100) >= 69.99999) and (((index / total) * 100) <= 70.00001):
result = True
elif (((index / total) * 100) >= 79.99999) and (((index / total) * 100) <= 80.00001):
result = True
elif (((index / total) * 100) >= 89.99999) and (((index / total) * 100) <= 90.00001):
result = True
elif ((((index+1) / total) * 100) > 99.99999):
result = True
else:
result = False
return result
#
#
# End of helper functions
#
#
# So we can easily execute our script
if __name__ == "__main__":
main()
I guess you can take a look, here is link to complete information
Use sklearn instead of numpy (sklearn is derived from numpy but much better for this kind of calculation) :
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
clf.coef_
array([ 0.5, 0.5])
Related
How to include numbers we need in a list which is generated by some stochastic algorithm
I need to implement a stochastic algorithm that provides as output the times and the states at the corresponding time points of a dynamic system. We include randomness in defining the time points by retrieving a random number from the uniform distribution. What I want to do, is to find the state at time points 0,1,2,...,24. Given the randomness of the algorithm, the time points 1, 2, 3,...,24 are not necessarily hit. We my include rounding at two decimal places, but even with rounding I can not find/insert all of these time points. The question is, how to change the code so as to be able to include in the list of the time points the numbers 1, 2,..., 24 while preserving the stochasticity of the algorithm ? Thanks for any suggestion. import numpy as np import random import math as m np.random.seed(seed = 5) # Stoichiometric matrix S = np.array([(-1, 0), (1, -1)]) # Reaction parameters ke = 0.3; ka = 0.5 k = [ke, ka] # Initial state vector at time t0 X1 = [200]; X2 = [0] # We will update it for each time. X = [X1, X2] # Initial time is t0 = 0, which we will update. t = [0] # End time tfinal = 24 # The propensity vector R concerning the last/updated value of time def ReactionRates(k, X1, X2): R = np.zeros((2,1)) R[0] = k[1] * X1[-1] R[1] = k[0] * X2[-1] return R # We implement the Gillespie (SSA) algorithm while True: # Reaction propensities/rates R = ReactionRates(k,X1,X2) propensities = R propensities_sum = sum(R)[0] if propensities_sum == 0: break # we include randomness u1 = np.random.uniform(0,1) delta_t = (1/propensities_sum) * m.log(1/u1) if t[-1] + delta_t > tfinal: break t.append(t[-1] + delta_t) b = [0,R[0], R[1]] u2 = np.random.uniform(0,1) # Choose j lambda_u2 = propensities_sum * u2 for j in range(len(b)): if sum(b[0:j-1+1]) < lambda_u2 <= sum(b[1:j+1]): break # out of for j # make j zero based j -= 1 # We update the state vector X1.append(X1[-1] + S.T[j][0]) X2.append(X2[-1] + S.T[j][1]) # round t values t = [round(tt,2) for tt in t] print("The time steps:", t) print("The second component of the state vector:", X2)
After playing with your model, I conclude, that interpolation works fine. Basically, just append the following lines to your code: ts = np.arange(tfinal+1) xs = np.interp(ts, t, X2) and if you have matplotlib installed, you can visualize using import matplotlib.pyplot as plt plt.plot(t, X2) plt.plot(ts, xs) plt.show()
Need help speeding up numpy code that finds number of `coincidences' between two NumPy arrays
I am looking for some help speeding up some code that I have written in Numpy. Here is the code: def TimeChunks(timevalues, num): avg = len(timevalues) / float(num) out = [] last = 0.0 while last < len(timevalues): out.append(timevalues[int(last):int(last + avg)]) last += avg return out ### chunk i can be called by out[i] ### NumChunks = 100000 t1chunks = TimeChunks(t1, NumChunks) t2chunks = TimeChunks(t2, NumChunks) NumofBins = 2000 CoincAllChunks = 0 for i in range(NumChunks): CoincOneChunk = 0 Hist1, something1 = np.histogram(t1chunks[i], NumofBins) Hist2, something2 = np.histogram(t2chunks[i], NumofBins) Mask1 = (Hist1>0) Mask2 = (Hist2>0) MaskCoinc = Mask1*Mask2 CoincOneChunk = np.sum(MaskCoinc) CoincAllChunks = CoincAllChunks + CoincOneChunk Is there anything that can be done to improve this to make it more efficient for large arrays? To explain the point of the code in a nutshell, the purpose of the code is simply to find the average "coincidences" between two NumPy arrays, representing time values of two channels (divided by some normalisation constant). This "coincidence" occurs when there is at least one time value in each of the two channels in a certain time interval. For example: t1 = [.4, .7, 1.1] t2 = [0.8, .9, 1.5] There is a coincidence in the window [0,1] and one coincidence in the interval [1, 2]. I want to find the average number of these "coincidences" when I break down my time array into a number of equally distributed bins. So for example if: t1 = [.4, .7, 1.1, 2.1, 3, 3.3] t2 = [0.8, .9, 1.5, 2.2, 3.1, 4] And I want 4 bins, the intervals I'll consider are ([0,1], [1,2], [2,3], [3,4]). Therefore the total coincidences will be 4 (because there is a coincidence in each bin), and therefore the average coincidences will be 4. This code is an attempt to do this for large time arrays for very small bin sizes, and as a result, to make it work I had to break down my time arrays into smaller chunks, and then for-loop through each of these chunks. I've tried making this as vectorized as I can, but it still is very slow... Any ideas what can be done to speed it up further? Any suggestions/hints will be appreciated. Thanks!.
This is 17X faster and more correct using a custom made numba_histogram function that beats the generic np.histogram. Note that you are computing and comparing histograms of two different series separately, which is not accurate for your purpose. So, in my numba_histogram function I use the same bin edges to compute the histograms of both series simultaneously. We can still optimize it even further if you provide more precise details about the algorithm. Namely, if you provide specific details about the parameters and the criteria on which you decide that two intervals coincide. import numpy as np from numba import njit #njit def numba_histogram(a, b, n): hista, histb = np.zeros(n, dtype=np.intp), np.zeros(n, dtype=np.intp) a_min, a_max = min(a[0], b[0]), max(a[-1], b[-1]) for x, y in zip(a, b): bin = n * (x - a_min) / (a_max - a_min) if x == a_max: hista[n - 1] += 1 elif bin >= 0 and bin < n: hista[int(bin)] += 1 bin = n * (y - a_min) / (a_max - a_min) if y == a_max: histb[n - 1] += 1 elif bin >= 0 and bin < n: histb[int(bin)] += 1 return np.sum( (hista > 0) * (histb > 0) ) #njit def calc_coincidence(t1, t2): NumofBins = 2000 NumChunks = 100000 avg = len(t1) / NumChunks CoincAllChunks = 0 last = 0.0 while last < len(t1): t1chunks = t1[int(last):int(last + avg)] t2chunks = t2[int(last):int(last + avg)] CoincAllChunks += numba_histogram(t1chunks, t2chunks, NumofBins) last += avg return CoincAllChunks Test with 10**8 arrays: t1 = np.arange(10**8) + np.random.rand(10**8) t2 = np.arange(10**8) + np.random.rand(10**8) CoincAllChunks = calc_coincidence(t1, t2) print( CoincAllChunks ) # 34793890 Time: 24.96140170097351 sec. (Original) # 34734897 Time: 1.499996423721313 sec. (Optimized)
Appending values in sub-arrays of an array in python
The goal here is to construct the one-particle distribution function of a system evolving under Brownian dynamics; one has to produce a random number drawn from a Gaussian distribution. To construct this quantity, I am thinking of running several simulations, and for specific times in each of the simulations, save the distances of each particle from the center of the 2D square and only in the end, create a histogram of all the values. My problem is that during each of the simulations, time begins from zero and goes on with a certain time-step, for each of which the particles move randomly. So, the distances to be saved have to be labeled correctly for their corresponding times. So, my thought was to create an array that will have in each row, 5 sub-arrays, one for each time I want to save the distances of the particles from the center of the square. I am trying to work this with numpy, but with no success. For each simulation, and for specific times, I create an array with all the distances, and I try to append it with numpy.append to the specific subarray, but this doesn't work correctly; as I understand the problem lies in the fact that I don't know how to index properly( and for all the simulations) the sub-arrays. Beyond that, I think that the approach is not the best: either I will have to abandon the idea of using numpy and figure out how I can index with two indices the array properly, or figure out a way to use numpy more effectively. So, to the point, the general question here is how I could add/append values to specific sub-arrays of an array ( either pre-constructed with numpy or not and treated as a list). The alternative would be for someone to mention a more efficient way of creating the one-particle distribution function of a Brownian motion problem, which would be really helpful. I am adding the relevant code below. Thank you all in advance. Code: import random import math import matplotlib.pyplot as plt import numpy as np # def dump(particles,step,n): # fileoutput = open('coord.txt', 'a') # fileoutput.write("ITEM: TIMESTEP \n") # fileoutput.write("%i \n" % step) # fileoutput.write("ITEM: NUMBER OF ATOMS \n") # fileoutput.write("%i \n" % n) # fileoutput.write("ITEM: BOX BOUNDS \n") # fileoutput.write("%e %e xlo xhi \n" % (0.0, 100)) # fileoutput.write("%e %e xlo xhi \n" % (0.0, 100)) # fileoutput.write("%e %e xlo xhi \n" % (-0.25, 0.25)) # fileoutput.write("ITEM: ATOMS id type x y z \n") # i = 0 # while i < n: # x = particles[i][0] # y = particles[i][1] # #fileoutput.write("%i %i %f %f %f \n" % (i, 1, x*1e10, y*1e10, z*1e10)) # fileoutput.write("%i %i %f %f %f \n" % (i, 1, x, y, 0)) # i += 1 # fileoutput.close() num_sims = 2 N = 49 L = 10 meanz = 0 varz = 1 sigma = 1 # tau = sigma**2*ksi/(kT) # Starting time t_0 = 0 # Time increments dt = 10**(-4) # dt/tau # Ending time T = 10**2 # T/tau # Produce random particles and avoid overlap: particles = np.full((N, 2), L/2) times = np.arange(t_0, T, dt) check = 0 distances = np.empty([50*num_sims, 5]) for sim in range(0, num_sims): step = 0 t_index = 0 for t in times: r=[] for i in range(0,N): z = np.random.normal(meanz, varz) particles[i][0] = particles[i][0] + ((2*dt*sigma**2)**(1/2))*z z = random.gauss(meanz, varz) particles[i][1] = particles[i][1] + ((2*dt*sigma**2)**(1/2))*z if (t%(2*(10**5)*dt) == 0): for j in range (0,N): rj = ((particles[j][0]-L/2)**2 + (particles[j][1]-L/2)**2)**(1/2) r.append(rj) distances[t_index] = np.append(distances[t_index],r) t_index += 1
Learning Laplace by doing, but getting unexpected result in Python
I'm trying to learn more about the Laplace transform, so I've tried to implement the forward and inverse (Mellin's inverse formula) transforms in code (approximated using the trapezium rule). I would expect to get roughly the same information back out when doing the forward and inverse one after the other. However, the output values appear to have nothing to do with the input data. CODE: # Dependencies: from math import ceil from cmath import * import numpy as np # Constants j = complex(0, 1) e = exp(1).real # Default Values sigma_default = 0 # Real component. When 0, the result is the Fourier transform # Forward Transform - Time Domain to Laplace Domain def Laplace(data, is_inverse, sigma=sigma_default, frequency_stamps=None, time_stamps=None): # Resolve empty data scenario data = np.asarray(data) if data.size <= 1: return data # Add time data if missing if time_stamps is None: if is_inverse is False: time_stamps = np.arange(0, data.size) else: time_stamps = np.arange(0, data.size * 2) else: time_stamps = np.asarray(time_stamps).real if time_stamps.size is not data.size: time_stamps = np.arange(0, data.size) # Add frequency stamps if missing if frequency_stamps is None: if is_inverse is False: frequency_stamps = np.asarray(np.arange(0, ceil(data.size / 2))).real * 2 * pi # Added forgotten constant else: frequency_stamps = np.asarray(np.arange(0, ceil(data.size))).real * 2 * pi # Added forgotten constant else: frequency_stamps = np.asarray(frequency_stamps).real frequency_stamps = sigma + frequency_stamps * j # Create the vector of powers exp(1) is raised to. Also create the delta times / frequencies if is_inverse is False: power = -Get_Powers(time_stamps, frequency_stamps) delta = np.diff(time_stamps) else: power = Get_Powers(frequency_stamps, time_stamps) delta = np.diff(frequency_stamps) delta = np.concatenate([[np.average(delta)], delta]) # Ensure a start value is present # Perform a numerical approximation of the Laplace transform laplace = data * np.power(e, power) * delta # Trapezium rule => average 1st and last wrt zero laplace = laplace.transpose() # Fixed bug in trapezium rule implementation laplace[[0, -1]] *= 0.5 laplace = laplace.transpose() laplace = np.sum(laplace, 1) # Integrate # If inverse function, then normalise and ensure the result is real if is_inverse is True: laplace *= 1 / (2 * pi * j) # Scale laplace = laplace.real # Ensure time series is real only # Return the result return laplace # Used to derive the vector of powers exp(1) is to be raised to def Get_Powers(values1, values2): # For forward Laplace, 1 = time, 2 = frequency # For inverse Laplace, 1 = frequency, 2 = time power = np.ones([values1.size, values2.size]) power = (power * values2).transpose() * values1 return power if __name__ == "__main__": # a = [0, 1, 2, 3, 4, 5] a = np.arange(0, 10) b = Laplace(a, False) c = Laplace(b, True) print(np.asarray(a)) print(c) EXPECTED RESULT: [0 1 2 3 4 5 6 7 8 9] [0 1 2 3 4 5 6 7 8 9] ACTUAL RESULT: [0 1 2 3 4 5 6 7 8 9] [162. 162. 162. 162. 162. 162. 162. 162. 162. 162.] Any ideas where I've gone awry? EDIT 1: Added Laplace functions: Forwards transform: Inverse transform: Definition of s: Where omega is represented as frequency_stamps in my code. When sigma = 0 the system becomes the Fourier transform. EDIT 2: Fixed two bugs. Problem still persists
Besides the two bug fixes made in the original question, there were a further 3 bugs left that I identified via Cris Luengo's suggestion to look into the conversion from the Fourier Transform into the Discrete Fourier Transform. A summary of all bug fixes is below: Fixed a bug in how I implemented the trapezium rule. Scaled the frequency_stamps by 2*pi to reflect the underlying circular nature of the Laplace data. Rescaled the frequency_stamps again such that they only travel around a circle once (aka. the data is in the range 0 -> 2*pi). Fixed a mistake where I'd assumed that there only needed to be half as many frequency points than time points. That's wrong. There should be an equal amount of both. Allowed the passing of initial and final time series points for the inverse transform as the data otherwise gets corrupted. Updated Code: # Dependencies: from cmath import * import numpy as np # Constants j = complex(0, 1) e = exp(1).real # Default Values sigma_default = 0.0 # Real component. When 0, the result is the Fourier transform ends_default = np.asarray([0, 0]) # Forward Transform - Time Domain to Laplace Domain def Laplace(data, is_inverse, sigma=sigma_default, frequency_stamps=None, time_stamps=None, ends=ends_default): # Resolve empty data scenario data = np.asarray(data) if data.size <= 1: return data # Add time data if missing if time_stamps is None: time_stamps = np.arange(0, data.size) # Size doesn't change between forward and inverse else: time_stamps = np.asarray(time_stamps).real if time_stamps.size is not data.size: time_stamps = np.arange(0, data.size) # Add frequency stamps if missing if frequency_stamps is None: frequency_stamps = np.asarray(np.arange(0.0, data.size)).real # Size doesn't change between forward and inverse frequency_stamps *= 2 * pi / np.max(frequency_stamps) # Restrict the integral range to 0 -> 2pi else: frequency_stamps = np.asarray(frequency_stamps).real frequency_stamps = sigma + frequency_stamps * j # Create the vector of powers exp(1) is raised to. Also create the delta times / frequencies if is_inverse is False: power = -Get_Powers(time_stamps, frequency_stamps) delta = np.diff(time_stamps) else: power = Get_Powers(frequency_stamps, time_stamps) delta = np.diff(frequency_stamps) delta = np.concatenate([[np.average(delta)], delta]) # Ensure a start value is present # Perform a numerical approximation of the Laplace transform laplace = data * np.power(e, power) * delta laplace = laplace.transpose() laplace[[0, -1]] *= 0.5 # Trapezium rule => average 1st and last wrt zero laplace = laplace.transpose() laplace = np.sum(laplace, 1) # Integrate # If inverse function, then normalise and ensure the result is real if is_inverse is True: laplace *= 1 / (2 * pi * j) # Scale laplace = laplace.real # Ensure time series is real only # Correct for edge cases laplace[0] = ends[0] laplace[-1] = ends[-1] # Return the result return laplace # Used to derive the vector of powers exp(1) is to be raised to def Get_Powers(values1, values2): # For forward Laplace, 1 = time, 2 = frequency # For inverse Laplace, 1 = frequency, 2 = time power = np.ones([values1.size, values2.size]) power = (power * values2).transpose() * values1 return power if __name__ == "__main__": a = np.arange(3, 13) b = Laplace(a, False, sigma=0.5) c = Laplace(b, True, sigma=0.5, ends=np.asarray([3, 12])) print(np.asarray(a)) print(c) Output [ 3 4 5 6 7 8 9 10 11 12] [ 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.] Thanks for the assist!
Population Monte Carlo implementation
I am trying to implement the Population Monte Carlo algorithm as described in this paper (see page 78 Fig.3) for a simple model (see function model()) with one parameter using Python. Unfortunately, the algorithm doesn't work and I can't figure out what's wrong. See my implementation below. The actual function is called abc(). All other functions can be seen as helper-functions and seem to work fine. To check whether the algorithm workds, I first generate observed data with the only parameter of the model set to param = 8. Therefore, the posterior resulting from the ABC algorithm should be centered around 8. This is not the case and I'm wondering why. I would appreciate any help or comments. # imports from math import exp from math import log from math import sqrt import numpy as np import random from scipy.stats import norm # globals N = 300 # sample size N_PARTICLE = 300 # number of particles ITERS = 5 # number of decreasing thresholds M = 10 # number of words to remember MEAN = 7 # prior mean of parameter SD = 2 # prior sd of parameter def model(param): recall_prob_all = 1/(1 + np.exp(M - param)) recall_prob_one_item = np.exp(np.log(recall_prob_all) / float(M)) return sum([1 if random.random() < recall_prob_one_item else 0 for item in range(M)]) ## example print "Output of model function: \n" + str(model(10)) + "\n" # generate data from model def generate(param): out = np.empty(N) for i in range(N): out[i] = model(param) return out ## example print "Output of generate function: \n" + str(generate(10)) + "\n" # distance function (sum of squared error) def distance(obsData,simData): out = 0.0 for i in range(len(obsData)): out += (obsData[i] - simData[i]) * (obsData[i] - simData[i]) return out ## example print "Output of distance function: \n" + str(distance([1,2,3],[4,5,6])) + "\n" # sample new particles based on weights def sample(particles, weights): return np.random.choice(particles, 1, p=weights) ## example print "Output of sample function: \n" + str(sample([1,2,3],[0.1,0.1,0.8])) + "\n" # perturbance function def perturb(variance): return np.random.normal(0,sqrt(variance),1)[0] ## example print "Output of perturb function: \n" + str(perturb(1)) + "\n" # compute new weight def computeWeight(prevWeights,prevParticles,prevVariance,currentParticle): denom = 0.0 proposal = norm(currentParticle, sqrt(prevVariance)) prior = norm(MEAN,SD) for i in range(len(prevParticles)): denom += prevWeights[i] * proposal.pdf(prevParticles[i]) return prior.pdf(currentParticle)/denom ## example prevWeights = [0.2,0.3,0.5] prevParticles = [1,2,3] prevVariance = 1 currentParticle = 2.5 print "Output of computeWeight function: \n" + str(computeWeight(prevWeights,prevParticles,prevVariance,currentParticle)) + "\n" # normalize weights def normalize(weights): return weights/np.sum(weights) ## example print "Output of normalize function: \n" + str(normalize([3.,5.,9.])) + "\n" # sampling from prior distribution def rprior(): return np.random.normal(MEAN,SD,1)[0] ## example print "Output of rprior function: \n" + str(rprior()) + "\n" # ABC using Population Monte Carlo sampling def abc(obsData,eps): draw = 0 Distance = 1e9 variance = np.empty(ITERS) simData = np.empty(N) particles = np.empty([ITERS,N_PARTICLE]) weights = np.empty([ITERS,N_PARTICLE]) for t in range(ITERS): if t == 0: for i in range(N_PARTICLE): while(Distance > eps[t]): draw = rprior() simData = generate(draw) Distance = distance(obsData,simData) Distance = 1e9 particles[t][i] = draw weights[t][i] = 1./N_PARTICLE variance[t] = 2 * np.var(particles[t]) continue for i in range(N_PARTICLE): while(Distance > eps[t]): draw = sample(particles[t-1],weights[t-1]) draw += perturb(variance[t-1]) simData = generate(draw) Distance = distance(obsData,simData) Distance = 1e9 particles[t][i] = draw weights[t][i] = computeWeight(weights[t-1],particles[t-1],variance[t-1],particles[t][i]) weights[t] = normalize(weights[t]) variance[t] = 2 * np.var(particles[t]) return particles[ITERS-1] true_param = 9 obsData = generate(true_param) eps = [15000,10000,8000,6000,3000] posterior = abc(obsData,eps) #print posterior
I stumbled upon this question as I was looking for pythonic implementations of PMC algorithms, since, quite coincidentally, I'm currently in the process of applying the techniques in this exact paper to my own research. Can you post the results you're getting? My guess is that 1) you're using a poor choice of distance function (and/or similarity thresholds), or 2) you're not using enough particles. I may be wrong here (I'm not very well-versed in sample statistics), but your distance function implicitly suggests to me that the ordering of your random draws matters. I'd have to think about this more to determine whether it actually has any effect on the convergence properties (it may not), but why don't you simply use the mean or median as your sample statistic? I ran your code with 1000 particles and a true parameter value of 8, while using the absolute difference between sample means as my distance function, for three iterations with epsilons of [0.5, 0.3, 0.1]; the peak of my estimated posterior distribution seems to be approaching 8 just like it should on each iteration, alongside a reduction in the population variance. Note that there is still a noticeable rightward bias, but this is because of the asymmetry of your model (parameter values of 8 or less can never result in more than 8 observed successes, while all parameters values greater than 8 can, leading to a rightward skewedness in the distribution). Here's the plot of my results: