I want to randomly draw N = 30 slope and intercept pairs, with replacement, and do it F = 5,000 times. For each draw I want to calculate the slope and intercept of the regression line and then plot the histogram of slope and intercept. Here is the code I have so far.
F = 10000
N = 30
X = sigma*(np.random.randn(F)/F)
Y = beta*X + alpha + sigma*(np.random.randn(F))
Xbar = np.mean(X)
Ybar = np.mean(Y)
numer2 = 0
denom2 = 0
for i in range(F):
for j in range(N):
numer2 += (X[j]-Xbar)*(Y[j]-Ybar)
denom2 += (X[j]-Xbar)**2
slope = numer2/denom2
intercept = Ybar - slope*Xbar
plt.figure(1)
plt.hist(slope, bins=50)
plt.hist(intercept, bins=50)
plt.grid()
plt.show()
I want to get 30 slope and intercept pairs, 5,000 times. I thought the double for loop would do that. Unfortunately, all I can get is one value for each. How can I fix this?
There's two errors, firstly what #GreenCloakGuy pointed out, you are not storing the values for the slope and intercept. Second, you are not sampling randomly from your X and Y with the second iteration. Also you don't need a loop to make your calculations, numpy array calculations are vectorized:
F = 5000
N = 30
sigma = 0.5
beta = 2
alpha = 0.2
X = np.random.randn(F)
Y = beta*X + alpha + sigma*(np.random.randn(F))
Xbar = np.mean(X)
Ybar = np.mean(Y)
slopes = []
intercepts = []
for i in range(F):
j = np.random.randint(0,F,N)
numer2 = np.sum((X[j]-Xbar)*(Y[j]-Ybar))
denom2 = np.sum((X[j]-Xbar)**2)
slope = numer2/denom2
intercept = Ybar - slope*Xbar
slopes.append(slope)
intercepts.append(intercept)
Not very sure what you are trying to do with your code and also where the sigma values are going. I think the above should give you a distribution of slopes and intercepts.
Every time you do slope = numer2/denom2 you overwrite the previous value of slope. If you want to save all of the values, then you need to store them to a collection defined outside of the loops, such as a list:
slopes = []
intercepts = []
for i in range(F):
for j in range(N):
numer2 += (X[j]-Xbar)*(Y[j]-Ybar)
denom2 += (X[j]-Xbar)**2
slopes = numer2/denom2
intercept = Ybar - slope*Xbar
slopes.append(slope)
intercepts.append(intercept)
...
plt.hist(slopes, bins=50)
plt.hist(intercepts, bins=50)
Related
I have been trying to code a piece of software to bootstrap data where every data point has a different and unique uncertainty. I take this uncertainty as the standard deviation when sampling that point from a gaussian distribution.
I run many samples, however the bootstrapped results does not agree with the curve_fit best result (the only difference I can think of is that this takes the datapoints and assumes they have no uncertainty), but this best result, by definition, be identical. Any ideas why?
The code is as follows:
with inputs as:
def f(x, a, b):
y = a*x + b
return y
x (array, x data points)
y (array, y data points)
x_err (array, uncertainty in each x point)
y_err (array, uncertainty in each y point)
n_samples = 10000
conf_pct = 68 (% for a 1 sigma test)
So just for clarity, x[i], y[i], x_err[i] and y_err[i] is all the information associated with the ith datapoint. (I did have these in a dataframe but took it out into arrays because I understood the processing of it better)
def bootstrap_fit(f, x, y, x_err, y_err, n_samples, conf_pct):
# n_samples number of draws from each data point, and then
# take the transpose to make n_samples number of samples (because n_samples >> len(x) )
x_sampling = []
y_sampling = []
a_boot = []
b_boot = []
# cov_boot = [] # don't think we'll need this but just in case?
for i, this_x in enumerate(x):
this_x_err = x_err[i]
this_y = y[i]
this_y_err = y_err[i]
this_x_samp = np.random.normal(loc=this_x, scale=this_x_err, size=n_samples)
this_y_samp = np.random.normal(loc=this_y, scale=this_y_err, size=n_samples)
x_sampling.append(this_x_samp)
y_sampling.append(this_y_samp)
# convert to np arrays and take the transpose
x_sampling = np.array(x_sampling).T
y_sampling = np.array(y_sampling).T
# ok, now that we have n_samples number of datasets randomly sampled within
# the actual errorbars of the data, let's fit each of those datasets
# notice how this_x and this_y and i, etc, are temporary variables
# that will get overwritten from the past loop
for i, this_x in enumerate(x_sampling):
this_y = y_sampling[i]
p_opt, p_cov = curve_fit(f, this_x, this_y)
a_boot.append(p_opt[0])
b_boot.append(p_opt[1])
# cov_boot.append(p_cov)
# make these into np arrays as well
a_boot = np.array(a_boot)
b_boot = np.array(b_boot)
# set up an array to use to plot the lines (because each x, y random dataset
# actually has slightly different min and max x values, and that gets messy)
x_fit = np.linspace(np.min(x), np.max(x), num=1000, endpoint=True)
y_fit = []
for i, this_a in enumerate(a_boot):
this_b = b_boot[i]
this_y = f(x_fit, this_a, this_b)
y_fit.append(this_y)
y_fit = np.array(y_fit)
# figure out from that what percentiles we actually need to identify
conf_lo = (100. - conf_pct)/2.
conf_hi = 100. - conf_lo
# set up the lists that will hold the upper and lower lines
y_upper = []
y_lower = []
y_median = []
y_difference = []
for i, this_x in enumerate(x_fit):
# we need to extract all the y-values for every random sample that correspond
# to this x value. We will just take the ith array of the transpose of y_boot.
this_y = y_fit.T[i]
# add the percentile values to each list for this value of x
y_lower.append(np.percentile(this_y, conf_lo))
y_upper.append(np.percentile(this_y, conf_hi))
y_median.append(np.percentile(this_y, 50.))
# make them numpy arrays because sometimes matplotlib doesn't like plotting lists
y_lower = np.array(y_lower)
y_upper = np.array(y_upper)
y_median = np.array(y_median)
# finding equation for the median line
p_opt, p_cov = curve_fit(f, x_fit, y_median)
a = float("{:.4f}".format(p_opt[0]))
b = float("{:.4f}".format(p_opt[1]))
for i, this_x in enumerate(x_fit):
this_y = y_fit.T[i]
spread_above = abs(point_line_distance(x, np.percentile(this_y, conf_hi), p_opt[0], p_opt[1]))
spread_below = abs(point_line_distance(x, np.percentile(this_y, conf_lo), p_opt[0], p_opt[1]))
orthog_distance = spread_above + spread_below
y_difference.append(orthog_distance)
spread = float("{:.4f}".format(np.amin(y_difference)))
print("narrowest orthogonal point on bootstrap_2 "+str(spread))
plt.fill_between(x_fit, y_lower, y_upper, alpha=0.4, label='Bootstrapped uncertainty at '+str(conf_pct)+'%')
plt.plot(x_fit, y_median, label='Bootstrapped curve_fit: y = ('+str(a)+')x + ('+str(b)+')')
def CURVE_fit(x, y):
# do a standard linear fitting to the data
p_opt, p_cov = curve_fit(f, x, y)
p_err = np.sqrt(np.diag(p_cov))
# y=ax+b for a linear fit
a, b = p_opt
a_err, b_err = p_err
x_plot = np.sort(x)
plt.plot(x, a*x+b, label='best curve_fit: y = ('+result(a,a_err)+')x + ('+result(b,b_err)+')', color = 'purple', linestyle = 'dashed')
I've tried playing around with the number of samples, the data input, and I've even coded an entirely seperate straight line fit using ODR and a corresponding independant bootstrapping method (they don't agree but thats a whole different issue) and nothing seems to reconsile these two values. Any ideas would be much appreciated.
I want to plot the least-square regression line for the X and Y in the log-log scale plot and find coefficients. The line function is log(Y) = a.log(X) + b equivalently, Y = 10^b . X^a. What are
the coefficients a and b? how can I use polyfit in NumPy?
I use code below using this code but I get this runtime error:
divide by zero encountered in log10 X_log = np.log10(X)
X_log = np.log10(X)
Y_log = np.log10(Y)
X_mean = np.mean(X_log)
Y_mean = np.mean(Y_log)
num = 0
den = 0
for i in range(len(X)):
num += (X_log[i] - X_mean)*(Y_log[i] - Y_mean)
den += (X_log[i] - X_mean)**2
m = num / den
c = Y_mean - m*X_mean
print (m, c)
Y_pred = m*X_log + c
plt.plot([min(X_log), max(X_log)], [min(Y_pred), max(Y_pred)], color='red') # predicted
plt.show()
It seems like you have X-values that are too close to zero, can you show the values you send to log_x = np.log10(x)?
To use np.polyfit just write
coeff = np.polyfit(np.log10(x), np.log10(y), deg = 1)
coeff will now be a list [a,b] with your coefficients for a first-degree fit (hence deg = 1) to the data points (log(x), log(y)). If you want the variance in the coefficients use
coeff, cov = np.polyfit(np.log10(x), np.log10(y), deg = 1, cov = True)
cov is now your covariance matrix.
I am trying to define the archimedean spiral: when I'm trying to define the inclination angle (incl) of the tangent vector to the orbit ( i.e: tan(incl))
I'm getting an error:
'numpy.ufunc' object does not support item assignment"
and "can't assign to function call"
the same error when I want to calculate cos(incl), and sin(incl).
Any suggestions and helps.
My code is:
T = 100
N = 10000
dt = float(T)/N
D = 2
DII = 10
a = 2.
v = 0.23
omega = 0.2
r0 = v/omega
t = np.linspace(0,T,N+1)
r = v*t
theta = a + r/r0
theta = omega*t
x = r * np.cos(omega*t)
y = r * np.sin(omega*t)
dxdr = np.cos(theta) - (r/r0)*np.sin(theta)
dydr = np.sin(theta) + (r/r0)*np.cos(theta)
dydx = (r0*np.sin(theta) + r*np.cos(theta))/r0*np.cos(theta) - r*np.sin(theta)
np.tan[incl] = dydx
incl = np.arctan((dydx))
### Calculate cos(incl) ,sin(incl) :
np.sin[np.incl] = np.tan(np.incl)/np.sqrt(1 + np.tan(np.incl)*2)
np.cos[incl] = 1/np.sqrt(1 + np.tan(incl)*2)
p1, = plt.plot(xx, yy)
i= 0 # this is the first value of the array
Bx = np.array([np.cos(i), -np.sin(i)])
By = np.array([np.sin(i), np.cos(i)])
n = 1000
seed(2)
finalpositions = []
for number in range(0, 10):
x = []
y = []
x.append(0)
y.append(0)
for i in range(n):
s = np.random.normal(0, 1, 2)
deltaX = Bx[0]*np.sqrt(2*DII*dt)*s[0] + Bx[1]*np.sqrt(2*D*dt)*s[1]
deltaY = By[0]*np.sqrt(2*DII*dt)*s[0] + By[1]*np.sqrt(2*D*dt)*s[1]
x.append(x[-1] + deltaX)
y.append(y[-1] + deltaY)
finalpositions.append([x[-1], y[-1]])
p2, = plt.plot(finalpositions[:,0],finalpositions[:,1],'*')
plt.show()
The error message is correct, you are trying to assign to a function! I think you're trying to compute a value that represents the sin, cos or tan of a value, but that doesn't mean you need to assign to np.sin, etc. What you want is to calculate the value which represents the trig function, and then use the inverse trig function to get the angle:
## np.tan[incl]= dydx ## np.tan is a function, so you cannot index it like an array, and you should not assign to it.
incl = np.arctan((dydx)) ## this is all you need to get "incl"
### Calculate cos(incl) ,sin(incl) :
## NOTE: you already have the angle you need!! No need for a complicated formulate to compute the sin or cos!
sin_incl = np.sin(incl)
cos_incl = np.cos(incl)
EDIT: One additional comment...np is a module that contains lots of numeric methods. When you calculate incl, it is not part of np! So there is no need to reference it like np.incl. Just use incl.
EDIT2: Another problem I found is this line:
dydx = (r0*np.sin(theta) + r*np.cos(theta))/r0*np.cos(theta) - r*np.sin(theta)
To calculate dydx, you're just dividing dydr by dxdr, but that's not what your code does! You need parens around the denominator like this:
dydx = (r0*np.sin(theta) + r*np.cos(theta))/(r0*np.cos(theta) - r*np.sin(theta))
I'm new to programming and scientific computing. Below is some code for evaluating an exponential integral over a grid. The integral is a function of radial distance from a point. I would like to sum the contribution from multiple points (with defined x, y coordinates) over the grid. I realize analytically this is simple superposition, but I'm confused over how to construct the loop summing the contribution from the points and the most efficient approach. If anyone has any suggestions or references, it will be much appreciated. The code setting up the grid and evaluating the function is below:
S=.0004
xi0 = 1.0
dx = 10.0
side = 100.0
points = 500
spacing = side/points
x1 = side/2 + dx/2
y1 = side/2
x2 = side/2 - dx/2
y2 = side/2
xi = empty([points,points],float)
for i in range(points):
y = spacing*i
for j in range(points):
x = spacing*j
r1 = sqrt((x-x1)**2+(y-y1)**2)
r2 = sqrt((x-x2)**2+(y-y2)**2)
u = (r1*r1*S)
xi[i,j] = expn(1,u)
Maybe something like this can be appropriate
x = np.linspace(0, side, points)
y = np.linspace(0, side, points)
r1 = np.sqrt((x-x1)**2 + (y-y1)**2)
r2 = np.sqrt((x-x2)**2 + (y-y2)**2)
u = (r1 * r2 * s)
xi = np.exp(u)
Main Problem: How can the scipy.signal.cwt() function be inversed.
I have seen where Matlab has an inverse continuous wavelet transform function which will return the original form of the data by inputting the wavelet transform, although you can filter out the slices you don't want.
MATALAB inverse cwt funciton
Since scipy doesn't appear to have the same function, I have been trying to figure out how to get the data back in the same form, while removing the noise and background.
How do I do this?
I tried squaring it to remove negative values, but this gives me values way to large and not quite right.
Here is what I have been trying:
# Compute the wavelet transform
widths = range(1,11)
cwtmatr = signal.cwt(xy['y'], signal.ricker, widths)
# Maybe we multiple by the original data? and square?
WT_to_original_data = (xy['y'] * cwtmatr)**2
And here is a fully compilable short script to show you the type of data I am trying to get and what I have etc.:
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
# Make some random data with peaks and noise
def make_peaks(x):
bkg_peaks = np.array(np.zeros(len(x)))
desired_peaks = np.array(np.zeros(len(x)))
# Make peaks which contain the data desired
# (Mid range/frequency peaks)
for i in range(0,10):
center = x[-1] * np.random.random() - x[0]
amp = 60 * np.random.random() + 10
width = 10 * np.random.random() + 5
desired_peaks += amp * np.e**(-(x-center)**2/(2*width**2))
# Also make background peaks (not desired)
for i in range(0,3):
center = x[-1] * np.random.random() - x[0]
amp = 40 * np.random.random() + 10
width = 100 * np.random.random() + 100
bkg_peaks += amp * np.e**(-(x-center)**2/(2*width**2))
return bkg_peaks, desired_peaks
x = np.array(range(0, 1000))
bkg_peaks, desired_peaks = make_peaks(x)
y_noise = np.random.normal(loc=30, scale=10, size=len(x))
y = bkg_peaks + desired_peaks + y_noise
xy = np.array( zip(x,y), dtype=[('x',float), ('y',float)])
# Compute the wavelet transform
# I can't figure out what the width is or does?
widths = range(1,11)
# Ricker is 2nd derivative of Gaussian
# (*close* to what *most* of the features are in my data)
# (They're actually Lorentzians and Breit-Wigner-Fano lines)
cwtmatr = signal.cwt(xy['y'], signal.ricker, widths)
# Maybe we multiple by the original data? and square?
WT = (xy['y'] * cwtmatr)**2
# plot the data and results
fig = plt.figure()
ax_raw_data = fig.add_subplot(4,3,1)
ax = {}
for i in range(0, 11):
ax[i] = fig.add_subplot(4,3, i+2)
ax_desired_transformed_data = fig.add_subplot(4,3,12)
ax_raw_data.plot(xy['x'], xy['y'], 'g-')
for i in range(0,10):
ax[i].plot(xy['x'], WT[i])
ax_desired_transformed_data.plot(xy['x'], desired_peaks, 'k-')
fig.tight_layout()
plt.show()
This script will output this image:
Where the first plot is the raw data, the middle plots are the wavelet transforms and the last plot is what I want to get out as the processed (background and noise removed) data.
Does anyone have any suggestions? Thank you so much for the help.
I ended up finding a package which provides an inverse wavelet transform function called mlpy. The function is mlpy.wavelet.uwt. This is the compilable script I ended up with which may interest people if they are trying to do noise or background removal:
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
import mlpy.wavelet as wave
# Make some random data with peaks and noise
############################################################
def gen_data():
def make_peaks(x):
bkg_peaks = np.array(np.zeros(len(x)))
desired_peaks = np.array(np.zeros(len(x)))
# Make peaks which contain the data desired
# (Mid range/frequency peaks)
for i in range(0,10):
center = x[-1] * np.random.random() - x[0]
amp = 100 * np.random.random() + 10
width = 10 * np.random.random() + 5
desired_peaks += amp * np.e**(-(x-center)**2/(2*width**2))
# Also make background peaks (not desired)
for i in range(0,3):
center = x[-1] * np.random.random() - x[0]
amp = 80 * np.random.random() + 10
width = 100 * np.random.random() + 100
bkg_peaks += amp * np.e**(-(x-center)**2/(2*width**2))
return bkg_peaks, desired_peaks
# make x axis
x = np.array(range(0, 1000))
bkg_peaks, desired_peaks = make_peaks(x)
avg_noise_level = 30
std_dev_noise = 10
size = len(x)
scattering_noise_amp = 100
scat_center = 100
scat_width = 15
scat_std_dev_noise = 100
y_scattering_noise = np.random.normal(scattering_noise_amp, scat_std_dev_noise, size) * np.e**(-(x-scat_center)**2/(2*scat_width**2))
y_noise = np.random.normal(avg_noise_level, std_dev_noise, size) + y_scattering_noise
y = bkg_peaks + desired_peaks + y_noise
xy = np.array( zip(x,y), dtype=[('x',float), ('y',float)])
return xy
# Random data Generated
#############################################################
xy = gen_data()
# Make 2**n amount of data
new_y, bool_y = wave.pad(xy['y'])
orig_mask = np.where(bool_y==True)
# wavelet transform parameters
levels = 8
wf = 'h'
k = 2
# Remove Noise first
# Wave transform
wt = wave.uwt(new_y, wf, k, levels)
# Matrix of the difference between each wavelet level and the original data
diff_array = np.array([(wave.iuwt(wt[i:i+1], wf, k)-new_y) for i in range(len(wt))])
# Index of the level which is most similar to original data (to obtain smoothed data)
indx = np.argmin(np.sum(diff_array**2, axis=1))
# Use the wavelet levels around this region
noise_wt = wt[indx:indx+1]
# smoothed data in 2^n length
new_y = wave.iuwt(noise_wt, wf, k)
# Background Removal
error = 10000
errdiff = 100
i = -1
iter_y_dict = {0:np.copy(new_y)}
bkg_approx_dict = {0:np.array([])}
while abs(errdiff)>=1*10**-24:
i += 1
# Wave transform
wt = wave.uwt(iter_y_dict[i], wf, k, levels)
# Assume last slice is lowest frequency (background approximation)
bkg_wt = wt[-3:-1]
bkg_approx_dict[i] = wave.iuwt(bkg_wt, wf, k)
# Get the error
errdiff = error - sum(iter_y_dict[i] - bkg_approx_dict[i])**2
error = sum(iter_y_dict[i] - bkg_approx_dict[i])**2
# Make every peak higher than bkg_wt
diff = (new_y - bkg_approx_dict[i])
peak_idxs_to_remove = np.where(diff>0.)[0]
iter_y_dict[i+1] = np.copy(new_y)
iter_y_dict[i+1][peak_idxs_to_remove] = np.copy(bkg_approx_dict[i])[peak_idxs_to_remove]
# new data without noise and background
new_y = new_y[orig_mask]
bkg_approx = bkg_approx_dict[len(bkg_approx_dict.keys())-1][orig_mask]
new_data = diff[orig_mask]
##############################################################
# plot the data and results
fig = plt.figure()
ax_raw_data = fig.add_subplot(121)
ax_WT = fig.add_subplot(122)
ax_raw_data.plot(xy['x'], xy['y'], 'g')
for bkg in bkg_approx_dict.values():
ax_raw_data.plot(xy['x'], bkg[orig_mask], 'k')
ax_WT.plot(xy['x'], new_data, 'y')
fig.tight_layout()
plt.show()
And here is the output I am getting now:
As you can see, there is still a problem with the background removal (it shifts to the right after each iteration), but it is a different question which I will address here.