I have a simple time series and I have a code implementing the moving average:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
keras = tf.keras
def plot_series(time, series, format="-", start=0, end=None, label=None):
plt.plot(time[start:end], series[start:end], format, label=label)
plt.xlabel("Time")
plt.ylabel("Value")
if label:
plt.legend(fontsize=14)
plt.grid(True)
def trend(time, slope=0):
return slope * time
def seasonal_pattern(season_time):
"""Just an arbitrary pattern, you can change it if you wish"""
return np.where(season_time < 0.4,
np.cos(season_time * 2 * np.pi),
1 / np.exp(3 * season_time))
def seasonality(time, period, amplitude=1, phase=0):
"""Repeats the same pattern at each period"""
season_time = ((time + phase) % period) / period
return amplitude * seasonal_pattern(season_time)
def white_noise(time, noise_level=1, seed=None):
rnd = np.random.RandomState(seed)
return rnd.randn(len(time)) * noise_level
time = np.arange(4 * 365 + 1)
slope = 0.05
baseline = 10
amplitude = 40
series = baseline + trend(time, slope) + seasonality(time, period=365, amplitude=amplitude)
noise_level = 5
noise = white_noise(time, noise_level, seed=42)
series += noise
plt.figure(figsize=(10, 6))
plot_series(time, series)
plt.show()
def moving_average_forecast(series, window_size):
"""Forecasts the mean of the last few values.
If window_size=1, then this is equivalent to naive forecast"""
forecast = []
for time in range(len(series) - window_size):
forecast.append(series[time:time + window_size].mean())
return np.array(forecast)
split_time = 1000
time_train = time[:split_time]
x_train = series[:split_time]
time_valid = time[split_time:]
x_valid = series[split_time:]
moving_avg = moving_average_forecast(series, 30)[split_time - 30:]
plt.figure(figsize=(10, 6))
plot_series(time_valid, x_valid, label="Series")
plot_series(time_valid, moving_avg, label="Moving average (30 days)")
I am not getting this part:
for time in range(len(series) - window_size):
forecast.append(series[time:time + window_size].mean())
return np.array(forecast)
What I do not understand is how series[time:time + window_size] works? Window_size is given into the function and can be a value specifying how many days are considered to calculate the mean, like 5 or 30 days.
When I try something similiar to illustrate this to myself, like
plot(series[time:time + 30]) this does not work.
Furthermore I do not get how len(series) - window_size) works.
debug your code and add some print statements to see how it is responding
Write them down and try to analyze the results
Step back and write a similar code that reproduce the same output
Compare
if it is the same congrats
if it is no then try to run again with timers on and see which one is faster.
if your code is faster the congrats.
Seems like the function moving_average_forecast simply calculates the x day rolling average? If that is the intention then:
The line for time in range(len(series) - window_size): gives you the index time that goes from 0 to some number n where n + 1 is the number of rolling averages you can get out of a time series of size N (i.e. if you have 11 data points and want to calculate 10 day rolling averages, you can get at most 2, here N = 11 = len(series), window_size = 10, so n = 1 and time = [0, 1]
The line series[time:time + window_size] I think should actually be series[time:time + window_size - 1] simply index into your data contained in series and calculate each of the rolling averages (i.e. using our example earlier, in the first iteration time = 0, time + window_size - 1 = 9 so series[time:time + window_size - 1] returns an array with the first 10 data points and so on
Hope that helps.
Related
I am looking for some help speeding up some code that I have written in Numpy. Here is the code:
def TimeChunks(timevalues, num):
avg = len(timevalues) / float(num)
out = []
last = 0.0
while last < len(timevalues):
out.append(timevalues[int(last):int(last + avg)])
last += avg
return out
### chunk i can be called by out[i] ###
NumChunks = 100000
t1chunks = TimeChunks(t1, NumChunks)
t2chunks = TimeChunks(t2, NumChunks)
NumofBins = 2000
CoincAllChunks = 0
for i in range(NumChunks):
CoincOneChunk = 0
Hist1, something1 = np.histogram(t1chunks[i], NumofBins)
Hist2, something2 = np.histogram(t2chunks[i], NumofBins)
Mask1 = (Hist1>0)
Mask2 = (Hist2>0)
MaskCoinc = Mask1*Mask2
CoincOneChunk = np.sum(MaskCoinc)
CoincAllChunks = CoincAllChunks + CoincOneChunk
Is there anything that can be done to improve this to make it more efficient for large arrays?
To explain the point of the code in a nutshell, the purpose of the code is simply to find the average "coincidences" between two NumPy arrays, representing time values of two channels (divided by some normalisation constant). This "coincidence" occurs when there is at least one time value in each of the two channels in a certain time interval.
For example:
t1 = [.4, .7, 1.1]
t2 = [0.8, .9, 1.5]
There is a coincidence in the window [0,1] and one coincidence in the interval [1, 2].
I want to find the average number of these "coincidences" when I break down my time array into a number of equally distributed bins. So for example if:
t1 = [.4, .7, 1.1, 2.1, 3, 3.3]
t2 = [0.8, .9, 1.5, 2.2, 3.1, 4]
And I want 4 bins, the intervals I'll consider are ([0,1], [1,2], [2,3], [3,4]). Therefore the total coincidences will be 4 (because there is a coincidence in each bin), and therefore the average coincidences will be 4.
This code is an attempt to do this for large time arrays for very small bin sizes, and as a result, to make it work I had to break down my time arrays into smaller chunks, and then for-loop through each of these chunks.
I've tried making this as vectorized as I can, but it still is very slow...
Any ideas what can be done to speed it up further?
Any suggestions/hints will be appreciated. Thanks!.
This is 17X faster and more correct using a custom made numba_histogram function that beats the generic np.histogram. Note that you are computing and comparing histograms of two different series separately, which is not accurate for your purpose. So, in my numba_histogram function I use the same bin edges to compute the histograms of both series simultaneously.
We can still optimize it even further if you provide more precise details about the algorithm. Namely, if you provide specific details about the parameters and the criteria on which you decide that two intervals coincide.
import numpy as np
from numba import njit
#njit
def numba_histogram(a, b, n):
hista, histb = np.zeros(n, dtype=np.intp), np.zeros(n, dtype=np.intp)
a_min, a_max = min(a[0], b[0]), max(a[-1], b[-1])
for x, y in zip(a, b):
bin = n * (x - a_min) / (a_max - a_min)
if x == a_max:
hista[n - 1] += 1
elif bin >= 0 and bin < n:
hista[int(bin)] += 1
bin = n * (y - a_min) / (a_max - a_min)
if y == a_max:
histb[n - 1] += 1
elif bin >= 0 and bin < n:
histb[int(bin)] += 1
return np.sum( (hista > 0) * (histb > 0) )
#njit
def calc_coincidence(t1, t2):
NumofBins = 2000
NumChunks = 100000
avg = len(t1) / NumChunks
CoincAllChunks = 0
last = 0.0
while last < len(t1):
t1chunks = t1[int(last):int(last + avg)]
t2chunks = t2[int(last):int(last + avg)]
CoincAllChunks += numba_histogram(t1chunks, t2chunks, NumofBins)
last += avg
return CoincAllChunks
Test with 10**8 arrays:
t1 = np.arange(10**8) + np.random.rand(10**8)
t2 = np.arange(10**8) + np.random.rand(10**8)
CoincAllChunks = calc_coincidence(t1, t2)
print( CoincAllChunks )
# 34793890 Time: 24.96140170097351 sec. (Original)
# 34734897 Time: 1.499996423721313 sec. (Optimized)
I am trying to plot the relationship between period and amplitude for an undamped and undriven pendulum for when small angle approximation breaks down, however, my code did not do what I expected...
I think I should be expecting a strictly increasing graph as shown in this video: https://www.youtube.com/watch?v=34zcw_nNFGU
Here is my code, I used zero crossing method to calculate period:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
from itertools import chain
# Second order differential equation to be solved:
# d^2 theta/dt^2 = - (g/l)*sin(theta) - q* (d theta/dt) + F*sin(omega*t)
# set g = l and omega = 2/3 rad per second
# Let y[0] = theta, y[1] = d(theta)/dt
def derivatives(t,y,q,F):
return [y[1], -np.sin(y[0])-q*y[1]+F*np.sin((2/3)*t)]
t = np.linspace(0.0, 100, 10000)
#initial conditions:theta0, omega0
theta0 = np.linspace(0.0,np.pi,100)
q = 0.0 #alpha / (mass*g), resistive term
F = 0.0 #G*np.sin(2*t/3)
value = []
amp = []
period = []
for i in range (len(theta0)):
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', t_eval = t,args = (q,F))
velocity = sol.y[1]
time = sol.t
zero_cross = 0
for k in range (len(velocity)-1):
if (velocity[k+1]*velocity[k]) < 0:
zero_cross += 1
value.append(k)
else:
zero_cross += 0
if zero_cross != 0:
amp.append(theta0[i])
# period calculated using the time evolved between the first and last zero-crossing detected
period.append((2*(time[value[zero_cross - 1]] - time[value[0]])) / (zero_cross -1))
plt.plot(amp,period)
plt.title('Period of oscillation of an undamped, undriven pendulum \nwith varying initial angular displacemnet')
plt.xlabel('Initial Displacement')
plt.ylabel('Period/s')
plt.show()
enter image description here
You can use the event mechanism of solve_ivp for such tasks, it is designed for such "simple" situations
def halfperiod(t,y): return y[1]
halfperiod.terminal=True # stop when root found
halfperiod.direction=1 # find sign changes from negative to positive
for i in range (1,len(theta0)): # amp==0 gives no useful result
sol = solve_ivp(derivatives, (0.0,100.0), (theta0[i], 0.0), method = 'RK45', events =(halfperiod,) )
if sol.success and len(sol.t_events[-1])>0:
period.append(2*sol.t_events[-1][0]) # the full period is twice the event time
amp.append(theta0[i])
This results in the plot
I'm trying to plot the time evolution graph for Ornstein-Uhlenbeck Process, which is a stochastic process, and then find the probability distribution at each time steps. I'm able to plot the graph for 1000 realizations of the process. Each realization has a 1000 time step, with width of the time step as .001. I used a 1000 x 1000 array to store the data. Each rows hold value of each realizations. And column wise i-th columns correspond value at i-th time step for 1000 realizations.
Now I want bin results at each time steps together and then plot the probability distribution corresponding to each time step. I'm quite confused with doing it (I tried modifying code from IPython Cookbook, where they don't store each realizations in the memory).
The code that I made from the IPython Cookbook:
import numpy as np
import matplotlib.pyplot as plt
sigma = 1. # Standard deviation.
mu = 10. # Mean.
tau = .05 # Time constant.
dt = .001 # Time step.
T = 1. # Total time.
n = int(T / dt) # Number of time steps.
ntrails = 1000 # Number of Realizations.
t = np.linspace(0., T, n) # Vector of times.
sigmabis = sigma * np.sqrt(2. / tau)
sqrtdt = np.sqrt(dt)
x = np.zeros((ntrails,n)) # Vector containing all successive values of our process
for j in range (ntrails): # Euler Method
for i in range(n - 1):
x[j,i + 1] = x[j,i] + dt * (-(x[j,i] - mu) / tau) + sigmabis * sqrtdt * np.random.randn()
for k in range(ntrails): #plotting 1000 realizations
plt.plot(t, x[k])
# Time averaging of each time stamp using bin
# Really lost from this point onwrds.
bins = np.linspace(-2., 15., 100)
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
for i in range(ntrails):
hist, _ = np.histogram(x[:,[i]], bins=bins)
ax.plot(hist)
Graph for 1000 realizations of Ornstein- Uhlenbeck Process:
Distribution generated from the code above:
I'm really lost with assigning of the bin value and plotting the histogram using it. I want to know whether my code is correct for plotting distributions corresponding to each time step, using bin. If not please tell me what modifications I need to make to my code.
The last for loop should iterate over n, not ntrails (which happen to be the same value here) but otherwise the code and plots look correct (apart from a few minor issues such as that is takes 101 breaks to get 100 bins so your code should probably read bins = np.linspace(-2., 15., 101)).
Your plots could be improved a bit though. A good guiding principle is to use as little ink as necessary to communicate the point that you are trying to make. You are always trying to plot all the data, which ends up obscuring your plots. Also, you could benefit from paying more attention to colour. Colour should carry meaning, or not be used at all.
Here would be my suggestions:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False
sigma = 1. # Standard deviation.
mu = 10. # Mean.
tau = .05 # Time constant.
dt = .001 # Time step.
T = 1 # Total time.
n = int(T / dt) # Number of time steps.
ntrails = 10000 # Number of Realizations.
t = np.linspace(0., T, n) # Vector of times.
sigmabis = sigma * np.sqrt(2. / tau)
sqrtdt = np.sqrt(dt)
x = np.zeros((ntrails,n)) # Vector containing all successive values of our process
for j in range(ntrails): # Euler Method
for i in range(n - 1):
x[j,i + 1] = x[j,i] + dt * (-(x[j,i] - mu) / tau) + sigmabis * sqrtdt * np.random.randn()
fig, ax = plt.subplots()
for k in range(200): # plotting fewer realizations shows the distribution better in this case
ax.plot(t, x[k], color='k', alpha=0.02)
# Really lost from this point onwards.
bins = np.linspace(-2., 15., 101) # you need 101 breaks to get 100 bins
fig, ax = plt.subplots(1, 1, figsize=(12, 4))
# plotting a smaller selection of time points spaced out using a log scale prevents
# the asymptotic approach to the mean from dominating the plot
for i in np.logspace(0, np.log10(n)-1, 21):
hist, _ = np.histogram(x[:,[int(i)]], bins=bins)
ax.plot(hist, color=plt.cm.plasma(i/20))
plt.show()
What I'm trying to do seems simple: I want to know exactly what frequencies there are in a .wav file at given times; i.e. "from the time n milliseconds to n + 10 milliseconds, the average frequency of the sound was x hertz". I have seen people talking about Fourier transforms and Goertzel algorithms, as well as various modules, that I can't seem to figure out how to get to do what I've described.
What I'm looking for is a solution like this pseudocode, or at least one that will do something like what the pseudocode is getting at:
import some_module_that_can_help_me_do_this as freq
file = 'output.wav'
start_time = 1000 # Start 1000 milliseconds into the file
end_time = 1010 # End 10 milliseconds thereafter
print("Average frequency = " + str(freq.average(start_time, end_time)) + " hz")
I don't come from a mathematics background, so I don't want to have to understand the implementation details.
If you'd like to detect pitch of a sound (and it seems you do), then in terms of Python libraries your best bet is aubio. Please consult this example for implementation.
import sys
from aubio import source, pitch
win_s = 4096
hop_s = 512
s = source(your_file, samplerate, hop_s)
samplerate = s.samplerate
tolerance = 0.8
pitch_o = pitch("yin", win_s, hop_s, samplerate)
pitch_o.set_unit("midi")
pitch_o.set_tolerance(tolerance)
pitches = []
confidences = []
total_frames = 0
while True:
samples, read = s()
pitch = pitch_o(samples)[0]
pitches += [pitch]
confidence = pitch_o.get_confidence()
confidences += [confidence]
total_frames += read
if read < hop_s: break
print("Average frequency = " + str(np.array(pitches).mean()) + " hz")
Be sure to check docs on pitch detection methods.
I also thought you might be interested in estimation of mean frequency and some other audio parameters without using any special libraries. Let's just use numpy! This should give you much better insight into how such audio features can be calculated. It's based off specprop from seewave package. Check docs for meaning of computed features.
import numpy as np
def spectral_properties(y: np.ndarray, fs: int) -> dict:
spec = np.abs(np.fft.rfft(y))
freq = np.fft.rfftfreq(len(y), d=1 / fs)
spec = np.abs(spec)
amp = spec / spec.sum()
mean = (freq * amp).sum()
sd = np.sqrt(np.sum(amp * ((freq - mean) ** 2)))
amp_cumsum = np.cumsum(amp)
median = freq[len(amp_cumsum[amp_cumsum <= 0.5]) + 1]
mode = freq[amp.argmax()]
Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
IQR = Q75 - Q25
z = amp - amp.mean()
w = amp.std()
skew = ((z ** 3).sum() / (len(spec) - 1)) / w ** 3
kurt = ((z ** 4).sum() / (len(spec) - 1)) / w ** 4
result_d = {
'mean': mean,
'sd': sd,
'median': median,
'mode': mode,
'Q25': Q25,
'Q75': Q75,
'IQR': IQR,
'skew': skew,
'kurt': kurt
}
return result_d
I felt the OPs frustration - it shouldnt be so hard to find how to get values of the sprectrogram instead of seeing the spectrogram image if someone needs to:
#!/usr/bin/env python
import librosa
import sys
import numpy as np
import matplotlib.pyplot as plt
import librosa.display
np.set_printoptions(threshold=sys.maxsize)
filename = 'filename.wav'
Fs = 44100
clip, sample_rate = librosa.load(filename, sr=Fs)
n_fft = 1024 # frame length
start = 0
hop_length=512
#commented out code to display Spectrogram
X = librosa.stft(clip, n_fft=n_fft, hop_length=hop_length)
#Xdb = librosa.amplitude_to_db(abs(X))
#plt.figure(figsize=(14, 5))
#librosa.display.specshow(Xdb, sr=Fs, x_axis='time', y_axis='hz')
#If to pring log of frequencies
#librosa.display.specshow(Xdb, sr=Fs, x_axis='time', y_axis='log')
#plt.colorbar()
#librosa.display.waveplot(clip, sr=Fs)
#plt.show()
#now print all values
t_samples = np.arange(clip.shape[0]) / Fs
t_frames = np.arange(X.shape[1]) * hop_length / Fs
#f_hertz = np.arange(N / 2 + 1) * Fs / N # Works only when N is even
f_hertz = np.fft.rfftfreq(n_fft, 1 / Fs) # Works also when N is odd
#example
print('Time (seconds) of last sample:', t_samples[-1])
print('Time (seconds) of last frame: ', t_frames[-1])
print('Frequency (Hz) of last bin: ', f_hertz[-1])
print('Time (seconds) :', len(t_samples))
#prints array of time frames
print('Time of frames (seconds) : ', t_frames)
#prints array of frequency bins
print('Frequency (Hz) : ', f_hertz)
print('Number of frames : ', len(t_frames))
print('Number of bins : ', len(f_hertz))
#This code is working to printout frame by frame intensity of each frequency
#on top line gives freq bins
curLine = 'Bins,'
for b in range(1, len(f_hertz)):
curLine += str(f_hertz[b]) + ','
print(curLine)
curLine = ''
for f in range(1, len(t_frames)):
curLine = str(t_frames[f]) + ','
for b in range(1, len(f_hertz)): #for each frame, we get list of bin values printed
curLine += str("%.02f" % np.abs(X[b, f])) + ','
#remove format of the float for full details if needed
#curLine += str(np.abs(X[b, f])) + ','
#print other useful info like phase of frequency bin b at frame f.
#curLine += str("%.02f" % np.angle(X[b, f])) + ','
print(curLine)
This answer is quite late, but you could try this:
(Note: I deserve very little credit for this since I got most of it from other SO posts and this great article on FFT using Python: https://realpython.com/python-scipy-fft/)
import numpy as np
from scipy.fft import *
from scipy.io import wavfile
def freq(file, start_time, end_time):
# Open the file and convert to mono
sr, data = wavfile.read(file)
if data.ndim > 1:
data = data[:, 0]
else:
pass
# Return a slice of the data from start_time to end_time
dataToRead = data[int(start_time * sr / 1000) : int(end_time * sr / 1000) + 1]
# Fourier Transform
N = len(dataToRead)
yf = rfft(dataToRead)
xf = rfftfreq(N, 1 / sr)
# Uncomment these to see the frequency spectrum as a plot
# plt.plot(xf, np.abs(yf))
# plt.show()
# Get the most dominant frequency and return it
idx = np.argmax(np.abs(yf))
freq = xf[idx]
return freq
This code can work for any .wav file, but it may be slightly off since it only returns the most dominant frequency, and also because it only uses the first channel of the audio (if not mono).
If you want to learn more about how the Fourier transform works, check out this video by 3blue1brown with a visual explanation: https://www.youtube.com/watch?v=spUNpyF58BY
Try something along the below, it worked for me with a sine wave file with a freq of 1234 I generated
from this page.
from scipy.io import wavfile
def freq(file, start_time, end_time):
sample_rate, data = wavfile.read(file)
start_point = int(sample_rate * start_time / 1000)
end_point = int(sample_rate * end_time / 1000)
length = (end_time - start_time) / 1000
counter = 0
for i in range(start_point, end_point):
if data[i] < 0 and data[i+1] > 0:
counter += 1
return counter/length
freq("sin.wav", 1000 ,2100)
1231.8181818181818
edited: cleaned up for loop a bit
The numpy.linalg.lstsq(a,b) function accepts an array a with size nx2 and a 1-dimensional array b which is the dependent variable.
How would I go about doing a least squares regression where the data points are presented as a 2d array generated from an image file? The array looks something like this:
[[0, 0, 0, 0, e]
[0, 0, c, d, 0]
[b, a, f, 0, 0]]
where a, b, c, d, e, f are positive integer values.
I want to fit a line to these points. Can I use np.linalg.lstsq (and if so, how) or is there something which may make more sense (and if so, how)?
Thanks very much.
once a while I saw a similar python program from
# Prac 2 for Monte Carlo methods in a nutshell
# Richard Chopping, ANU RSES and Geoscience Australia, October 2012
# Useage
# python prac_q2.py [number of bootstrap runs]
# e.g. python prac_q2.py 10000
# would execute this and perform 10 000 bootstrap runs.
# Default is 100 runs.
# sys cause I need to access the arguments the script was called with
import sys
# math cause it's handy for scalar maths
import math
# time cause I want to benchmark how long things take
import time
# numpy cause it gives us awesome array / matrix manipulation stuff
import numpy
# scipy just in case
import scipy
# scipy.stats to make life simpler statistcally speaking
import scipy.stats as stats
def main():
print "Prac 2 solution: no graphs"
true_model = numpy.array([17.0, 10.0, 1.96])
# Here's a nifty way to write out numpy arrays.
# Unlike the data table in the prac handouts, I've got time first
# and height second.
# You can mix up the order but you need to change a lot of calculations
# to deal with this change.
data = numpy.array([[1.0, 26.94],
[2.0, 33.45],
[3.0, 40.72],
[4.0, 42.32],
[5.0, 44.30],
[6.0, 47.19],
[7.0, 43.33],
[8.0, 40.13]])
# Perform the least squares regression to find the best fit solution
best_fit = regression(data)
# Nifty way to get out elements from an array
m1,m2,m3 = best_fit
print "Best fit solution:"
print "m1 is", m1, "and m2 is", m2, "and m3 is", m3
# Calculate residuals from the best fit solution
best_fit_resid = residuals(data, best_fit)
print "The residuals from the best fit solution are:"
print best_fit_resid
print ""
# Bootstrap part
# --------------
# Number of bootstraps to run. 100 is a minimum and our default number.
num_booties = 100
# If we have an argument to the python script, use this as the
# number of bootstrap runs
if len(sys.argv) > 1:
num_booties = int(sys.argv[1])
# preallocate an array to store the results.
ensemble = numpy.zeros((num_booties, 3))
print "Starting up the bootstrap routine"
# How to do timing within a Python script - here I start a stopwatch running
start_time = time.clock()
for index in range(num_booties):
# Print every 10 % so we know where we're up to in long runs
if print_progress(index, num_booties):
percent = (float(index) / float(num_booties)) * 100.0
print "Have completed", percent, "percent"
# For each iteration of the bootstrap algorithm,
# first calculate mixed up residuals...
resamp_resid = resamp_with_replace(best_fit_resid)
# ... then generate new data...
new_data = calc_new_data(data, best_fit, resamp_resid)
# ... then perform another regression to generate a new set of m1, m2, m3
bootstrap_model = regression(new_data)
ensemble[index] = (bootstrap_model[0], bootstrap_model[1], bootstrap_model[2])
# Done with the loop
# Calculate the time the run took - what's the current time, minus when we started.
loop_time = time.clock() - start_time
print ""
print "Ensemble calculated based on", num_booties, "bootstrap runs."
print "Bootstrap runs took", loop_time, "seconds."
print ""
# Stats on the ensemble time
# --------------------------
B = num_booties
# Mean is pretty simple, 1.0/B to force it to use floating points
# This gives us an array of the means of the 3 model parameters
mean = 1.0/B * numpy.sum(ensemble, axis=0)
print "Mean is ([m1 m2 m3]):", mean
# Variance
var2 = 1.0/B * numpy.sum(((ensemble - mean)**2), axis=0)
print "Variance squared is ([m1 m2 m3]):", var2
# Bias
bias = mean - best_fit
print "Bias is ([m1 m2 m3]):", bias
bias_corr = best_fit - bias
print "Bias corrected solution is ([m1 m2 m3]):", bias_corr
print "The original solution was ([m1 m2 m3]):", best_fit
print "And the true solution is ([m1 m2 m3]):", true_model
print ""
# Confidence intervals
# ---------------------
# Sort column 1 to calculate confidence intervals
# Sorting in numpy sucks.
# Need to declare what the fields are (so it knows how to sort it)
# f8 => numpy's floating point number
# Then need to delcare what we sort it on
# Here we sort on the first column, then the second, then the third.
# f0,f1,f2 field 0, then field 1, then field 2.
# Then we make sure we sort it by column (axis = 0)
# Then we take a view of that data as a float64 so it works properly
sorted_m1 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f0','f1','f2'], axis=0).view(numpy.float64)
# stats is my name for scipy.stats
# This has a wonderful function that calculates percentiles, including performing interpolation
# (important for low numbers of bootstrap runs)
m1_perc0p5 = stats.scoreatpercentile(sorted_m1,0.5)[0]
m1_perc2p5 = stats.scoreatpercentile(sorted_m1,2.5)[0]
m1_perc16 = stats.scoreatpercentile(sorted_m1,16)[0]
m1_perc84 = stats.scoreatpercentile(sorted_m1,84)[0]
m1_perc97p5 = stats.scoreatpercentile(sorted_m1,97.5)[0]
m1_perc99p5 = stats.scoreatpercentile(sorted_m1,99.5)[0]
print "m1 68% confidence interval is from", m1_perc16, "to", m1_perc84
print "m1 95% confidence interval is from", m1_perc2p5, "to", m1_perc97p5
print "m1 99% confidence interval is from", m1_perc0p5, "to", m1_perc99p5
print ""
# Now column 2, sort it...
sorted_m2 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f1','f0','f2'], axis=0).view(numpy.float64)
# ... and do stats.
m2_perc0p5 = stats.scoreatpercentile(sorted_m2,0.5)[1]
m2_perc2p5 = stats.scoreatpercentile(sorted_m2,2.5)[1]
m2_perc16 = stats.scoreatpercentile(sorted_m2,16)[1]
m2_perc84 = stats.scoreatpercentile(sorted_m2,84)[1]
m2_perc97p5 = stats.scoreatpercentile(sorted_m2,97.5)[1]
m2_perc99p5 = stats.scoreatpercentile(sorted_m2,99.5)[1]
print "m2 68% confidence interval is from", m2_perc16, "to", m2_perc84
print "m2 95% confidence interval is from", m2_perc2p5, "to", m2_perc97p5
print "m2 99% confidence interval is from", m2_perc0p5, "to", m2_perc99p5
print ""
# and finally column 3, again, sort it..
sorted_m3 = numpy.sort(ensemble.view('f8,f8,f8'), order=['f2','f1','f0'], axis=0).view(numpy.float64)
# ... and do stats.
m3_perc0p5 = stats.scoreatpercentile(sorted_m3,0.5)[1]
m3_perc2p5 = stats.scoreatpercentile(sorted_m3,2.5)[1]
m3_perc16 = stats.scoreatpercentile(sorted_m3,16)[1]
m3_perc84 = stats.scoreatpercentile(sorted_m3,84)[1]
m3_perc97p5 = stats.scoreatpercentile(sorted_m3,97.5)[1]
m3_perc99p5 = stats.scoreatpercentile(sorted_m3,99.5)[1]
print "m3 68% confidence interval is from", m3_perc16, "to", m3_perc84
print "m3 95% confidence interval is from", m3_perc2p5, "to", m3_perc97p5
print "m3 99% confidence interval is from", m3_perc0p5, "to", m3_perc99p5
print ""
# End of the main function
#
#
# Helper functions go down here
#
#
# regression
# This takes a 2D numpy array and performs a least-squares regression
# using the formula on the practical sheet, page 3
# Stored in the top are the real values
# Returns an array of m1, m2 and m3.
def regression(data):
# While testing, just return the real values
# real_values = numpy.array([17.0, 10.0, 1.96])
# Creating the G matrix
# ---------------------
# Because I'm using numpy arrays here, we need
# to learn some notation.
# data[:,0] is the FIRST column
# Length of this = number of time samples in data
N = len(data[:,0])
# numpy.sum adds up all data in a row or column.
# Axis = 0 implies add up each column. [0] at end
# returns the sum of the first column
# This is the sum of Ti for i = 1..N
sum_Ti = numpy.sum(data, axis=0)[0]
# numpy.power takes each element of an array and raises them to a given power
# In this one call we also take the sum of the columns (as above) after they have
# been squared, and then just take the t column
sum_Ti2 = numpy.sum(numpy.power(data, 2), axis=0)[0]
# Now we need to get the cube of Ti, then sum that result
sum_Ti3 = numpy.sum(numpy.power(data, 3), axis=0)[0]
# Finally we need the quartic of Ti, then sum that result
sum_Ti4 = numpy.sum(numpy.power(data, 4), axis=0)[0]
# Now we can construct the G matrix
G = numpy.array([[N, sum_Ti, -0.5 * sum_Ti2],
[sum_Ti, sum_Ti2, -0.5 * sum_Ti3],
[-0.5 * sum_Ti2, -0.5 * sum_Ti3, 0.25 * sum_Ti4]])
# We also need to take the inverse of the G matrix
G_inv = numpy.linalg.inv(G)
# Creating the d matrix
# ---------------------
# Hello numpy.sum, my old friend...
sum_Yi = numpy.sum(data, axis=0)[1]
# numpy.prod multiplies the values in an array.
# We need to do the products along axis 1 (i.e. row by row)
# Then sum all the elements
sum_TiYi = numpy.sum(numpy.prod(data, axis=1))
# The final element we need is a bit tricky.
# We need the product as above
TiYi = numpy.prod(data, axis=1)
# Then we get tricky. * works how we need it here,
# remember that the Ti column is referenced by data[:,0] as above
Ti2Yi = TiYi * data[:,0]
# Then we sum
sum_Ti2Yi = numpy.sum(Ti2Yi)
#With all the elements, we make the d matrix
d = numpy.array([sum_Yi,
sum_TiYi,
-0.5 * sum_Ti2Yi])
# Do the linear algebra stuff
# To multiple numpy arrays in a matrix style,
# we need to use numpy.dot()
# Not the most useful notation, but there you go.
# To help out the Matlab users: http://www.scipy.org/NumPy_for_Matlab_Users
result = G_inv.dot(d)
#Return this result
return result
# residuals:
# Takes in a data array, and an array of best fit paramers
# calculates the difference between the observed and predicted data
# and returns an array
def residuals(data, best_fit):
# Extract ti from the data array
ti = data[:,0]
# We also need an array of the square of ti
ti2 = numpy.power(ti, 2)
# Extract yi
yi = data[:,1]
# Calculate residual (data minus predicted)
result = yi - best_fit[0] - (best_fit[1] * ti) + (0.5 * best_fit[2] * ti2)
return result
# resamp_with_replace:
# Perform a dataset resampling with replacement on parameter set.
# Uses numpy.random to generate the random numbers to pick the indices to look up.
# So for item 0, ... N, we look up a random index from the set and put that in
# our resampled data.
def resamp_with_replace(set):
# How many things do we need to do this for?
N = len(set)
# Preallocate our result array
result = numpy.zeros(N)
# Generate N random integers between 0 and N-1
indices = numpy.random.randint(0, N - 1, N)
# For i from the set 0...N-1 (that's what the range() command gives us),
# our result for that i is given by the index we randomly generated above
for i in range(N):
result[i] = set[indices[i]]
return result
# calc_new_data:
# Given a set of resampled residuals, use the model parameters to derive
# new data. This is used for bootstrapping the residuals.
# true_data is a numpy array of rows of ti, yi. We only need the ti column though.
# model is an array of three parameters, corresponding to m1, m2, m3.
# residuals are an array of our resudials
def calc_new_data(true_data, model, residuals):
# Extract the time information from the new data array
ti = true_data[:,0]
# Calculate new data using array maths
# This goes through and does the sums etc for each element of the array
# Nice and compact way to represent it.
y_new = residuals + model[0] + (model[1] * ti) - (0.5 * model[2] * ti**2)
# Our result needs to be an array of ti, y_new, so we need to combine them using
# the numpy.column_stack routine
result = numpy.column_stack((ti, y_new))
# Return this combined array
return result
# print_progress:
# Just a quick thing that returns true if we want to print for this index
# and false otherwise
def print_progress(index, total):
index = float(index)
total = float(total)
result = False
# Floating point maths is irritating
# We want to print at the start, every 10%, and at the end.
# This works up to index = 100,000
# Would also be lovely if Python had a switch statement
if (((index / total) * 100) <= 0.00001):
result = True
elif (((index / total) * 100) >= 9.99999) and (((index / total) * 100) <= 10.00001):
result = True
elif (((index / total) * 100) >= 19.99999) and (((index / total) * 100) <= 20.00001):
result = True
elif (((index / total) * 100) >= 29.99999) and (((index / total) * 100) <= 30.00001):
result = True
elif (((index / total) * 100) >= 39.99999) and (((index / total) * 100) <= 40.00001):
result = True
elif (((index / total) * 100) >= 49.99999) and (((index / total) * 100) <= 50.00001):
result = True
elif (((index / total) * 100) >= 59.99999) and (((index / total) * 100) <= 60.00001):
result = True
elif (((index / total) * 100) >= 69.99999) and (((index / total) * 100) <= 70.00001):
result = True
elif (((index / total) * 100) >= 79.99999) and (((index / total) * 100) <= 80.00001):
result = True
elif (((index / total) * 100) >= 89.99999) and (((index / total) * 100) <= 90.00001):
result = True
elif ((((index+1) / total) * 100) > 99.99999):
result = True
else:
result = False
return result
#
#
# End of helper functions
#
#
# So we can easily execute our script
if __name__ == "__main__":
main()
I guess you can take a look, here is link to complete information
Use sklearn instead of numpy (sklearn is derived from numpy but much better for this kind of calculation) :
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
normalize=False)
clf.coef_
array([ 0.5, 0.5])