Find phase difference between two (inharmonic) waves

Find phase difference between two (inharmonic) waves - python

I have two datasets listing the average voltage outputs of two assemblies of neural networks at times t, that look something like this:
A = [-80.0, -80.0, -80.0, -80.0, -80.0, -80.0, -79.58, -79.55, -79.08, -78.95, -78.77, -78.45,-77.75, -77.18, -77.08, -77.18, -77.16, -76.6, -76.34, -76.35]
B = [-80.0, -80.0, -80.0, -80.0, -80.0, -80.0, -78.74, -78.65, -78.08, -77.75, -77.31, -76.55, -75.55, -75.18, -75.34, -75.32, -75.43, -74.94, -74.7, -74.68]
When two neural assemblies are "in phase" to a reasonable extent, that means that they are interrelated. What I want to do is calculate the phase difference between A and B, preferably over the whole time of the simulation. Since two assemblies are unlikely to be totally in phase, I want to compare that phase difference to a certain threshold.
These are inharmonic oscillators and I don't know their functions, only these values, so I have no idea how to determine the phase or the respective phase difference.
I am doing this project in Python, using numpy and scipy (the two assemblies are numpy arrays).
Any suggestions would be greatly appreciated!
EDIT: Added plots
Example datafile for assembly 1
Example datafile for assembly 2
Here is a plot of what the two datasets look like:

Perhaps you are looking for the cross-correlation:
scipy.signal.signaltools.correlate(A, B)
The position of the peak in the cross-correlation will be an estimate of the phase difference.
EDIT 3: Update now that I have looked at the real data files. There are two reasons that you find a phase shift of zero. First, the phase shift really is zero between your two time series. You can see this clearly if you zoom in horizontally on your matplotlib graph. Second, it is important to regularize the data first (most importantly, subtract off the mean), otherwise the effect of zero-padding at the ends of the arrays swamps the real signal in the cross-correlation. In the following example, I verify that I am finding the "true" peak by adding an artificial shift and then checking that I recover it correctly.
import numpy, scipy
from scipy.signal import correlate
# Load datasets, taking mean of 100 values in each table row
A = numpy.loadtxt("vb-sync-XReport.txt")[:,1:].mean(axis=1)
B = numpy.loadtxt("vb-sync-YReport.txt")[:,1:].mean(axis=1)
nsamples = A.size
# regularize datasets by subtracting mean and dividing by s.d.
A -= A.mean(); A /= A.std()
B -= B.mean(); B /= B.std()
# Put in an artificial time shift between the two datasets
time_shift = 20
A = numpy.roll(A, time_shift)
# Find cross-correlation
xcorr = correlate(A, B)
# delta time array to match xcorr
dt = numpy.arange(1-nsamples, nsamples)
recovered_time_shift = dt[xcorr.argmax()]
print "Added time shift: %d" % (time_shift)
print "Recovered time shift: %d" % (recovered_time_shift)
# SAMPLE OUTPUT:
# Added time shift: 20
# Recovered time shift: 20
EDIT: Here is an example of how it works with fake data.
EDIT 2: Added a graph of the example.
import numpy, scipy
from scipy.signal import square, sawtooth, correlate
from numpy import pi, random
period = 1.0 # period of oscillations (seconds)
tmax = 10.0 # length of time series (seconds)
nsamples = 1000
noise_amplitude = 0.6
phase_shift = 0.6*pi # in radians
# construct time array
t = numpy.linspace(0.0, tmax, nsamples, endpoint=False)
# Signal A is a square wave (plus some noise)
A = square(2.0*pi*t/period) + noise_amplitude*random.normal(size=(nsamples,))
# Signal B is a phase-shifted saw wave with the same period
B = -sawtooth(phase_shift + 2.0*pi*t/period) + noise_amplitude*random.normal(size=(nsamples,))
# calculate cross correlation of the two signals
xcorr = correlate(A, B)
# The peak of the cross-correlation gives the shift between the two signals
# The xcorr array goes from -nsamples to nsamples
dt = numpy.linspace(-t[-1], t[-1], 2*nsamples-1)
recovered_time_shift = dt[xcorr.argmax()]
# force the phase shift to be in [-pi:pi]
recovered_phase_shift = 2*pi*(((0.5 + recovered_time_shift/period) % 1.0) - 0.5)
relative_error = (recovered_phase_shift - phase_shift)/(2*pi)
print "Original phase shift: %.2f pi" % (phase_shift/pi)
print "Recovered phase shift: %.2f pi" % (recovered_phase_shift/pi)
print "Relative error: %.4f" % (relative_error)
# OUTPUT:
# Original phase shift: 0.25 pi
# Recovered phase shift: 0.24 pi
# Relative error: -0.0050
# Now graph the signals and the cross-correlation
from pyx import canvas, graph, text, color, style, trafo, unit
from pyx.graph import axis, key
text.set(mode="latex")
text.preamble(r"\usepackage{txfonts}")
figwidth = 12
gkey = key.key(pos=None, hpos=0.05, vpos=0.8)
xaxis = axis.linear(title=r"Time, \(t\)")
yaxis = axis.linear(title="Signal", min=-5, max=17)
g = graph.graphxy(width=figwidth, x=xaxis, y=yaxis, key=gkey)
plotdata = [graph.data.values(x=t, y=signal+offset, title=label) for label, signal, offset in (r"\(A(t) = \mathrm{square}(2\pi t/T)\)", A, 2.5), (r"\(B(t) = \mathrm{sawtooth}(\phi + 2 \pi t/T)\)", B, -2.5)]
linestyles = [style.linestyle.solid, style.linejoin.round, style.linewidth.Thick, color.gradient.Rainbow, color.transparency(0.5)]
plotstyles = [graph.style.line(linestyles)]
g.plot(plotdata, plotstyles)
g.text(10*unit.x_pt, 0.56*figwidth, r"\textbf{Cross correlation of noisy anharmonic signals}")
g.text(10*unit.x_pt, 0.33*figwidth, "Phase shift: input \(\phi = %.2f \,\pi\), recovered \(\phi = %.2f \,\pi\)" % (phase_shift/pi, recovered_phase_shift/pi))
xxaxis = axis.linear(title=r"Time Lag, \(\Delta t\)", min=-1.5, max=1.5)
yyaxis = axis.linear(title=r"\(A(t) \star B(t)\)")
gg = graph.graphxy(width=0.2*figwidth, x=xxaxis, y=yyaxis)
plotstyles = [graph.style.line(linestyles + [color.rgb(0.2,0.5,0.2)])]
gg.plot(graph.data.values(x=dt, y=xcorr), plotstyles)
gg.stroke(gg.xgridpath(recovered_time_shift), [style.linewidth.THIck, color.gray(0.5), color.transparency(0.7)])
ggtrafos = [trafo.translate(0.75*figwidth, 0.45*figwidth)]
g.insert(gg, ggtrafos)
g.writePDFfile("so-xcorr-pyx")
So it works pretty well, even for very noisy data and very aharmonic waves.

#deprecated's comments are the exact answer to the question, when it comes to the pure-code python solution. The comments were very valuable, but I feel like I should add some notes for people searching for an answer in the specific context of neural networks.
When you take the average membrane potential of large assemblies of neurons, like I did, the correlation will be relatively weak. What you want to look at, primarily, is either the correlation between spike trains, the latency or the excitability (i.e. synaptic efficacy) of the individual assemblies. This can be found relatively easily by just looking at points where the potential exceeds a certain threshold. Scipy's correlation function on spike trains will show a much more detailed picture of interdependence between neurons or neural assemblies when you give it spike trains, as opposed to the actual potentials. You can also take a look at Brian's statistics module, which can be found here:
http://neuralensemble.org/trac/brian/browser/trunk/brian/tools/statistics.py
As for phase difference, it probably is an inadequate measure, because neurons are not harmonic oscillators. If you want to take very precise measurements of phase, it's best to look at the synchronization of inharmonic oscillators. The mathematical model that describes these kinds of oscillators, which is very useful in the context of neurons and neural networks, is the Kuramoto model. There is extensive documentation available for the Kuramoto model and Integrate-and-fire-synchronization, so I'll leave it to that.

Related

One Cycle Fourier Window optimization. My code is inefficient

Good day
EDIT:
What I want: From any current/voltage waveform on a Power System(PS) I want the filtered 50Hz (fundamental) RMS values magnitudes (and effectively their angles). The current as measured contains all harmonics from 100Hz to 1250Hz depending on the equipment. One cannot analyse and calculate using a wave with these harmonics your error gets so big (depending on equipment) that PS protection equipment calculates incorrect quantities. The signal attached also has MANY many other frequency components involved.
My aim: PS protection Relays are special and calculate a 20ms window in a very short time. I.m not trying to get this. I'm using external recording tech and testing what the relays see are true and they operate correctly. Thus I need to do what they do and only keep 50Hz values without any harmonic and DC.
Important expected result: Given any frequency component that MAY be in the signal I want to see the magnitude of any given harmonic (150,250 - 3rd harmonic magnitudes and 5th harmonic of the fundamental) as well as the magnitude of the DC. This will tell me what type of PS equipment possibly injects these frequencies. It is important that I provide a frequency and the answer is a vector of that frequency only with all other values filtered OUT.
RMS-of-the-fundamental vs RMS differs with a factor of 4000A (50Hz only) and 4500A (with other freqs included)
This code calculates a One Cycle Fourier value (RMS) for given frequency. Sometimes called a Fourier filter I think? I use it for Power System 50Hz/0Hz/150Hz analogues analysis. (The answers have been tested and are correct fundamental RMS values. (https://users.wpi.edu/~goulet/Matlab/overlap/trigfs.html)
For a large sample the code is very slow. For 55000 data points it takes 12seconds. For 3 voltages and 3 currents this gets to be VERY slow. I look at 100s of records a day.
How do I enhance it? What Python tips and tricks/ libraries are there to append my lists/array.
(Also feel free to rewrite or use the code). I use the code to pick out certain values out of a signal at given times. (which is like reading the values from a specialized program for power system analysis)
Edited: with how I load the files and use them, code works with pasting it:
import matplotlib.pyplot as plt
import csv
import math
import numpy as np
import cmath
# FILES ATTACHED TO POST
filenamecfg = r"E:/Python_Practise/2019-10-21 13-54-38-482.CFG"
filename = r"E:/Python_Practise/2019-10-21 13-54-38-482.DAT"
t = []
IR = []
newIR=[]
with open(filenamecfg,'r') as csvfile1:
cfgfile = [row for row in csv.reader(csvfile1, delimiter=',')]
numberofchannels=int(np.array(cfgfile)[1][0])
scaleval = float(np.array(cfgfile)[3][5])
scalevalI = float(np.array(cfgfile)[8][5])
samplingfreq = float(np.array(cfgfile)[numberofchannels+4][0])
numsamples = int(np.array(cfgfile)[numberofchannels+4][1])
freq = float(np.array(cfgfile)[numberofchannels+2][0])
intsample = int(samplingfreq/freq)
#TODO neeeed to get number of samples and frequency and detect
#automatically
#scaleval = np.array(cfgfile)[3]
print('multiplier:',scaleval)
print('SampFrq:',samplingfreq)
print('NumSamples:',numsamples)
print('Freq:',freq)
with open(filename,'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
t.append(float(row[1])/1000000) #get time from us to s
IR.append(float(row[6]))
newIR = np.array(IR) * scalevalI
t = np.array(t)
def mag_and_theta_for_given_freq(f,IVsignal,Tsignal,samples): #samples are the sample window size you want to caclulate for (256 in my case)
# f in hertz, IVsignal, Tsignal in numpy.array
timegap = Tsignal[2]-Tsignal[3]
pi = math.pi
w = 2*pi*f
Xr = []
Xc = []
Cplx = []
mag = []
theta = []
#print("Calculating for frequency:",f)
for i in range(len(IVsignal)-samples):
newspan = range(i,i+samples)
timewindow = Tsignal[newspan]
#print("this is my time: ",timewindow)
Sig20ms = IVsignal[newspan]
N = len(Sig20ms) #get number of samples of my current Freq
RealI = np.multiply(Sig20ms, np.cos(w*timewindow)) #Get Real and Imaginary part of any signal for given frequency
ImagI = -1*np.multiply(Sig20ms, np.sin(w*timewindow)) #Filters and calculates 1 WINDOW RMS (root mean square value).
#calculate for whole signal and create a new vector. This is the RMS vector (used everywhere in power system analysis)
Xr.append((math.sqrt(2)/N)*sum(RealI)) ### TAKES SO MUCH TIME
Xc.append((math.sqrt(2)/N)*sum(ImagI)) ## these steps make RMS
Cplx.append(complex(Xr[i],Xc[i]))
mag.append(abs(Cplx[i]))
theta.append(np.angle(Cplx[i]))#th*180/pi # this can be used to get Degrees if necessary
#also for freq 0 (DC) id the offset is negative how do I return a negative to indicate this when i'm using MAGnitude or Absolute value
return Cplx,mag,theta #mag[:,1]#,theta # BUT THE MAGNITUDE WILL NEVER BE zero
myZ,magn,th = mag_and_theta_for_given_freq(freq,newIR,t,intsample)
plt.plot(newIR[0:30000],'b',linewidth=0.4)#, label='CFG has been loaded!')
plt.plot(magn[0:30000],'r',linewidth=1)
plt.show()
The code as pasted runs smoothly given the files attached
Regards
EDIT: Please find a test csvfile and COMTRADE TEST files here:
CSV:
https://drive.google.com/open?id=18zc4Ms_MtYAeTBm7tNQTcQkTnFWQ4LUu
COMTRADE
https://drive.google.com/file/d/1j3mcBrljgerqIeJo7eiwWo9eDu_ocv9x/view?usp=sharing
https://drive.google.com/file/d/1pwYm2yj2x8sKYQUcw3dPy_a9GrqAgFtD/view?usp=sharing

Forewords
As I said in my previous comment:
Your code mainly relies on a for loop with a lot of indexation and
scalar operations. You already have imported numpy so you should take
advantage of vectorization.
This answer is a start towards your solution.
Light weight MCVE
First we create a trial signal for the MCVE:
import numpy as np
# Synthetic signal sampler: 5s sampled as 400 Hz
fs = 400 # Hz
t = 5 # s
t = np.linspace(0, t, fs*t+1)
# Synthetic Signal: Amplitude is about 325V #50Hz
A = 325 # V
f = 50 # Hz
y = A*np.sin(2*f*np.pi*t) # V
Then we can compute the RMS of this signal using the usual formulae:
# Actual definition of RMS:
yrms = np.sqrt(np.mean(y**2)) # 229.75 V
Or alternatively we can compute it using DFT (implemented as rfft in numpy.fft):
# RMS using FFT:
Y = np.fft.rfft(y)/y.size
Yrms = np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2)) # 229.64 V
A demonstration of why this last formulae works can be found here. This is valid because of the Parseval Theorem implies Fourier transform does conserve Energy.
Both versions make use of vectorized functions, no need of splitting real and imaginary part to perform computation and then reassemble into a complex number.
MCVE: Windowing
I suspect you want to apply this function as a window on a long term time serie where RMS value is about to change. Then we can tackle this problem using pandas library which provides time series commodities.
import pandas as pd
We encapsulate the RMS function:
def rms(y):
Y = 2*np.fft.rfft(y)/y.size
return np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2))
We generate a damped signal (variable RMS)
y = np.exp(-0.1*t)*A*np.sin(2*f*np.pi*t)
We wrap our trial signal into a dataframe to take advantage of the rolling or resample methods:
df = pd.DataFrame(y, index=t*pd.Timedelta('1s'), columns=['signal'])
A rolling RMS of your signal is:
df['rms'] = df.rolling(int(fs/f)).agg(rms)
A periodically sampled RMS is:
df['signal'].resample('1s').agg(rms)
The later returns:
00:00:00 2.187840e+02
00:00:01 1.979639e+02
00:00:02 1.791252e+02
00:00:03 1.620792e+02
00:00:04 1.466553e+02
Signal Conditioning
Addressing your need of keeping only fundamental harmonic (50 Hz), a straightforward solution could be a linear detrend (to remove constant and linear trend) followed by a Butterworth filter (bandpass filter).
We generate a synthetic signal with other frequencies and linear trend:
y = np.exp(-0.1*t)*A*(np.sin(2*f*np.pi*t) \
+ 0.2*np.sin(8*f*np.pi*t) + 0.1*np.sin(16*f*np.pi*t)) \
+ A/20*t + A/100
Then we condition the signal:
from scipy import signal
yd = signal.detrend(y, type='linear')
filt = signal.butter(5, [40,60], btype='band', fs=fs, output='sos', analog=False)
yfilt = signal.sosfilt(filt, yd)
Graphically it leads to:
It resumes to apply the signal conditioning before the RMS computation.

Convolution of two signals with constrained output measuring unit

I'm programming in Python and I have two signals, a rectangular signal f(t) and a distribution function g(t) (e.g. normal distribution, log-normal distribution etc.). Both signals are defined over time (x-axis) with a resolution of 0.001 s and the measure of the rectangle is in pulses per second (1/s, y-axis). Both functions shall be convolved, whereby the unit of the result (y-axis) has to be again pulses per second for the subsequent calculations. The original domain of definition in which the results make sense is 0-250 1/s (or at least not much higher). Since I'm programming in Python, f(t) is just an array of ones, multiplied with a respective amplitude. I read in the literature, that g(t) has to be a PDF for my sort of calculations, so I normalized it to its area.
The specific background is, that g(t) is a time delay profile whose influence on the original signal f(t) shall be investigated. This means, I want to find out in which way the shape of f(t) is altered and which impact that has on the subsequent calculations.
My first question is: can I convolve these two signals without any further considerations, or do I have to transform the rectangle at first into a PDF as well?
My second problem is: after the convolution, what measure does the result have? Lets say I convolved the PDF of g(t) with the original f(t) ... is the resulting unit simply 1/s again? If not, how do I convert it into a signal with the unit 1/s?
The thing is, I tested both approaches case1 PDF(f(t)) * g(t) and case2 PDF(f(t)) * PDF(g(t)). Both results are way to high as that it would make any sense for the subsequent computations. Let me give you an example:
time resolution = 1 ms
Amplitude f(t) = 250 1/s
duration f(t) = 1.2 s
duration g(t) = 0.18 s
In case1 the resulting maximal amplitude of the convolution product would be slightly less than 250k. One solution might be to divide this result by 1000 (the time resolution, because this is the parameter to which I normalized the PDF). So in the end the amplitude would be between 0-250 again. But I don't know if this is right, since this is more or less like I would convolve f(t) with the probability mass function of g(t) (division and multiplication before and after the convolution with the time resolution would cancel each other out) ... Whereas case2 would give me an amplitude of ~833. Here I don't know how to regain 1/s.
Does anyone have an idea, or do you need at first further information?
Any help will be much appreciated!
PS: If this question doesn't fit into this stackexchange forum, then please tell me where I should ask it!
EDIT: Included code snippet for the above description.
#%% import
import numpy as np
#%% definitions
def AD(d, a):
return d*np.exp(-d/a)/a**2
#%% calculate pdf of "g(t)"
# parameters
a = 0.013
# domain of definition
t_min = 1e-12
t_max = 0.5
t_step = 0.001
t_range = np.arange(t_min,t_max,t_step)
# original + pdf
distr = AD(t_range,a)
distr_pdf = distr/(distr.sum()*t_step)
# reduce the array (optimize processing time)
for x in np.arange(0, len(distr_pdf)):
if np.round(distr_pdf[0:x+1].sum()*t_step,5) >= 1:
distr_pdf = distr_pdf[0:x+1]
break
#%% calculate rectangle "f(t)"
# parameters
length_rect = 1200
intensity = 250
# original + pdf
rect = np.ones(length_rect) * intensity
rect_pdf = rect/(rect.sum()*t_step)
#%% convolution case1
result1 = np.convolve(distr_pdf,rect)
#%% convolution case2
result2 = np.convolve(distr_pdf,rect_pdf)

How do I scale an FFT-based cross-correlation such that its peak is equal to Pearson's rho

Description of the problem
FFT can be used to compute cross-correlation between two signals or images. To determine the delay or lag between two signals A and B, it suffices to locate the peak of:
IFFT(FFT(A)*conjugate(FFT(B)))
However, the amplitude of the peak is related to the amplitude of the frequency spectra of the individual signals. Thus to determine the Pearson correlation (rho), the amplitude of this peak must be scaled by the total energy in the two signals.
One way to do this is to normalize by the geometric mean of the individual autocorrelations. This gives a reasonable approximation of rho, especially when the delay between samples is small, but not the exact value.
I thought the reason for this error was that the Pearson correlation is only defined for the overlapping portions of the signal, whereas the normalization factor (the geometric mean of the two autocorrelation peaks) includes contributions from the non-overlapping portions. I considered two approaches for fixing this and producing an exact value for rho via FFT. In the first (called rho_exact_1 below), I trimmed the samples down to their overlapping portions and computed the normalization factor from those. In the second (called rho_exact_2 below), I computed the fraction of the measurements contained in the overlapping portion of the signals and multiplied the full-autocorrelation-normalization factor by that fraction.
Neither works! The figure below shows plots of the three approaches for calculating Pearson's rho using DFT-based cross-correlation. Only the region of the cross-correlation peak is shown. Each estimate is close to the correct value of 1.0, but not equal to it.
The code I used to perform the calculations is below. I used a simple sine wave as an example signal. I noticed that if I use a square-wave (w/ duty cycle not necessarily 50%) the approaches' errors change.
Can somebody explain what's going on?
A working example
import numpy as np
from matplotlib import pyplot as plt
# make a time vector w/ 256 points
# and a source signal
N_cycles = 10.0
N_points = 256.0
t = np.arange(0,N_cycles*np.pi,np.pi*N_cycles/N_points)
signal = np.sin(t)
use_rect = False
if use_rect:
threshold = -0.75
signal[np.where(signal>=threshold)]=1.0
signal[np.where(signal<threshold)]=-1.0
# normalize the signal (not technically
# necessary for this example, but required
# for measuring correlation of physically
# different signals)
signal = signal/signal.std()
# generate two samples of the signal
# with a temporal offset:
N = 128
offset = 5
sample_1 = signal[:N]
sample_2 = signal[offset:N+offset]
# determine the offset through cross-
# correlation
xc_num = np.abs(np.fft.ifft(np.fft.fft(sample_1)*np.fft.fft(sample_2).conjugate()))
offset_estimate = np.argmax(xc_num)
if offset_estimate>N//2:
offset_estimate = offset_estimate - N
# for an approximate estimate of Pearson's
# correlation, we normalize by the RMS
# of individual autocorrelations:
autocorrelation_1 = np.abs(np.fft.ifft(np.fft.fft(sample_1)*np.fft.fft(sample_1).conjugate()))
autocorrelation_2 = np.abs(np.fft.ifft(np.fft.fft(sample_2)*np.fft.fft(sample_2).conjugate()))
xc_denom_approx = np.sqrt(np.max(autocorrelation_1))*np.sqrt(np.max(autocorrelation_2))
rho_approx = xc_num/xc_denom_approx
print 'rho_approx',np.max(rho_approx)
# this is an approximation because we've
# included autocorrelation of the whole samples
# instead of just the overlapping portion;
# using cropped versions of the samples should
# yield the correct correlation:
sample_1_cropped = sample_1[offset:]
sample_2_cropped = sample_2[:-offset]
# these should be identical vectors:
assert np.all(sample_1_cropped==sample_2_cropped)
# compute autocorrelations of cropped samples
# and corresponding value for rho
autocorrelation_1_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_1_cropped).conjugate()))
autocorrelation_2_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_2_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
xc_denom_exact_1 = np.sqrt(np.max(autocorrelation_1_cropped))*np.sqrt(np.max(autocorrelation_2_cropped))
rho_exact_1 = xc_num/xc_denom_exact_1
print 'rho_exact_1',np.max(rho_exact_1)
# alternatively we could try to use the
# whole sample autocorrelations and just
# scale by the number of pixels used to
# compute the numerator:
scaling_factor = float(len(sample_1_cropped))/float(len(sample_1))
rho_exact_2 = xc_num/(xc_denom_approx*scaling_factor)
print 'rho_exact_2',np.max(rho_exact_2)
# finally a sanity check: is rho actually 1.0
# for the two signals:
rho_corrcoef = np.corrcoef(sample_1_cropped,sample_2_cropped)[0,1]
print 'rho_corrcoef',rho_corrcoef
x = np.arange(len(rho_approx))
plt.plot(x,rho_approx,label='FFT rho_approx')
plt.plot(x,rho_exact_1,label='FFT rho_exact_1')
plt.plot(x,rho_exact_2,label='FFT rho_exact_2')
plt.plot(x,np.ones(len(x))*rho_corrcoef,'k--',label='Pearson rho')
plt.legend()
plt.ylim((.75,1.25))
plt.xlim((0,20))
plt.show()

The normalised cross correlation between two N-periodic discrete signals F and G is defined as:
Since the numerator is a dot product between two vectors (F and G_x) and the denominator is the product of the norm of these two vectors, the scalar r_x must indeed lie between -1 and +1 and it is the cosinus of the angle between the vectors (See there). If the vector F and G_x are aligned, then r_x=1. If r_x=1, then the vector F and G_x are aligned due to the triangular inequality. To ensure these properties, the vectors at the numerator must match those at the denominator.
All numerators can be computed at once by using the Discrete Fourier Transform. Indeed, that transform turns the convolution into pointwise products in the Fourier space. Here is why the different estimated normalized cross correlations are not 1 in the tests you performed.
For the first test "approx", sample_1 and sample_2 are both extracted from a periodic signal. Both are of the same length, but the length is not a multiple of the period as it is 2.5 periods (5pi) (figure below). As a result, since the dft performs the correlation as if they where periodic signals, it is found that sample_1 and sample_2 are not perfectly correlated and r_x<1.
For the second test rho_exact_1, the convolution is performed on signals of length N=128, but the norms at the denominator are computed on truncated vectors of size N-offset=128-5. As a result, the properties of r_x are lost. In addition, it must be noticed that the proposed convolution and norms are not normalized: the computed norms and convolution product are globally proportionnal to the number of points of the considered vectors. As a result, the norms of the truncated vectors are slightly lower compared to the previous case and r_x increases: values larger that 1 are likely encountered as the offset increases.
For the third test rho_exact_2, a scaling factor is introduced to try to correct the first test: the properties of r_x are also lost and values larger than one can be encountered as the scaling factor is larger than one.
Nevertheless, the function corrcoef() of numpy actually computes a r_x equal to 1 for the truncated signals. Indeed, these signals are perfectly identical! The same result can be obtained using DFTs:
xc_num_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
autocorrelation_1_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_1_cropped).conjugate()))
autocorrelation_2_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_2_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
xc_denom_exact_11 = np.sqrt(np.max(autocorrelation_1_cropped))*np.sqrt(np.max(autocorrelation_2_cropped))
rho_exact_11 = xc_num_cropped/xc_denom_exact_11
print 'rho_exact_11',np.max(rho_exact_11)
To provide the user with a significant value for r_x, you can stick to the value provided by the first test, which can be lower than one for identical periodic signals if the length of the frame is not a multiple of the period. To correct this drawback, the estimated offset can also be retreived and used to build two cropped signals of the same length. The whole correlation procedure must be re-run to get a new value for r_x, which will not be plaged by the fact that the length of the cropped frame is not a multiple of the period.
Lastly, if the DFT is a very efficient way to compute the convolution at the numerator for all values of x at once, the denominator can be efficiently computed as 2-norms of vector, using numpy.linalg.norm. Since the argmax(r_x) for the cropped signals will likely be zero if the first correlation was successful, it could be sufficient to compute r_0 using a dot product `sample_1_cropped.dot(sample_2_cropped).

extracting phase information using numpy fft

I am trying to use a fast fourier transform to extract the phase shift of a single sinusoidal function. I know that on paper, If we denote the transform of our function as T, then we have the following relations:
However, I am finding that while I am able to accurately capture the frequency of my cosine wave, the phase is inaccurate unless I sample at an extremely high rate. For example:
import numpy as np
import pylab as pl
num_t = 100000
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
w = 2.0*np.pi*30.0
phase = np.pi/2.0
amp = np.fft.rfft(np.cos(w*t+phase))
freqs = np.fft.rfftfreq(t.shape[-1],dt)
print (np.arctan2(amp.imag,amp.real))[30]
pl.subplot(211)
pl.plot(freqs[:60],np.sqrt(amp.real**2+amp.imag**2)[:60])
pl.subplot(212)
pl.plot(freqs[:60],(np.arctan2(amp.imag,amp.real))[:60])
pl.show()
Using num=100000 points I get a phase of 1.57173880459.
Using num=10000 points I get a phase of 1.58022110476.
Using num=1000 points I get a phase of 1.6650441064.
What's going wrong? Even with 1000 points I have 33 points per cycle, which should be enough to resolve it. Is there maybe a way to increase the number of computed frequency points? Is there any way to do this with a "low" number of points?
EDIT: from further experimentation it seems that I need ~1000 points per cycle in order to accurately extract a phase. Why?!
EDIT 2: further experiments indicate that accuracy is related to number of points per cycle, rather than absolute numbers. Increasing the number of sampled points per cycle makes phase more accurate, but if both signal frequency and number of sampled points are increased by the same factor, the accuracy stays the same.

Your points are not distributed equally over the interval, you have the point at the end doubled: 0 is the same point as 1. This gets less important the more points you take, obviusly, but still gives some error. You can avoid it totally, the linspace has a flag for this. Also it has a flag to return you the dt directly along with the array.
Do
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
instead of
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
then it works :)

The phase value in the result bin of an unrotated FFT is only correct if the input signal is exactly integer periodic within the FFT length. Your test signal is not, thus the FFT measures something partially related to the phase difference of the signal discontinuity between end-points of the test sinusoid. A higher sample rate will create a slightly different last end-point from the sinusoid, and thus a possibly smaller discontinuity.
If you want to decrease this FFT phase measurement error, create your test signal so the your test phase is referenced to the exact center (sample N/2) of the test vector (not the 1st sample), and then do an fftshift operation (rotate by N/2) so that there will be no signal discontinuity between the 1st and last point in your resulting FFT input vector of length N.

This snippet of code might help:
def reconstruct_ifft(data):
"""
In this function, we take in a signal, find its fft, retain the dominant modes and reconstruct the signal from that
Parameters
----------
data : Signal to do the fft, ifft
Returns
-------
reconstructed_signal : the reconstructed signal
"""
N = data.size
yf = rfft(data)
amp_yf = np.abs(yf) #amplitude
yf = yf*(amp_yf>(THRESHOLD*np.amax(amp_yf)))
reconstructed_signal = irfft(yf)
return reconstructed_signal
The 0.01 is the threshold of amplitudes of the fft that you would want to retain. Making the THRESHOLD greater(more than 1 does not make any sense), will give
fewer modes and cause higher rms error but ensures higher frequency selectivity.
(Please adjust the TABS for the python code)

Plotting confidence intervals for Maximum Likelihood Estimate

I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .

Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.

The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.

Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.