I'm trying to denoise financial time series data (second by second). I have a very long time series, but I've been working with 100,000 observations just to test how well the wavelet denoising (haar) works. It doesn't.
No matter what I do, the reconstructed signal ends up invariably almost identical to the original. Obviously, I want to preserve the original signal, but I feel like the series just simply isn't being denoised -- a financial time series whose only noise occurs in the few-second resolution? Moreover, even at the smallest time scales, the graph of the reconstructed and original graph remain almost the same.
I've tried changing the mother wavelet, the time series length, the mode in which reconstruction of the time series is done (soft vs hard) and, obviously, I've messed with the threshold value itself. I started at the recommended/standard threshold value of sqrt(2*log(len(signal))), but that did virtually nothing for me, so I gradually increased it until I got to the completely ridiculous 2*len(signal)**2 -- which should have smoothed the graph beyond recognition but did basically nothing.
WAVELET = "haar"
LEVEL = 2
signal = training_series
mean = signal.mean()
mean_series = [mean] * len(signal)
signal = [a - b for a, b in zip(signal, mean_series)]
coeffs = pywt.wavedec(signal, WAVELET, level=LEVEL)
sigma = mad(coeffs[-LEVEL])
threshold = sigma * np.sqrt(2*np.log(len(signal)))
coeffs[1:] = (pywt.threshold(i, value=threshold, mode="soft" ) for i in coeffs[1:])
reconstructed_signal = pywt.waverec(coeffs, WAVELET)
I expected that the reconstructed signal would be significantly different from the original signal (as in, smoothed out, denoised, less... identical to the original), but that wasn't the case. At the smallest of scales (think every 10 or 20 seconds on a scale of 100,000 seconds), there is some very minor smoothing that is essentially just ignoring peaks and valleys of size 0.01 (the smallest possible change), but it's almost negligible.
I expected a signal that would be, well, I don't know -- denoised? Am I doing something wrong?
Your threshold might be too high.
You should try setting it by a metric based on the detail coefficients at each level, instead of the original time trace.
Usually starting at:
threshold=np.std(coeff[i])
and going from there will at least get one started.
I had the same problem and found by steadily increasing a scale factor on the threshold helped.
I was attempting to denoise an acoustic emission signal, and only got reconstruction. By multiplying sigma by an increasing scale factor I could find out how high the thresholds needed to be to stop reproducing the signal.
import pywt
import numpy as np
import matplotlib.pyplot as plt
def madev(d, axis=None):
""" Mean absolute deviation of a signal """
return np.mean(np.absolute(d - np.mean(d, axis)), axis)
def wavelet_denoising(x, wavelet, level, s_factor):
"""
deconstructs, thresholds then reconstructs
higher thresholds = less detailed reconstruction
"""
coeff = pywt.wavedec(x, wavelet, mode="per")
sigma = (1/0.6745) * madev(coeff[-level])*s_factor
uthresh = sigma * np.sqrt(2 * np.log(len(x)))
coeff[1:] = (pywt.threshold(i, value=uthresh, mode='hard') for i in coeff[1:])
return pywt.waverec(coeff, wavelet, mode='per')
wav = 'db4'
level=1
for s_factor in np.arange(0,20, 2):
data = wavelet_denoising(signal, wav, level, s_factor)
plt.plot(data)
plt.title('scale factor = {}'.format(s_factor))
fname = 'wavelet_{}_sf_{}_n_{}'.format(wav, s_factor, len(signal))
plt.savefig(fname)
plt.show()
Related
I'm trying to do some tests before I proceed analyzing some real dataset via FFT, and I've found the following problem.
First, I create a signal as the sum of two cosines and then use rfft to to the transformation (since it has only real values):
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft, rfftfreq
# Number of sample points
N = 800
# Sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = 0.5*np.cos(10*2*np.pi*x) + 0.5*np.cos(200*2*np.pi*x)
# FFT
yf = rfft(y)
xf = rfftfreq(N, T)
fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(x,y)
ax[1].plot(xf, 2.0/N*np.abs(yf))
As it can be seen from the definition of the signal, I have two oscillations with amplitude 0.5 and frequency 10 and 200. Now, I would expect the FFT spectrum to be something like two deltas at those points, but apparently increasing the frequency broadens the peaks:
From the first peak it can be infered that the amplitude is 0.5, but not for the second. I've tryied to obtain the area under the peak using np.trapz and use that as an estimate for the amplitude, but as it is close to a dirac delta it's very sensitive to the interval I choose. My problem is that I need to get the amplitude as exact as possible for my data analysis.
EDIT: As it seems to be something related with the number of points, I decided to increment (now that I can) the sample frequency. This seems to solve the problem, as it can be seen in the figure:
However, it still seems strange that for a certain number of points and sample frequency, the high frequency peaks broaden...
It is not strange , you have leakage of the frequency bins. When you discretize the signal (sampling) needed for the Fourier transfrom , frequency bins are created which are frequency intervals where the the amplitude is calculated. And each bin has wide which is given by the sample_rate / num_points . So , the less the number of bins the more difficult is to assign precise amplitudes to every frequency. Other problems in choosing the best sampling rate exist such as the shannon-nyquist theorem to prevent aliasing. https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem . But depending on the problem sometimes there some custom rates used for sampling. E.g. when dealing with audio a sampling rate of 44,100 Hz is widely used , cause is based on the limits of the human hearing. So it depends also on nature of the data you want to perform analysis as you wrote. Anyway , since this question has also theoretical value , you can also check https://dsp.stackexchange.com for some useful info.
I would comment to George's answer, but yet I cannot.
Maybe a starting point for your research are the properties of the Discrete Fourier Transform.
The signal in the time domain is actual the cosines multiplied by a box window which transforms into the frequency domain as the convolution of the deltas with the sinc function. The sinc functions will smear the spectrum.
However, I am not sure we are observing spectral leakage here, since the window fits exactly to the full period of cosines. The discretization of the bins might still play a role here.
I'm analyzing a signal sampled at 200Hz for 6-8 seconds, and the important part are the spikes, that lasts 1 second at max. Think for example to an earthquake...
I have to downsample the signal by a factor 2. I tried:
from scipy import signal
signal.decimate(mysignal, 2, ftype="fir")
signal.resample_poly(mysignal, 1, 2)
I get the same result with both the functions: the signal is resampled, but the spikes, positive and negative ones, are diminished.
I wrong the function, or I have to pass a custom FIR filter?
Note
Downsampling will always damage the signal if you hit the limits of sampling frequency with the frequency of your signal (Nyquist-Shannon sampling theorem). In your case, your spikes are similar to very high frequency signals, therefore you also need very high sampling frequency.
(Example: You have 3 points where middle one has spike. You want to downsample it to 2 points. Where to put the spike? Nowhere, because you run out of samples.)
Nevertheless, if you really want to downsample the signal and still you want to preserve (more or less accurately) particular points (in your case spikes), you can try below attitude, which 'saves' your spikes, downsample the signal and only afterwards applies the 'saved' spikes on corresponding downsampled signal positions.
Steps to do:
1) Get spikes, or in other words,local maximums(or minimums).
example: Pandas finding local max and min
2) Downsample the signal
3) With those spikes you got from 1), replace the corresponding downsampled values
(count with the fact that your signal will be damaged. You cant downsample without losing spikes that are represented by one or two points)
EDIT
Ilustrative example
This is example how to keep the spikes. Its just example, as it is now it doesnt work for negative values
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
t = np.arange(1000)/100
y = np.sin(t*2*3.14)
y[150]=5
y[655]=5
y[333]=5
y[250]=5
def downsample(factor,values):
buffer_ = deque([],maxlen=factor)
downsampled_values = []
for i,value in enumerate(values):
buffer_.appendleft(value)
if (i-1)%factor==0:
#Take max value out of buffer
# or you can take higher value if their difference is too big, otherwise just average
downsampled_values.append(max(buffer_))
return np.array(downsampled_values)
plt.plot(downsample(10,y))
plt.show()
You can, if your hardware supports it, sample at the highest possible frequency but only save a point when a minimum difference in amplitude or difference in time is reached. That way your actual datapoints are filtered on either one criterium. When nothing really changes in the signal you have your wanted sample rate and peaks are also still registered.
Let's assume data contains your sampling points at constant sampling rate. At the end of this algorithm, the list saved will contain all important [ timestamp, sample_point ] entries of your data:
DIVIDER = 5
THRESHOLD = 1000
saved = [ [0, data[0]] ]
for i in range(1, len(data)):
if( (i % DIVIDER == 0) || (abs(data[i] - data[i - 1]) > THRESHOLD) ):
saved.append([ i, data[i] ])
Instead of looking at the amplitude difference between two sampling points you could also just save all the datapoints that lie above or below a certain amplitude with minor changes in this simple piece of code.
If you are not too fussy about the aliasing can you just take the maximum of every second (Nth) sample.
def take(N, samples):
it = iter(samples)
for _ in range(len(samples)/N):
yield max(it.next() for _ in range(N))
Tensing this:
import random
random.seed(1)
a = [random.gauss(10,3) for _ in range(100)]
for c in take(5, a):
print c
As #Martin pointed out, the spikes contain high-frequency components. Transform your signal into a rotating frame of reference (i.e. add/remove a base frequency). This effectively shifts your signal by half the maximum required frequency and you can sample at half rate.
I'm programming in Python and I have two signals, a rectangular signal f(t) and a distribution function g(t) (e.g. normal distribution, log-normal distribution etc.). Both signals are defined over time (x-axis) with a resolution of 0.001 s and the measure of the rectangle is in pulses per second (1/s, y-axis). Both functions shall be convolved, whereby the unit of the result (y-axis) has to be again pulses per second for the subsequent calculations. The original domain of definition in which the results make sense is 0-250 1/s (or at least not much higher). Since I'm programming in Python, f(t) is just an array of ones, multiplied with a respective amplitude. I read in the literature, that g(t) has to be a PDF for my sort of calculations, so I normalized it to its area.
The specific background is, that g(t) is a time delay profile whose influence on the original signal f(t) shall be investigated. This means, I want to find out in which way the shape of f(t) is altered and which impact that has on the subsequent calculations.
My first question is: can I convolve these two signals without any further considerations, or do I have to transform the rectangle at first into a PDF as well?
My second problem is: after the convolution, what measure does the result have? Lets say I convolved the PDF of g(t) with the original f(t) ... is the resulting unit simply 1/s again? If not, how do I convert it into a signal with the unit 1/s?
The thing is, I tested both approaches case1 PDF(f(t)) * g(t) and case2 PDF(f(t)) * PDF(g(t)). Both results are way to high as that it would make any sense for the subsequent computations. Let me give you an example:
time resolution = 1 ms
Amplitude f(t) = 250 1/s
duration f(t) = 1.2 s
duration g(t) = 0.18 s
In case1 the resulting maximal amplitude of the convolution product would be slightly less than 250k. One solution might be to divide this result by 1000 (the time resolution, because this is the parameter to which I normalized the PDF). So in the end the amplitude would be between 0-250 again. But I don't know if this is right, since this is more or less like I would convolve f(t) with the probability mass function of g(t) (division and multiplication before and after the convolution with the time resolution would cancel each other out) ... Whereas case2 would give me an amplitude of ~833. Here I don't know how to regain 1/s.
Does anyone have an idea, or do you need at first further information?
Any help will be much appreciated!
PS: If this question doesn't fit into this stackexchange forum, then please tell me where I should ask it!
EDIT: Included code snippet for the above description.
#%% import
import numpy as np
#%% definitions
def AD(d, a):
return d*np.exp(-d/a)/a**2
#%% calculate pdf of "g(t)"
# parameters
a = 0.013
# domain of definition
t_min = 1e-12
t_max = 0.5
t_step = 0.001
t_range = np.arange(t_min,t_max,t_step)
# original + pdf
distr = AD(t_range,a)
distr_pdf = distr/(distr.sum()*t_step)
# reduce the array (optimize processing time)
for x in np.arange(0, len(distr_pdf)):
if np.round(distr_pdf[0:x+1].sum()*t_step,5) >= 1:
distr_pdf = distr_pdf[0:x+1]
break
#%% calculate rectangle "f(t)"
# parameters
length_rect = 1200
intensity = 250
# original + pdf
rect = np.ones(length_rect) * intensity
rect_pdf = rect/(rect.sum()*t_step)
#%% convolution case1
result1 = np.convolve(distr_pdf,rect)
#%% convolution case2
result2 = np.convolve(distr_pdf,rect_pdf)
Description of the problem
FFT can be used to compute cross-correlation between two signals or images. To determine the delay or lag between two signals A and B, it suffices to locate the peak of:
IFFT(FFT(A)*conjugate(FFT(B)))
However, the amplitude of the peak is related to the amplitude of the frequency spectra of the individual signals. Thus to determine the Pearson correlation (rho), the amplitude of this peak must be scaled by the total energy in the two signals.
One way to do this is to normalize by the geometric mean of the individual autocorrelations. This gives a reasonable approximation of rho, especially when the delay between samples is small, but not the exact value.
I thought the reason for this error was that the Pearson correlation is only defined for the overlapping portions of the signal, whereas the normalization factor (the geometric mean of the two autocorrelation peaks) includes contributions from the non-overlapping portions. I considered two approaches for fixing this and producing an exact value for rho via FFT. In the first (called rho_exact_1 below), I trimmed the samples down to their overlapping portions and computed the normalization factor from those. In the second (called rho_exact_2 below), I computed the fraction of the measurements contained in the overlapping portion of the signals and multiplied the full-autocorrelation-normalization factor by that fraction.
Neither works! The figure below shows plots of the three approaches for calculating Pearson's rho using DFT-based cross-correlation. Only the region of the cross-correlation peak is shown. Each estimate is close to the correct value of 1.0, but not equal to it.
The code I used to perform the calculations is below. I used a simple sine wave as an example signal. I noticed that if I use a square-wave (w/ duty cycle not necessarily 50%) the approaches' errors change.
Can somebody explain what's going on?
A working example
import numpy as np
from matplotlib import pyplot as plt
# make a time vector w/ 256 points
# and a source signal
N_cycles = 10.0
N_points = 256.0
t = np.arange(0,N_cycles*np.pi,np.pi*N_cycles/N_points)
signal = np.sin(t)
use_rect = False
if use_rect:
threshold = -0.75
signal[np.where(signal>=threshold)]=1.0
signal[np.where(signal<threshold)]=-1.0
# normalize the signal (not technically
# necessary for this example, but required
# for measuring correlation of physically
# different signals)
signal = signal/signal.std()
# generate two samples of the signal
# with a temporal offset:
N = 128
offset = 5
sample_1 = signal[:N]
sample_2 = signal[offset:N+offset]
# determine the offset through cross-
# correlation
xc_num = np.abs(np.fft.ifft(np.fft.fft(sample_1)*np.fft.fft(sample_2).conjugate()))
offset_estimate = np.argmax(xc_num)
if offset_estimate>N//2:
offset_estimate = offset_estimate - N
# for an approximate estimate of Pearson's
# correlation, we normalize by the RMS
# of individual autocorrelations:
autocorrelation_1 = np.abs(np.fft.ifft(np.fft.fft(sample_1)*np.fft.fft(sample_1).conjugate()))
autocorrelation_2 = np.abs(np.fft.ifft(np.fft.fft(sample_2)*np.fft.fft(sample_2).conjugate()))
xc_denom_approx = np.sqrt(np.max(autocorrelation_1))*np.sqrt(np.max(autocorrelation_2))
rho_approx = xc_num/xc_denom_approx
print 'rho_approx',np.max(rho_approx)
# this is an approximation because we've
# included autocorrelation of the whole samples
# instead of just the overlapping portion;
# using cropped versions of the samples should
# yield the correct correlation:
sample_1_cropped = sample_1[offset:]
sample_2_cropped = sample_2[:-offset]
# these should be identical vectors:
assert np.all(sample_1_cropped==sample_2_cropped)
# compute autocorrelations of cropped samples
# and corresponding value for rho
autocorrelation_1_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_1_cropped).conjugate()))
autocorrelation_2_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_2_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
xc_denom_exact_1 = np.sqrt(np.max(autocorrelation_1_cropped))*np.sqrt(np.max(autocorrelation_2_cropped))
rho_exact_1 = xc_num/xc_denom_exact_1
print 'rho_exact_1',np.max(rho_exact_1)
# alternatively we could try to use the
# whole sample autocorrelations and just
# scale by the number of pixels used to
# compute the numerator:
scaling_factor = float(len(sample_1_cropped))/float(len(sample_1))
rho_exact_2 = xc_num/(xc_denom_approx*scaling_factor)
print 'rho_exact_2',np.max(rho_exact_2)
# finally a sanity check: is rho actually 1.0
# for the two signals:
rho_corrcoef = np.corrcoef(sample_1_cropped,sample_2_cropped)[0,1]
print 'rho_corrcoef',rho_corrcoef
x = np.arange(len(rho_approx))
plt.plot(x,rho_approx,label='FFT rho_approx')
plt.plot(x,rho_exact_1,label='FFT rho_exact_1')
plt.plot(x,rho_exact_2,label='FFT rho_exact_2')
plt.plot(x,np.ones(len(x))*rho_corrcoef,'k--',label='Pearson rho')
plt.legend()
plt.ylim((.75,1.25))
plt.xlim((0,20))
plt.show()
The normalised cross correlation between two N-periodic discrete signals F and G is defined as:
Since the numerator is a dot product between two vectors (F and G_x) and the denominator is the product of the norm of these two vectors, the scalar r_x must indeed lie between -1 and +1 and it is the cosinus of the angle between the vectors (See there). If the vector F and G_x are aligned, then r_x=1. If r_x=1, then the vector F and G_x are aligned due to the triangular inequality. To ensure these properties, the vectors at the numerator must match those at the denominator.
All numerators can be computed at once by using the Discrete Fourier Transform. Indeed, that transform turns the convolution into pointwise products in the Fourier space. Here is why the different estimated normalized cross correlations are not 1 in the tests you performed.
For the first test "approx", sample_1 and sample_2 are both extracted from a periodic signal. Both are of the same length, but the length is not a multiple of the period as it is 2.5 periods (5pi) (figure below). As a result, since the dft performs the correlation as if they where periodic signals, it is found that sample_1 and sample_2 are not perfectly correlated and r_x<1.
For the second test rho_exact_1, the convolution is performed on signals of length N=128, but the norms at the denominator are computed on truncated vectors of size N-offset=128-5. As a result, the properties of r_x are lost. In addition, it must be noticed that the proposed convolution and norms are not normalized: the computed norms and convolution product are globally proportionnal to the number of points of the considered vectors. As a result, the norms of the truncated vectors are slightly lower compared to the previous case and r_x increases: values larger that 1 are likely encountered as the offset increases.
For the third test rho_exact_2, a scaling factor is introduced to try to correct the first test: the properties of r_x are also lost and values larger than one can be encountered as the scaling factor is larger than one.
Nevertheless, the function corrcoef() of numpy actually computes a r_x equal to 1 for the truncated signals. Indeed, these signals are perfectly identical! The same result can be obtained using DFTs:
xc_num_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
autocorrelation_1_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_1_cropped)*np.fft.fft(sample_1_cropped).conjugate()))
autocorrelation_2_cropped = np.abs(np.fft.ifft(np.fft.fft(sample_2_cropped)*np.fft.fft(sample_2_cropped).conjugate()))
xc_denom_exact_11 = np.sqrt(np.max(autocorrelation_1_cropped))*np.sqrt(np.max(autocorrelation_2_cropped))
rho_exact_11 = xc_num_cropped/xc_denom_exact_11
print 'rho_exact_11',np.max(rho_exact_11)
To provide the user with a significant value for r_x, you can stick to the value provided by the first test, which can be lower than one for identical periodic signals if the length of the frame is not a multiple of the period. To correct this drawback, the estimated offset can also be retreived and used to build two cropped signals of the same length. The whole correlation procedure must be re-run to get a new value for r_x, which will not be plaged by the fact that the length of the cropped frame is not a multiple of the period.
Lastly, if the DFT is a very efficient way to compute the convolution at the numerator for all values of x at once, the denominator can be efficiently computed as 2-norms of vector, using numpy.linalg.norm. Since the argmax(r_x) for the cropped signals will likely be zero if the first correlation was successful, it could be sufficient to compute r_0 using a dot product `sample_1_cropped.dot(sample_2_cropped).
I am trying to use a fast fourier transform to extract the phase shift of a single sinusoidal function. I know that on paper, If we denote the transform of our function as T, then we have the following relations:
However, I am finding that while I am able to accurately capture the frequency of my cosine wave, the phase is inaccurate unless I sample at an extremely high rate. For example:
import numpy as np
import pylab as pl
num_t = 100000
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
w = 2.0*np.pi*30.0
phase = np.pi/2.0
amp = np.fft.rfft(np.cos(w*t+phase))
freqs = np.fft.rfftfreq(t.shape[-1],dt)
print (np.arctan2(amp.imag,amp.real))[30]
pl.subplot(211)
pl.plot(freqs[:60],np.sqrt(amp.real**2+amp.imag**2)[:60])
pl.subplot(212)
pl.plot(freqs[:60],(np.arctan2(amp.imag,amp.real))[:60])
pl.show()
Using num=100000 points I get a phase of 1.57173880459.
Using num=10000 points I get a phase of 1.58022110476.
Using num=1000 points I get a phase of 1.6650441064.
What's going wrong? Even with 1000 points I have 33 points per cycle, which should be enough to resolve it. Is there maybe a way to increase the number of computed frequency points? Is there any way to do this with a "low" number of points?
EDIT: from further experimentation it seems that I need ~1000 points per cycle in order to accurately extract a phase. Why?!
EDIT 2: further experiments indicate that accuracy is related to number of points per cycle, rather than absolute numbers. Increasing the number of sampled points per cycle makes phase more accurate, but if both signal frequency and number of sampled points are increased by the same factor, the accuracy stays the same.
Your points are not distributed equally over the interval, you have the point at the end doubled: 0 is the same point as 1. This gets less important the more points you take, obviusly, but still gives some error. You can avoid it totally, the linspace has a flag for this. Also it has a flag to return you the dt directly along with the array.
Do
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
instead of
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
then it works :)
The phase value in the result bin of an unrotated FFT is only correct if the input signal is exactly integer periodic within the FFT length. Your test signal is not, thus the FFT measures something partially related to the phase difference of the signal discontinuity between end-points of the test sinusoid. A higher sample rate will create a slightly different last end-point from the sinusoid, and thus a possibly smaller discontinuity.
If you want to decrease this FFT phase measurement error, create your test signal so the your test phase is referenced to the exact center (sample N/2) of the test vector (not the 1st sample), and then do an fftshift operation (rotate by N/2) so that there will be no signal discontinuity between the 1st and last point in your resulting FFT input vector of length N.
This snippet of code might help:
def reconstruct_ifft(data):
"""
In this function, we take in a signal, find its fft, retain the dominant modes and reconstruct the signal from that
Parameters
----------
data : Signal to do the fft, ifft
Returns
-------
reconstructed_signal : the reconstructed signal
"""
N = data.size
yf = rfft(data)
amp_yf = np.abs(yf) #amplitude
yf = yf*(amp_yf>(THRESHOLD*np.amax(amp_yf)))
reconstructed_signal = irfft(yf)
return reconstructed_signal
The 0.01 is the threshold of amplitudes of the fft that you would want to retain. Making the THRESHOLD greater(more than 1 does not make any sense), will give
fewer modes and cause higher rms error but ensures higher frequency selectivity.
(Please adjust the TABS for the python code)