I'm analyzing a signal sampled at 200Hz for 6-8 seconds, and the important part are the spikes, that lasts 1 second at max. Think for example to an earthquake...
I have to downsample the signal by a factor 2. I tried:
from scipy import signal
signal.decimate(mysignal, 2, ftype="fir")
signal.resample_poly(mysignal, 1, 2)
I get the same result with both the functions: the signal is resampled, but the spikes, positive and negative ones, are diminished.
I wrong the function, or I have to pass a custom FIR filter?
Note
Downsampling will always damage the signal if you hit the limits of sampling frequency with the frequency of your signal (Nyquist-Shannon sampling theorem). In your case, your spikes are similar to very high frequency signals, therefore you also need very high sampling frequency.
(Example: You have 3 points where middle one has spike. You want to downsample it to 2 points. Where to put the spike? Nowhere, because you run out of samples.)
Nevertheless, if you really want to downsample the signal and still you want to preserve (more or less accurately) particular points (in your case spikes), you can try below attitude, which 'saves' your spikes, downsample the signal and only afterwards applies the 'saved' spikes on corresponding downsampled signal positions.
Steps to do:
1) Get spikes, or in other words,local maximums(or minimums).
example: Pandas finding local max and min
2) Downsample the signal
3) With those spikes you got from 1), replace the corresponding downsampled values
(count with the fact that your signal will be damaged. You cant downsample without losing spikes that are represented by one or two points)
EDIT
Ilustrative example
This is example how to keep the spikes. Its just example, as it is now it doesnt work for negative values
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
t = np.arange(1000)/100
y = np.sin(t*2*3.14)
y[150]=5
y[655]=5
y[333]=5
y[250]=5
def downsample(factor,values):
buffer_ = deque([],maxlen=factor)
downsampled_values = []
for i,value in enumerate(values):
buffer_.appendleft(value)
if (i-1)%factor==0:
#Take max value out of buffer
# or you can take higher value if their difference is too big, otherwise just average
downsampled_values.append(max(buffer_))
return np.array(downsampled_values)
plt.plot(downsample(10,y))
plt.show()
You can, if your hardware supports it, sample at the highest possible frequency but only save a point when a minimum difference in amplitude or difference in time is reached. That way your actual datapoints are filtered on either one criterium. When nothing really changes in the signal you have your wanted sample rate and peaks are also still registered.
Let's assume data contains your sampling points at constant sampling rate. At the end of this algorithm, the list saved will contain all important [ timestamp, sample_point ] entries of your data:
DIVIDER = 5
THRESHOLD = 1000
saved = [ [0, data[0]] ]
for i in range(1, len(data)):
if( (i % DIVIDER == 0) || (abs(data[i] - data[i - 1]) > THRESHOLD) ):
saved.append([ i, data[i] ])
Instead of looking at the amplitude difference between two sampling points you could also just save all the datapoints that lie above or below a certain amplitude with minor changes in this simple piece of code.
If you are not too fussy about the aliasing can you just take the maximum of every second (Nth) sample.
def take(N, samples):
it = iter(samples)
for _ in range(len(samples)/N):
yield max(it.next() for _ in range(N))
Tensing this:
import random
random.seed(1)
a = [random.gauss(10,3) for _ in range(100)]
for c in take(5, a):
print c
As #Martin pointed out, the spikes contain high-frequency components. Transform your signal into a rotating frame of reference (i.e. add/remove a base frequency). This effectively shifts your signal by half the maximum required frequency and you can sample at half rate.
Related
I want to apply Fourier transformation using fft function to my time series data to find "patterns" by extracting the dominant frequency components in the observed data, ie. the lowest 5 dominant frequencies to predict the y value (bacteria count) at the end of each time series.
I would like to preserve the smallest 5 coefficients as features, and eliminate the rest.
My code is as below:
df = pd.read_csv('/content/drive/My Drive/df.csv', sep=',')
X = df.iloc[0:2,0:10000]
dft_X = np.fft.fft(X)
print(dft_X)
print(len(dft_X))
plt.plot(dft_X)
plt.grid(True)
plt.show()
# What is the graph about(freq/amplitude)? How much data did it use?
for i in dft_X:
m = i[np.argpartition(i,5)[:5]]
n = i[np.argpartition(i,range(5))[:5]]
print(m,'\n',n)
Here is the output:
But I am not sure how to interpret this graph. To be precise,
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
Parameters:
n : int
Window length.
d : scalar, optional
Sample spacing (inverse of the sampling rate). Defaults to 1.
Returns:
f : ndarray
Array of length n containing the sample frequencies.
How to determine n(window length) and sample spacing?
3) Why are transformed values all complex numbers?
Thanks
I'm gonna answer in reverse order of your questions
3) Why are transformed values all complex numbers?
The output of a Fourier Transform is always complex numbers. To get around this fact, you can either apply the absolute value on the output of the transform, or only plot the real part using:
plt.plot(dft_X.real)
2) To obtain frequency value, should I use np.fft.fftfreq(n, d=timestep)?
No, the "frequency values" will be visible on the output of the FFT.
1) Does the graph show the transformed values of the input data? I only used 2 rows of data(each row is a time series), thus data is 2x10000, why are there so many lines in the graph?
Your graph has so many lines because it's making a line for each column of your data set. Apply the FFT on each row separately (or possibly just transpose your dataframe) and then you'll get more actual frequency domain plots.
Follow up
Would using absolute value or real part of the output as features for a later model have different effect than using the original output?
Absolute values are easier to work with usually.
Using real part
Using absolute value
Here's the Octave code that generated this:
Fs = 4000; % Sampling rate of signal
T = 1/Fs; % Period
L = 4000; % Length of signal
t = (0:L-1)*T; % Time axis
freq = 1000; % Frequency of our sinousoid
sig = sin(freq*2*pi*t); % Fill Time-Domain with 1000 Hz sinusoid
f_sig = fft(sig); % Apply FFT
f = Fs*(0:(L/2))/L; % Frequency axis
figure
plot(f,abs(f_sig/L)(1:end/2+1)); % peak at 1kHz)
figure
plot(f,real(f_sig/L)(1:end/2+1)); % main peak at 1kHz)
In my example, you can see the absolute value returned no noise at frequencies other than the sinusoid of frequency 1kHz I generated while the real part had a bigger peak at 1kHz but also had much more noise.
As for effects, I don't know what you mean by that.
is it expected that "frequency values" always be complex numbers
Always? No. The Fourier series represents the frequency coefficients at which the sum of sines and cosines completely equate any continuous periodic function. Sines and cosines can be written in complex forms through Euler's formula. This is the most convenient way to store Fourier coefficients. In truth, the imaginary part of your frequency-domain signal represents the phase of the signal. (i.e if I have 2 sine functions of the same frequency, they can have different complex forms depending on the time shifting). However, most libraries that provide an FFT function will, by default, store FFT coefficients as complex numbers, to facilitate phase and magnitude calculations.
Is it convention that FFT use each column of dataset when plotting a line
I think it is an issue with mathplotlib.plot, not np.fft.
Could you please show me how to apply FFT on each row separately
There are many ways to go around this and I don't want to force you down one path, so I will propose the general solution to iterate over each row of your dataframe and apply the FFT on each specific row. Otherwise, in your case, I believe transposing your output could also work.
I'm trying to denoise financial time series data (second by second). I have a very long time series, but I've been working with 100,000 observations just to test how well the wavelet denoising (haar) works. It doesn't.
No matter what I do, the reconstructed signal ends up invariably almost identical to the original. Obviously, I want to preserve the original signal, but I feel like the series just simply isn't being denoised -- a financial time series whose only noise occurs in the few-second resolution? Moreover, even at the smallest time scales, the graph of the reconstructed and original graph remain almost the same.
I've tried changing the mother wavelet, the time series length, the mode in which reconstruction of the time series is done (soft vs hard) and, obviously, I've messed with the threshold value itself. I started at the recommended/standard threshold value of sqrt(2*log(len(signal))), but that did virtually nothing for me, so I gradually increased it until I got to the completely ridiculous 2*len(signal)**2 -- which should have smoothed the graph beyond recognition but did basically nothing.
WAVELET = "haar"
LEVEL = 2
signal = training_series
mean = signal.mean()
mean_series = [mean] * len(signal)
signal = [a - b for a, b in zip(signal, mean_series)]
coeffs = pywt.wavedec(signal, WAVELET, level=LEVEL)
sigma = mad(coeffs[-LEVEL])
threshold = sigma * np.sqrt(2*np.log(len(signal)))
coeffs[1:] = (pywt.threshold(i, value=threshold, mode="soft" ) for i in coeffs[1:])
reconstructed_signal = pywt.waverec(coeffs, WAVELET)
I expected that the reconstructed signal would be significantly different from the original signal (as in, smoothed out, denoised, less... identical to the original), but that wasn't the case. At the smallest of scales (think every 10 or 20 seconds on a scale of 100,000 seconds), there is some very minor smoothing that is essentially just ignoring peaks and valleys of size 0.01 (the smallest possible change), but it's almost negligible.
I expected a signal that would be, well, I don't know -- denoised? Am I doing something wrong?
Your threshold might be too high.
You should try setting it by a metric based on the detail coefficients at each level, instead of the original time trace.
Usually starting at:
threshold=np.std(coeff[i])
and going from there will at least get one started.
I had the same problem and found by steadily increasing a scale factor on the threshold helped.
I was attempting to denoise an acoustic emission signal, and only got reconstruction. By multiplying sigma by an increasing scale factor I could find out how high the thresholds needed to be to stop reproducing the signal.
import pywt
import numpy as np
import matplotlib.pyplot as plt
def madev(d, axis=None):
""" Mean absolute deviation of a signal """
return np.mean(np.absolute(d - np.mean(d, axis)), axis)
def wavelet_denoising(x, wavelet, level, s_factor):
"""
deconstructs, thresholds then reconstructs
higher thresholds = less detailed reconstruction
"""
coeff = pywt.wavedec(x, wavelet, mode="per")
sigma = (1/0.6745) * madev(coeff[-level])*s_factor
uthresh = sigma * np.sqrt(2 * np.log(len(x)))
coeff[1:] = (pywt.threshold(i, value=uthresh, mode='hard') for i in coeff[1:])
return pywt.waverec(coeff, wavelet, mode='per')
wav = 'db4'
level=1
for s_factor in np.arange(0,20, 2):
data = wavelet_denoising(signal, wav, level, s_factor)
plt.plot(data)
plt.title('scale factor = {}'.format(s_factor))
fname = 'wavelet_{}_sf_{}_n_{}'.format(wav, s_factor, len(signal))
plt.savefig(fname)
plt.show()
I am trying to use a fast fourier transform to extract the phase shift of a single sinusoidal function. I know that on paper, If we denote the transform of our function as T, then we have the following relations:
However, I am finding that while I am able to accurately capture the frequency of my cosine wave, the phase is inaccurate unless I sample at an extremely high rate. For example:
import numpy as np
import pylab as pl
num_t = 100000
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
w = 2.0*np.pi*30.0
phase = np.pi/2.0
amp = np.fft.rfft(np.cos(w*t+phase))
freqs = np.fft.rfftfreq(t.shape[-1],dt)
print (np.arctan2(amp.imag,amp.real))[30]
pl.subplot(211)
pl.plot(freqs[:60],np.sqrt(amp.real**2+amp.imag**2)[:60])
pl.subplot(212)
pl.plot(freqs[:60],(np.arctan2(amp.imag,amp.real))[:60])
pl.show()
Using num=100000 points I get a phase of 1.57173880459.
Using num=10000 points I get a phase of 1.58022110476.
Using num=1000 points I get a phase of 1.6650441064.
What's going wrong? Even with 1000 points I have 33 points per cycle, which should be enough to resolve it. Is there maybe a way to increase the number of computed frequency points? Is there any way to do this with a "low" number of points?
EDIT: from further experimentation it seems that I need ~1000 points per cycle in order to accurately extract a phase. Why?!
EDIT 2: further experiments indicate that accuracy is related to number of points per cycle, rather than absolute numbers. Increasing the number of sampled points per cycle makes phase more accurate, but if both signal frequency and number of sampled points are increased by the same factor, the accuracy stays the same.
Your points are not distributed equally over the interval, you have the point at the end doubled: 0 is the same point as 1. This gets less important the more points you take, obviusly, but still gives some error. You can avoid it totally, the linspace has a flag for this. Also it has a flag to return you the dt directly along with the array.
Do
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
instead of
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
then it works :)
The phase value in the result bin of an unrotated FFT is only correct if the input signal is exactly integer periodic within the FFT length. Your test signal is not, thus the FFT measures something partially related to the phase difference of the signal discontinuity between end-points of the test sinusoid. A higher sample rate will create a slightly different last end-point from the sinusoid, and thus a possibly smaller discontinuity.
If you want to decrease this FFT phase measurement error, create your test signal so the your test phase is referenced to the exact center (sample N/2) of the test vector (not the 1st sample), and then do an fftshift operation (rotate by N/2) so that there will be no signal discontinuity between the 1st and last point in your resulting FFT input vector of length N.
This snippet of code might help:
def reconstruct_ifft(data):
"""
In this function, we take in a signal, find its fft, retain the dominant modes and reconstruct the signal from that
Parameters
----------
data : Signal to do the fft, ifft
Returns
-------
reconstructed_signal : the reconstructed signal
"""
N = data.size
yf = rfft(data)
amp_yf = np.abs(yf) #amplitude
yf = yf*(amp_yf>(THRESHOLD*np.amax(amp_yf)))
reconstructed_signal = irfft(yf)
return reconstructed_signal
The 0.01 is the threshold of amplitudes of the fft that you would want to retain. Making the THRESHOLD greater(more than 1 does not make any sense), will give
fewer modes and cause higher rms error but ensures higher frequency selectivity.
(Please adjust the TABS for the python code)
I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)
Essentially I've got an excel files with voltage in the first column, and time in the second. I want to find the period of the voltages, as it returns a graph of voltage in y axis and time in x axis with a periodicity, looking similar to a sine function.
To find the frequency I have uploaded my excel file to python as I think this will make it easier- there may be something I've missed that will simplify this.
So far in python I have:
import xlrd
import numpy as N
import numpy.fft as F
import matplotlib.pyplot as P
wb = xlrd.open_workbook('temp7.xls') #LOADING EXCEL FILE
wb.sheet_names()
sh = wb.sheet_by_index(0)
first_column = sh.col_values(1) #VALUES FROM EXCEL
second_column = sh.col_values(2) #VALUES FROM EXCEL
Now how do I find the frequency from this?
I'm not sure how much you know about the Fourier transform, so forgive me if this is too much background.
Your signal does not have "a frequency", it is but it can be thought of as the sum of many frequencies. The Fourier transform will tell you the weights of all the frequencies that make up your signal. Unfortunately information may be lost when sampling from the analog (continuous time) to digital (discrete time) domain. This puts a constraint on the information we can get about frequency - namely that the maximum frequency component we can determine is related to the digital sampling rate (Nyquist-Shannon criterion):
fs > 2B
Where fs is your sampling rate (samples/unit time, typically in Hz or something like it), and B is the maximum frequency of your signal. If your signal actually has frequencies higher than B they will be "aliased" to some value lower than B.
For your problem, all you have to do is this:
x = N.array(first_column)
X = F.fft(x)
Now X is the frequency-domain representation of your voltage signal. The corresponding frequency axis covers [0, fs), based on the sampling theorem. So, what is fs? You need to calculate that by looking at the number of samples you have divided by the total duration of your sampled signal (note your units here):
fs = len(second_column) / second_column[-1]
Note that this representation of your signal will also (probably) be complex, i.e. each frequency will have an associated amplitude and phase.
Hopefully this helps, and hopefully I didn't cover a bunch of stuff you already knew.