How to compute a 95% confidence interval around a continuous signal? - python

I would like to compute and display in python a 95% CI around a continuous signal (voltage values as a function of time). This signal was recorded in the brain of 16 different subjects, and lasts 1300 ms. Sampling rate was 250 Hz (so one datapoint every 4 ms). How can I proceed?

Here's a pythonic example of continuous error bar plotting:
https://tonysyu.github.io/plotting-error-bars.html#.V1HmMPkrJhE. The example is for plotting errors, but just replace err with stdev.
Assuming that each sample variable is normally distributed, I would calculate 2*standard deviation (~95% confidence) for each sample (every 4ms data point across all subjects). Each of these values gets stored in an array, and the array and data points can be fed into the example code.

Related

Area under the peak of a FFT in Python

I'm trying to do some tests before I proceed analyzing some real dataset via FFT, and I've found the following problem.
First, I create a signal as the sum of two cosines and then use rfft to to the transformation (since it has only real values):
import numpy as np
import matplotlib.pyplot as plt
from scipy.fft import rfft, rfftfreq
# Number of sample points
N = 800
# Sample spacing
T = 1.0 / 800.0
x = np.linspace(0.0, N*T, N)
y = 0.5*np.cos(10*2*np.pi*x) + 0.5*np.cos(200*2*np.pi*x)
# FFT
yf = rfft(y)
xf = rfftfreq(N, T)
fig, ax = plt.subplots(1,2,figsize=(15,5))
ax[0].plot(x,y)
ax[1].plot(xf, 2.0/N*np.abs(yf))
As it can be seen from the definition of the signal, I have two oscillations with amplitude 0.5 and frequency 10 and 200. Now, I would expect the FFT spectrum to be something like two deltas at those points, but apparently increasing the frequency broadens the peaks:
From the first peak it can be infered that the amplitude is 0.5, but not for the second. I've tryied to obtain the area under the peak using np.trapz and use that as an estimate for the amplitude, but as it is close to a dirac delta it's very sensitive to the interval I choose. My problem is that I need to get the amplitude as exact as possible for my data analysis.
EDIT: As it seems to be something related with the number of points, I decided to increment (now that I can) the sample frequency. This seems to solve the problem, as it can be seen in the figure:
However, it still seems strange that for a certain number of points and sample frequency, the high frequency peaks broaden...
It is not strange , you have leakage of the frequency bins. When you discretize the signal (sampling) needed for the Fourier transfrom , frequency bins are created which are frequency intervals where the the amplitude is calculated. And each bin has wide which is given by the sample_rate / num_points . So , the less the number of bins the more difficult is to assign precise amplitudes to every frequency. Other problems in choosing the best sampling rate exist such as the shannon-nyquist theorem to prevent aliasing. https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem . But depending on the problem sometimes there some custom rates used for sampling. E.g. when dealing with audio a sampling rate of 44,100 Hz is widely used , cause is based on the limits of the human hearing. So it depends also on nature of the data you want to perform analysis as you wrote. Anyway , since this question has also theoretical value , you can also check https://dsp.stackexchange.com for some useful info.
I would comment to George's answer, but yet I cannot.
Maybe a starting point for your research are the properties of the Discrete Fourier Transform.
The signal in the time domain is actual the cosines multiplied by a box window which transforms into the frequency domain as the convolution of the deltas with the sinc function. The sinc functions will smear the spectrum.
However, I am not sure we are observing spectral leakage here, since the window fits exactly to the full period of cosines. The discretization of the bins might still play a role here.

Unsupervised learning: Anomaly detection on discrete time series

I am working on a final year project on an unlabelled dataset consisting of vibration data from multiple components inside a wind turbine.
Datasets:
I have data from 4 wind turbines each consisting of 415 10-second intervals.
About the 10 second interval data:
Each of the 415 10-second intervals consist of vibration data for the generator, gearbox etc. (14 features in total)
The vibration data (the 14 features) have a resolution of 25.6kHz (262144 rows in each interval)
The 10-seconds are recorded once every day, at different times => A little more than 1 year worth of data
Head of dataframe with some of the features shown:
Plan:
My current plan is to
Do a Fast Fourier Transformation (FFT) from the time domain for each of the different sensors (gearbox, generator etc.) for each of the 415 intervals. From the FFT I am able to extract frequency information to put in a dataframe. (Statistical data from the FFT like spectral RMS per bin)
Build different data sets for different components.
Add features such as wind speed, wind direction, power produced etc.
I will then build unsupervised ML models that can detect anomalies.
Unsupervised models I consider using are Encoder-Decorder and clustering.
Questions:
Does it look like I have enough data for this type of task? 415
intervals x 4 different turbines = 1660 rows and approx. 20 features
Should the data be treated as a time series? (It is sampled for 10 seconds once a day at random times..)
What other unsupervised ML models/approaches that could be good for this task?
I hope this was clearly written. Thanks in advance for any input!

preprocessing EEG dataset in python to get better accuracy

I've an EEG dataset which has 8 features taken using 8-channel EEG headset. Each row represents readings taken with 250ms interval. The values are all floating point representing voltages in micro volt. If I plot individual features, I can see that they form a continuous wave. now the target has 3 categories: 0,1,2. and for a duration of time the target doesn't change because the sample taken spans across multiple rows. I would appreciate any guidance as to how to pre-process the dataset. Since using it as it is gives me very low accuracy(80%) and according to Wikipedia P300 signal can be detected with 95% accuracy. And please note that I've almost zero knowledge about signal processing and analysing waveforms.
I did try making a 3D array where each row represented a single target and the values of each feature was a list of values that originally spanned across multiple rows. But I get an error that says estimator array expected to be <=2. I'm not sure if this was the right approach. But it didn't work anyway.
here have a look at my feature set:-
-1.2198,-0.32769,-1.22,2.4115,0.057031,-2.6568,7.372,-0.2789
-1.4262,-4.19,-5.6546,-7.7161,-5.4359,-9.4553,-3.6705,-5.4851
-1.3152,-6.8708,-8.5599,-14.739,-9.1808,-14.268,-11.632,-8.929
-0.53987,-7.5156,-8.9646,-16.656,-10.119,-15.791,-14.616,-9.4095
Their corresponding targets:-
0
0
0
0

Calculating frequency present in Fast Fourier Transform

I have signal output values recorded by a Software-defined radio whose center frequency was 162.550 MHz & sample rate of 1,000,000. Now to analyse the data in frequency domain I calculated FFT which was straight forward.
#Calculating FFT of signal
fourier=np.fft.fft(RadioData)
Since for Amplitude vs Frequency plot I need to calculate frequencies present in the signal too. I used Numpy fftfreq for that.
freq=np.fft.fftfreq(fourier.shape[0])
The output was in the range of [-0.5 0.4999995]. I am confused how to interpret this result or alternatively how to calculate frequencies present in the data ?
When SDR samples are baseband IQ (or complex, or cosine/sine), then the bandwidth is equal to the IQ sample rate. This is because baseband IQ samples (unlike single_channel strictly real samples) can contain both positive and negative frequency spectrum, independently, half the bandwidth above and half the bandwidth below an RTL-SDR's (et.al.) tuned RF frequency setting (unless a frequency offset is selected).
Thus, the frequency range of the FFT of IQ data will be from Fcenter - (indicated_bandwidth/2) to almost Fcenter + (indicated_bandwidth/2). Or for your example: 162.050 to (a bit below) 163.050 MHz. (the "bit below" value depends on the FFT size.) The step size, dF, with be the IQ sample rate divided by the FFT length.
(Note that the data rate in scalar samples is twice the IQ sample rate because each IQ sample contains two samples (real and imaginary components, or cosine and sine mixer outputs). Thus, because each IQ sample contains more information, the information bandwidth can be greater. But SDR apps usually indicate the IQ sample rate, not the higher raw data rate.)

Algorithm to detect spike in x y graph

Imagine a realtime x, y graph where x is the quantity and y is time, with 1 minute interval. Every minute a new value is pushed in the graph. So I want to detect whenever there is a spike in the graph.
There are 2 kinds of spike:
Sudden Spike
Gradual Spike
Is there any way to detect them?
Since spikes are over a short distance (x2 - x1 ). You can take a standard deviation for a set of y values over a short range of x. If the deviation is reasonably large value, its a spike.
For example for 9 consecutive y values
4,4,5,10,26,10,5,4,4 standard deviation is 7.19.
4,4,5,10,100,10,5,4,4 standard deviation is 31.51.
You can start by analysing the highest values of y and its neighbours.
You can take the first derivative of y w.r.t. x using numpy.diff. Get a set of clean signals and obtain the threshold for it by obtaining the upper limit for derivative (this was the max deviation a clean signal had) using plain old max(array).
Then you can subject your real time signal to the same kind of scrutiny, check for the derivative.
Also, you could threshold it based on the angle of the signal, but you would need a comprehensive sample size for that. You can use tan(signal) for this.
Different thresholds give you different kinds of peaks.
Adding to the suggestion provided, you could also calculate the standard deviation by numpy.std(array) and then checking for +- the value from the mean. This would of course, be better seen using the derivative as I mentioned.
A method used in financial analysis is Bollinger Bands. This link can give you more information about it : http://sentdex.com/sentiment-analysisbig-data-and-python-tutorials-algorithmic-trading/how-to-chart-stocks-and-forex-doing-your-own-financial-charting/calculate-bollinger-bands-python-graph-matplotlib/
They are basically the moving average over a period of a time series. You can get a better set of thresholds using them rather than just the standard deviation.

Categories

Resources