Finding RMS noise in a spectra

Finding RMS noise in a spectra - python

I have an intensity v/s velocity spectrum and my aim is to find the RMS noise in the spectrum excluding the channels where the peak is present.
So, after some research, I came to know that RMS noise is the same as the standard deviation of the spectrum and the signal-to-noise ratio of the signal is the average of the signal divided by the same standard deviation. Can anybody please tell me if I am wrong here?
This is how I coded it in python
def Average(data):
return sum(data) / len(data)
average = Average(data)
print("Average of the list =", average)
standardDev = data.std()
print('The standard deviation is',standardDev)
SNR = average/standardDev
print('SNR = ',SNR)
My original data points are:
x-axis(velocity) :
[-5.99999993e+04 -4.99999993e+04 -3.99999993e+04 -2.99999993e+04
-1.99999993e+04 -9.99999934e+03 6.65010004e-04 1.00000007e+04
2.00000007e+04 3.00000007e+04 4.00000007e+04 5.00000007e+04
6.00000007e+04 7.00000007e+04 8.00000007e+04 9.00000007e+04
1.00000001e+05 1.10000001e+05 1.20000001e+05 1.30000001e+05
1.40000001e+05]
y-axis (data):
[ 0.00056511 -0.00098584 -0.00325616 -0.00101042 0.00168894 -0.00097406
-0.00134408 0.00128847 -0.00111633 -0.00151621 0.00299326 0.00916455
0.00960554 0.00317363 0.00311124 -0.00080881 0.00215932 0.00596419
-0.00192256 -0.00190138 -0.00013216]
If I want to measure the standard deviation excluding the channels where the line is present, should I exclude values from y[10] to y[14] and then calculate the standard deviation?

Yes, since you are to determine some properties of the noise, you should exclude the points that do not constitute the noise. If these are points number 10 to 14 - exclude them.
Then you compute the average of the remaining y-values (intensity). However, from your data and the fitting function, a * exp(-(x-c)**2 / w), one might infer that the theoretical value of this mean value is just zero. If so, the average is only a means of validating your experiment / theory ("we've obtained almost zero, as expected) and use 0 as the true average value. Then, the noise level would amount to the square root of the second moment, E(Y^2).
You should compare the stddev from your code with the square root of the second moment, they should be similar to each other, so similar, that it should not matter which of them you'll chose as the noise value.
The part with SNR, signal to noise ratio, is wrong in your derivation. The signal is the signal, that is - it is the amplitude of the Gaussian obtained from the fit. You divide it by the noise level (either the square root of the second moment, or stddev). To my eye, you should obtain a value between 2 and about 10.
Finally, remember that this is a public forum and that some people read it and may be puzzled by the question & answer: both are based on the previous question Fitting data to a gaussian profile which should've been referred to in the question itself.
If this is a university assignment and you work on real experimental data, remember the purpose. Imagine yourself as a scientist who is to convince others that this a real signal, say, from the Aliens, not just an erratic result of the Mother Nature tossing dice at random. That's the primary purpose of the signal to noise ratio.

Related

How to add random white noise to data

Suppose I have a column of data whose value ranges from -1.23 to +2.56. What I want is to add 10% random white noise to my data. I'm not sure how to do it in python; please help me with the code.

Add independent Gaussian (normal) randomness to your values.
Technically it need not be Gaussian. White noise is called that because it has a flat spectrum, meaning it is composed of all frequencies in equal proportions. The Weiner-Khinchin theorem shows that this is mathematically equivalent to having the serial correlation be zero. Many people believe that it requires Gaussian noise, but independence is sufficient to yield a flat spectrum.

What is sigma clipping? How do you know when to apply it?

I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?

This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.

Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.

The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.

I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.

extracting phase information using numpy fft

I am trying to use a fast fourier transform to extract the phase shift of a single sinusoidal function. I know that on paper, If we denote the transform of our function as T, then we have the following relations:
However, I am finding that while I am able to accurately capture the frequency of my cosine wave, the phase is inaccurate unless I sample at an extremely high rate. For example:
import numpy as np
import pylab as pl
num_t = 100000
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
w = 2.0*np.pi*30.0
phase = np.pi/2.0
amp = np.fft.rfft(np.cos(w*t+phase))
freqs = np.fft.rfftfreq(t.shape[-1],dt)
print (np.arctan2(amp.imag,amp.real))[30]
pl.subplot(211)
pl.plot(freqs[:60],np.sqrt(amp.real**2+amp.imag**2)[:60])
pl.subplot(212)
pl.plot(freqs[:60],(np.arctan2(amp.imag,amp.real))[:60])
pl.show()
Using num=100000 points I get a phase of 1.57173880459.
Using num=10000 points I get a phase of 1.58022110476.
Using num=1000 points I get a phase of 1.6650441064.
What's going wrong? Even with 1000 points I have 33 points per cycle, which should be enough to resolve it. Is there maybe a way to increase the number of computed frequency points? Is there any way to do this with a "low" number of points?
EDIT: from further experimentation it seems that I need ~1000 points per cycle in order to accurately extract a phase. Why?!
EDIT 2: further experiments indicate that accuracy is related to number of points per cycle, rather than absolute numbers. Increasing the number of sampled points per cycle makes phase more accurate, but if both signal frequency and number of sampled points are increased by the same factor, the accuracy stays the same.

Your points are not distributed equally over the interval, you have the point at the end doubled: 0 is the same point as 1. This gets less important the more points you take, obviusly, but still gives some error. You can avoid it totally, the linspace has a flag for this. Also it has a flag to return you the dt directly along with the array.
Do
t, dt = np.linspace(0, 1, num_t, endpoint=False, retstep=True)
instead of
t = np.linspace(0,1,num_t)
dt = 1.0/num_t
then it works :)

The phase value in the result bin of an unrotated FFT is only correct if the input signal is exactly integer periodic within the FFT length. Your test signal is not, thus the FFT measures something partially related to the phase difference of the signal discontinuity between end-points of the test sinusoid. A higher sample rate will create a slightly different last end-point from the sinusoid, and thus a possibly smaller discontinuity.
If you want to decrease this FFT phase measurement error, create your test signal so the your test phase is referenced to the exact center (sample N/2) of the test vector (not the 1st sample), and then do an fftshift operation (rotate by N/2) so that there will be no signal discontinuity between the 1st and last point in your resulting FFT input vector of length N.

This snippet of code might help:
def reconstruct_ifft(data):
"""
In this function, we take in a signal, find its fft, retain the dominant modes and reconstruct the signal from that
Parameters
----------
data : Signal to do the fft, ifft
Returns
-------
reconstructed_signal : the reconstructed signal
"""
N = data.size
yf = rfft(data)
amp_yf = np.abs(yf) #amplitude
yf = yf*(amp_yf>(THRESHOLD*np.amax(amp_yf)))
reconstructed_signal = irfft(yf)
return reconstructed_signal
The 0.01 is the threshold of amplitudes of the fft that you would want to retain. Making the THRESHOLD greater(more than 1 does not make any sense), will give
fewer modes and cause higher rms error but ensures higher frequency selectivity.
(Please adjust the TABS for the python code)

FFT in Python with Explanations

I have a WAV file which I would like to visualize in the frequency domain. Next, I would like to write a simple script that takes in a WAV file and outputs whether the energy at a certain frequency "F" exceeds a threshold "Z" (whether a certain tone has a strong presence in the WAV file). There are a bunch of code snippets online that show how to plot an FFT spectrum in Python, but I don't understand a lot of the steps.
I know that wavfile.read(myfile) returns the sampling rate (fs) and the data array (data), but when I run an FFT on it (y = numpy.fft.fft(data)), what units is y in?
To get the array of frequencies for the x-axis, some posters do this where n = len(data):
X = numpy.linspace(0.0, 1.0/(2.0*T), n/2)
and others do this:
X = numpy.fft.fftfreq(n) * fs)[range(n/2)]
Is there a difference between these two methods and is there a good online explanation for what these operations do conceptually?
Some of the online tutorials about FFTs mention windowing, but not a lot of posters use windowing in their code snippets. I see that numpy has a numpy.hamming(N), but what should I use as the input to that method and how do I "apply" the output window to my FFT arrays?
For my threshold computation, is it correct to find the frequency in X that's closest to my desired tone/frequency and check if the corresponding element (same index) in Y has an amplitude greater than the threshold?

FFT data is in units of normalized frequency where the first point is 0 Hz and one past the last point is fs Hz. You can create the frequency axis yourself with linspace(0.0, (1.0 - 1.0/n)*fs, n). You can also use fftfreq but the components will be negative.
These are the same if n is even. You can also use rfftfreq I think. Note that this is only the "positive half" of your frequencies, which is probably what you want for audio (which is real-valued). Note that you can use rfft to just produce the positive half of the spectrum, and then get the frequencies with rfftfreq(n,1.0/fs).
Windowing will decrease sidelobe levels, at the cost of widening the mainlobe of any frequencies that are there. N is the length of your signal and you multiply your signal by the window. However, if you are looking in a long signal you might want to "chop" it up into pieces, window them, and then add the absolute values of their spectra.
"is it correct" is hard to answer. The simple approach is as you said, find the bin closest to your frequency and check its amplitude.

Algorithm to detect spike in x y graph

Imagine a realtime x, y graph where x is the quantity and y is time, with 1 minute interval. Every minute a new value is pushed in the graph. So I want to detect whenever there is a spike in the graph.
There are 2 kinds of spike:
Sudden Spike
Gradual Spike
Is there any way to detect them?

Since spikes are over a short distance (x2 - x1 ). You can take a standard deviation for a set of y values over a short range of x. If the deviation is reasonably large value, its a spike.
For example for 9 consecutive y values
4,4,5,10,26,10,5,4,4 standard deviation is 7.19.
4,4,5,10,100,10,5,4,4 standard deviation is 31.51.
You can start by analysing the highest values of y and its neighbours.

You can take the first derivative of y w.r.t. x using numpy.diff. Get a set of clean signals and obtain the threshold for it by obtaining the upper limit for derivative (this was the max deviation a clean signal had) using plain old max(array).
Then you can subject your real time signal to the same kind of scrutiny, check for the derivative.
Also, you could threshold it based on the angle of the signal, but you would need a comprehensive sample size for that. You can use tan(signal) for this.
Different thresholds give you different kinds of peaks.
Adding to the suggestion provided, you could also calculate the standard deviation by numpy.std(array) and then checking for +- the value from the mean. This would of course, be better seen using the derivative as I mentioned.
A method used in financial analysis is Bollinger Bands. This link can give you more information about it : http://sentdex.com/sentiment-analysisbig-data-and-python-tutorials-algorithmic-trading/how-to-chart-stocks-and-forex-doing-your-own-financial-charting/calculate-bollinger-bands-python-graph-matplotlib/
They are basically the moving average over a period of a time series. You can get a better set of thresholds using them rather than just the standard deviation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.