Imagine a realtime x, y graph where x is the quantity and y is time, with 1 minute interval. Every minute a new value is pushed in the graph. So I want to detect whenever there is a spike in the graph.
There are 2 kinds of spike:
Sudden Spike
Gradual Spike
Is there any way to detect them?
Since spikes are over a short distance (x2 - x1 ). You can take a standard deviation for a set of y values over a short range of x. If the deviation is reasonably large value, its a spike.
For example for 9 consecutive y values
4,4,5,10,26,10,5,4,4 standard deviation is 7.19.
4,4,5,10,100,10,5,4,4 standard deviation is 31.51.
You can start by analysing the highest values of y and its neighbours.
You can take the first derivative of y w.r.t. x using numpy.diff. Get a set of clean signals and obtain the threshold for it by obtaining the upper limit for derivative (this was the max deviation a clean signal had) using plain old max(array).
Then you can subject your real time signal to the same kind of scrutiny, check for the derivative.
Also, you could threshold it based on the angle of the signal, but you would need a comprehensive sample size for that. You can use tan(signal) for this.
Different thresholds give you different kinds of peaks.
Adding to the suggestion provided, you could also calculate the standard deviation by numpy.std(array) and then checking for +- the value from the mean. This would of course, be better seen using the derivative as I mentioned.
A method used in financial analysis is Bollinger Bands. This link can give you more information about it : http://sentdex.com/sentiment-analysisbig-data-and-python-tutorials-algorithmic-trading/how-to-chart-stocks-and-forex-doing-your-own-financial-charting/calculate-bollinger-bands-python-graph-matplotlib/
They are basically the moving average over a period of a time series. You can get a better set of thresholds using them rather than just the standard deviation.
Related
I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.
1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.
I have an intensity v/s velocity spectrum and my aim is to find the RMS noise in the spectrum excluding the channels where the peak is present.
So, after some research, I came to know that RMS noise is the same as the standard deviation of the spectrum and the signal-to-noise ratio of the signal is the average of the signal divided by the same standard deviation. Can anybody please tell me if I am wrong here?
This is how I coded it in python
def Average(data):
return sum(data) / len(data)
average = Average(data)
print("Average of the list =", average)
standardDev = data.std()
print('The standard deviation is',standardDev)
SNR = average/standardDev
print('SNR = ',SNR)
My original data points are:
x-axis(velocity) :
[-5.99999993e+04 -4.99999993e+04 -3.99999993e+04 -2.99999993e+04
-1.99999993e+04 -9.99999934e+03 6.65010004e-04 1.00000007e+04
2.00000007e+04 3.00000007e+04 4.00000007e+04 5.00000007e+04
6.00000007e+04 7.00000007e+04 8.00000007e+04 9.00000007e+04
1.00000001e+05 1.10000001e+05 1.20000001e+05 1.30000001e+05
1.40000001e+05]
y-axis (data):
[ 0.00056511 -0.00098584 -0.00325616 -0.00101042 0.00168894 -0.00097406
-0.00134408 0.00128847 -0.00111633 -0.00151621 0.00299326 0.00916455
0.00960554 0.00317363 0.00311124 -0.00080881 0.00215932 0.00596419
-0.00192256 -0.00190138 -0.00013216]
If I want to measure the standard deviation excluding the channels where the line is present, should I exclude values from y[10] to y[14] and then calculate the standard deviation?
Yes, since you are to determine some properties of the noise, you should exclude the points that do not constitute the noise. If these are points number 10 to 14 - exclude them.
Then you compute the average of the remaining y-values (intensity). However, from your data and the fitting function, a * exp(-(x-c)**2 / w), one might infer that the theoretical value of this mean value is just zero. If so, the average is only a means of validating your experiment / theory ("we've obtained almost zero, as expected) and use 0 as the true average value. Then, the noise level would amount to the square root of the second moment, E(Y^2).
You should compare the stddev from your code with the square root of the second moment, they should be similar to each other, so similar, that it should not matter which of them you'll chose as the noise value.
The part with SNR, signal to noise ratio, is wrong in your derivation. The signal is the signal, that is - it is the amplitude of the Gaussian obtained from the fit. You divide it by the noise level (either the square root of the second moment, or stddev). To my eye, you should obtain a value between 2 and about 10.
Finally, remember that this is a public forum and that some people read it and may be puzzled by the question & answer: both are based on the previous question Fitting data to a gaussian profile which should've been referred to in the question itself.
If this is a university assignment and you work on real experimental data, remember the purpose. Imagine yourself as a scientist who is to convince others that this a real signal, say, from the Aliens, not just an erratic result of the Mother Nature tossing dice at random. That's the primary purpose of the signal to noise ratio.
I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.
Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.
The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.
I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.
I have a WAV file which I would like to visualize in the frequency domain. Next, I would like to write a simple script that takes in a WAV file and outputs whether the energy at a certain frequency "F" exceeds a threshold "Z" (whether a certain tone has a strong presence in the WAV file). There are a bunch of code snippets online that show how to plot an FFT spectrum in Python, but I don't understand a lot of the steps.
I know that wavfile.read(myfile) returns the sampling rate (fs) and the data array (data), but when I run an FFT on it (y = numpy.fft.fft(data)), what units is y in?
To get the array of frequencies for the x-axis, some posters do this where n = len(data):
X = numpy.linspace(0.0, 1.0/(2.0*T), n/2)
and others do this:
X = numpy.fft.fftfreq(n) * fs)[range(n/2)]
Is there a difference between these two methods and is there a good online explanation for what these operations do conceptually?
Some of the online tutorials about FFTs mention windowing, but not a lot of posters use windowing in their code snippets. I see that numpy has a numpy.hamming(N), but what should I use as the input to that method and how do I "apply" the output window to my FFT arrays?
For my threshold computation, is it correct to find the frequency in X that's closest to my desired tone/frequency and check if the corresponding element (same index) in Y has an amplitude greater than the threshold?
FFT data is in units of normalized frequency where the first point is 0 Hz and one past the last point is fs Hz. You can create the frequency axis yourself with linspace(0.0, (1.0 - 1.0/n)*fs, n). You can also use fftfreq but the components will be negative.
These are the same if n is even. You can also use rfftfreq I think. Note that this is only the "positive half" of your frequencies, which is probably what you want for audio (which is real-valued). Note that you can use rfft to just produce the positive half of the spectrum, and then get the frequencies with rfftfreq(n,1.0/fs).
Windowing will decrease sidelobe levels, at the cost of widening the mainlobe of any frequencies that are there. N is the length of your signal and you multiply your signal by the window. However, if you are looking in a long signal you might want to "chop" it up into pieces, window them, and then add the absolute values of their spectra.
"is it correct" is hard to answer. The simple approach is as you said, find the bin closest to your frequency and check its amplitude.
I have a couple of images of graphs for which I'd like to synthesise the original (x,y) data. I don't have the original data, just the graphs.
Ideally I'd like to be able to approximate the shape of the curves using mathematical functions, so that I can vary the functions to produce slightly different output, and so that I can have simple reproducibility.
The first image shows a set of curves: temperature anomalies from the mean over some recent period, stretching back to 20,000 years before present. The second image shows a step function with a change at 10,000 years before present (log scale). (You'll also notice they have opposite x-axis directions).
For each of these, the eventual data I want to produce is a text file, with a temperature anomaly value for every 10 or 100 years.
Any solution is welcome.
I am not sure I fully understand your question. But what you could try to do is to digitize the data (windig works for windows and engauge for linux) and then do some data interpolation between the points.
The trivial interpolation which works almost all the time is just a straight line between two consecutive points. A more sophisticated approach is a cubic spline (for instance B-splines http://en.wikipedia.org/wiki/B-spline) that keep the second derivative continuous.
I've decided to answer with some details about a way to generate a curve using algebra.
For a periodic type curve, you'd use either sine or cosine, and fiddle with amplitude and frequency to match your particular situation. For example, y = A sin(2x), where the amplitude is A, and the period is related to the inner function of x (that bit inside the brackets). Try this in gnuplot:
A=2
f(x) = A*sin(2*x)
set xrange[-pi:pi]
plot f(x), sin(x), cos(x)
To change the amplitude as x changes just add a power term or exponential term of the x values to the amplitude:
f(x) = A*exp(0.5*x)*sin(2*x)
set xrange[-2*pi:2*pi]
plot f(x), sin(x)
# add an initial value (offset)
f(x) = 5+A*exp(0.5*x)*sin(2*x)
plot f(x), sin(x)
And so on.