What is sigma clipping? How do you know when to apply it?

What is sigma clipping? How do you know when to apply it? - python

I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?

This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.

Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.

The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.

I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.

Related

why alpha = 2/(span + 1) in pandas.DataFrame.ewm

In the document of pandas.DataFrame.ewm, it says alpha = 2/(span + 1). Don't understand why such formula exists between alpha and span.
Is it a convention or any sources to explain the formula?
Is it only an assumption/setting in pandas? actually it can be other formats, e.g. alpha = 4/(span + 3)?
Googled but there is no clue on that.
Grateful if someone could help! Thx.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html#pandas-dataframe-ewm

Is it a convention or any sources to explain the formula?
The span parameter is intended to roughly correspond to an N-period rolling average. Here's what Wikipedia says about this:
Note that there is no "accepted" value that should be chosen for alpha, although there are some recommended values based on the application. A commonly used value for alpha is alpha = 2/(N + 1). This is because the weights of an SMA and EMA have the same "center of mass" when alpha_EMA = 2/(N_SMA + 1).
So, the Pandas project did not come up with this formula - many others have used it.
Is it only an assumption/setting in pandas? actually it can be other formats, e.g. alpha = 4/(span + 3)?
But would other ways of calculating alpha work too? You suggest 4/(N + 1), for example.
It's possible to do this - but it means that it approximates the N-period rolling average less well. Here's a practical example.
Below is a graph of US unemployment rate between 2005-2008. (Black line) Imagine you wanted to remove noise from this time-series. One way I could do that is to take a rolling 12-month average of the unemployment rate. (Blue line) Another possible approach would be to use an exponential weighted average. But how should alpha be chosen, in order get approximately a 12-month average?
Here are some formulas for alpha that you could use. There are three: the one Pandas uses, the one you suggested, and one I made up.
N
Formula
Alpha
12
1/(N+1)
0.0769
12
2/(N+1)
0.1538
12
4/(N+3)
0.2666
Below is a plot of what each looks like after smoothing.
You'll see that the green line ends up being similar to the blue rolling average, but is a bit more wiggly. The yellow line has a lower alpha, so it tends to put less emphasis on new pieces of data. It stays above the rolling average until 2008, then is the slowest to update when unemployment spikes. The red line tends to follow the original time series closely - it's pretty strongly influenced by new data points.
So, which of these alpha values is best? Well, it depends. High values of alpha are good at incorporating new data. Low values of alpha are good at rejecting noise. You'll have to decide what is best for your application.

How Can I find the min and max values of y and draw a border between them in python?

I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.

1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.

Finding RMS noise in a spectra

I have an intensity v/s velocity spectrum and my aim is to find the RMS noise in the spectrum excluding the channels where the peak is present.
So, after some research, I came to know that RMS noise is the same as the standard deviation of the spectrum and the signal-to-noise ratio of the signal is the average of the signal divided by the same standard deviation. Can anybody please tell me if I am wrong here?
This is how I coded it in python
def Average(data):
return sum(data) / len(data)
average = Average(data)
print("Average of the list =", average)
standardDev = data.std()
print('The standard deviation is',standardDev)
SNR = average/standardDev
print('SNR = ',SNR)
My original data points are:
x-axis(velocity) :
[-5.99999993e+04 -4.99999993e+04 -3.99999993e+04 -2.99999993e+04
-1.99999993e+04 -9.99999934e+03 6.65010004e-04 1.00000007e+04
2.00000007e+04 3.00000007e+04 4.00000007e+04 5.00000007e+04
6.00000007e+04 7.00000007e+04 8.00000007e+04 9.00000007e+04
1.00000001e+05 1.10000001e+05 1.20000001e+05 1.30000001e+05
1.40000001e+05]
y-axis (data):
[ 0.00056511 -0.00098584 -0.00325616 -0.00101042 0.00168894 -0.00097406
-0.00134408 0.00128847 -0.00111633 -0.00151621 0.00299326 0.00916455
0.00960554 0.00317363 0.00311124 -0.00080881 0.00215932 0.00596419
-0.00192256 -0.00190138 -0.00013216]
If I want to measure the standard deviation excluding the channels where the line is present, should I exclude values from y[10] to y[14] and then calculate the standard deviation?

Yes, since you are to determine some properties of the noise, you should exclude the points that do not constitute the noise. If these are points number 10 to 14 - exclude them.
Then you compute the average of the remaining y-values (intensity). However, from your data and the fitting function, a * exp(-(x-c)**2 / w), one might infer that the theoretical value of this mean value is just zero. If so, the average is only a means of validating your experiment / theory ("we've obtained almost zero, as expected) and use 0 as the true average value. Then, the noise level would amount to the square root of the second moment, E(Y^2).
You should compare the stddev from your code with the square root of the second moment, they should be similar to each other, so similar, that it should not matter which of them you'll chose as the noise value.
The part with SNR, signal to noise ratio, is wrong in your derivation. The signal is the signal, that is - it is the amplitude of the Gaussian obtained from the fit. You divide it by the noise level (either the square root of the second moment, or stddev). To my eye, you should obtain a value between 2 and about 10.
Finally, remember that this is a public forum and that some people read it and may be puzzled by the question & answer: both are based on the previous question Fitting data to a gaussian profile which should've been referred to in the question itself.
If this is a university assignment and you work on real experimental data, remember the purpose. Imagine yourself as a scientist who is to convince others that this a real signal, say, from the Aliens, not just an erratic result of the Mother Nature tossing dice at random. That's the primary purpose of the signal to noise ratio.

How to do calibration accounting for resolution of the instrument

I have to calibrate a distance measuring instrument which gives capacitance as output, I am able to use numpy polyfit to find a relation and apply it get distance. But I need to include limits of detection 0.0008 m as it is the resolution of the instrument.
My data is:
cal_distance = [.1 , .4 , 1, 1.5, 2, 3]
cal_capacitance = [1971, 2336, 3083, 3720, 4335, 5604]
raw_data = [3044,3040,3039,3036,3033]
I need my distance values to be like .1008, .4008 that represents the limits of detection of the instrument.
I have used the following code:
coeffs = np.polyfit(cal_capacitance, cal_distance, 1)
new_distance = []
for i in raw_data:
d = i*coeffs[0] + coeffs[1]
new_distance.append(d)
I have a csv file and actually used a pandas dataframe with date time index to store the raw data, but for simplicity I have given a list here.
I need to include the limits of detection in the calibration process to get it right.

Limit of detection is the accuracy of your measurement (the smallest 'step' you can resolve)
polyfit gives you a 'model' of the best fit function f of the relation
distance = f(capacitance)
You use 1 as the degree of the polynomial so you're basically fitting a line.
So, first off you need to look into the accuracy of the fit: this is returned by using the 3rd parameter full=True.
(see the docs: http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html for more details)
You will get the residual of the fit.
Is it actually smaller than the LOD? Otherwise your limiting factor is the fitting
accuracy. In your particular case it looks like it is 0.00017021, so indeed below the 0.0008 LOD.
Second, why 'add' LOD to the reading? Your reading is the reading. then LOD is the +/- range the distance could really be within. Adding it to the end result does not seem to make sense here.
You should instead report the final value as 'new distance' +/- LOD.
Is your raw data all measurements of the same distance? If so, you can see that the standard deviation of this measurement using the fit is 0.0029680362423331122, ( numpy.std(new_distance) ) and range is 0.0087759439302268483, which is 10x over the LOD, so here your limiting factor really seems to be the measuring conditions.

Not to beat a dead horse, but LOD and precision are two completely different things. LOD is typically defined as three-times the standard deviation of the noise of your instrument, which would be equivalent to the minimum capacitance (or distance , which is related to capacitance here) your instrument can detect. i.e. anything less than that is equivalent to zero (more or less). But your precision is the minimum change in capacitance that can be detected by your instrument, which may or may not be less than the LOD. Such terms (in addition to accuracy) are common sources of confusion. While you may know what you are talking about when you say LOD (and everyone else may be able to understand that you really mean precision) it would be beneficial to use the proper notation. Just a thought...

Algorithm to detect spike in x y graph

Imagine a realtime x, y graph where x is the quantity and y is time, with 1 minute interval. Every minute a new value is pushed in the graph. So I want to detect whenever there is a spike in the graph.
There are 2 kinds of spike:
Sudden Spike
Gradual Spike
Is there any way to detect them?

Since spikes are over a short distance (x2 - x1 ). You can take a standard deviation for a set of y values over a short range of x. If the deviation is reasonably large value, its a spike.
For example for 9 consecutive y values
4,4,5,10,26,10,5,4,4 standard deviation is 7.19.
4,4,5,10,100,10,5,4,4 standard deviation is 31.51.
You can start by analysing the highest values of y and its neighbours.

You can take the first derivative of y w.r.t. x using numpy.diff. Get a set of clean signals and obtain the threshold for it by obtaining the upper limit for derivative (this was the max deviation a clean signal had) using plain old max(array).
Then you can subject your real time signal to the same kind of scrutiny, check for the derivative.
Also, you could threshold it based on the angle of the signal, but you would need a comprehensive sample size for that. You can use tan(signal) for this.
Different thresholds give you different kinds of peaks.
Adding to the suggestion provided, you could also calculate the standard deviation by numpy.std(array) and then checking for +- the value from the mean. This would of course, be better seen using the derivative as I mentioned.
A method used in financial analysis is Bollinger Bands. This link can give you more information about it : http://sentdex.com/sentiment-analysisbig-data-and-python-tutorials-algorithmic-trading/how-to-chart-stocks-and-forex-doing-your-own-financial-charting/calculate-bollinger-bands-python-graph-matplotlib/
They are basically the moving average over a period of a time series. You can get a better set of thresholds using them rather than just the standard deviation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.