In the document of pandas.DataFrame.ewm, it says alpha = 2/(span + 1). Don't understand why such formula exists between alpha and span.
Is it a convention or any sources to explain the formula?
Is it only an assumption/setting in pandas? actually it can be other formats, e.g. alpha = 4/(span + 3)?
Googled but there is no clue on that.
Grateful if someone could help! Thx.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html#pandas-dataframe-ewm
Is it a convention or any sources to explain the formula?
The span parameter is intended to roughly correspond to an N-period rolling average. Here's what Wikipedia says about this:
Note that there is no "accepted" value that should be chosen for alpha, although there are some recommended values based on the application. A commonly used value for alpha is alpha = 2/(N + 1). This is because the weights of an SMA and EMA have the same "center of mass" when alpha_EMA = 2/(N_SMA + 1).
So, the Pandas project did not come up with this formula - many others have used it.
Is it only an assumption/setting in pandas? actually it can be other formats, e.g. alpha = 4/(span + 3)?
But would other ways of calculating alpha work too? You suggest 4/(N + 1), for example.
It's possible to do this - but it means that it approximates the N-period rolling average less well. Here's a practical example.
Below is a graph of US unemployment rate between 2005-2008. (Black line) Imagine you wanted to remove noise from this time-series. One way I could do that is to take a rolling 12-month average of the unemployment rate. (Blue line) Another possible approach would be to use an exponential weighted average. But how should alpha be chosen, in order get approximately a 12-month average?
Here are some formulas for alpha that you could use. There are three: the one Pandas uses, the one you suggested, and one I made up.
N
Formula
Alpha
12
1/(N+1)
0.0769
12
2/(N+1)
0.1538
12
4/(N+3)
0.2666
Below is a plot of what each looks like after smoothing.
You'll see that the green line ends up being similar to the blue rolling average, but is a bit more wiggly. The yellow line has a lower alpha, so it tends to put less emphasis on new pieces of data. It stays above the rolling average until 2008, then is the slowest to update when unemployment spikes. The red line tends to follow the original time series closely - it's pretty strongly influenced by new data points.
So, which of these alpha values is best? Well, it depends. High values of alpha are good at incorporating new data. Low values of alpha are good at rejecting noise. You'll have to decide what is best for your application.
Related
I am currently gathering X gyro data of a device. I need to find the differences between them in terms of their y values in the graphs such as this one.
I have written an algorithm where it finds the min and max y value but when they have some minor fluctuations like this one my algorithm returns faulty answers. I have written it in python and it is as follows;
import numpy as np
x, y = np.loadtxt("gyroDataX.txt", unpack=True, delimiter=",")
max_value = max(y)
min_value = min(x)
print(min_value)
borders = max_value - min_value
I need to now write and algorithm that will;
Determine the max and min y value and draw their borders.
If it sees minor fluctuations it will ignore them.
Would writing such an algorithm be possible, and if so how could I go about writing one? Is there any libraries or any reading material you would recommend? Thanks for your help.
1. Generally maths like this should be done in pure code, with the help of little or none external API's, so it's easier to debug the algorithm and processes.
2. Now for a possible answer:
Since you do not want any outliers (those buggy and irritating minor fluctuations), you need to calculate the standard deviation of your data.
What is the standard deviation, you might ask?
It represents how far your data is from the mean(Symbol: µ) (average) of your data set.
If you do not know what it is, here is quick tutorial on the standard deviation and its partner, variance.
First, the MEAN:
It should not be that hard to calculate the mean of your x and y arrays/lists . Just loop (for loop would be optimal) through the lists and add up all the values and divide it by the length of the list itself. There you have the mean.
Second, the VARIANCE(σ squared):
If you followed the website above, to calculate the variance, loop through the x and y lists again, subtract the x and y values from their respective mean to get the difference, square this difference, add all the differences up and divide by the length of the respective lists and you have the variance.
For the Standard Deviation (Symbol: σ), just take the square root of the variance.
Now the standard deviation can be used to find the real max and min (leaving out those buggy outliers).
Use this graph as an approximate reference to find where your most of your values may be:
Since your data is mostly uniform, you should get a pretty accurate answer.
Do test the different situations: σ + µ or 2σ + µ; to find the optimum max and min.
Edit:
Sorry, only y now:
Sorry for the horrible representation and drawing. This is what it should like graphically is. Also do experiment by yourself with the Standard Deviation from the mean (like above; σ + µ or 2σ + µ) to find the your suited max and min.
I have an intensity v/s velocity spectrum and my aim is to find the RMS noise in the spectrum excluding the channels where the peak is present.
So, after some research, I came to know that RMS noise is the same as the standard deviation of the spectrum and the signal-to-noise ratio of the signal is the average of the signal divided by the same standard deviation. Can anybody please tell me if I am wrong here?
This is how I coded it in python
def Average(data):
return sum(data) / len(data)
average = Average(data)
print("Average of the list =", average)
standardDev = data.std()
print('The standard deviation is',standardDev)
SNR = average/standardDev
print('SNR = ',SNR)
My original data points are:
x-axis(velocity) :
[-5.99999993e+04 -4.99999993e+04 -3.99999993e+04 -2.99999993e+04
-1.99999993e+04 -9.99999934e+03 6.65010004e-04 1.00000007e+04
2.00000007e+04 3.00000007e+04 4.00000007e+04 5.00000007e+04
6.00000007e+04 7.00000007e+04 8.00000007e+04 9.00000007e+04
1.00000001e+05 1.10000001e+05 1.20000001e+05 1.30000001e+05
1.40000001e+05]
y-axis (data):
[ 0.00056511 -0.00098584 -0.00325616 -0.00101042 0.00168894 -0.00097406
-0.00134408 0.00128847 -0.00111633 -0.00151621 0.00299326 0.00916455
0.00960554 0.00317363 0.00311124 -0.00080881 0.00215932 0.00596419
-0.00192256 -0.00190138 -0.00013216]
If I want to measure the standard deviation excluding the channels where the line is present, should I exclude values from y[10] to y[14] and then calculate the standard deviation?
Yes, since you are to determine some properties of the noise, you should exclude the points that do not constitute the noise. If these are points number 10 to 14 - exclude them.
Then you compute the average of the remaining y-values (intensity). However, from your data and the fitting function, a * exp(-(x-c)**2 / w), one might infer that the theoretical value of this mean value is just zero. If so, the average is only a means of validating your experiment / theory ("we've obtained almost zero, as expected) and use 0 as the true average value. Then, the noise level would amount to the square root of the second moment, E(Y^2).
You should compare the stddev from your code with the square root of the second moment, they should be similar to each other, so similar, that it should not matter which of them you'll chose as the noise value.
The part with SNR, signal to noise ratio, is wrong in your derivation. The signal is the signal, that is - it is the amplitude of the Gaussian obtained from the fit. You divide it by the noise level (either the square root of the second moment, or stddev). To my eye, you should obtain a value between 2 and about 10.
Finally, remember that this is a public forum and that some people read it and may be puzzled by the question & answer: both are based on the previous question Fitting data to a gaussian profile which should've been referred to in the question itself.
If this is a university assignment and you work on real experimental data, remember the purpose. Imagine yourself as a scientist who is to convince others that this a real signal, say, from the Aliens, not just an erratic result of the Mother Nature tossing dice at random. That's the primary purpose of the signal to noise ratio.
In a multi-peak fitting I intend to constrain the solution space for the parameters of the second peak based on the values of the first one. Especially I want to have the amplitude parameter of the second one never to be larger than the amplitude of the first one.
I've read on the lmfit website about "Using Inequality Constraints" and I have the feeling it should be possible with this approach, but I do not quite understand it well it enough to make it work.
import lmfit
GaussianA = lmfit.models.GaussianModel(prefix='A_')
pars = GaussianA.make_params()
GaussianB = lmfit.models.GaussianModel(prefix='B_')
pars.update(GaussianB.make_params())
pars['B_amplitude'].set(expr = 'A_amplitude')
This locks in the amplitude of B to the amplitude of A.
However, how do I specify that the amplitude of B is at most 'A_amplitude'?
This doesn't work (but it would be awesome if it were that easy) but maybe helps to demonstrate what I'd like to have): pars['B_amplitude'].set(1,max='A_amplitude')
The min and max values for a lmfit.Parameter are not dynamically calculated from the other variables, but must be real numerical values. That is, something like
pars['B_amplitude'].set(1,max='A_amplitude') # Nope!
will not work.
What you need to do is follow the documentation for an inequality constraint (see https://lmfit.github.io/lmfit-py/constraints.html#using-inequality-constraints). That is, you can think of
B_amplitude < A_amplitude
as
B_amplitude = A_amplitude - delta_amplitude
with delta_amplitude being some variable value that must be positive.
That can be expressed as
GaussianA = lmfit.models.GaussianModel(prefix='A_')
pars = GaussianA.make_params()
GaussianB = lmfit.models.GaussianModel(prefix='B_')
pars.update(GaussianB.make_params())
pars.add('delta_amplitude', value=0.01, min=0, vary=True)
pars['B_amplitude'].set(expr = 'A_amplitude - delta_amplitude')
Now delta_amplitude is a variable that must be positive, and B_amplitude is no longer a freely varying parameter but is constrained by the values of A_amplitude and delta_amplitude.
Do you have a plot of your data, how noisy is it? I understood that you do 2 seperate fits but you have 2 peaks in your data. If your data is friendly you might be able to fit first one peak and then take the amplitude of it and fit the second one by setting limits for the amplitude. But maybe it's better to set a limit for the x position as you are talking of two different peaks.
How I solved this in a little hacky way (I assume your problem is that your fit does not converges):
Find the highest peak (maximum) in data -> x1
cut out the data in the environment of the peak (x1 +- 2 half power width, depending of the distance of your peaks and the heights of them)
find the highest peak (maximum) in the new reduced data -> x2
Use a custom fit curve which is a sum of your two gauss curves. f(x) = gauss1 + gauss2, where gauss(x, x1, width, amplitude, y_offset) and gauss = amplitude/width * e^(-(x-x1)^2/width) + y_offset
Sorry, it's years ago that I did that and without lmfit, so I can't give you details on it.
I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.
Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.
The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.
I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.
Generally, we calculate exponential moving averages as the following:
y_t = (1 - alpha) * y_tminus1 + alpha * x_t
where alpha is the alpha specified for the exponential moving average, y_t is the resulting moving average, and x_t is the new inputted data.
This seems to be confirmed in the methodology behind Pandas' implementation of the exponentially weighted moving average as well.
So I wrote an online algorithm for calculating the exponentially weighted moving average of a dataset:
def update_stats_exp(new, mean):
mean = (1 - ALPHA) * mean + ALPHA* new
return mean
However, this returns a different moving average compared to that of Pandas' implementation, called by the following two lines of code:
exponential_window = df['price'].ewm(alpha=ALPHA, min_periods=LOOKBACK,
adjust=False, ignore_na=True)
df['exp_ma'] = exponential_window.mean()
In both of the above pieces of code, I kept ALPHA the same, yet they resulted in different moving averages, even though the documentation that Pandas provided on exponentially weighted windows seems to match the methodology I had in mind.
Can someone elucidate the differences between the online function I've provided for calculating moving average and Pandas' implementation for the same thing? Also, is there an easy way of formulating Pandas' implementation into an online algorithm?
Thanks so much!