I am working with timeseries data collected from a sensor at 5min intervals. Unfortunately, there are cases when the measured value (PV yield in watts) is suddenly 0 or very high. The values before and after are correct:
My goal is to identify these 'outliers' and (in a second step) calculate the mean of the previous and next value to fix the measured value. I've experimented with two approaches so far, but am receiving many 'outliers' which are not measurement-errors. Hence, I am looking for better approaches.
Try 1: Classic outlier detection with IQR Source
def updateOutliersIQR(group):
Q1 = group.yield.quantile(0.25)
Q3 = group.yield.quantile(0.75)
IQR = Q3 - Q1
outliers = (group.yield < (Q1 - 1.5 * IQR)) | (group.yield > (Q3 + 1.5 * IQR))
print(outliers[outliers == True])
# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)
Try 2: kernel density estimation Source
def updateOutliersKDE(group):
a = 0.9
r = group.yield.rolling(3, min_periods=1, win_type='parzen').sum()
n = r.max()
outliers = (r > n*a)
print(outliers[outliers == True])
# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)
Try 3: Median Filter Source
(As suggested by Jonnor)
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-3]
if (s >= _median - num_std * _std and s <= _median + num_std * _std):
return s
else:
return _median
return _median_filter
# calling the function
df.yield.rolling(5, center=True).apply(median_filter(2), raw=True)
Edit: with try 3 and a window of 5 and std of 3, it finally catches the massive outlier, but will also loose accuracy of the other (non-faulty) sensor-measurements:
Are there any better approaches to detect the described 'outliers' or perform smoothing in timeseries data with the occasional sensor measurement issue?
Your abnormal values are abnormal in the sense that
the values deviate a lot from the values around it
the value changes very quickly from one time-step to the other
Thus what is needed is a filter that looks at a short time-context to filter these out.
One of the simplest and most effective is the median filter.
filtered = pandas.rolling_median(df, window=5)
The longer the window, the stronger the filter.
An alternative would be a low-pass filter. Though setting an appropriate cutoff frequency can be harder, and it will impose a smoothness onto the signal.
One can of course create more custom filters as well. For example, compute the first-order difference, and reject changes higher than a certain threshold. You can plot a histogram of the differences to determine a threshold. Mark these as missing (NaN), and then impute the missing using median/mean.
If your goal is Anomaly Detection, you can also use an Autoencoder. I would expect PV output to have a very strong daily pattern. So training it on daily sequences should work quite well (provided you have enough data). This is much more complicated than a simple filter, but has the advantage of being able to detect many other kinds of anomalies as well, not just the pattern identified here.
Related
I have a numpy array which is basically a data column from excel sheet. This data is acquired through low-pass 10 Hz filter DAS but due to some ambiguity it contains square wave like artifacts. The data now has to be filtered at 0.4 Hz Highpass Butterworth filter, which I do through scipy.signal. But after applying Highpass fiter, the square wave like artifacts change into spikes. When applying scipy.median to it I am not able to successfully filter the spikes. What should I try?
The following pic shows the original data.
Following pic shows highpass filter of 0.4 Hz applied followed by median filter of order 3
Even a median filter of order 51 is not useful.
If your input is always expected to have a significant outlier, I would recommend an iterative filtering approach.
Here is your data plotted along with the mean, 1-sigma, 2-sigma and 3-sigma lines:
I would start by removing everything above and below 2-sigma from the mean. Since that will tighten the distribution, I would recommend doing the iterations over and over until the size of the un-trimmed data remains the same. I would recommend increasing the threshold geometrically to avoid trimming "good" data. Finally, you can fill in the missing points with the mean of the remainder or something like that.
Here is a sample implementation, with no attempt at optimization whatsoever:
data = np.loadtxt('data.txt', skiprows=1)
x = np.arange(data.size)
loop_data = data
prev_size = 0
nsigma = 2
while prev_size != loop_data.size:
mean = loop_data.mean()
std = loop_data.std()
mask = (loop_data < mean + nsigma * std) & (loop_data > mean - nsigma * std)
prev_size = loop_data.size
loop_data = loop_data[mask]
x = x[mask]
# Constantly expanding sigma guarantees fast loop termination
nsigma *= 2
# Reconstruct the mask
mask = np.zeros_like(data, dtype=np.bool)
mask[x] = True
# This destroys the original data somewhat
data[~mask] = data[mask].mean()
This approach may not be optimal in all situations, but I have found it to be fairly robust most of the time. There are lots of tweakable parameters. You may want to change your increase factor from 2, or even go with a linear instead of geometrical increase (although I tried the latter and it really didn't work well). You could also use IQR instead of sigma, since it is more robust against outliers.
Here is an image of the resulting dataset (with the removed portion in red and the original dotted):
Another artifact of interest: here are the plots of the data showing the trimming progression and how it affects the cutoff points. Plots show the data, with cut portions in red, and a the n-sigma line for the remainder. The title shows how much the sigma shrinks:
I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.
What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?
As per the text:
quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
Why 0.74? Is there a proof for this?
This final line is a robust estimate of the sample mean, where the 0.74 comes
from the interquartile range of a Gaussian distribution.
That's it, really...
The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:
p1 = sp.stats.norm.ppf(0.25) # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75) # third quartile
print(p2 - p1) # 1.3489795003921634
sig = 1 # standard deviation of the standard normal distribution
factor = sig / (p2 - p1)
print(factor) # 0.74130110925280102
In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.
Suppose you have a set of data. Compute its median m and its standard deviation sigma. Keep only the data that falls in the range (m-a*sigma,m+a*sigma) for some value of a, and discard everything else. This is one iteration of sigma clipping. Continue to iterate a predetermined number of times, and/or stop when the relative reduction in the value of sigma is small.
Sigma clipping is geared toward removing outliers, to allow for a more robust (i.e. resistant to outliers) estimation of, say, the mean of the distribution. So it's applicable to data where you expect to find outliers.
As for the 0.74, it comes from the interquartile range of the Gaussian distribution, as per the text.
The answers here are accurate and reasonable, but don't quite get to the heart of your question:
What is sigma clipping? Is it only applicable for certain data?
If we want to use mean (mu) and standard deviation (sigma) to figure out a threshold for ejecting extreme values in situations where we have a reason to suspect that those extreme values are mistakes (and not just very high/low values), we don't want to calculate mu/sigma using the dataset which includes these mistakes.
Sample problem: you need to compute a threshold for a temperature sensor to indicate when the temperature is "High" - but sometimes the sensor gives readings that are impossible, like "surface of the sun" high.
Imagine a series that looks like this:
thisSeries = np.array([1,2,3,4,1,2,3,4,5,3,4,5,3, 500, 1000])
Those last two values look like obvious mistakes - but if we use a typical stats function like a Normal PPF, it's going to implicitly assume that those outliers belong in the distribution, and perform its calculation accordingly:
st.norm.ppf(.975, thisSeries.mean(), thisSeries.std())
631.5029013468446
So using a two-sided 5% outlier threshold (meaning we will reject the lower and upper 2.5%), it's telling me that 500 is not an outlier. Even if I use a one-sided threshold of .95 (reject the upper 5%), it will give me 546 as the outlier limit, so again, 500 is regarded as non-outlier.
Sigma-clipping works by focusing on the inter-quartile range and using median instead of mean, so the thresholds won't be calculated under the influence of the extreme values.
thisDF = pd.DataFrame(thisSeries, columns=["value"])
intermed="value"
factor=5
quartiles = np.percentile(thisSeries, [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
queryString = '({} < #mu - {} * #sig) | ({} > #mu + {} * #sig)'.format(intermed, factor, intermed, factor)
print(mu + 5 * sig)
10.4
print(thisDF.query(queryString))
500
1000
At factor=5, both outliers are correctly isolated, and the threshold is at a reasonable 10.4 - reasonable, given that the 'clean' part of the series is [1,2,3,4,1,2,3,4,5,3,4,5,3]. ('factor' in this context is a scalar applied to the thresholds)
To answer the question, then: sigma clipping is a method of identifying outliers which is immune from the deforming effects of the outliers themselves, and though it can be used in many contexts, it excels in situations where you suspect that the extreme values are not merely high/low values that should be considered part of the dataset, but rather that they are errors.
Here's an illustration of the difference between extreme values that are part of a distribution, and extreme values that are possibly errors, or just so extreme as to deform analysis of the rest of the data.
The data above was generated synthetically, but you can see that the highest values in this set are not deforming the statistics.
Now here's a set generated the same way, but this time with some artificial outliers injected (above 40):
If I sigma-clip this, I can get back to the original histogram and statistics, and apply them usefully to the dataset.
But where sigma-clipping really shines is in real world scenarios, in which faulty data is common. Here's an example that uses real data - historical observations of my heart-rate monitor. Let's look at the histogram without sigma-clipping:
I'm a pretty chill dude, but I know for a fact that my heart rate is never zero. Sigma-clipping handles this easily, and we can now look at the real distribution of heart-rate observations:
Now, you may have some domain knowledge that would enable you to manually assert outlier thresholds or filters. This is one final nuance to why we might use sigma-clipping - in situations where data is being handled entirely by automation, or we have no domain knowledge relating to the measurement or how it's taken, then we don't have any informed basis for filter or threshold statements.
It's easy to say that a heart rate of 0 is not a valid measurement - but what about 10? What about 200? And what if heart-rate is one of thousands of different measurements we're taking. In such cases, maintaining sets of manually defined thresholds and filters would be overly cumbersome.
I think there is a small typo to the sentence that "this final line is a strong estimate of the sample average". From the previous proof, I think the final line is a solid estimate of 1 Sigma for births if the normal distribution is followed.
Generally, we calculate exponential moving averages as the following:
y_t = (1 - alpha) * y_tminus1 + alpha * x_t
where alpha is the alpha specified for the exponential moving average, y_t is the resulting moving average, and x_t is the new inputted data.
This seems to be confirmed in the methodology behind Pandas' implementation of the exponentially weighted moving average as well.
So I wrote an online algorithm for calculating the exponentially weighted moving average of a dataset:
def update_stats_exp(new, mean):
mean = (1 - ALPHA) * mean + ALPHA* new
return mean
However, this returns a different moving average compared to that of Pandas' implementation, called by the following two lines of code:
exponential_window = df['price'].ewm(alpha=ALPHA, min_periods=LOOKBACK,
adjust=False, ignore_na=True)
df['exp_ma'] = exponential_window.mean()
In both of the above pieces of code, I kept ALPHA the same, yet they resulted in different moving averages, even though the documentation that Pandas provided on exponentially weighted windows seems to match the methodology I had in mind.
Can someone elucidate the differences between the online function I've provided for calculating moving average and Pandas' implementation for the same thing? Also, is there an easy way of formulating Pandas' implementation into an online algorithm?
Thanks so much!
I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python
The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):
V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50
Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF
Assuming that your timeseries is an array, try something like this:
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
This will confine your values between 0 and 1
Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization:
( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.
This function is slow, so it’s advisable run it just once and store the results.
from scipy.stats import norm
import pandas as pd
def get_NormArray(df, n, mode = 'total', linear = False):
'''
It computes the normalized value on the stats of n values ( Modes: total or scale )
using the formulas from the book "Statistically sound machine learning..."
(Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
It is modified to fit the data from -1 to 1 instead of -100 to 100
df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
n define the number of data points to get the mean and the quartiles for the normalization
modes: scale: scale, without centering. total: center and scale.
'''
temp =[]
for i in range(len(df))[::-1]:
if i >= n: # there will be a traveling norm until we reach the initian n values.
# those values will be normalized using the last computed values of F50,F75 and F25
F50 = df[i-n:i].quantile(0.5)
F75 = df[i-n:i].quantile(0.75)
F25 = df[i-n:i].quantile(0.25)
if linear == True and mode == 'total':
v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
elif linear == True and mode == 'scale':
v = 0.25 * df.iloc[i]/(F75-F25) -0.5
elif linear == False and mode == 'scale':
v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5
else: # even if strange values are given, it will perform full normalization with compression as default
v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5
temp.append(v[0])
return pd.DataFrame(temp[::-1])
I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.
from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)
You can take a look here normalize-standardize-time-series-data-python
and
sklearn.preprocessing.minmax_scale
I am creating a system for logging data from sensors. (Just a series of numbers)
I would like to be able to put the system into a "learn" mode for a couple of days so it can see what its "normal" operational values are and that once it is out of this any deviation from this behaviour past a certain point can be flagged. The data is all stored in a MySQL database.
Any suggestions on how to carry this out would be welcome, as would locations for further reading on the topic.
I would preferably like to use python for this task.
The data temperature and humidity values ever 5 minutes in a temperature controlled area that is accessed and used during the day. This means the it will have fluctuations for when it is in use and some temperature changes. But anything different to this such as cooling or heating systems failing needs to be detected
Essentially what you should be looking at is density estimation: the task of determining a model of how some variables behave, so that you can look for deviations from it.
Here's some very simple example code. I've assumed that temperature and humidity have independent normal distributions on their untransformed scales:
import numpy as np
from matplotlib.mlab import normpdf
from itertools import izip
class TempAndHumidityModel(object):
def __init__(self):
self.tempMu=0
self.tempSigma=1
self.humidityMu=0
self.humiditySigma=1
def setParams(self, tempMeasurements, humidityMeasurements, quantile):
self.tempMu=np.mean(tempMeasurements)
self.tempSigma=np.std(tempMeasurements)
self.humidityMu=np.mean(humidityMeasurements)
self.humiditySigma=np.std(humidityMeasurements)
if not 0 < quantile <= 1:
raise ValueError("Quantile for threshold must be between 0 and 1")
self._thresholdDensity(quantile, tempMeasurements, humidityMeasurements)
def _thresholdDensity(self, quantile, tempMeasurements, humidityMeasurements):
tempDensities = np.apply_along_axis(
lambda x: normpdf(x, self.tempMu, self.tempSigma),0,tempMeasurements)
humidityDensities = np.apply_along_axis(
lambda x: normpdf(x, self.humidityMu, self.humiditySigma),0,humidityMeasurements)
densities = sorted(tempDensities * humidityDensities, reverse=True)
#Here comes the massive oversimplification: just choose the
#density value at the quantile*length position, and use this as the threshold
self.threshold = densities[int(np.round(quantile*len(densities)))]
def probOfObservation(self, temp, humidity):
return normpdf(temp, self.tempMu, self.tempSigma) * \
normpdf(humidity, self.humidityMu, self.humiditySigma)
def isNormalMeasurement(self, temp, humidity):
return self.probOfObservation(temp, humidity) > self.threshold
if __name__ == '__main__':
#Create some simulated data
temps = np.random.randn(100)*10 + 50
humidities = np.random.randn(100)*2 + 10
thm = TempAndHumidityModel()
#going to hard code in the 95% threshold
thm.setParams(temps, humidities, 0.95)
#Create some new data from same dist and see how many false positives
newTemps = np.random.randn(100)*10 + 50
newHumidities = np.random.randn(100)*2 + 10
numFalseAlarms = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(newTemps,newHumidities))
print '{} false alarms!'.format(numFalseAlarms)
#Now create some abnormal data: mean temp drops to 20
lowTemps = np.random.randn(100)*10 + 20
normalHumidities = np.random.randn(100)*2 + 10
numDetections = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(lowTemps,normalHumidities))
print '{} abnormal measurements flagged'.format(numDetections)
Example output:
>> 3 false alarms!
>> 77 abnormal measurements flagged
Now, I have no idea whether the assumption of normality is appropriate for your data (you may want to transform the data onto a different scale so that it is); it's probably wildly inaccurate to assume independence between temperature and humidity; and the trick that I have used to find the density value corresponding to the requested quantile of the distribution should be replaced by something that uses the inverse CDF of the distribution. However, this should give you a flavour of what to do.
Note additionally that there are many good non-parametric density estimators: kernel density estimators immediately spring to mind. These may be more appropriate if your data doesn't look like any standard distribution.
It looks like you are attempting to perform anomaly detection but your description of your data is vague. You should start by trying to define/constrain what it means, in general, for your data to be "normal".
Is there a different "normal" for each sensor?
Is a sensor measurement somehow dependent on it's previous measurement(s)?
Does "normal" change over the course of a day?
Can the "normal" measurements from a sensor be characterized by a statistical model (e.g., are the data Gaussian or log-normal)?
Once you have answered those types of questions, then you can train a classifier or anomaly detector with a batch of data from your database and use the result to assess future log output. If machine learning algorithms are applicable to your data, you might consider using scikit-learn. For statistical models, you could use the stats subpackage of SciPy. And, of course, for any kind of numerical data manipulation in python, NumPy is your friend.