I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python
The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):
V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50
Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF
Assuming that your timeseries is an array, try something like this:
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
This will confine your values between 0 and 1
Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization:
( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.
This function is slow, so it’s advisable run it just once and store the results.
from scipy.stats import norm
import pandas as pd
def get_NormArray(df, n, mode = 'total', linear = False):
'''
It computes the normalized value on the stats of n values ( Modes: total or scale )
using the formulas from the book "Statistically sound machine learning..."
(Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
It is modified to fit the data from -1 to 1 instead of -100 to 100
df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
n define the number of data points to get the mean and the quartiles for the normalization
modes: scale: scale, without centering. total: center and scale.
'''
temp =[]
for i in range(len(df))[::-1]:
if i >= n: # there will be a traveling norm until we reach the initian n values.
# those values will be normalized using the last computed values of F50,F75 and F25
F50 = df[i-n:i].quantile(0.5)
F75 = df[i-n:i].quantile(0.75)
F25 = df[i-n:i].quantile(0.25)
if linear == True and mode == 'total':
v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
elif linear == True and mode == 'scale':
v = 0.25 * df.iloc[i]/(F75-F25) -0.5
elif linear == False and mode == 'scale':
v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5
else: # even if strange values are given, it will perform full normalization with compression as default
v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5
temp.append(v[0])
return pd.DataFrame(temp[::-1])
I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.
from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)
You can take a look here normalize-standardize-time-series-data-python
and
sklearn.preprocessing.minmax_scale
Related
So I have a dataset that I want to be normalized. The datset contains of bunch of numbers so im just going to post one line of it:
1,1,22,22,22,19,18,14,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1,0
Does anyone know how to do it? I'm not allowed to use Scikit-Learn.
Normalization takes all your values and transforms them so that they lie in between 0 and 1.
To perform this:
First find the minimum value (call it a) and the maximum value (call it b)
Take every value in your data set (call it d) and find (d-a)/(b-a).
(d-a) makes sure that the range goes from [a,b] to [0,b-a] and then dividing by (b-a) makes the range [0,1].
In Python you would first convert your dataset to a numpy array (a much more efficient data structure)
import numpy as np
d = np.array(your_dataset)
Then find the max and min
a = d.min()
b = d.max()
Finally you perform the operation
d = (d-a)/(b-a)
In order to normalize a dataset you simply calculate the average df['column_name'].mean() and standard deviation df['column_name'].std() for the dataset and subsequently subtract the average from every value in your dataset and divide the result by the standard deviation.
So the result would look something like this:
avg = df['column_name'].mean()
std = df['column_name'].std()
normalized = (df['column_name'] - avg) / std
I am working with timeseries data collected from a sensor at 5min intervals. Unfortunately, there are cases when the measured value (PV yield in watts) is suddenly 0 or very high. The values before and after are correct:
My goal is to identify these 'outliers' and (in a second step) calculate the mean of the previous and next value to fix the measured value. I've experimented with two approaches so far, but am receiving many 'outliers' which are not measurement-errors. Hence, I am looking for better approaches.
Try 1: Classic outlier detection with IQR Source
def updateOutliersIQR(group):
Q1 = group.yield.quantile(0.25)
Q3 = group.yield.quantile(0.75)
IQR = Q3 - Q1
outliers = (group.yield < (Q1 - 1.5 * IQR)) | (group.yield > (Q3 + 1.5 * IQR))
print(outliers[outliers == True])
# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)
Try 2: kernel density estimation Source
def updateOutliersKDE(group):
a = 0.9
r = group.yield.rolling(3, min_periods=1, win_type='parzen').sum()
n = r.max()
outliers = (r > n*a)
print(outliers[outliers == True])
# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)
Try 3: Median Filter Source
(As suggested by Jonnor)
def median_filter(num_std=3):
def _median_filter(x):
_median = np.median(x)
_std = np.std(x)
s = x[-3]
if (s >= _median - num_std * _std and s <= _median + num_std * _std):
return s
else:
return _median
return _median_filter
# calling the function
df.yield.rolling(5, center=True).apply(median_filter(2), raw=True)
Edit: with try 3 and a window of 5 and std of 3, it finally catches the massive outlier, but will also loose accuracy of the other (non-faulty) sensor-measurements:
Are there any better approaches to detect the described 'outliers' or perform smoothing in timeseries data with the occasional sensor measurement issue?
Your abnormal values are abnormal in the sense that
the values deviate a lot from the values around it
the value changes very quickly from one time-step to the other
Thus what is needed is a filter that looks at a short time-context to filter these out.
One of the simplest and most effective is the median filter.
filtered = pandas.rolling_median(df, window=5)
The longer the window, the stronger the filter.
An alternative would be a low-pass filter. Though setting an appropriate cutoff frequency can be harder, and it will impose a smoothness onto the signal.
One can of course create more custom filters as well. For example, compute the first-order difference, and reject changes higher than a certain threshold. You can plot a histogram of the differences to determine a threshold. Mark these as missing (NaN), and then impute the missing using median/mean.
If your goal is Anomaly Detection, you can also use an Autoencoder. I would expect PV output to have a very strong daily pattern. So training it on daily sequences should work quite well (provided you have enough data). This is much more complicated than a simple filter, but has the advantage of being able to detect many other kinds of anomalies as well, not just the pattern identified here.
I have a numpy array which is basically a data column from excel sheet. This data is acquired through low-pass 10 Hz filter DAS but due to some ambiguity it contains square wave like artifacts. The data now has to be filtered at 0.4 Hz Highpass Butterworth filter, which I do through scipy.signal. But after applying Highpass fiter, the square wave like artifacts change into spikes. When applying scipy.median to it I am not able to successfully filter the spikes. What should I try?
The following pic shows the original data.
Following pic shows highpass filter of 0.4 Hz applied followed by median filter of order 3
Even a median filter of order 51 is not useful.
If your input is always expected to have a significant outlier, I would recommend an iterative filtering approach.
Here is your data plotted along with the mean, 1-sigma, 2-sigma and 3-sigma lines:
I would start by removing everything above and below 2-sigma from the mean. Since that will tighten the distribution, I would recommend doing the iterations over and over until the size of the un-trimmed data remains the same. I would recommend increasing the threshold geometrically to avoid trimming "good" data. Finally, you can fill in the missing points with the mean of the remainder or something like that.
Here is a sample implementation, with no attempt at optimization whatsoever:
data = np.loadtxt('data.txt', skiprows=1)
x = np.arange(data.size)
loop_data = data
prev_size = 0
nsigma = 2
while prev_size != loop_data.size:
mean = loop_data.mean()
std = loop_data.std()
mask = (loop_data < mean + nsigma * std) & (loop_data > mean - nsigma * std)
prev_size = loop_data.size
loop_data = loop_data[mask]
x = x[mask]
# Constantly expanding sigma guarantees fast loop termination
nsigma *= 2
# Reconstruct the mask
mask = np.zeros_like(data, dtype=np.bool)
mask[x] = True
# This destroys the original data somewhat
data[~mask] = data[mask].mean()
This approach may not be optimal in all situations, but I have found it to be fairly robust most of the time. There are lots of tweakable parameters. You may want to change your increase factor from 2, or even go with a linear instead of geometrical increase (although I tried the latter and it really didn't work well). You could also use IQR instead of sigma, since it is more robust against outliers.
Here is an image of the resulting dataset (with the removed portion in red and the original dotted):
Another artifact of interest: here are the plots of the data showing the trimming progression and how it affects the cutoff points. Plots show the data, with cut portions in red, and a the n-sigma line for the remainder. The title shows how much the sigma shrinks:
I am trying to do dimensionality reduction using PCA function of sklearn, specifically
from sklearn.decomposition import PCA
def mypca(X,comp):
pca = PCA(n_components=comp)
pca.fit(X)
PCA(copy=True, n_components=comp, whiten=False)
Xpca = pca.fit_transform(X)
return Xpca
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
I am calling mypca function from a loop with different values for comp. I am doing this in order to find the best value of comp for the problem I am trying to solve. But mypca function always returns the same value i.e. Xpca irrespective of value of comp.
The value it returns is correct for first value of comp I send from the loop i.e. Xpca value which it sends each time is correct for comp = 10 in my case.
What should I do in order to find best value of comp?
You use PCA to reduce the dimension.
From your code:
for n_comp in range(10,1000,20):
Xpca = mypca(X,n_comp) # X is a 2 dimensional array
print Xpca
Your input dataset X is only a 2 dimensional array, the minimum n_comp is 10, so the PCA try to find the 10 best dimension for you. Since 10 > 2, you will always get the same answer. :)
It looks like you're trying to pass different values for number of components, and re-fit with each. A great thing about PCA is that it's actually not necessary to do this. You can fit the full number of components (even as many components as dimensions in your dataset), then simply discard the components you don't want (i.e. those with small variance). This is equivalent to re-fitting the entire model with fewer components. Saves a lot of computation.
How to do it:
# x = input data, size(<points>, <dimensions>)
# fit the full model
max_components = x.shape[1] # as many components as input dimensions
pca = PCA(n_components=max_components)
pca.fit(x)
# transform the data (contains all components)
y_all = pca.transform(x)
# keep only the top k components (with greatest variance)
k = 2
y = y_all[:, 0:k]
In terms of how to select the number of components, it depends what you want to do. One standard way of choosing the number of components k is to look at the fraction of variance explained (R^2) by each choice of k. If your data is distributed near a low-dimensional linear subspace, then when you plot R^2 vs. k, the curve will have an 'elbow' shape. The elbow will be located at the dimensionality of the subspace. It's good practice to look at this curve because it helps understand the data. Even if there's no clean elbow, it's common to choose a threshold value for R^2, e.g. to preserve 95% of the variance.
Here's how to do it (this should be done on the model with max_components components):
# Calculate fraction of variance explained
# for each choice of number of components
r2 = pca.explained_variance_.cumsum() / x.var(0).sum()
Another way you might want to proceed is to take the PCA-transformed data and feed it to a downstream algorithm (e.g. classifier/regression), then select your number of components based on the performance (e.g. using cross validation).
Side note: Maybe just a formatting issue, but your code block in mypca() should be indented, or it won't be interpreted as part of the function.
I am creating a system for logging data from sensors. (Just a series of numbers)
I would like to be able to put the system into a "learn" mode for a couple of days so it can see what its "normal" operational values are and that once it is out of this any deviation from this behaviour past a certain point can be flagged. The data is all stored in a MySQL database.
Any suggestions on how to carry this out would be welcome, as would locations for further reading on the topic.
I would preferably like to use python for this task.
The data temperature and humidity values ever 5 minutes in a temperature controlled area that is accessed and used during the day. This means the it will have fluctuations for when it is in use and some temperature changes. But anything different to this such as cooling or heating systems failing needs to be detected
Essentially what you should be looking at is density estimation: the task of determining a model of how some variables behave, so that you can look for deviations from it.
Here's some very simple example code. I've assumed that temperature and humidity have independent normal distributions on their untransformed scales:
import numpy as np
from matplotlib.mlab import normpdf
from itertools import izip
class TempAndHumidityModel(object):
def __init__(self):
self.tempMu=0
self.tempSigma=1
self.humidityMu=0
self.humiditySigma=1
def setParams(self, tempMeasurements, humidityMeasurements, quantile):
self.tempMu=np.mean(tempMeasurements)
self.tempSigma=np.std(tempMeasurements)
self.humidityMu=np.mean(humidityMeasurements)
self.humiditySigma=np.std(humidityMeasurements)
if not 0 < quantile <= 1:
raise ValueError("Quantile for threshold must be between 0 and 1")
self._thresholdDensity(quantile, tempMeasurements, humidityMeasurements)
def _thresholdDensity(self, quantile, tempMeasurements, humidityMeasurements):
tempDensities = np.apply_along_axis(
lambda x: normpdf(x, self.tempMu, self.tempSigma),0,tempMeasurements)
humidityDensities = np.apply_along_axis(
lambda x: normpdf(x, self.humidityMu, self.humiditySigma),0,humidityMeasurements)
densities = sorted(tempDensities * humidityDensities, reverse=True)
#Here comes the massive oversimplification: just choose the
#density value at the quantile*length position, and use this as the threshold
self.threshold = densities[int(np.round(quantile*len(densities)))]
def probOfObservation(self, temp, humidity):
return normpdf(temp, self.tempMu, self.tempSigma) * \
normpdf(humidity, self.humidityMu, self.humiditySigma)
def isNormalMeasurement(self, temp, humidity):
return self.probOfObservation(temp, humidity) > self.threshold
if __name__ == '__main__':
#Create some simulated data
temps = np.random.randn(100)*10 + 50
humidities = np.random.randn(100)*2 + 10
thm = TempAndHumidityModel()
#going to hard code in the 95% threshold
thm.setParams(temps, humidities, 0.95)
#Create some new data from same dist and see how many false positives
newTemps = np.random.randn(100)*10 + 50
newHumidities = np.random.randn(100)*2 + 10
numFalseAlarms = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(newTemps,newHumidities))
print '{} false alarms!'.format(numFalseAlarms)
#Now create some abnormal data: mean temp drops to 20
lowTemps = np.random.randn(100)*10 + 20
normalHumidities = np.random.randn(100)*2 + 10
numDetections = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(lowTemps,normalHumidities))
print '{} abnormal measurements flagged'.format(numDetections)
Example output:
>> 3 false alarms!
>> 77 abnormal measurements flagged
Now, I have no idea whether the assumption of normality is appropriate for your data (you may want to transform the data onto a different scale so that it is); it's probably wildly inaccurate to assume independence between temperature and humidity; and the trick that I have used to find the density value corresponding to the requested quantile of the distribution should be replaced by something that uses the inverse CDF of the distribution. However, this should give you a flavour of what to do.
Note additionally that there are many good non-parametric density estimators: kernel density estimators immediately spring to mind. These may be more appropriate if your data doesn't look like any standard distribution.
It looks like you are attempting to perform anomaly detection but your description of your data is vague. You should start by trying to define/constrain what it means, in general, for your data to be "normal".
Is there a different "normal" for each sensor?
Is a sensor measurement somehow dependent on it's previous measurement(s)?
Does "normal" change over the course of a day?
Can the "normal" measurements from a sensor be characterized by a statistical model (e.g., are the data Gaussian or log-normal)?
Once you have answered those types of questions, then you can train a classifier or anomaly detector with a batch of data from your database and use the result to assess future log output. If machine learning algorithms are applicable to your data, you might consider using scikit-learn. For statistical models, you could use the stats subpackage of SciPy. And, of course, for any kind of numerical data manipulation in python, NumPy is your friend.