Outlier detection for time series - python

I wanted to generate a very simple example of anomaly detection for time series. So I created sample data with one very obvious outlier but I didn't get any method to detect the outlier reliably so far. I tried local outlier factor, isolation forests and k nearest neighbors. From what I read, at least one of those methods should be suitable. I also tried tweaking the parameters but that didn't really help.
What mistake do I make here? Are the methods not appropriate?
Below is a code example.
Thanks in advance!
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
t=np.linspace(0,10,101).reshape(-1,1)
y_test=0.5+t+t**2+2*np.random.randn(len(t),1)
y_test[10]=y_test[10]*7
plt.figure(1)
plt.plot(t,y_test)
plt.show;
from sklearn.neighbors import LocalOutlierFactor
clf=LocalOutlierFactor(contamination='auto')
pred=clf.fit_predict(y_test)
plt.figure(2)
plt.plot(t[pred==1],y_test[pred==1],'bx')
plt.plot(t[pred==-1],y_test[pred==-1],'ro')
plt.show
from sklearn.ensemble import IsolationForest
clf=IsolationForest(behaviour='new',contamination='auto')
pred=clf.fit_predict(y_test)
plt.figure(3)
plt.plot(t[pred==1],y_test[pred==1],'bx')
plt.plot(t[pred==-1],y_test[pred==-1],'ro')
plt.show
from pyod.models.knn import KNN
clf = KNN()
clf.fit(y_test)
pred=clf.predict(y_test)
plt.figure(4)
plt.plot(t[pred==0],y_test[pred==0],'bx')
plt.plot(t[pred==1],y_test[pred==1],'ro')
plt.show

I guess it is because of two reasons:
these algorithms are not designed to handle 1-d data specifically
they are not designed for ts problem...
you may need to use time series tool for it

Related

Signal data denoising or removing artifacts

In the original dataset there is noise. I want to denoise them. Spline interpolation or wavelet filtering can be used.
Please find the dataset sample here
>t1=1583516027000;t2=1583516028000
>t3=1583515991000;t4=1583515993000
>u1=d5[(d5['time']>=t1) & (d5['time']<=t2)]
>u2=d5[(d5['time']>=t3) & (d5['time']<=t4)]
t1,t2,t3,t4 are the timestamp interval where the noise occurred. To denoise them,
>u1['ch1'].interpolate(method='spline', order=2)
It provides me an error and also interpolation only interpolate the missing observations not the existing values.
Also, for wavelet denoise filtering I wrote this code
>import pywt
>import numpy as np
>from scipy.misc import electrocardiogram
>import scipy.signal as signal
>import matplotlib.pyplot as plt
>from skimage.restoration import denoise_wavelet
>wavelet_type='db6'
>x_denoise = denoise_wavelet(u1.iloc[:,0], method='BayesShrink', mode='soft', wavelet_levels=3, wavelet='sym8', rescale_sigma='True')
However, it does not change any results. How can I do this task? I am new to post the question. This is my first question. May be I am not clearly able to explain all the problem. My intention is to make a denoise dataset using spline interpolation and wavelet denoising filtering. But the problem is I can not filter the whole dataset. I have to filter only based on time interval because whole dataset dose not include the artifacts or noises. If I filter the whole data, it would also remove the original data. Therefore, I have to filter based on time interval.

Reproduce statsmodel calculation

I am running an OLS regression with statsmodel, using clustered standard errors (https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html):
import statsmodels.api as sm
import statsmodels.formula.api as smf
model = smf.ols(model_string, data=df1).fit(cov_type = 'cluster', cov_kwds={'groups': df1['correctedSE']})
I run a loop with different specifications and would like to visualise the the coefficients of all iterations.
I extract the coefficient and standard error in every iteration with:
coe = model.params["variable_of_interest"]
se = model.bse["variable_of_interest"]
List_coefficients.append(coe)
List_standard_errors.append(se)
I create a simple dataframe out of the lists then want to visualise the coefficients in an errorbar from matplotlib:
import matplotlib
import matplotlib.pyplot as plt
ax = plt.errorbar(df["loop_rounds"], df["Coefficient"], yerr=df150["Standard_e"])
However, when doing so, the confidence intervals are somehow calculated different and some coefficients in the graph are currently significant.
What would be the best way to correct for the calculation difference (e.g. extract different parameters from statsmodel, manually adjust standard errors, change in errorbar)?
If manually editing is the solution, where can I find the formula for the clustered standard errors?
Thank you

Rotating parallel coordinate axis-names in Pandas

When using some of the built in visualization tools in Pandas, one that is very helpful for me is the parallel_coordinates visualization. However, since I have around 18 features in the dataframe, the bottom of the parallel_coords plot gets really messy.
Therefore, I was wondering if anyone knew how to rotate the axis-names to be vertical rather than horizontal as shown here:
I did find a way to use parallel_coords in a polar set up, creating a radar-chart; while that was helpful for getting the different features to be visible, that solution doesn't quite work since whenever the values are close to 0, it becomes almost impossible to see the curve. Furthermore, doing it with the polar coord frame required me to break from using pandas' dataframe which is part of what made the this method so appealing.
Use plt.xticks(rotation=90) should be enough. Here is an example with the “Iris” dataset:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import parallel_coordinates
data = pd.read_csv('iris.csv')
parallel_coordinates(data, 'Name')
plt.xticks(rotation=90)
plt.show()

Python Heatmaps (Basic and Complex)

What's the best way to do a heatmap in python (2.7)? I've found the heatmap.py module, and I was wondering if people have any advice on using it, or if there are other packages that do a good job.
I'm dealing with pretty basic data, like xy = np.random.rand(1000,2) superimposed on an image.
Although there's another thing I want to try, which is doing a heatmap that's scaled to a different heatmap. E.g., I have
attempts = np.random.rand(5000,2)
successes = np.random.rand(500,2)
And I want a heatmap of the successes relative to the density of the attempts. Is this possible?
Seaborn is a pretty widely-used library for making nice-looking plots, and has a heatmap function. Seaborn uses matplotlib under the hood.
import numpy as np
import seaborn as sns
xy = np.random.rand(1000,2)
sns.heatmap(xy, yticklabels=100)
Regarding your second question, I'm not sure what you mean. But my advice would be to create a numpy array or pandas dataframe of "successes [scaled] relative to the density of the attempts", however you mean that, and then pass that scaled array or dataframe to sns.heatmap
You can plot very complex heatmap using python package PyComplexHeatmap: https://github.com/DingWB/PyComplexHeatmap
https://github.com/DingWB/PyComplexHeatmap/blob/main/examples.ipynb
The most basic heatmap you can get is an image plot:
import matplotlib.pyplot as plt
import numpy as np
xy = np.random.rand(100,2)
plt.imshow(xy, aspect="auto")
plt.colorbar()
plt.show()
Note that using more points than you have pixels to show the heatmap might not make too much sense.
There are of course also different methods to draw a heatmaps and you may go through the matplotlib example gallery and see which plot appeals most to you.

How to better fit seaborn violinplots?

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?
As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Categories

Resources