Unsupervised learning: Anomaly detection on discrete time series

Unsupervised learning: Anomaly detection on discrete time series - python

I am working on a final year project on an unlabelled dataset consisting of vibration data from multiple components inside a wind turbine.
Datasets:
I have data from 4 wind turbines each consisting of 415 10-second intervals.
About the 10 second interval data:
Each of the 415 10-second intervals consist of vibration data for the generator, gearbox etc. (14 features in total)
The vibration data (the 14 features) have a resolution of 25.6kHz (262144 rows in each interval)
The 10-seconds are recorded once every day, at different times => A little more than 1 year worth of data
Head of dataframe with some of the features shown:
Plan:
My current plan is to
Do a Fast Fourier Transformation (FFT) from the time domain for each of the different sensors (gearbox, generator etc.) for each of the 415 intervals. From the FFT I am able to extract frequency information to put in a dataframe. (Statistical data from the FFT like spectral RMS per bin)
Build different data sets for different components.
Add features such as wind speed, wind direction, power produced etc.
I will then build unsupervised ML models that can detect anomalies.
Unsupervised models I consider using are Encoder-Decorder and clustering.
Questions:
Does it look like I have enough data for this type of task? 415
intervals x 4 different turbines = 1660 rows and approx. 20 features
Should the data be treated as a time series? (It is sampled for 10 seconds once a day at random times..)
What other unsupervised ML models/approaches that could be good for this task?
I hope this was clearly written. Thanks in advance for any input!

Related

Get prediction from multiple rows - Decision Tree Regressor

I have a dataset of weather data that I want to use to make a prediction.
The data set consists of data from several different locations. The features in the data set are as follows:
datetime
location
rain
snow
temp_min
temp_max
clouds
pressure
humidity
wind_speed
wind_deg
weather_description
The measurements have been made at the same time in all locations, which make it possible to distinguish between the individual measurements.
I want to use data from all locations as input with getting a prediction for a location.
Is it possible to use several lines as input or can input data only consist of one line?

The DecisionTreeRegressor from scikit-learn expects a dataframe where each output is generated based on a single row. You can nevertheless move all your measurements in into one row (during training and testing) as below:
rain_stn1, rain_stn2, rain_stn3, ..., snow_stn1, snow_stn2, snow_stn3, ...
rain_value#stn1, rain_value#stn2, rain_value#stn3, ...
Of course this means that there needs to be some logical relationship between the stations such as distance. You could also create aggregate values such as rain_nearby (average of stations at <5 km distance), rain_far (average of stations at >5 km distance) which is probably more helpful in your case.
To give more specific answers you need to provide more details on use case, what you are trying to achieve, and how the dataset looks like.

Python - How do I check time series stationarity?

I have a car speed dataset on a highway. The observations are collected at 15 min steps, which means I have 96 observations per day and 672 per week.
I have a whole month dataset (2976 observations)
My goal is to predict future values using an Autoregressive AR(p) model.
Here's my data repartition over the month.
In addition, here's the autocorrelation plot (ACF)
The visualization of the 2 plots above lead to think of a seasonal component and hence, a non-stationnary time series, which for me makes no doubt.
However, to make sure of the non-stationarity, I applied on it a Dickey-Fuller test. Here are the results.
Results of Dickey-Fuller Test:
Test Statistic -1.666334e+01
p-value 1.567300e-29
#Lags Used 3.000000e+00
Number of Observations Used 2.972000e+03
Critical Value (5%) -2.862513e+00
Critical Value (1%) -3.432552e+00
Critical Value (10%) -2.567288e+00
dtype: float64
The results clearly show that the absolute value of Test statistic is greater than the critical values, therefore, we reject the null hypothesis which means we have a stationary series !
So I'm very confused of the seasonality and stationarity of my time series.
Any help about that would be appreciated.
Thanks a lot

Actually, stationarity and seasonality are not controversial qualities. Stationarity represent a constancy (no variation) on the series moments (such as mean, variance for weak stationarity), and seasonality is a periodic component of the series that can be extracted with filters.
Seasonality and cyclical patterns are not exactly the same thing, but are very close. You can think as if this series in the images that you show can have a sum of sines and cosines that repeats itself for weekly (or monthly, yearly, ...) periods. It does not have any correlation with the fact that the mean value of the series seems to be constant over the period, or even variance.

preprocessing EEG dataset in python to get better accuracy

I've an EEG dataset which has 8 features taken using 8-channel EEG headset. Each row represents readings taken with 250ms interval. The values are all floating point representing voltages in micro volt. If I plot individual features, I can see that they form a continuous wave. now the target has 3 categories: 0,1,2. and for a duration of time the target doesn't change because the sample taken spans across multiple rows. I would appreciate any guidance as to how to pre-process the dataset. Since using it as it is gives me very low accuracy(80%) and according to Wikipedia P300 signal can be detected with 95% accuracy. And please note that I've almost zero knowledge about signal processing and analysing waveforms.
I did try making a 3D array where each row represented a single target and the values of each feature was a list of values that originally spanned across multiple rows. But I get an error that says estimator array expected to be <=2. I'm not sure if this was the right approach. But it didn't work anyway.
here have a look at my feature set:-
-1.2198,-0.32769,-1.22,2.4115,0.057031,-2.6568,7.372,-0.2789
-1.4262,-4.19,-5.6546,-7.7161,-5.4359,-9.4553,-3.6705,-5.4851
-1.3152,-6.8708,-8.5599,-14.739,-9.1808,-14.268,-11.632,-8.929
-0.53987,-7.5156,-8.9646,-16.656,-10.119,-15.791,-14.616,-9.4095
Their corresponding targets:-
0
0
0
0

How to compute a 95% confidence interval around a continuous signal?

I would like to compute and display in python a 95% CI around a continuous signal (voltage values as a function of time). This signal was recorded in the brain of 16 different subjects, and lasts 1300 ms. Sampling rate was 250 Hz (so one datapoint every 4 ms). How can I proceed?

Here's a pythonic example of continuous error bar plotting:
https://tonysyu.github.io/plotting-error-bars.html#.V1HmMPkrJhE. The example is for plotting errors, but just replace err with stdev.
Assuming that each sample variable is normally distributed, I would calculate 2*standard deviation (~95% confidence) for each sample (every 4ms data point across all subjects). Each of these values gets stored in an array, and the array and data points can be fed into the example code.

does the classification after a fft

I have a spectrum and I do the fft. And I wanted to use this data to make learning with scikit-learn. However I know what to take as explanatory variables, the frequencies the amplitudes or phases. It also seems it there's specific methods to process data. If you have ideas thank you
for example measurements made on two species
measure for the specie 1
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2806130.78600507 -79.781679752725
234.24463948875 1913786.60902507 17.7111789273704
351.366959233125 808519.710937228 116.444676921222
468.4892789775 122095.42475935 25.5770279979328
585.520239658112 607116.287067349 142.264887989957
702.642559402487 604818.747928879 -112.469849617122
819.764879146862 277750.38203791 -15.0000950192717
936.887198891237 118608.971696726 -74.5121366118222
1054.00951863561 344484.145698282 -6.21161038546633
1171.13183837999 327156.097365635 97.0304114077862
1288.25415812436 133294.989030519 -42.5375933954097
1405.37647786874 112216.937121264 78.5147573168857
1522.49879761311 231245.476714294 -25.4436913705878
1639.62111735749 201337.057689481 -24.3659638609968
1756.6520780381 77785.2190703514 29.0468023773855
1873.77439778247 103345.482912432 -13.8433556624336
1990.89671752685 164252.685204496 32.0091367478569
2108.01903727122 131507.600569796 3.20717282723705
2225.1413570156 62446.6053497028 17.6656168494324
2342.26367675998 92615.8137781526 -2.92386499550556
measure for the specie 2
Frequency [Hz] Peak amplitude Phase [degrees]
117.122319744375 2786323.45338023 -78.5559125894388
234.24463948875 1915479.67743241 20.1586403367551
351.366959233125 830370.792189816 120.081294764269
468.4892789775 94486.3308071095 28.1762359863422
585.611598721875 590794.892175599 137.070646192436
702.642559402487 610017.558439343 -99.8603287979889
819.764879146862 300481.494163747 -7.0350571153689
936.887198891237 93989.1090623071 -52.6686900337389
1054.00951863561 332194.292343295 4.40278213901234
1171.13183837999 335166.932956212 92.5972261483014
1288.25415812436 154686.81104112 -64.5940556800747
1405.37647786874 91910.7647280088 82.3509804545009
1522.49879761311 223229.665336525 -64.4186985300827
1639.62111735749 211038.25587802 12.6057366375093
1756.74343710186 93456.4477333818 25.3398315513138
1873.77439778247 87937.8620001563 15.3447294063444
1990.89671752685 160213.112972346 7.41647669351739
2108.01903727122 141354.896010814 -48.4341201110724
2225.1413570156 69137.6327300227 39.9238718439715
2342.26367675998 82097.0663259956 -28.9291500313113

OP is asking how to classify this. I've explained it to him in comments and will break it down more here:
Each "specie" represents a row, or a sample. Each sample, thus, has 60 features (20x3)
He is doing a binary classification problem
Re-cast the output of the FFT to give Freq1,Amp1,Phase1....etc as a numerical input set for a training algorithm
Use something like a Support Vector Machine or Decision Tree Classifier out of scikit-learn and train over the dataset
Evaluate and measure accuracy
Caveats: 60 features over 1000 samples is potentially going to be quite hard to separate and liable to over-fitting. OP needs to be careful. I havent spent much time understanding the features themselves, but I suspect 20 of those features are redundant (the frequencies always seem to be the same between samples)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.