Real-time anomaly detection from time series data - python

I have some problem when detecting anomaly from time series data.
I use LSTM model to predict value of next time as y_pred, true value at next time of data is y_real, so I have er = |y_pred - y_t|, I use er to compare with threshold = alpha * std and get anomaly data point. But sometime, our data is effected by admin or user for example number of player of a game on Sunday will higher than Monday.
So, should I use another model to classify anomaly data point or use "If else" to classify it?

I think you are using batch processing model(you didn't using any real-time processing frameworks and tools) so there shouldn't be any problem when making your model or classification. The problem may occur a while after you make model, so after that time your predicted model is not valid.
I suggest some ways maybe solve this problem:
Use Real-time or near real-time processing(like apache spark, flink, storm, etc).
Use some conditions to check your data periodically for any changes if any change happens run your model again.
Delete instances that you think they may cause problem(may that changed data known as anomaly itself) but before make sure that data is not so important.
Change you algorithm and use algorithms that not very sensitive to changes.

Related

How to add future feature data to STS model - TensorFlow

I went through this case study of Structural Time Series Modeling in TensorFlow, but I couldn't find a way to add future values of features. I would like to add holidays effect, but when I am following these steps my holidays starts to repeat in forecast period.
Below is visualisation from case study, you can see that temperature_effect starts from begginig.
Is it possible to feed the model with actual future data?
Edit:
In my case holidays started to repeat in my forecast which does not make sense.
Just now I have found issue on github refering to this problem, there is workaround to this problem.
There is a slight fallacy in what you are asking in particular. As mentioned in my comment when predicting with a model, future data does not exist because it just hasn't happened yet. For whatever model its not possible to feed data that does not exist. However you could use an autoregressive approach as defined in the link above to feed 'future' data. A Pseudo example would be as follows:
Model 1: STS model with inputs x_in and x_future to predict y_future.
You could stack this with a secondary helper model that predicts x_future from x_in.
Model 2: Regression model with input x_in predicting x_future.
Concatenating these models will result then allow your STS model to take into account 'future' feature elements. On the other hand in your question you mention a holiday effect. You could simply add another input where you define via some if/else case if a holiday effect is active or inactive. You could also use random sampling of your holiday effect as well and it might help. To exactly help you with code/model to do what you want I'll need to have more details on your model/inputs/outputs.
In simple words, you can't work with data that doesn't exist so you either need to spoof it or get it in some other way.
The tutorial on forecasting is here:https://www.tensorflow.org/probability/examples/Structural_Time_Series_Modeling_Case_Studies_Atmospheric_CO2_and_Electricity_Demand
You only need to enter new data and parameters to predict how many results in the future.

Best model to predict failure using time series from sensors

I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck

User Activity Prediction in Python

I am struggling with a standard ML problem.
I'm trying to build a service which predicts the next time a user is sending a message on a platform. For this I'm using a historic dataset of the users messages which is structured as an array of timestamps. For example:
[2019-05-23 18:28:34.741413, 2019-05-23 18:45:39.643218, 2019-05-23 23:26:44.767524]
What is the best way of predicting the next timestamp in this series on when the user will be online?
Currently I am creating a dataframe in Python to then put it into a Sequential() model of keras but I need a y value for doing this.
Thanks for your ideas on how to handle this.
As a first attempt, I would predict the time duration until the next timestamp. (Regression, not classification.) Probably even better would be to predict the logarithm of that duration instead. Because it is more important to get 2min vs 3min right than to focus on 500min vs 510min.
As inputs you could use the logarithmic time since the last timestamp, and maybe a couple of the previous distances, or logarithm of the last message length, or some general user stats.
But ideally, you'd have the neural network predict the parameters of a probability distribution, such that it can give you an answer like "probably within the next 30 minutes, certainly not after midnight, but possibly after 7am", and then you can measure this prediction against the empirical distribution (e.g. cross-entropy loss). But this is probably a bit too involved for getting started.
If you only want to predict a single timestamp (and not a distribution) then in theory you'd have to define an appropriate loss, and make a decision about which errors are how bad for your application, and then train a model that optimizes this loss.

Where does machine learning algorithme store the result?

I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.
the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.
I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.

time series anomaly detection steps-model-tools

I will have a device which will measure for example the temperature values every specific time intervals.
I want to train a model to understand which values do not belong to "normal" values and if so, raise an alert.So, I want an anomaly time series detection model.
At first, I thought of using a clustering model (kmeans, hierarchical).So, in the beginning of time, I will have many alerts.Later, some clusters will be created and hopefully I will have a good model!
But, since I don't have any experience on this I want to ask if this approach is right or what other approaches exist.And what kind of tools should I use (either Python either R).
I have read many links and a few papers and I can see that some people do not suggest kmeans do not use k means.
Also, I am not sure how/if to use Dynamic Time Warping Clustering.
I read a paper Clustering of Time Series Subsequences is Meaningless which states that under some conditions applying clustering is meaningless.
I also saw the tsouliers package , the twitter's anomaly detection but I am not sure as I said which approach/tools should I use.
I am also trying to get anomaly detection working. I started using ARMA model to solve this but they had lot of false positives. So I am trying to implement using PCA. Idea is to approximate the function using PCA components. I got this idea through this link. Hopefully this helps you. Will update to better answer when I implement this fully.

Categories

Resources