I am struggling with a standard ML problem.
I'm trying to build a service which predicts the next time a user is sending a message on a platform. For this I'm using a historic dataset of the users messages which is structured as an array of timestamps. For example:
[2019-05-23 18:28:34.741413, 2019-05-23 18:45:39.643218, 2019-05-23 23:26:44.767524]
What is the best way of predicting the next timestamp in this series on when the user will be online?
Currently I am creating a dataframe in Python to then put it into a Sequential() model of keras but I need a y value for doing this.
Thanks for your ideas on how to handle this.
As a first attempt, I would predict the time duration until the next timestamp. (Regression, not classification.) Probably even better would be to predict the logarithm of that duration instead. Because it is more important to get 2min vs 3min right than to focus on 500min vs 510min.
As inputs you could use the logarithmic time since the last timestamp, and maybe a couple of the previous distances, or logarithm of the last message length, or some general user stats.
But ideally, you'd have the neural network predict the parameters of a probability distribution, such that it can give you an answer like "probably within the next 30 minutes, certainly not after midnight, but possibly after 7am", and then you can measure this prediction against the empirical distribution (e.g. cross-entropy loss). But this is probably a bit too involved for getting started.
If you only want to predict a single timestamp (and not a distribution) then in theory you'd have to define an appropriate loss, and make a decision about which errors are how bad for your application, and then train a model that optimizes this loss.
Related
I went through this case study of Structural Time Series Modeling in TensorFlow, but I couldn't find a way to add future values of features. I would like to add holidays effect, but when I am following these steps my holidays starts to repeat in forecast period.
Below is visualisation from case study, you can see that temperature_effect starts from begginig.
Is it possible to feed the model with actual future data?
Edit:
In my case holidays started to repeat in my forecast which does not make sense.
Just now I have found issue on github refering to this problem, there is workaround to this problem.
There is a slight fallacy in what you are asking in particular. As mentioned in my comment when predicting with a model, future data does not exist because it just hasn't happened yet. For whatever model its not possible to feed data that does not exist. However you could use an autoregressive approach as defined in the link above to feed 'future' data. A Pseudo example would be as follows:
Model 1: STS model with inputs x_in and x_future to predict y_future.
You could stack this with a secondary helper model that predicts x_future from x_in.
Model 2: Regression model with input x_in predicting x_future.
Concatenating these models will result then allow your STS model to take into account 'future' feature elements. On the other hand in your question you mention a holiday effect. You could simply add another input where you define via some if/else case if a holiday effect is active or inactive. You could also use random sampling of your holiday effect as well and it might help. To exactly help you with code/model to do what you want I'll need to have more details on your model/inputs/outputs.
In simple words, you can't work with data that doesn't exist so you either need to spoof it or get it in some other way.
The tutorial on forecasting is here:https://www.tensorflow.org/probability/examples/Structural_Time_Series_Modeling_Case_Studies_Atmospheric_CO2_and_Electricity_Demand
You only need to enter new data and parameters to predict how many results in the future.
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
Using the following guide, I've made an sklearn regression model for doing time series forecasting. I'm able to use the model to get predictions on a set of test data where I have the timestamps, as well as the independent variable data, since the model just takes those variables and and gives the output labels as predictions.
However, I'm not sure how, or even if I can use this model to do out of sample predictions, where I only have a future timestamp and none of the independent variable data that goes with it. Is there some sort of recursive method where the model can use data from a test set, make a prediction, then use the prediction and the data to make the next prediction, etc.? Thanks!
Yes, but it depends on whether you want to do single-step or multi-step forecasts.
For single-step forecasts, as you describe, use the last available window of your data as input to the prediction function, this returns the first step ahead forecasted value.
For multi-step forecasts, you have three options:
Direct: Fit one regressor for each step ahead and let each fitted regressor make a prediction with the last available window,
Recursive: Use the last available window to make the first step prediction, then use the first step prediction to roll the window and predict again.
DirRec: A combination of the above strategies, where you instead of rolling the window, you expand it with the previously predicted value, note however this requires to fit the regressors accordingly.
You can find more details in:
Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. "Machine learning strategies for time series forecasting." European business intelligence summer school. Springer, Berlin, Heidelberg, 2012.
Also note that you have to be careful to appropriately evaluate your model. The train and test sets are not independent in this setting, as they represent measurements at subsequent time points of the same variable. So you have to account for the potential auto-correlation.
I'm sorry for the poorly worded title, I'm just not sure how to phrase my question. Explaining my problem is probably easier. Also, I'm quite new to this, having only taken a few Udemy courses in Python and no professional experience in the field.
TL;DR: I want to create a model that will predict how many a minutes a flight will be delayed, or if it'll be canceled.
I'm working with the 2015 Flight Delays dataset on Kaggle, and I'm combining this data with a site I found that gives historical weather data for airports. I'm trying to combine those two sets of data to built a better delay predictor.
My hypothesis is that bad weather is the primary cause of delays, but really bad weather is the primary cause of cancellations.
My first thought at attacking this is that I could create a linear model that takes ceiling, visibility, wind as continuous independent variables, plus origin airport, departure airport, and precipitation intensity as categorical independent variables, and output the arrival delay as the dependent variable. However, this doesn't account for flight cancellations.
In the dataset, a cancelled flight has a NaN for its departure and arrival times, and then a one-hot encoded column for Cancelled. I could just edit the dataset and do something like making the arrival delay 999 if the Cancelled column is 1, but that screws with the linear model.
Is it possible for my model to output either a numeric (continuous) value for how many minutes a flight will be delayed, or if it'll be cancelled altogether (categorical)?
If not, can I somehow stack models? Ie, make something like a random forest model that only predicts "Canceled' or "Not Canceled", and feed the "Not Canceled" flights into a linear model that predicts a number of minutes a flight will be delayed?
Slightly unrelated question: I don't think there is a linear relationship between ceiling/visibility/winds, but I do think there is maybe a logarithmic relationship. For example, there is a massive difference between a visibility of 0.5, 1.0, and 1.5 miles, but there is zero difference between a visibility of 8 and 10 miles. Same for winds and ceiling. What sort of model is available to me in scikit-learn that accounts for this?
There are a lot of questions here but let me answer the one related with the stacked machine learning models.
The answer is yes, you train a neural network which solves both tasks. That is called multitask learning. I only know implementations of multitask learning with deep neural networks and its not hard to implement. The most important part is the output layer, in this case you would have an output layer with two activation units. One sigmoid output unit and one relu output unit (if you are planning to predict only possitive delays).
Another solution for your problem is the one you proposed. Just set the delay to a big number and set a condition in your code like "If the number of delayed hours is longer than n days, then the flight was cancelled".
And the last solution I see for it is implementing two neural networks. One which determines if the flight was cancelled and one which determines the delay time. If the first one returns true, then don't use the second one.
Please feel free to ask
I have some problem when detecting anomaly from time series data.
I use LSTM model to predict value of next time as y_pred, true value at next time of data is y_real, so I have er = |y_pred - y_t|, I use er to compare with threshold = alpha * std and get anomaly data point. But sometime, our data is effected by admin or user for example number of player of a game on Sunday will higher than Monday.
So, should I use another model to classify anomaly data point or use "If else" to classify it?
I think you are using batch processing model(you didn't using any real-time processing frameworks and tools) so there shouldn't be any problem when making your model or classification. The problem may occur a while after you make model, so after that time your predicted model is not valid.
I suggest some ways maybe solve this problem:
Use Real-time or near real-time processing(like apache spark, flink, storm, etc).
Use some conditions to check your data periodically for any changes if any change happens run your model again.
Delete instances that you think they may cause problem(may that changed data known as anomaly itself) but before make sure that data is not so important.
Change you algorithm and use algorithms that not very sensitive to changes.