I don't have too much experience with pycaret but I'm figuring things out slowly but surely. I'm trying to figure out how to have a model predict a value, not in the data I give it but rather outside the data. For example, I have data that is ordered by date from 2020-9-30 to 2020-10-30. Once I provide the model with this data, it predicts values from 2020-9-30 to 2020-10-30, but I want it to predict 2020-10-31. I can't figure out how to make that work. Please can someone who knows how to solve this problem, if there is a solution, help me.
P.S If that was confusing, please ask me to clarify 👍 Thanks!
This is what I'm using to predict the values, asset is a 250x7 data frame. When the code runs, it prints a 250x2 data frame that is an exact copy of the asset data. But I want it to print one new value in the label column
modelpredict = predict_model(final_bestmodel, data=asset)
mpdf = pd.DataFrame(modelpredict, columns = ['Date', 'Label'])
mpdf
If you do classification or regression you can use predict_model(best_model, data=data_unseen) and pass new data to the data argument e.g. if you want to prediction value on 2020-10-31 you need to provide all variables that you use in the training phase to model. You can find documentation from here and tutorial from here
If you do forecasting you can also use predict_model(best_model, fh = 24) and pass the number of forecasting horizons to fh argument, and don't forget to finalize the model before the last prediction. It will forecast fh step beyond the latest date in training data. You can find documentation from here and tutorial from here
Related
so I was recently trying to predict stock prices using an LSTM (seems really overdone I know) but when I was predicting on data that is outside the dataset, I get a really weird graph that I do not think is correct, am I doing something wrong?
Prediction: https://github.com/Alpheron/StockPred/blob/master/predictions/MSFT-5-Year-LSTM.ipynb
Training:
https://github.com/Alpheron/StockPred/blob/master/MSFT-5-Year-LSTM.ipynb
In order to process the data, I used the lookback index of 60 points, so when trying to predict on data that is outside of the dataset, I would need the last 60 points as well, but I am I doing something wrong with the way I am predicting?
I apologize in advance if I will say nonsense, I have zero experience in this field. Unfortunately my teacher had the brilliant idea to explain time series forecasting in only 3 hours (and I'm a computer engineering student, not a data science student, so for me it is a totally new topic!) and I didn't understand much about it.
Context: in my project I generate a metric in Spring Boot with Micrometer (a counter, that counts the number of purchases), I use Prometheus to pull this metric and I generate a .csv file with Grafana. Then I open this file with Python e I create the time series using Panda's dataframe.
This is my time series:
time_series
My teacher explained that the first thing to do is to apply the seasonal_decompose filter and extract the trend of the series:
result = seasonal_decompose(ts, model='...', period=...)
trend = result.trend
And here my first questions: in model do I have to insert "additive" or "multiplicative"? And then, what period do I have to insert? My series is aperiodic! In class it has been explained to us only the case of a "beautiful" series with particular features, but my series is very different from that example series.
Then my teacher said that we have to apply a model. In particular, he advised to apply Arima (or Holt Winter), in this way:
auto_arima(trend) #example output: (3,2,5)
train_data = trend.iloc[:600] #90% of samples
test_data = trend.iloc[600:] #10% of samples
model = ARIMA(train_data, order=(3,2,5)) #training
results = model.fit()
start = len(train_data)
end = len(train_data)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels')
model = ARIMA(trend, order=(3,2,5))
results = model.fit()
fcast = results.predict(len(trend), len(trend)+72, typ='levels') #prediction of 72 samples
But I wonder: can I use Arima for my specific series? Or should I use another model?
He also explained other concepts to us (stationarity, differentiation, autocorrelation, etc.), but all this was explained quickly and in a very short time and I have a lot of confusion in my head. Having never studied machine learning (I'll do it next semester), I can't easily understand the documentations on Internet.
Sorry for my ignorance on the subject, I'm a bit desperate.
No worries, time series analysis is not very easy and hard, but I suggest you review your course materials at first. ARIMA model is the basic model to do the time series analysis. You should follow the instruction about how to decide its key arguments like p value, q value, or d value according to you series data.
I went through this case study of Structural Time Series Modeling in TensorFlow, but I couldn't find a way to add future values of features. I would like to add holidays effect, but when I am following these steps my holidays starts to repeat in forecast period.
Below is visualisation from case study, you can see that temperature_effect starts from begginig.
Is it possible to feed the model with actual future data?
Edit:
In my case holidays started to repeat in my forecast which does not make sense.
Just now I have found issue on github refering to this problem, there is workaround to this problem.
There is a slight fallacy in what you are asking in particular. As mentioned in my comment when predicting with a model, future data does not exist because it just hasn't happened yet. For whatever model its not possible to feed data that does not exist. However you could use an autoregressive approach as defined in the link above to feed 'future' data. A Pseudo example would be as follows:
Model 1: STS model with inputs x_in and x_future to predict y_future.
You could stack this with a secondary helper model that predicts x_future from x_in.
Model 2: Regression model with input x_in predicting x_future.
Concatenating these models will result then allow your STS model to take into account 'future' feature elements. On the other hand in your question you mention a holiday effect. You could simply add another input where you define via some if/else case if a holiday effect is active or inactive. You could also use random sampling of your holiday effect as well and it might help. To exactly help you with code/model to do what you want I'll need to have more details on your model/inputs/outputs.
In simple words, you can't work with data that doesn't exist so you either need to spoof it or get it in some other way.
The tutorial on forecasting is here:https://www.tensorflow.org/probability/examples/Structural_Time_Series_Modeling_Case_Studies_Atmospheric_CO2_and_Electricity_Demand
You only need to enter new data and parameters to predict how many results in the future.
I've been scouring the net for something like this, but I can't quite figure it out.
Here is my data. I am trying to predict 'Close' using both the time series data from 'Close' as well as the time series data from 'Posts'. I've tried looking into documentation on SARIMA, auto arima, etc... and I'm not getting anywhere. Does anyone have any idea on how this could be done in Python? This is a pandas dataframe.
import pmdarima
arima_model = pmdarima.auto_arima(arima_final['Close'].values, X=arima_final['Posts'].values)
arima_model.predict(5, X=...)
That's the simplest way I know of to do what you're asking (use a model, presumably an ARIMA model, to predict future values of Close). The X argument is the exogenous data. Note that you'll also need to provide it with exogenous data when predicting, hence the X=... in the code above.
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck