I apologize in advance if I will say nonsense, I have zero experience in this field. Unfortunately my teacher had the brilliant idea to explain time series forecasting in only 3 hours (and I'm a computer engineering student, not a data science student, so for me it is a totally new topic!) and I didn't understand much about it.
Context: in my project I generate a metric in Spring Boot with Micrometer (a counter, that counts the number of purchases), I use Prometheus to pull this metric and I generate a .csv file with Grafana. Then I open this file with Python e I create the time series using Panda's dataframe.
This is my time series:
time_series
My teacher explained that the first thing to do is to apply the seasonal_decompose filter and extract the trend of the series:
result = seasonal_decompose(ts, model='...', period=...)
trend = result.trend
And here my first questions: in model do I have to insert "additive" or "multiplicative"? And then, what period do I have to insert? My series is aperiodic! In class it has been explained to us only the case of a "beautiful" series with particular features, but my series is very different from that example series.
Then my teacher said that we have to apply a model. In particular, he advised to apply Arima (or Holt Winter), in this way:
auto_arima(trend) #example output: (3,2,5)
train_data = trend.iloc[:600] #90% of samples
test_data = trend.iloc[600:] #10% of samples
model = ARIMA(train_data, order=(3,2,5)) #training
results = model.fit()
start = len(train_data)
end = len(train_data)+len(test)-1
predictions = results.predict(start=start, end=end, dynamic=False, typ='levels')
model = ARIMA(trend, order=(3,2,5))
results = model.fit()
fcast = results.predict(len(trend), len(trend)+72, typ='levels') #prediction of 72 samples
But I wonder: can I use Arima for my specific series? Or should I use another model?
He also explained other concepts to us (stationarity, differentiation, autocorrelation, etc.), but all this was explained quickly and in a very short time and I have a lot of confusion in my head. Having never studied machine learning (I'll do it next semester), I can't easily understand the documentations on Internet.
Sorry for my ignorance on the subject, I'm a bit desperate.
No worries, time series analysis is not very easy and hard, but I suggest you review your course materials at first. ARIMA model is the basic model to do the time series analysis. You should follow the instruction about how to decide its key arguments like p value, q value, or d value according to you series data.
Related
I don't have too much experience with pycaret but I'm figuring things out slowly but surely. I'm trying to figure out how to have a model predict a value, not in the data I give it but rather outside the data. For example, I have data that is ordered by date from 2020-9-30 to 2020-10-30. Once I provide the model with this data, it predicts values from 2020-9-30 to 2020-10-30, but I want it to predict 2020-10-31. I can't figure out how to make that work. Please can someone who knows how to solve this problem, if there is a solution, help me.
P.S If that was confusing, please ask me to clarify 👍 Thanks!
This is what I'm using to predict the values, asset is a 250x7 data frame. When the code runs, it prints a 250x2 data frame that is an exact copy of the asset data. But I want it to print one new value in the label column
modelpredict = predict_model(final_bestmodel, data=asset)
mpdf = pd.DataFrame(modelpredict, columns = ['Date', 'Label'])
mpdf
If you do classification or regression you can use predict_model(best_model, data=data_unseen) and pass new data to the data argument e.g. if you want to prediction value on 2020-10-31 you need to provide all variables that you use in the training phase to model. You can find documentation from here and tutorial from here
If you do forecasting you can also use predict_model(best_model, fh = 24) and pass the number of forecasting horizons to fh argument, and don't forget to finalize the model before the last prediction. It will forecast fh step beyond the latest date in training data. You can find documentation from here and tutorial from here
I am using darts(https://unit8co.github.io/darts/) I want to train a model to predict how many product we will sell per day based on historical information.
We don't sell any product on the weekend. so there is no data in in the training data for that.
We also know that advertising spend of the days leading up to a day will have a big impact on how much we sell. So I want to use daily_advertising_spend as a covariates.
Input data looks like:
date,y,advertising_spend
2022-07-20 00:00:00,456,10
2022-07-21 00:00:00,514,10
2022-07-22 00:00:00,353,6
2022-07-25 00:00:00,511,28
2022-07-26 00:00:00,419,13
2022-07-27 00:00:00,439,16
Code:
from darts import TimeSeries
from darts.models import TFTModel
all_data = TimeSeries.from_csv("csvfile.csv", time_col="ds", freq="D")
model = TFTModel(input_chunk_length=7, output_chunk_length=1)
model.fit(series=all_data["y"], future_covariates=all_data["advertising_spend"])
However during the training it can't compute a loss function. Please see image.
I looked into why this is and it is because it is treating weekends as NAN.
I set fillna_value=0 when creating the TimeSeries however and then it is able to train. However the model produced is not a good approximation when we do this.
What is the best way to handle this?
creator of Darts here. I would make a few suggestions:
Start with simpler models. Not TFT, but rather linear regression or ARIMA, which both support future covariates.
Use business day frequency ("B"), not daily.
Make sure you don't have any NaN value in your time series. If you do, consider using e.g., darts.utils.missing_values.fill_missing_values().
If you use deep learning (later on), scale the values using Scaler.
I'm experimenting with Darts for similar modeling problems currently. What exactly do you mean by the model produced is not a good approximation?
Maybe you already did that, but generally speaking I would suggest to start with simpler models (Exponential Smoothing, ARIMA, Prophet) to get a solid baseline forecast, that more complex models like TFT would have to outperform.
My experience so far has been that most NN models don't quite beat the simpler statistical models when it comes to univariate timeseries. Intuitively I believe the NN models need more signal in order to really leverage their strengths - such as multiple similar target series to be trained on and/or bigger sets of covariates. I read some of Prof. Rob Hyndman's research on that topic for reference.
If you purely want to improve the TFT Model given your set of data, maybe the first way to do so would be to tune the hyperparameters - this can have quite an effect. Darts offers the gridsearch method for this, see here for documentation. Another option I saw in the Darts examples is PyTorch's Ray Tune.
About the advertising covariate: Do you have data on (planned) advertising spend for a certain amount of days into the future, or do you only have data until the present? In the latter case, you can also consider the other NN models in Darts that only use past_covariates without losing any signal in your model.
I have a time-series data from 2016 to 2021, how could I backcast to get the data from 2010 to 2015 using ARIMA in Python?
COuld you guys give me some sample Python code?
Thank you very much
The only possibiliy I see here is to simply inverse your time series. That means the last observations becomes the first, the second last becomes the second and so on. You then have a series from 2021 to 2016.
You can do that by:
df = df.reindex(index=df.index[::-1])
You can then train an ARIMA model on this data and predict the "next" five years from 2015 to 2010. Remember that the first prediction will be for 2015-12-31, so you need to inverse this again to have the series from 2010 to 2015.
Keep in mind that ARIMA the predictions will be very, very bad, since your forecasts will be based on forecasts and so on. ARIMA is not made for predictions on such long time frames, so the results will be useless anyway, I guess. It is very likely that the predicitons will become a straight line after 30 or 40 predicions.And you can only use the autoregression part in such a case, since the order of the moving average model will limit the amount of steps you can forecast into the future.
Forecasting from an inversed timeseries would be the solution if you had more data.
However, only having 6 observations is problematic. Creating a forecasting (or backcasting) model requires using some of the observations to train the model and others to validate it. If you train with 4 observations then you only have 2 observations for validation. Is a model good if it forecasts those two well or did you just get lucky? Is it bad if it forecasts one observation well and the other poorly? If you increase the validation set to 3 observations, you get more confidence on whether the model is good or bad but creating the model (with only 3 observations) gets even harder than before.
Like others have stated, regardless of what machine learning model you choose, the results are likely to be poor with so little data. If you had the monthly data it might be more fruitful.
If you can't get the monthly data, since you are backcasting into the past, it might be better to estimate the values manually based on some related variables that you have data of (if any). E.g. if your timeseries is about a company's sales then maybe you could estimate based on the company's annual revenue (or company size, or something else) if you can get the historical data of that variable. This is not precise but can still be more precise than what ARIMA or similar methods would give with the data you have.
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
I'm working on a multivariate (100+ variables) multi-step (t1 to t30) forecasting problem where the time series frequency is every 1 minute. The problem requires to forecast one of the 100+ variables as target.
I'm interested to know if it's possible to do it using FB Prophet's Python API. I was able to do it in a univariate fashion using only the target variable and the datetime variable. Any help and direction is appreciated. Please let me know if any further input or clarity is needed on the question.
You can add additional variables in Prophet using the add_regressor method.
For example if we want to predict variable y using also the values of the additional variables add1 and add2.
Let's first create a sample df:
import pandas as pd
df = pd.DataFrame(pd.date_range(start="2019-09-01", end="2019-09-30", freq='D', name='ds'))
df["y"] = range(1,31)
df["add1"] = range(101,131)
df["add2"] = range(201,231)
df.head()
ds y add1 add2
0 2019-09-01 1 101 201
1 2019-09-02 2 102 202
2 2019-09-03 3 103 203
3 2019-09-04 4 104 204
4 2019-09-05 5 105 205
and split train and test:
df_train = df.loc[df["ds"]<"2019-09-21"]
df_test = df.loc[df["ds"]>="2019-09-21"]
Before training the forecaster, we can add regressors that use the additional variables. Here the argument of add_regressor is the column name of the additional variable in the training df.
from fbprophet import Prophet
m = Prophet()
m.add_regressor('add1')
m.add_regressor('add2')
m.fit(df_train)
The predict method will then use the additional variables to forecast:
forecast = m.predict(df_test.drop(columns="y"))
Note that the additional variables should have values for your future (test) data. If you don't have them, you could start by predicting add1 and add2 with univariate timeseries, and then predict y with add_regressor and the predicted add1 and add2 as future values of the additional variables.
From the documentation I understand that the forecast of y for t+1 will only use the values of add1 and add2 at t+1, and not their values at t, t-1, ..., t-n as it does with y. If that is important for you, you could create new additional variables with the lags.
See also this notebook, with an example of using weather factors as extra regressors in a forecast of bicycle usage.
To do forecasting for more than one dependent variable you need to implement that time series using Vector Auto Regression.
In VAR model, each variable is a linear function of the past values of itself and the past values of all the other variables.
for more information on VAR go to https://www.analyticsvidhya.com/blog/2018/09/multivariate-time-series-guide-forecasting-modeling-python-codes/
I am confused, it seems like there is no agreement if Prophet works in multivariate way, see the github issues here and here. Judging by some comments, queise's answer and a nice youtube tutorial you can somehow make a work around to multivariate functionality, see the video here: https://www.youtube.com/watch?v=XZhPO043lqU
This might be late, however if you are reading this in 2019, you can implement multivariate time series using LSTM, Keras.
You can do this with one line using the timemachines package that wraps prophet in a functional form. See prophet skaters to be precise. Here's an example of use:
from timemachines.skatertools.data import hospital_with_exog
from timemachines.skatertools.visualization.priorplot import prior_plot
import matplotlib.pyplot as plt
k = 11
y, a = hospital_with_exog(k=k, n=450, offset=True)
f = fbprophet_exogenous
err2 = prior_plot(f=f, k=k, y=y, n=450, n_plot=50)
print(err2)
plt.show()
Note that you can set k to be whatever you want. That is the number of steps ahead to use. Now be careful, because when prophet says multivariate they are really referring to variables known in advance (the a argument). It doesn't really address multivariate prediction. But you can use the facebook skater called _recursive to use prophet to predict the exogenous variables before it predicts the one you really care about.
Having said all that, I strongly advise you to read this critique of prophet and also check its position on the Elo ratings before using it in anger.
The answer to the original question is yes!
Here is a link to specific Neural prophet documentation with several examples of how to use multivariate inputs. For neuralprophet, these are referred to as 'lagged regressors'.
https://neuralprophet.com/html/lagged_covariates_energy_ercot.html
yes indeed we can now apply multivariate time series forecasting,
here is the solution.
https://medium.com/#bobrupakroy/yes-our-favorite-fbprophet-is-back-with-multivariate-forecasting-785fbe412731
VAR is a pure econometric model, but after reading a lot of literature on forecasting , i see that VAR also suffers from not able to capture the trend. So then we are not having a good forecast, but still VAR is a workhorse model for multivariate analysis. I think prophet is not for a multivariate, rather use the ML models like RF,XGBOOST, NNET ... but keep in mind that if you want to capture the trend then be sure which model is better. Else go for deep learning