I want to predict company's sales. I tried with LSTM but all the examples that I found only use two variables (time and sales).
https://www.kaggle.com/freespirit08/time-series-for-beginners-with-arima
This page mentioned that time series only use two variables but I think that is not suficient to build a good forecast. After this, I found different 'multiple features' options like polynomial regression with PolynomialFeatures from sklearn or regression trees. I haven't write a script with these last algorithms yet, then I wanna know your recommendations about what model to use.
Thanks.
You could try Facebook's Prophet, which allows you to take into account additional regressors, or Amazon's DeepAR.
But I also have seen forecasting models based not on ARIMA style time series but on simple linear regression with extensive feature engineering (features=store+product+historical values) in production.
Hope this helps.
I would recommend using Prophet. As this has certain advantages over conventional models like ARIMA:
It take cares of empty value well.
Tunning its parameters is way easier and intuition based.
Traditional time series forecasting model expects data points to be in consistent time interval. However, that’s not the case with “Prophet”. Time interval need not to be same throughout.
Related
I am using darts(https://unit8co.github.io/darts/) I want to train a model to predict how many product we will sell per day based on historical information.
We don't sell any product on the weekend. so there is no data in in the training data for that.
We also know that advertising spend of the days leading up to a day will have a big impact on how much we sell. So I want to use daily_advertising_spend as a covariates.
Input data looks like:
date,y,advertising_spend
2022-07-20 00:00:00,456,10
2022-07-21 00:00:00,514,10
2022-07-22 00:00:00,353,6
2022-07-25 00:00:00,511,28
2022-07-26 00:00:00,419,13
2022-07-27 00:00:00,439,16
Code:
from darts import TimeSeries
from darts.models import TFTModel
all_data = TimeSeries.from_csv("csvfile.csv", time_col="ds", freq="D")
model = TFTModel(input_chunk_length=7, output_chunk_length=1)
model.fit(series=all_data["y"], future_covariates=all_data["advertising_spend"])
However during the training it can't compute a loss function. Please see image.
I looked into why this is and it is because it is treating weekends as NAN.
I set fillna_value=0 when creating the TimeSeries however and then it is able to train. However the model produced is not a good approximation when we do this.
What is the best way to handle this?
creator of Darts here. I would make a few suggestions:
Start with simpler models. Not TFT, but rather linear regression or ARIMA, which both support future covariates.
Use business day frequency ("B"), not daily.
Make sure you don't have any NaN value in your time series. If you do, consider using e.g., darts.utils.missing_values.fill_missing_values().
If you use deep learning (later on), scale the values using Scaler.
I'm experimenting with Darts for similar modeling problems currently. What exactly do you mean by the model produced is not a good approximation?
Maybe you already did that, but generally speaking I would suggest to start with simpler models (Exponential Smoothing, ARIMA, Prophet) to get a solid baseline forecast, that more complex models like TFT would have to outperform.
My experience so far has been that most NN models don't quite beat the simpler statistical models when it comes to univariate timeseries. Intuitively I believe the NN models need more signal in order to really leverage their strengths - such as multiple similar target series to be trained on and/or bigger sets of covariates. I read some of Prof. Rob Hyndman's research on that topic for reference.
If you purely want to improve the TFT Model given your set of data, maybe the first way to do so would be to tune the hyperparameters - this can have quite an effect. Darts offers the gridsearch method for this, see here for documentation. Another option I saw in the Darts examples is PyTorch's Ray Tune.
About the advertising covariate: Do you have data on (planned) advertising spend for a certain amount of days into the future, or do you only have data until the present? In the latter case, you can also consider the other NN models in Darts that only use past_covariates without losing any signal in your model.
So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?
First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset
Is there any way to use multiple time-series to train one model and use this model for predictions given a new time-series as an input? It is rather a theoretical question but did not know where else to post it.
It's theoretically possible, nevertheless every time-series has it's own components about seasonality, stationarity, frequency. (In case you talk about mixing series).
I've seen some work using wavelets-decomposition, deep-learning, time-series and uses several datasets and weights to train the model. But the time-series are similar, same metric different times (aka Temperature in a city from 2000-2001, 2005-2007).
I found some library called darts
I'm working with a company on a project to develop ML models for predictive maintenance. The data we have is a collection of log files. In each log file we have time series from sensors (Temperature, Pressure, MototSpeed,...) and a variable in which we record the faults occurred. The aim here is to build a model that will use the log files as its input (the time series) and to predict whether there will be a failure or not. For this I have some questions:
1) What is the best model capable of doing this?
2) What is the solution to deal with imbalanced data? In fact, for some kind of failures we don't have enough data.
I tried to construct an RNN classifier using LSTM after transforming the time series to sub time series of a fixed length. The targets were 1 if there was a fault and 0 if not. The number of ones compared to the number of zeros is negligible. As a result, the model always predicted 0. What is the solution?
Mohamed, for this problem you could actually start with traditional ML models (random forest, lightGBM, or anything of this nature). I recommend you focus on your features. For example you mentioned Pressure, MototSpeed. Look at some window of time going back. Calculate moving averages, min/max values in that same window, st.dev. To tackle this problem you will need to have a set of healthy features. Take a look at featuretools package. You can either use it or get some ideas what features can be created using time series data. Back to your questions.
1) What is the best model capable of doing this? Traditional ML methods as mentioned above. You could also use deep learning models, but I would first start with easy models. Also if you do not have a lot of data I probably would not touch RNN models.
2) What is the solution to deal with imbalanced data? You may want to oversample or undersample your data. For oversampling look at the SMOTE package.
Good luck
I'm sorry, i know that this is a very basic question but since i'm still a beginner in machine learning, determining what model suits best for my problem is still confusing to me, lately i used linear regression model (causing the r2_score is so low) and a user mentioned i could use certain model according to the curve of the plot of my data and when i see another coder use random forest regressor (causing the r2_score 30% better than the linear regression model) and i do not know how the heck he/she knows better model since he/she doesn't mention about it. I mean in most sites that i read, they shoved the data to some models that they think would suit best for the problem (example: for regression problem, the models could be using linear regression or random forest regressor) but in some sites and some people said firstly we need to plot the data so we can predict what exact one of the models that suit the best. I really don't know which part of the data should i plot? I thought using seaborn pairplot would give me insight of the shape of the curve but i doubt that it is the right way, what should i actually plot? only the label itself or the features itself or both? and how can i get the insight of the curve to know the possible best model after that?
This question is too general, but I will try to give an overview of how to choose the model. First of all you should that there is no general rule to choose the family of models to use, it is more a choosen by experiminting different model and looking to which one gives better results. You should also now that in general you have multi-dimensional features, thus plotting the data will not give you a full insight of the dependance of your features with the target, however to check if you want to fit a linear model or not, you can start plotting the target vs each dimension of the input, and look if there is some kind of linear relation. However I would recommand that you to fit a linear model, and check if if this is relvant from a statistical point of view (student test, smirnov test, check the residuals...). Note that in real life applications, it is not likeley that linear regression will be the best model, unless you do a lot of featue engineering. So I would recommand you to use more advanced methods (RandomForests, XGboost...)
If you are using off-the-shelf packages like sklearn, then many simple models like SVM, RF, etc, are just one-liners, so in practice, we usually try several such models at the same time.