How to find accuracy of ARIMA model? - python

Problem description: Prediction on CPU utilization.
Approach: Used time series algorithm.
Step 1: From Elasticsearch I collected 1000 observations and exported on Python.
Step 2: Plotted the data and checked whether data is stationary or not.
Step 3: Used log to convert the data into stationary form.
Step 4: Done DF test, ACF and PACF.
Step 5: Build ARIMA(3,0,2) model.
Step 6: Forecast.
I built an ARIMA (3,0,2) time-series model but was unable to find the accuracy of model. Is there any command through which we can check the accuracy of model in Python?
Could you please advice if my approach was correct or not and how to find accuracy of model in Python?

Approach is correct or not-
I hope you would have found out best P,Q values from ACF and PACF. There are github codes in python that will do sth like Auto Arima (automatically find best parameter), so you dont have to worry about P,q values. Basically one takes P,Q values where BIC of model is least.
Pyhton code-
There are three primary metrics used to evaluate linear models. These are: Mean absolute error (MAE), Mean squared error (MSE), or Root mean squared error (RMSE).
MAE: The easiest to understand. Represents average error
MSE: Similar to MAE but noise is exaggerated and larger errors are “punished”. It is harder to interpret than MAE as it’s not in base units, however, it is generally more popular.
RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model.
Below, you can see how to calculate each metric. All of them require two lists as parameters, with one being your predicted values and the other being the true values-

I have been doing some research on this, unfortunately,I could not find a score function with regard to statsmodels in python. I would recommend to visit this site as recommended as an answer from an earlier post.
Also, as noted in the answer "statsmodels does have performance measures for continuous dependent variables."
Hopefully some geek would find and answer and if I find anything with regard to this, I will definitely post it to the community.

Related

Regression problem optimization using ML or DL

I have some data (data from sensors and etc.) from an energy system. consider the x-axis is temperature and the y-axis is energy consumption. Suppose we just have data and we don't have access to the mathematical formulation of the problem:
energy consumption vs temperature curve
In the above figure, it is absolutely obvious that the optimum point is 20. I want to predict the optimum point using ML or DL models. Based on the courses that I have taken I know that it's a regression supervised learning problem, however, I don't know how can I do optimization on this kind of problem.
I don't want you to write a code for this problem. I just want you to give me some hints and instructions about doing this optimization problem.
Also if you recommend any references or courses, I will welcome them to learn how to predict the optimum point of a regression supervised learning problem without knowing the mathematical formulation of the problem.
There are lots of ways that you can try when it comes to optimizing your model, for example, fine tuning your model. What you can do with fine tuning is to try different options that a model consists of and find the smallest errors or higher accuracy based on the actual and predicted data.
Using DecisionTreeRegressor model, you can try to use different split criterion, limit the minimum number of split & depth to see which give you the best predicted scores/errors. For neural network model, using keras, you can try different optimizers, try different loss functions, tune your parameters etc. and try all out as a combination of model.
As for resources, you can go Google, Youtube, and other platform to use keywords such as "fine tuning DNN model" and a lot of resources will pop up for your reference. The bottom line is that you will need to try out different models and fine tune your model until when you are satisfied with your results. The results will be based on your judgement and there is no right or wrong answers (i.e., errors are always there), it just completely up to you on how would you like to achieve your solution with handful of ML and DL models that you got. My advice to you is to spend more time on getting your hands dirty. It will be worth it in the long run. HFGL!

Regression vs Classification for a problem that could be solved by both

I have a problem that I have been treating as a classification problem. I am trying to predict whether a machine will pass or fail a particular test based on a number of input features.
What I am really interested in is actually whether a new machine is predicted to pass or fail the test. It can pass or fail the test by having certain signatures (such as speed, vibration etc) go out of range.
Therefore, I could either:
1) Treat it as a pure regression problem; try to predict the actual values of speed, vibration etc
2) Treat it as a pure classification problem; for each observation, feed in whether it passed or failed on the labels, and try to predict this in the tool I am making
3) Treat it as a pseudo problem; where I predict the actual value, and come up with some measure of how confident I am that it is a pass or fail based on distance from the threshold of pass/fail
To be clear; I am working on a real problem. I am not interested in getting a super precise prediction of a certain value, just whether a machine is predicted to pass or fail (and bonus extension; how likely that it is to be true).
I have been working with classification model as I only have a couple hundred observations and some previous research showed that this might be the best way to treat the problem. However I am wondering now whether this is the right thing to do.
What would you do!?
Many thanks.
Without having the data and running classification or regression, a comparison would be hard because of the metric you use for each family is different.
For example, comparing RMSE of a regression with F1 score (or accuracy) of a classification problem would be apple to orange comparison.
It would be ideal if you can train a good regression model (low RMSE) because that would give you information more than the original pass/fail question. From my past experiences with industrial customers,
First, train all 3 models you have mentioned and then present the outcome to your customer and let them give you more direction on which models/outputs are more meaningful for them.

Very small negative mean squared error

I've been working on a house price prediction.
After I've used standardscaler and gridsearch, the error turns out to be -66031776763.3788. Below is the code and the dataset. Could anyone tell me what is wrong with my code?
Why you think there is some problem? Looks like you run the grid search and select the best model, with MSE = 66031776763. (You use negative of this error as score which you try to maximize, so thats why your number is negative). So the RMSE is about 256966. Given that that you are predicting price, which is usually between 200k - 1M (as i can see from your data), the error seems to be feasible. It means that linear SVM which you used is not so good (RMSE 250k is relatively large), but is's not anything out of reality.

NN: outputting a probability density function instead of a single value

This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!
Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:
train is a pandas dataframe with 100k rows and 50 features as columns.
target is a pandas series
xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0,
objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)
predictions = xgb_classifier.predict(test)
However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?
Data clarification:
~50 numerical features with a numerical target
I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?
This question has received several responses including on this thread as well as here and here.
I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.
I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.
None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included:
1) check if gamma is too high
2) make sure your target labels are not included in your training dataset
3) max_depth may be too small.
One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.
Won't the max_depth =3 too smaller, try to get it bigger,the default value is 7 if i remember it correctly. and set silent to be 1, then you can monitor what's the error each epochs
You need to post a reproducible example for any real investigation. It's entirely likely that your response target is highly unbalanced and that your training data is not super predictive, thus you always (or almost always) get one class predicted. Have you looked at the predicted probabilities at all to see if there is any variance? Is it just an issue of not using the proper cut-off for classification labels?
Since you said that a RF gave reasonable predictions it would useful to see your training parameters for that. At a glance, it's curious why you're using a regression objective function in your xgboost call though -- that could easily be why you are seeing such poor performance. Trying changing your objective to: 'binary:logistic.
You should check there are no inf values in your target.
Try to increase (significantly) min_child_weight in XGBoost or min_data_in_leaf in LightGBM:
min_data_in_leaf oof_rmse
20000 0.052998
2000 0.053001
200 0.053002
20 0.053015
2 0.054261
Actually, it may be a case of overfitting masking as underfitting. It happens for instance for zero-inflated targets in case of insurance claims frequency models. One solution would be to increase the representation/coverage of rare target levels (e.g. non-zero insurance claims) in each tree leaf, by increasing the hyperparameter controlling minimum leaf size to some rather large values, such as those specified in the example above.
I just had this problem and managed to fix it. The problem was I was training on tree_method='gpu_hist' which gave all the same predictions. If I set tree_method='auto' it works properly but wayy longer runtimes. So then if I set tree_method='gpu_hist' along with base_score=0 it worked. I think base_score should be about the mean of your predicted variable.
I have tried all solutions on this page, but none worked.
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.
Probably the hyper-parameters you use cause errors. Try using default values. In my case, this problem was solved by removing subsample and min_child_weight hyper-parameters from params.

Categories

Resources