Very small negative mean squared error - python

I've been working on a house price prediction.
After I've used standardscaler and gridsearch, the error turns out to be -66031776763.3788. Below is the code and the dataset. Could anyone tell me what is wrong with my code?

Why you think there is some problem? Looks like you run the grid search and select the best model, with MSE = 66031776763. (You use negative of this error as score which you try to maximize, so thats why your number is negative). So the RMSE is about 256966. Given that that you are predicting price, which is usually between 200k - 1M (as i can see from your data), the error seems to be feasible. It means that linear SVM which you used is not so good (RMSE 250k is relatively large), but is's not anything out of reality.

Related

How do I interpret my Random Forest Regression accuracy data?

I have a dataset to analyze crypto prices against Tweet sentiment and I'm using random forest regression. Are the rates I'm getting good or bad? How do I interpret them?
Your rmse is about 100 where the error is not big compare to average coin price 4400. I think you can work on to get more generalized or accurate prediction. Maybe you can validate your model with other data as well.
Yet it really depends on the goal you want. If the aim is to do HFT, 2% error would very huge. If your aim is to set RF model as base, I think it is a good way to start off.
Though it is a prediction task, it maybe necessary to check the correlation between Tweet and crypto price first so that you can be assured that there is enough statistical relationship between those 2 variables(correlation method for categorical vs interval variable may helpful).
Mean absolute error is literally the average "distance" between your prediction and the "real" value. The mean squared error is the mean of the distance squared. and as you saw from your code the RMSE is the square root of the mean square error.
In the case of the MAE its usefull to "level" things. how? Percentage or fraction. MAE/np.mean(y_test) but depending on the data you are using you could use np.max(y_test) or np.min(y_test).
The MSE is less forgiving as it scales quadratically, so this basically will grow faster for every "unit of error".
As such, both the MSE and RMSE give more weight to larger errors. Normally you can compare RMSE scores between runs and improvements will be much more noticeable, I normally use RMSE as a scorer to minimize since when you use MAE, small deviations may be just part of the randomness in the RF.

"reg_alpha" parameter in XGBoost regressor. Is it bad to use high values?

I'm performing hyperparameter tuning with grid search and I realized that I was getting overfitting... I tried a lot of ways to reduce it, changing the "gamma", "subsample", "max_depth" parameters to reduce it, but I was still overfitting...
Then, I increased the "reg_alpha" parameters value to > 30....and them my model reduced overfitting drastically. I know that this parameter refers to L1 regularization term on weights, and maybe that's why solved my problem.
I just want to know if it has any problem using high values for reg_alpha like this?
I would appreciate your help :D
reg_alpha penalizes the features which increase cost function. Meaning it finds the features that doesn't increase accuracy. But this makes prediction line smoother.
On some problems I also increase reg_alpha > 30 because it reduces both overfitting and test error.
But if it is a regression problem it's prediction will be close to mean on test set and it will maybe not catch anomalies good.
So i may say you can increase it as long as your test accuracy doesn't begin to fall down.
Lastly when increasing reg_alpha , keeping max_depth small might be a good practise.

Easy statistics question with an (not so) easy answer

Im working on a Kaggle machine learning proyect (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) and my target variable is "SalePrice" of a particular house.
After plotting the data I can see that my target variable doesnt show a normal distribution and have positive skewness
So I (kind of) normalize it by taking the log of it
When I run my predictions using my regressors later, am I going to predict the log of the sale price? In this case, what should I do?
Is it okay just to do the inverse transformation or is it mathematical or statisticaly wrong?
Short answer: yes your model will predict the log of the house price, and there's nothing wrong with taking the exponential of that (mathematically speaking) to get back to the actual house price.
Forgetting statistics for a moment, if you have taken the log of your house price for your training data, your algorithm doesn't "know" that, it's just a different set of numbers, and your regressor is going to just fit to a different curve.
One practical problem is, when you take the exponential of your output, you better be quite confident in the precision of the predicted number. A small difference in your log-predicted-house prices, after being exponentiated, could make a large difference in your output. Is your model going to be able to achieve that level of precision?

How to find accuracy of ARIMA model?

Problem description: Prediction on CPU utilization.
Approach: Used time series algorithm.
Step 1: From Elasticsearch I collected 1000 observations and exported on Python.
Step 2: Plotted the data and checked whether data is stationary or not.
Step 3: Used log to convert the data into stationary form.
Step 4: Done DF test, ACF and PACF.
Step 5: Build ARIMA(3,0,2) model.
Step 6: Forecast.
I built an ARIMA (3,0,2) time-series model but was unable to find the accuracy of model. Is there any command through which we can check the accuracy of model in Python?
Could you please advice if my approach was correct or not and how to find accuracy of model in Python?
Approach is correct or not-
I hope you would have found out best P,Q values from ACF and PACF. There are github codes in python that will do sth like Auto Arima (automatically find best parameter), so you dont have to worry about P,q values. Basically one takes P,Q values where BIC of model is least.
Pyhton code-
There are three primary metrics used to evaluate linear models. These are: Mean absolute error (MAE), Mean squared error (MSE), or Root mean squared error (RMSE).
MAE: The easiest to understand. Represents average error
MSE: Similar to MAE but noise is exaggerated and larger errors are “punished”. It is harder to interpret than MAE as it’s not in base units, however, it is generally more popular.
RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model.
Below, you can see how to calculate each metric. All of them require two lists as parameters, with one being your predicted values and the other being the true values-
I have been doing some research on this, unfortunately,I could not find a score function with regard to statsmodels in python. I would recommend to visit this site as recommended as an answer from an earlier post.
Also, as noted in the answer "statsmodels does have performance measures for continuous dependent variables."
Hopefully some geek would find and answer and if I find anything with regard to this, I will definitely post it to the community.

xgboost predict method returns the same predicted value for all rows

I've created an xgboost classifier in Python:
train is a pandas dataframe with 100k rows and 50 features as columns.
target is a pandas series
xgb_classifier = xgb.XGBClassifier(nthread=-1, max_depth=3, silent=0,
objective='reg:linear', n_estimators=100)
xgb_classifier = xgb_classifier.fit(train, target)
predictions = xgb_classifier.predict(test)
However, after training, when I use this classifier to predict values the entire results array is the same number. Any idea why this would be happening?
Data clarification:
~50 numerical features with a numerical target
I've also tried RandomForestRegressor from sklearn with the same data and it does give realistic predictions. Perhaps a legitimate bug in the xgboost implementation?
This question has received several responses including on this thread as well as here and here.
I was having a similar issue with both XGBoost and LGBM. For me, the solution was to increase the size of the training dataset.
I was training on a local machine using a random sample (~0.5%) of a large sparse dataset (200,000 rows and 7000 columns) because I did not have enough local memory for the algorithm. It turned out that for me, the array of predicted values was just an array of the average values of the target variable. This suggests to me that the model may have been underfitting. One solution to an underfitting model is to train your model on more data, so I tried my analysis on a machine with more memory and the issue was resolved: my prediction array was no longer an array of average target values. On the other hand, the issue could simply have been that the slice of predicted values I was looking at were predicted from training data with very little information (e.g. 0's and nan's). For training data with very little information, it seems reasonable to predict the average value of the target feature.
None of the other suggested solutions I came across were helpful for me. To summarize some of the suggested solutions included:
1) check if gamma is too high
2) make sure your target labels are not included in your training dataset
3) max_depth may be too small.
One of the reasons for the same is that you're providing a high penalty through parameter gamma. Compare the mean value of your training response variable and check if the prediction is close to this. If yes then the model is restricting too much on the prediction to keep train-rmse and val-rmse as close as possible. Your prediction is the simplest with higher value of gamma. So you'll get the simplest model prediction like mean of training set as prediction or naive prediction.
Won't the max_depth =3 too smaller, try to get it bigger,the default value is 7 if i remember it correctly. and set silent to be 1, then you can monitor what's the error each epochs
You need to post a reproducible example for any real investigation. It's entirely likely that your response target is highly unbalanced and that your training data is not super predictive, thus you always (or almost always) get one class predicted. Have you looked at the predicted probabilities at all to see if there is any variance? Is it just an issue of not using the proper cut-off for classification labels?
Since you said that a RF gave reasonable predictions it would useful to see your training parameters for that. At a glance, it's curious why you're using a regression objective function in your xgboost call though -- that could easily be why you are seeing such poor performance. Trying changing your objective to: 'binary:logistic.
You should check there are no inf values in your target.
Try to increase (significantly) min_child_weight in XGBoost or min_data_in_leaf in LightGBM:
min_data_in_leaf oof_rmse
20000 0.052998
2000 0.053001
200 0.053002
20 0.053015
2 0.054261
Actually, it may be a case of overfitting masking as underfitting. It happens for instance for zero-inflated targets in case of insurance claims frequency models. One solution would be to increase the representation/coverage of rare target levels (e.g. non-zero insurance claims) in each tree leaf, by increasing the hyperparameter controlling minimum leaf size to some rather large values, such as those specified in the example above.
I just had this problem and managed to fix it. The problem was I was training on tree_method='gpu_hist' which gave all the same predictions. If I set tree_method='auto' it works properly but wayy longer runtimes. So then if I set tree_method='gpu_hist' along with base_score=0 it worked. I think base_score should be about the mean of your predicted variable.
I have tried all solutions on this page, but none worked.
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.
Probably the hyper-parameters you use cause errors. Try using default values. In my case, this problem was solved by removing subsample and min_child_weight hyper-parameters from params.

Categories

Resources