I have a dataset to analyze crypto prices against Tweet sentiment and I'm using random forest regression. Are the rates I'm getting good or bad? How do I interpret them?
Your rmse is about 100 where the error is not big compare to average coin price 4400. I think you can work on to get more generalized or accurate prediction. Maybe you can validate your model with other data as well.
Yet it really depends on the goal you want. If the aim is to do HFT, 2% error would very huge. If your aim is to set RF model as base, I think it is a good way to start off.
Though it is a prediction task, it maybe necessary to check the correlation between Tweet and crypto price first so that you can be assured that there is enough statistical relationship between those 2 variables(correlation method for categorical vs interval variable may helpful).
Mean absolute error is literally the average "distance" between your prediction and the "real" value. The mean squared error is the mean of the distance squared. and as you saw from your code the RMSE is the square root of the mean square error.
In the case of the MAE its usefull to "level" things. how? Percentage or fraction. MAE/np.mean(y_test) but depending on the data you are using you could use np.max(y_test) or np.min(y_test).
The MSE is less forgiving as it scales quadratically, so this basically will grow faster for every "unit of error".
As such, both the MSE and RMSE give more weight to larger errors. Normally you can compare RMSE scores between runs and improvements will be much more noticeable, I normally use RMSE as a scorer to minimize since when you use MAE, small deviations may be just part of the randomness in the RF.
Related
I just built a price prediction model with LSTM and the RMSE was approximately 0.12. The price range is from 0 to 3. Does it mean the model is accurate? Are there any other ways to measure LSTM's accuracy?
Thank you!
Accuracy in this sense is fairly subjective. RMSE means that on average your LSTM is off by 0.12, which is a lot better than random guessing.
Usually accuracies are compared to a baseline accuracy of another (simple) algorithm, so that you can see whether the task is just very easy or your LSTM is very good.
There's definitely other ways to measure accuracy, but you should really be considering whether being off by on average 0.12 is good for your task specifically or good in comparison to other regressions.
It would be easier if you measure MAE instead of RMSE, as it is the L1 is a more "natural metric".
If you want to estimate how good is your model, you need to build baselines. You could check for example how does the error change it you predicted always the average, or if you predicted always the last available data point.
That said, once you compute that, you could calculate relative metrics, for example the MAE of your model divided by the MAE of one of those Benchmarks. This builds upon the idea of the MASE metric from Rob Hyndman. I recommend you to have a look at it.
One 'easy' way to test whether this performance is good, is to calculate the RMSE of a naive forecast. A naive forecast is forecasting the value from the previous step.
So if your series has the values [0, .2, .8, 2.2, 1.1], then the next predicted value would be '1.1'.
The RMSE from your LSTM should be smaller than the RMSE from your naive forecast, but even a smaller RMSE is no guarantee that your model is good.
My oob_score is 0.97 and the accuracy on test data is 0.97.
Is there a way to know how many samples is being used to calculate oob_score ?
This may give some more confidence on results.
rf.oob_score_ # for oob score
rf.score(X_test_scaled,y_test) # for accuracy on test data
The short answer would be ca. 36% of your sample: https://www.researchgate.net/publication/228451484
I'm not sure about calculating the exact number (especially sense the number of oob samples really only pertains to the bootstrapped sample they belong to), but another post goes into further detail on where this 36-37% estimate really comes from: https://stats.stackexchange.com/questions/173520/random-forests-out-of-bag-sample-size
Im working on a Kaggle machine learning proyect (https://www.kaggle.com/c/house-prices-advanced-regression-techniques) and my target variable is "SalePrice" of a particular house.
After plotting the data I can see that my target variable doesnt show a normal distribution and have positive skewness
So I (kind of) normalize it by taking the log of it
When I run my predictions using my regressors later, am I going to predict the log of the sale price? In this case, what should I do?
Is it okay just to do the inverse transformation or is it mathematical or statisticaly wrong?
Short answer: yes your model will predict the log of the house price, and there's nothing wrong with taking the exponential of that (mathematically speaking) to get back to the actual house price.
Forgetting statistics for a moment, if you have taken the log of your house price for your training data, your algorithm doesn't "know" that, it's just a different set of numbers, and your regressor is going to just fit to a different curve.
One practical problem is, when you take the exponential of your output, you better be quite confident in the precision of the predicted number. A small difference in your log-predicted-house prices, after being exponentiated, could make a large difference in your output. Is your model going to be able to achieve that level of precision?
I've been working on a house price prediction.
After I've used standardscaler and gridsearch, the error turns out to be -66031776763.3788. Below is the code and the dataset. Could anyone tell me what is wrong with my code?
Why you think there is some problem? Looks like you run the grid search and select the best model, with MSE = 66031776763. (You use negative of this error as score which you try to maximize, so thats why your number is negative). So the RMSE is about 256966. Given that that you are predicting price, which is usually between 200k - 1M (as i can see from your data), the error seems to be feasible. It means that linear SVM which you used is not so good (RMSE 250k is relatively large), but is's not anything out of reality.
I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen
The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset
There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.