About accuracy of LSTM

About accuracy of LSTM - python

I just built a price prediction model with LSTM and the RMSE was approximately 0.12. The price range is from 0 to 3. Does it mean the model is accurate? Are there any other ways to measure LSTM's accuracy?
Thank you!

Accuracy in this sense is fairly subjective. RMSE means that on average your LSTM is off by 0.12, which is a lot better than random guessing.
Usually accuracies are compared to a baseline accuracy of another (simple) algorithm, so that you can see whether the task is just very easy or your LSTM is very good.
There's definitely other ways to measure accuracy, but you should really be considering whether being off by on average 0.12 is good for your task specifically or good in comparison to other regressions.

It would be easier if you measure MAE instead of RMSE, as it is the L1 is a more "natural metric".
If you want to estimate how good is your model, you need to build baselines. You could check for example how does the error change it you predicted always the average, or if you predicted always the last available data point.
That said, once you compute that, you could calculate relative metrics, for example the MAE of your model divided by the MAE of one of those Benchmarks. This builds upon the idea of the MASE metric from Rob Hyndman. I recommend you to have a look at it.

One 'easy' way to test whether this performance is good, is to calculate the RMSE of a naive forecast. A naive forecast is forecasting the value from the previous step.
So if your series has the values [0, .2, .8, 2.2, 1.1], then the next predicted value would be '1.1'.
The RMSE from your LSTM should be smaller than the RMSE from your naive forecast, but even a smaller RMSE is no guarantee that your model is good.

Related

How do I interpret my Random Forest Regression accuracy data?

I have a dataset to analyze crypto prices against Tweet sentiment and I'm using random forest regression. Are the rates I'm getting good or bad? How do I interpret them?

Your rmse is about 100 where the error is not big compare to average coin price 4400. I think you can work on to get more generalized or accurate prediction. Maybe you can validate your model with other data as well.
Yet it really depends on the goal you want. If the aim is to do HFT, 2% error would very huge. If your aim is to set RF model as base, I think it is a good way to start off.
Though it is a prediction task, it maybe necessary to check the correlation between Tweet and crypto price first so that you can be assured that there is enough statistical relationship between those 2 variables(correlation method for categorical vs interval variable may helpful).

Mean absolute error is literally the average "distance" between your prediction and the "real" value. The mean squared error is the mean of the distance squared. and as you saw from your code the RMSE is the square root of the mean square error.
In the case of the MAE its usefull to "level" things. how? Percentage or fraction. MAE/np.mean(y_test) but depending on the data you are using you could use np.max(y_test) or np.min(y_test).
The MSE is less forgiving as it scales quadratically, so this basically will grow faster for every "unit of error".
As such, both the MSE and RMSE give more weight to larger errors. Normally you can compare RMSE scores between runs and improvements will be much more noticeable, I normally use RMSE as a scorer to minimize since when you use MAE, small deviations may be just part of the randomness in the RF.

SGD optimiser graph

I just wanted to ask a quick question. I understand that val_loss and train_loss is insufficient to tell if the model is overfitting. However, i wish to use it as a rough gauge by monitoring if the val_loss is increasing. As i use SGD optimiser, i seem to have 2 different trends based on the smoothing value. Which should i use? Blue is val_loss and Orange is train_loss.
From smoothing = 0.999, both seems to be decreasing but from smoothing = 0.927, val_loss seems to be increasing. Thank you for reading!
Also, when is a good time to decrease the learning rate? Is it directly before the model overfits?
Smoothing = 0.999
Smoothing = 0.927

In my experience with DL as applied to CNNs, overfitting is tied more to the difference in train/val accuracies/losses rather than just one or the other. In your graphs, it's clear that the difference in loss is increasing as time goes on, showing that your model does not generalize well to the dataset, and hence shows signs of overfitting. It would also help for you to track classification accuracy on train and val datasets if possible--this will show you the generalization error which acts as a similar metric but might show more visible effects.
Dropping the learning rate once the loss starts to even out and overfitting begins is a good idea; however you may find better gains for your generalization if you adjust the net's complexity to better fit the dataset first. For such overfitting, a modest decrease in complexity may help--use the difference in train/val losses and accuracies to confirm.

Targeting a specific metric to optimize in tensorflow

Is there any way we can target a specific metric to optimize using inbuilt tensorflow optimizers? If not, how to achieve this? For eg. If I want to focus only on maximizing F-score of my classifier specifically, is it possible to do so in tensorflow?
estimator = tf.estimator.LinearClassifier(
feature_columns=feature_cols,
config=my_checkpointing_config,
model_dir=output_dir,
optimizer=lambda: tf.train.FtrlOptimizer(
learning_rate=tf.train.exponential_decay(
learning_rate=0.1,
global_step=tf.train.get_or_create_global_step(),
decay_steps=1000,
decay_rate=0.96)))
I am trying to optimize my classifier specifically on the basis of getting a better F-score. Despite using the decaying learning_rate and 300 training steps I am getting inconsistent results. While checking the metrics in the logs, I found the behavior of precision, recall and accuracy to be very erratic. Despite increasing the number of training steps, there was no significant improvement. So I thought that if i could make the optimizer focus more on improving the F-score as a whole I might get better results. Hence the question. Is there something that I am missing?

In classification settings, optimizers minimize the loss, e.g. cross entropy; quantities like accuracy, F-score, precision, recall etc. are essentially business metrics, and they are not (and cannot be) directly minimized during the optimization process.
This is a question that pops up rather frequently here in SO in various disguises; here are some threads which will hopefully help you disentangle the concepts (although they refer to accuracy, precision, and recall, the argument is exactly the same for the F-score):
Loss & accuracy - Are these reasonable learning curves?
Cost function training target versus accuracy desired goal
Is there an optimizer in keras based on precision or recall instead of loss?
The bottom line, adapting one of my own (linked) answers:
Loss and metrics like accuracy or F-score are different things; roughly speaking, metrics like accuracy & F-score are what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy, F-score etc) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

One could technically adjust the threshold parameter that distinguishes between class 1 and 0. For example, in logistic regression, if the threshold is lowered from 0.5 to 0.3, recall would decrease and precision would increase, and viceversa. But as others have mentioned, this is not the same as optimizing ("minimizing") the loss function.

How to recognize Overfitting and underfitting in Python

I have a regression model. I write code of this algorithm :
create 10 random splits of training data into training and validation data. Choose the best value of alpha from the following set: {0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333}.
To choose the best alpha hyperparameter value, you have to do the following:
• For each value of hyperparameter, perform 10 random splits of training data into training and validation data as said above.
• For each value of hyperparameter, use its 10 random splits and find the average training and validation accuracy.
• On a graph, plot both the average training accuracy (in red) and average validation accuracy (in blue) w.r.t. each hyperparameter setting. Comment on this graph by identifying regions of overfitting and underfitting.
• Print the best value of alpha hyperparameter.
2- Evaluate the prediction performance on test data and report the following:
• Total number of non-zero features in the final model.
• The confusion matrix
• Precision, recall and accuracy for each class.
Finally, discuss if there is any sign of underfitting or overfitting with appropriate reasoning
I write This code :
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(Newclassifier.score(X_test, y_test)))
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
My Questions is :
1- why accuracy in each iteration decrease?
2- is My model Overfit or underfit?
3- does My model work right?

There is no official/absolute metric for deciding whether you are underfitting, overfitting of neither. In practice
underfitting: you model is too simple. There will be no much difference between train and validation set, but the accuracy will be pretty low on them
overfitting: you model is too complicated. Instead of learning the underlying patterns, it memorizes you training set. So, the training error will decrease, but the validation error will start increasing after some point
In you case, your training and testing error seem to go in parallel, so you don't seem to have a problem with overfitting. Your model could be underfitting, so you could try with a more complex model. However, it is possible that this is how good this algorithm can get at this particular training set. In most real problems, no algorithm can get to zero error.
As to why your error increases, I don't know how this particular algorithm works, but since it seems to rely on random methods, it seems reasonable behavior. It goes a bit up and down, but it does not steadily increase, so it doesn't seem problematic.

How can I test my classifier for overfitting?

I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.
Now, I want to make them better. I know from speaking with people that my classifier is 'overfitting' the data; what I am looking for is a solid way to prove this so that the next time I write a classifier I will be able to run a test and see if I am overfitting or underfitting.
What is the best way of doing this? I am open to all suggestion!
I've spent literally weeks googling this topic and found no canonical or trusted ways to do this effectively, so any response will be appreciated. I will be putting a bounty on this question.
Edit:
Let's assume my clasifier spits out a .tsv containing :
the website UID<tab>the likelihood it is to be ephemeral or evergreen, 0 being ephemeral, 1 being evergreen<tab>whether the page is ephemeral or evergreen

The most simple way to check your classifier "efficiency" is to perform a cross validation:
Take your data, lets call them X
Split X into K batches of equal sizes
For each i=1 to K:
Train your classifier on all batches but i'th
Test on i'th
Return the average result
One more important aspect - if your classifier uses any parameters, some constants, thresholds etc. which are not trained, but rather given by the user you cannot just select the ones giving the best results in the above procedure. This has to be somehow automatized in the "Train your classifier on all batches but i'th". In other words - you cannot use the testing data to fit any parameters to your model. Once done this, there are four possible outcomes:
Training error is low but is much lower than testing error - overfitting
Both errors are low - ok
Both errors are high - underfitting
Training error is high but testing is low - error in implementation or very small dataset

There are many ways that people try to handle overfitting:
Cross-validation, you might also see it mentioned as x-validation
see lejlot's post for details
choose a simpler model
linear classifiers have a high bias because the model must be linear but lower variance in the optimal solution because of the high bias. This means that you wouldn't expect to see much difference in the final model given a large number of random training samples.
Regularization is a common practice to combat overfitting.
It is generally done by adding a term to the minimization function
Typically this term is the sum of squares of the model's weights because it is easy to differentiate.
Generally there is a constant C associated with the regularization term. Tuning this constant will increase / decrease the effect of regularization. A high weight applied to regularization generally helps with overfitting. C should always be greater or equal to zero. (Note: some training packages apply 1/C as the regularization weight. In this case, the close C gets to zero the greater weight is applied to regularization)
Regardless of the specifics, regularization works by reducing the variance in a model by biasing it to solutions with low regularization weight.
Finally, boosting is a method of training that mysteriously/magically does not overfit. Not sure if anyone has discovered why, but it is a process of combining high bias low variance simple learns into a high variance low bias model. Its pretty slick.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.