When talking about regression problems, RMSE (Root Mean Square Error) is often used as the evaluation metric. And it is also used as the loss function in linear regression (what's more? it is equivalent to the Maximum Likelihood Method considering the distribution of the output follows a normal distribution).
In real-life problems, I find the MAPE (Mean absolute percentage error) can be more meaningful. For example, when prediction house prices, we are more interested in the relative error. Because a difference of 100k$ is not the same if the house is priced around 100k$ or 1M$.
When creating a linear regression for a house price prediction problem, I found this following graph
x axis: real value of prices
y axis: relative error = (prediction-real_value) / real_value
The algorithm predicts relatively much higher prices when the real price is low
The algorithm predicts relatively lower prices when the real price is high.
What kind of transformation can we do, in order to find a better algorithm that would have more homogeneous relative errors.
Sure, one method of obtaining the maximum likelihood estimator is by gradient descent. In this process, the error between the predicted and actual values is determined, and the gradient of this error with respect to each of the changeable parameters of the model is found. Then, these parameters are tweaked slightly according to the calculated gradients such that the error value would be minimized. This process is repeated until the error converges to a suitably low value.
The great thing about this method is that your error or loss function has a lot of flexibility in how you define it. For instance, L2 (MSE) norm is often used, but you can also use L1 norm, smooth-L1 norm, or any other function.
An error function that yields MAPE would simply divide each error term by the true value's magnitude, thus yeilding error relative to value size. Then, the gradients of this error can be calculated with respect to each parameter and gradient descent can be carried out as before.
Comment if there's any part of this that is unclear or needs more explanation!
Related
I was trying to optimise a curve fitting of 20th order and find out optimal value of coefficient with CMA optimization. As a initial iteration, I was suggesting coefficient values obtained by numpy.polyfit function.
The objective was to reduce the error (which was defined by me as two error metrics). With suggested initial values, nevegrad optimizer computed the values of coefficient and error was very small.
But from next iteration, optimize suggested very high value of coefficient , resulting in wrong prediction of data of curve and high error value for all next iterations. Hence nevegrad took lowest error value of first initial value of coefficient as best value and recommended that as solution of optimization.
Please suggest why it is not giving coefficient of small value from next iteration after initial suggestions
I have a dataset to analyze crypto prices against Tweet sentiment and I'm using random forest regression. Are the rates I'm getting good or bad? How do I interpret them?
Your rmse is about 100 where the error is not big compare to average coin price 4400. I think you can work on to get more generalized or accurate prediction. Maybe you can validate your model with other data as well.
Yet it really depends on the goal you want. If the aim is to do HFT, 2% error would very huge. If your aim is to set RF model as base, I think it is a good way to start off.
Though it is a prediction task, it maybe necessary to check the correlation between Tweet and crypto price first so that you can be assured that there is enough statistical relationship between those 2 variables(correlation method for categorical vs interval variable may helpful).
Mean absolute error is literally the average "distance" between your prediction and the "real" value. The mean squared error is the mean of the distance squared. and as you saw from your code the RMSE is the square root of the mean square error.
In the case of the MAE its usefull to "level" things. how? Percentage or fraction. MAE/np.mean(y_test) but depending on the data you are using you could use np.max(y_test) or np.min(y_test).
The MSE is less forgiving as it scales quadratically, so this basically will grow faster for every "unit of error".
As such, both the MSE and RMSE give more weight to larger errors. Normally you can compare RMSE scores between runs and improvements will be much more noticeable, I normally use RMSE as a scorer to minimize since when you use MAE, small deviations may be just part of the randomness in the RF.
I am using scipy.curve_fit to fit a sum of gaussian functions to histogram data taken from an experiment. I can get a decent enough looking overall fit, with an acceptable combined Root-Mean-Sqaure-Error for the fit model as a whole, but I would like to do more with the individual root mean square error on each derived coefficient as it propagates to later calculations. I am using the var_matrix returned from curve_fit to extract the RMSE for each individual parameter.
Is it possible to set a maximum error allowed per coefficient during fitting e.g. 10% of a given coefficients value. I have initial guess parameters and bounds for each coefficient, but My concern is I have 21 fitted coefficients and some of them have miniscule error while others have errors larger than the coefficient value it's self which leads me to believe the fitting is done solely based on the total error of the model, which may be "shoving" a lot of error on a given coefficient to make it easier for the other parameters.
I am aware of this parameter var_smoothing and how to tune it, but I'd like an explanation from a math/stats aspect that explains what tuning it actually does - I haven't been able to find any good ones online.
A Gaussian curve can serve as a "low pass" filter, allowing only the samples close to its mean to "pass." In the context of Naive Bayes, assuming a Gaussian distribution is essentially giving more weights to the samples closer to the distribution mean. This might or might not be appropriate depending if what you want to predict follows a normal distribution.
The variable, var_smoothing, artificially adds a user-defined value to the distribution's variance (whose default value is derived from the training data set). This essentially widens (or "smooths") the curve and accounts for more samples that are further away from the distribution mean.
I have looked over the Scikit-learn repository and found the following code and statement:
# If the ratio of data variance between dimensions is too small, it
# will cause numerical errors. To address this, we artificially
# boost the variance by epsilon, a small fraction of the standard
# deviation of the largest dimension.
self.epsilon_ = self.var_smoothing * np.var(X, axis=0).max()
In Stats, probability distribution function such as Gaussian depends on sigma^2 (variance); and the more variance between two features the less correlational and better estimator since naive Bayes as the model used is a iid (basically, it assume the feature are independent).
However, in terms computation, it is very common in machine learning that high or low values vectors or float operations can bring some errors, such as, "ValueError: math domain error". Which this extra variable may serve its purpose as a adjustable limit in case some-type numerical error occurred.
Now, it will be interesting to explore if we can use this value for further control such as avoiding over-fitting since this new self-epsilon is added into the variance(sigma^2) or standard deviations(sigma).
I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?
Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).