Nevegrad package optimization with suggest and ask

Nevegrad package optimization with suggest and ask - python

I was trying to optimise a curve fitting of 20th order and find out optimal value of coefficient with CMA optimization. As a initial iteration, I was suggesting coefficient values obtained by numpy.polyfit function.
The objective was to reduce the error (which was defined by me as two error metrics). With suggested initial values, nevegrad optimizer computed the values of coefficient and error was very small.
But from next iteration, optimize suggested very high value of coefficient , resulting in wrong prediction of data of curve and high error value for all next iterations. Hence nevegrad took lowest error value of first initial value of coefficient as best value and recommended that as solution of optimization.
Please suggest why it is not giving coefficient of small value from next iteration after initial suggestions

Related

How do I interpret my Random Forest Regression accuracy data?

I have a dataset to analyze crypto prices against Tweet sentiment and I'm using random forest regression. Are the rates I'm getting good or bad? How do I interpret them?

Your rmse is about 100 where the error is not big compare to average coin price 4400. I think you can work on to get more generalized or accurate prediction. Maybe you can validate your model with other data as well.
Yet it really depends on the goal you want. If the aim is to do HFT, 2% error would very huge. If your aim is to set RF model as base, I think it is a good way to start off.
Though it is a prediction task, it maybe necessary to check the correlation between Tweet and crypto price first so that you can be assured that there is enough statistical relationship between those 2 variables(correlation method for categorical vs interval variable may helpful).

Mean absolute error is literally the average "distance" between your prediction and the "real" value. The mean squared error is the mean of the distance squared. and as you saw from your code the RMSE is the square root of the mean square error.
In the case of the MAE its usefull to "level" things. how? Percentage or fraction. MAE/np.mean(y_test) but depending on the data you are using you could use np.max(y_test) or np.min(y_test).
The MSE is less forgiving as it scales quadratically, so this basically will grow faster for every "unit of error".
As such, both the MSE and RMSE give more weight to larger errors. Normally you can compare RMSE scores between runs and improvements will be much more noticeable, I normally use RMSE as a scorer to minimize since when you use MAE, small deviations may be just part of the randomness in the RF.

Curve_fit - Set maximum allowed error for a given coefficient returned during fitting for a model function

I am using scipy.curve_fit to fit a sum of gaussian functions to histogram data taken from an experiment. I can get a decent enough looking overall fit, with an acceptable combined Root-Mean-Sqaure-Error for the fit model as a whole, but I would like to do more with the individual root mean square error on each derived coefficient as it propagates to later calculations. I am using the var_matrix returned from curve_fit to extract the RMSE for each individual parameter.
Is it possible to set a maximum error allowed per coefficient during fitting e.g. 10% of a given coefficients value. I have initial guess parameters and bounds for each coefficient, but My concern is I have 21 fitted coefficients and some of them have miniscule error while others have errors larger than the coefficient value it's self which leads me to believe the fitting is done solely based on the total error of the model, which may be "shoving" a lot of error on a given coefficient to make it easier for the other parameters.

Optimize Relative Error in regression problems

When talking about regression problems, RMSE (Root Mean Square Error) is often used as the evaluation metric. And it is also used as the loss function in linear regression (what's more? it is equivalent to the Maximum Likelihood Method considering the distribution of the output follows a normal distribution).
In real-life problems, I find the MAPE (Mean absolute percentage error) can be more meaningful. For example, when prediction house prices, we are more interested in the relative error. Because a difference of 100k$ is not the same if the house is priced around 100k$ or 1M$.
When creating a linear regression for a house price prediction problem, I found this following graph
x axis: real value of prices
y axis: relative error = (prediction-real_value) / real_value
The algorithm predicts relatively much higher prices when the real price is low
The algorithm predicts relatively lower prices when the real price is high.
What kind of transformation can we do, in order to find a better algorithm that would have more homogeneous relative errors.

Sure, one method of obtaining the maximum likelihood estimator is by gradient descent. In this process, the error between the predicted and actual values is determined, and the gradient of this error with respect to each of the changeable parameters of the model is found. Then, these parameters are tweaked slightly according to the calculated gradients such that the error value would be minimized. This process is repeated until the error converges to a suitably low value.
The great thing about this method is that your error or loss function has a lot of flexibility in how you define it. For instance, L2 (MSE) norm is often used, but you can also use L1 norm, smooth-L1 norm, or any other function.
An error function that yields MAPE would simply divide each error term by the true value's magnitude, thus yeilding error relative to value size. Then, the gradients of this error can be calculated with respect to each parameter and gradient descent can be carried out as before.
Comment if there's any part of this that is unclear or needs more explanation!

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations

Running the code of linear binary pattern for Adrian. This program runs but gives the following warning:
C:\Python27\lib\site-packages\sklearn\svm\base.py:922: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning
I am running python2.7 with opencv3.7, what should I do?

Normally when an optimization algorithm does not converge, it is usually because the problem is not well-conditioned, perhaps due to a poor scaling of the decision variables. There are a few things you can try.
Normalize your training data so that the problem hopefully becomes more well
conditioned, which in turn can speed up convergence. One
possibility is to scale your data to 0 mean, unit standard deviation using
Scikit-Learn's
StandardScaler
for an example. Note that you have to apply the StandardScaler fitted on the training data to the test data. Also, if you have discrete features, make sure they are transformed properly so that scaling them makes sense.
Related to 1), make sure the other arguments such as regularization
weight, C, is set appropriately. C has to be > 0. Typically one would try various values of C in a logarithmic scale (1e-5, 1e-4, 1e-3, ..., 1, 10, 100, ...) before finetuning it at finer granularity within a particular interval. These days, it probably make more sense to tune parameters using, for e.g., Bayesian Optimization using a package such as Scikit-Optimize.
Set max_iter to a larger value. The default is 1000. This should be your last resort. If the optimization process does not converge within the first 1000 iterations, having it converge by setting a larger max_iter typically masks other problems such as those described in 1) and 2). It might even indicate that you have some in appropriate features or strong correlations in the features. Debug those first before taking this easy way out.
Set dual = True if number of features > number of examples and vice versa. This solves the SVM optimization problem using the dual formulation. Thanks #Nino van Hooff for pointing this out, and #JamesKo for spotting my mistake.
Use a different solver, for e.g., the L-BFGS solver if you are using Logistic Regression. See #5ervant's answer.
Note: One should not ignore this warning.
This warning came about because
Solving the linear SVM is just solving a quadratic optimization problem. The solver is typically an iterative algorithm that keeps a running estimate of the solution (i.e., the weight and bias for the SVM).
It stops running when the solution corresponds to an objective value that is optimal for this convex optimization problem, or when it hits the maximum number of iterations set.
If the algorithm does not converge, then the current estimate of the SVM's parameters are not guaranteed to be any good, hence the predictions can also be complete garbage.
Edit
In addition, consider the comment by #Nino van Hooff and #5ervant to use the dual formulation of the SVM. This is especially important if the number of features you have, D, is more than the number of training examples N. This is what the dual formulation of the SVM is particular designed for and helps with the conditioning of the optimization problem. Credit to #5ervant for noticing and pointing this out.
Furthermore, #5ervant also pointed out the possibility of changing the solver, in particular the use of the L-BFGS solver. Credit to him (i.e., upvote his answer, not mine).
I would like to provide a quick rough explanation for those who are interested (I am :)) why this matters in this case. Second-order methods, and in particular approximate second-order method like the L-BFGS solver, will help with ill-conditioned problems because it is approximating the Hessian at each iteration and using it to scale the gradient direction. This allows it to get better convergence rate but possibly at a higher compute cost per iteration. That is, it takes fewer iterations to finish but each iteration will be slower than a typical first-order method like gradient-descent or its variants.
For e.g., a typical first-order method might update the solution at each iteration like
x(k + 1) = x(k) - alpha(k) * gradient(f(x(k)))
where alpha(k), the step size at iteration k, depends on the particular choice of algorithm or learning rate schedule.
A second order method, for e.g., Newton, will have an update equation
x(k + 1) = x(k) - alpha(k) * Hessian(x(k))^(-1) * gradient(f(x(k)))
That is, it uses the information of the local curvature encoded in the Hessian to scale the gradient accordingly. If the problem is ill-conditioned, the gradient will be pointing in less than ideal directions and the inverse Hessian scaling will help correct this.
In particular, L-BFGS mentioned in #5ervant's answer is a way to approximate the inverse of the Hessian as computing it can be an expensive operation.
However, second-order methods might converge much faster (i.e., requires fewer iterations) than first-order methods like the usual gradient-descent based solvers, which as you guys know by now sometimes fail to even converge. This can compensate for the time spent at each iteration.
In summary, if you have a well-conditioned problem, or if you can make it well-conditioned through other means such as using regularization and/or feature scaling and/or making sure you have more examples than features, you probably don't have to use a second-order method. But these days with many models optimizing non-convex problems (e.g., those in DL models), second order methods such as L-BFGS methods plays a different role there and there are evidence to suggest they can sometimes find better solutions compared to first-order methods. But that is another story.

I reached the point that I set, up to max_iter=1200000 on my LinearSVC classifier, but still the "ConvergenceWarning" was still present. I fix the issue by just setting dual=False and leaving max_iter to its default.
With LogisticRegression(solver='lbfgs') classifier, you should increase max_iter. Mine have reached max_iter=7600 before the "ConvergenceWarning" disappears when training with large dataset's features.

Explicitly specifying the max_iter resolves the warning as the default max_iter is 100. [For Logistic Regression].
logreg = LogisticRegression(max_iter=1000)

Please incre max_iter to 10000 as default value is 1000. Possibly, increasing no. of iterations will help algorithm to converge. For me it converged and solver was -'lbfgs'
log_reg = LogisticRegression(solver='lbfgs',class_weight='balanced', max_iter=10000)

Statsmodels OLS Regression: Log-likelihood, uses and interpretation

I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?

Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.