I've implemented gradient descent in Python to perform a regularized polynomial regression using as a loss function the MSE, but on linear data (to prove the role of the regularization).
So my model is under the form:
And in my loss function, R represents the regularization term:
Let's take the L2-norm as our regularization, the partial derivatives of the loss function w.r.t. wi are given below:
Finally, the coefficients wi are updated using a constant learning rate:
The problem is that I'm unable to make it converge, because the regularization is penalizing both of the coefficients of degree 2 (w2) and degree 1 (w1) of the polynomial, while in my case I want it to penalize only the former since the data is linear.
Is it possible to achieve this, as both LassoCV and RidgeCV implemented in Scikit-learn are able to do it? Or is there a mistake in my equations given above?
I suspect that a constant learning rate (mu) could be problematic too, what's a simple formula to make it adaptive?
I ended up using Coordinate descent as described in this tutorial to which I added a regularization term (L1 or L2). After a relatively large number of iterations, w2 was almost zero (and therefore the predicted model was linear).
Related
I have a question to ask:
How exactly are different L1 and L2 regularization terms on weights in xgboost algorithm.
As I understand, L1 is used by LASSO and L2 is used by RIDGE regression and L1 can shrink to 0, L2 can't. I understand the mechanics when using simple linear regression, but I have no clue how it works in tree based models.
Further more, gamma is another parameter, that makes the model more conservative. How should I notice the difference between L1/L2 and gamma parameter.
I have found in documentation very little to this problem:
lambda [default=1, alias: reg_lambda]
L2 regularization term on weights. Increasing this value will make
model more conservative.
alpha [default=0, alias: reg_alpha]
L1 regularization term on weights. Increasing this value will make
model more conservative.
gamma [default=0, alias: min_split_loss]
Minimum loss reduction required to make a further partition on a leaf
node of the tree. The larger gamma is, the more conservative the
algorithm will be.
All of them ranging from 0 to inf.
Thanks in advance for any answer/comment!
I have foud this blog post on gamma regularization. It's the best I have fond in 2 months. Still searching for answers, but lack of information is huge.
I am wondering about this as well. The best thing I have found is the original XGBoost paper. Section 2.1 makes it sound as if XGBoost uses regression tree as a main building block for both regression and classification. If this is correct, then Alpha and Lambda probably work in the same way as they do in the linear regression.
Gamma controls how deep trees will be. Large gamma means large hurdle to add another tree level. So larger gamma regularizes the model by growing shallower trees. E. g., depth-2 tree has smaller range of predicted values than depth-10 tree, so such model will have lower variance.
When talking about regression problems, RMSE (Root Mean Square Error) is often used as the evaluation metric. And it is also used as the loss function in linear regression (what's more? it is equivalent to the Maximum Likelihood Method considering the distribution of the output follows a normal distribution).
In real-life problems, I find the MAPE (Mean absolute percentage error) can be more meaningful. For example, when prediction house prices, we are more interested in the relative error. Because a difference of 100k$ is not the same if the house is priced around 100k$ or 1M$.
When creating a linear regression for a house price prediction problem, I found this following graph
x axis: real value of prices
y axis: relative error = (prediction-real_value) / real_value
The algorithm predicts relatively much higher prices when the real price is low
The algorithm predicts relatively lower prices when the real price is high.
What kind of transformation can we do, in order to find a better algorithm that would have more homogeneous relative errors.
Sure, one method of obtaining the maximum likelihood estimator is by gradient descent. In this process, the error between the predicted and actual values is determined, and the gradient of this error with respect to each of the changeable parameters of the model is found. Then, these parameters are tweaked slightly according to the calculated gradients such that the error value would be minimized. This process is repeated until the error converges to a suitably low value.
The great thing about this method is that your error or loss function has a lot of flexibility in how you define it. For instance, L2 (MSE) norm is often used, but you can also use L1 norm, smooth-L1 norm, or any other function.
An error function that yields MAPE would simply divide each error term by the true value's magnitude, thus yeilding error relative to value size. Then, the gradients of this error can be calculated with respect to each parameter and gradient descent can be carried out as before.
Comment if there's any part of this that is unclear or needs more explanation!
enter image description here
I would like to make a linear regression model.
Predictor variable: 'Sud grenoblois / Vif PM10' has a decaying exponent distribution. you could see on the graph. As fas as I know, regression supposes
normal distribution of the predictor. Should I use a transform of variables or another type of regression?
You can apply logarithm transformation to reduce right skewness.
If the tail is to the left of data, then the common transformations include square, cube root and logarithmic.
I don't think you need to change your regression model.
I'm using python's statsmodels package to do linear regressions. Among the output of R^2, p, etc there is also "log-likelihood". In the docs this is described as "The value of the likelihood function of the fitted model." I've taken a look at the source code and don't really understand what it's doing.
Reading more about likelihood functions, I still have very fuzzy ideas of what this 'log-likelihood' value might mean or be used for. So a few questions:
Isn't the value of likelihood function, in the case of linear regression, the same as the value of the parameter (beta in this case)? It seems that way according to the following derivation leading to equation 12: http://www.le.ac.uk/users/dsgp1/COURSES/MATHSTAT/13mlreg.pdf
What's the use of knowing the value of the likelihood function? Is it to compare with other regression models with the same response and a different predictor? How do practical statisticians and scientists use the log-likelihood value spit out by statsmodels?
Likelihood (and by extension log-likelihood) is one of the most important concepts in statistics. Its used for everything.
For your first point, likelihood is not the same as the value of the parameter. Likelihood is the likelihood of the entire model given a set of parameter estimates. It's calculated by taking a set of parameter estimates, calculating the probability density for each one, and then multiplying the probability densities for all the observations together (this follows from probability theory in that P(A and B) = P(A)P(B) if A and B are independent). In practice, what this means for linear regression and what that derivation shows, is that you take a set of parameter estimates (beta, sd), plug them into the normal pdf, and then calculate the density for each observation y at that set of parameter estimates. Then, multiply them all together. Typically, we choose to work with the log-likelihood because it's easier to calculate because instead of multiplying we can sum (log(a*b) = log(a) + log(b)), which is computationally faster. Also, we tend to minimize the negative log-likelihood (instead of maximizing the positive), because optimizers sometimes work better on minimization than maximization.
To answer your second point, log-likelihood is used for almost everything. It's the basic quantity that we use to find parameter estimates (Maximum Likelihood Estimates) for a huge suite of models. For simple linear regression, these estimates turn out to be the same as those for least squares, but for more complicated models least squares may not work. It's also used to calculate AIC, which can be used to compare models with the same response and different predictors (but penalizes on parameter numbers, because more parameters = better fit regardless).
I have implemented the Perceptron Learning Algorithm in Python as below. Even with 500,000 iterations, it still won't converge.
I have a training data matrix X with target vector Y, and a weight vector w to be optimized.
My update rule is:
while(exist_mistakes):
# dot product to check for mistakes
output = [np.sign(np.dot(X[i], w)) == Y[i] for i in range(0, len(X))]
# find index of mistake. (choose randomly in order to avoid repeating same index.)
n = random.randint(0, len(X)-1)
while(output[n]): # if output is true here, choose again
n = random.randint(0, len(X)-1)
# once we have found a mistake, update
w = w + Y[n]*X[n]
Is this wrong? Or why is it not converging even after 500,000 iterations?
Perceptrons by Minsky and Papert (in)famously demonstrated in 1969 that the perceptron learning algorithm is not guaranteed to converge for datasets that are not linearly separable.
If you're sure that your dataset is linearly separable, you might try adding a bias to each of your data vectors, as described by the question: Perceptron learning algorithm not converging to 0 -- adding a bias can help model decision boundaries that do not pass through the origin.
Alternatively, if you'd like to use a variant of the perceptron learning algorithm that is guaranteed to converge to a margin of specified width, even for datasets that are not linearly separable, have a look at the Averaged Perceptron -- PDF. The averaged perceptron is an approximation to the voted perceptron, which was introduced (as far as I know) in a nice paper by Freund and Schapire, "Large Margin Classification Using the Perceptron Algorithm" -- PDF.
Using an averaged perceptron, you make a copy of the parameter vector after each presentation of a training example during training. The final classifier uses the mean of all parameter vectors.
The basic issue is that randomly chosen points are not necessarily linearly classifiable.
However, there is a worse problem in the algorithm:
Even if you go by a good reference like Vapnik's, "Statistical Learning Theory", you are not given the biggest problem in the OP's algorithm. The issue is NOT the learning rate parameter, either. It is trivial to prove that the learning rate parameter has no real effect on whether or not the algorithm converges - this is because the learning rate parameter is simply a scalar.
Imagine a set of four points to classify. The four points are orthogonal vectors of length sqrt(2). The "in class" point is (-1,1). The "out of class" points are (-1,-1), (1,1), and (1,-1). Regardless of any learning rate, the OP's original algorithm will never converge.
The reason why the original poster's algorithm fails to converge is because of the missing bias term (effectively the coefficient of the 0th dimensional term), which MUST augment the other dimensional terms. Without the bias term, the perceptron is not completely defined. This is trivial to prove by hand modeling a perceptron in 1D or 2D space.
The bias term is often hand-waved in literature as a way to "shift" the hyperplane, but it seems that's mostly due to the fact that many people tend to teach/learn perceptrons in 2D space for teaching purposes. The term "Shifting" doesn't provide an adequate explanation of why the bias term is needed in high dimensional spaces (what does it mean to "shift" a 100-dimensional hyperplane?)
One may notice that literature proving the mean convergence time of the perceptron excludes the bias term. This is due to the ability to simplify the perceptron equation if you assume that the perceptron will converge (see Vapnik, 1998, Wiley, p.377). That is a big (but necessary) assumption for the proof, but one cannot adequately implement a perceptron by assuming an incomplete implementation.
Alex B. J. Novikoff's 1962/1963 proofs of the perceptron convergence include this zero dimensional term.