I generate a simple linear model in which X (dimension D) variables come from multi-normal with 0 covariance. Only the first 10 variables have true coefficients of 1, the rest have coefficients 0. Hence, theoretically, the ridge regression results should be the true coefficients divided by (1+C), where C is the penalty constant.
import numpy as np
from sklearn import linear_model
def generate_data(n):
d = 100
w = np.zeros(d)
for i in range(0,10):
w[i] = 1.0
trainx = np.random.normal(size=(n,d))
e = np.random.normal(size=(n))
trainy = np.dot(trainx, w) + e
return trainx, trainy
Then I use:
n = 200
x,y = generate_data(n)
regr = linear_model.Ridge(alpha=4,normalize=True)
regr.fit(x, y)
print(regr.coef_[0:20])
Under normalize = True, I get the first 10 coefficients to be somewhere 20% (i.e. 1/(1+4)) of the true value of 1. When normalize = False, I get the first 10 coefficients to be around 1, which are the same results as a simple linear regression model. Moreover, since I generate the data to be mean = 0 and std = 1, normalize = True shouldn't do anything as the data is already "normalized". Can someone explain to me what is going on here? Thanks!
It's important to understand that normalizing and standardizing are not the same and both cannot be done at the same time. You can either normalize or standardize.
Often Standardizing refers to transforming the data so that it has 0 mean and unit (1) variance. E.g. can be achieved by removing the mean and dividing by the standard deviation. In this case, this would be feature (column) wise.
Commonly Normalizing refers to transforming the data values to a range between 0 and 1. E.g. can be achieved by dividing by the length of the vector. But that doesn't mean that the mean is going to be 0 and the variance 1.
After generating trainx, trainy they're not not normalized yet. Maybe print it to see your results.
So, when normalize=True, trainx will be normalized by subtracting the mean and dividing by the l2-norm (according to sklearn).
When normalize=False, trainx will remain as is.
If you do normalize=True, every feature column is divided by its L2 norm, in other words, magnitude of every feature column is diminished, which causes the estimated coefficients to be larger (βX should be more or less constant; the smaller X, the larger β). When coefficients are larger, greater L2 penalty is imposed. The function thus places more focus on L2 penalty rather than the linear part (Xβ). The estimates of coefficients from the linear part, as a result, is not so accurate compared to pure linear regression.
By contrast, if normalize=False, X is bigger, β is smaller. Given the same alpha, L2 penalty is marginal. More focus is on linear part - the result is close to a pure linear regression.
Related
I am very new to time series modeling and statsmodels and trying to understand the AR model in statsmodels. Suppose I have a data record y of 1000 samples, and I fit an AR (1) model on y. Then I generate the in-sample prediction from this model as y_pred. I do this as
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(y,1).fit()
y_pred = model.predict()
I get the parameters of the model using model.params.
I would like to know, after estimating the model parameters, how does statsmodels calculate the in-sample predictions? For ex. how is y_pred[10] calculated?
I am sorry if the question is too basic, thanks for the help.
Per Wikipedia:
The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term).
In your model example, you have one predictor - lagged value of y. In this simple case, the .predict() method multiplies each lagged value by the value of the estimated linear slope parameter for that predictor and adds the estimated value of the intercept of that line. So y_pred[10] will be equal to the product of the fitted slope parameter and y[9], with the value of the intercept estimate added.
Here is an example:
from statsmodels.tsa.ar_model import AutoReg
y = [1, 2, 3, 6, 2, 9, 1]
model = AutoReg(y,1).fit()
model.params
# array([ 5.72953737, -0.49466192])
The first value in the params array is the estimated intercept parameter and the second value is the estimated linear (slope) parameter.
y_pred = model.predict()
y_pred
# array([5.23487544, 4.74021352, 4.2455516 , 2.76156584, 4.74021352, 1.27758007])
The first value in the y_pred array is the predicted value for the second value in the y array. It is calculated as:
-0.49466192 * 1 + 5.72953737 = 5.23487544
The second value in the y_pred array is computed as:
-0.49466192 * 2 + 5.72953737 = 4.74021353
and so on...
I am having these weird results when playing around with cross validation that I would greatly appreciate to have any comments.
Briefly, I have a lower mean squared error (MSE) when doing regression (least-squares) using cross-valitation (CV), than when using the "ground truth weights" that I used to generate the data.
Note however, that I compute the MSE on noisy data (generated data + noise), so MSE of 0 would not be expected for noise levels above 0.
Weirdly, for high noise conditions, I get lower MSE with cross validated least squares than with the "ground" truth weights used to generate the clean data - to which I then add different levels of noise to the input (X). Instead, if I add guassian noise to the output (y) the "ground truth weights" perform better.
More details below.
Simulation of data
I am generating beta from a guassian and X from a uniform distribution. I then compute the to-be-regressed y as y = beta * X.
python 3 code:
def generate_data(noise_frac):
X = np.random.rand(ntrials,nneurons)
X = np.random.normal(size=(ntrials,nneurons))
beta = np.random.randn(nneurons)
y = X # beta
# not very important how I generated noise here
noise_x = np.random.multivariate_normal(mean=zeros(nneurons), cov=diag(np.random.rand(nneurons)), size=ntrials)
X_noise = X + noise_x*noise_frac
return X_noise, y, beta
As you can see I also add noise to X.
Regression
I then project this noised data X_noise for different values of noise onto beta:
y_hat = (X_noise) # beta
And compute the MSE:
mse = mean((y_hat - y)**2)
As expected, MSE increases with noise (blue line in the figure).
However, I get lower MSE if I use cross validated least-squares! This is now orange line in the figure.
To do CV, I split X_noise in random 100 train and test sets. In broad terms, This is how I do CV in python:
beta_lsq = pinv(X_train) # y_train
y_hat_lsq = (X_test) # beta_lsq
mse = mean((y_hat_lsq - y_test)**2)
On the other hand, if I add noise to y, instead of X, then everything makes sense:
Thank you very much in advance!
PS: This is a crosspost from stack overflow
The regularization parameter C in logistic regression
(see http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) is used allow the function to be fitted to be well defined and avoid either overfitting or problems with step functions (see https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806).
However, regularization in logistic regression should only concern the weights for the features, not the intercept (also explained here: http://aimotion.blogspot.com/2011/11/machine-learning-with-python-logistic.html)
But is seems that sklearn.linear_model.LogisticRegression actually regularizes the intercept as well. Here is why:
1) Conside above link carefully (https://datascience.stackexchange.com/questions/10805/does-scikit-learn-use-regularization-by-default/10806): the sigmod is moved slightly to the left, closer to the intercept 0.
2) I tried to fit data points with a logistic curve and a manual maximum likelihood function. Including the intercept into the L2 norm gives identical results as sklearn's function.
Two questions please:
1) Did I get this wrong, is this a bug, or is there a well-justified reason for regularizing the intercept?
2) Is there a way to use sklearn and specify to regularize all parameters except the intercepts?
Thanks!
import numpy as np
from sklearn.linear_model import LogisticRegression
C = 1e1
model = LogisticRegression(C=C)
x = np.arange(100, 110)
x = x[:, np.newaxis]
y = np.array([0]*5 + [1]*5)
print x
print y
model.fit(x, y)
a = model.coef_[0][0]
b = model.intercept_[0]
b_modified = -b/a # without regularization, b_modified should be 104.5 (as for C=1e10)
print "a, b:", a, -b/a
# OUTPUT:
# [[100]
# [101]
# [102]
# [103]
# [104]
# [105]
# [106]
# [107]
# [108]
# [109]]
# [0 0 0 0 0 1 1 1 1 1]
# a, b: 0.0116744221756 100.478968664
scikit-learn has default regularized logistic regression.
The change in intercept_scaling parameter value in sklearn.linear_model.LogisticRegression has similar effect on the result if only C parameter is changed.
In case of modification in intercept_scaling parameter, regularization has an impact on the estimation of bias in logistic regression. When this parameter's value is on higher side then the regularization impact on bias is reduced. Per official documentation:
The intercept becomes intercept_scaling * synthetic_feature_weight.
Note! the synthetic feature weight is subject to l1/l2 regularization
as all other features. To lessen the effect of regularization on
synthetic feature weight (and therefore on the intercept)
intercept_scaling has to be increased.
Hope it helps!
Thanks #Prem, this is indeed the solution:
C = 1e1
intercept_scaling=1e3 # very high numbers make it unstable in practice
model = LogisticRegression(C=C, intercept_scaling=intercept_scaling)
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.