How to reproduce the behaviour of Ridge(normalize=True)? - python

This block of code:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
X = 'some_data'
y = 'some_target'
penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)
triggers the following warning:
FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
- model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
Ridge(alpha=1.5e-05)
But that codes gives me completely different coefficients, as expected because normalisation and standardisation are different.
B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)
The result still does not match if I set alpha = penalty * n_features.
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)
even though Ridge() uses a bit different normalization than I expected:
the regressor X will be normalized by subtracting mean and dividing by
l2-norm
So what's the proper way to use ridge regression with normalization?
considering that l2-norm seems like being obtained after prediction, data modifying and fitting again
nothing comes to my mind in context of using ridge regression from sklearn, especially after 1.2 version
prepare data for experimenting:
url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)
X = data.iloc[:,:15]
y = data['target']

The difference is that the coefficients reported with normalize=True are to be applied directly to the unscaled inputs, whereas the pipeline approach applies its coefficients to the model's inputs, which are the scaled features.
You can "normalize" (an unfortunate overloading of the word) the coefficients by multiplying/dividing by the features' standard deviation. Together with the change to penalty suggested in the future warning, I get the same outputs:
np.allclose(A.coef_, B[1].coef_ / B[0].scale_)
# True
(I've tested using sklearn.datasets.load_diabetes.)

Related

OLS fit for python with coefficient error and transformed target

There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?
In stats model it would be done like this
import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())
To return the fit with the coefficients and the error on them
For Sklearn you can use the TransformedTargetRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)
But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?
EDIT
I found a good example for the special case I care about here
https://web.archive.org/web/20160322085813/http://www.ats.ucla.edu/stat/mult_pkg/faq/general/log_transformed_regression.htm
Just to add a lengthy comment here, I believe that TransformedTargetRegressor does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict method is called. It does not express the coefficients in units of the untransformed outcome.
Example:
import pandas as pd
import statsmodels.api as sm
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
create some sample data:
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names
X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
Sklearn first:
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
Statsmodels on transformed outcome:
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)
#const -0.458678
#sepal length (cm) 0.356758
#sepal width (cm) -0.296294
#dtype: float64
You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor yields the same coefficients as statsmodels.OLS with the transformed outcome. TransformedTargetRegressor does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.
What to do instead?
If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).
Example:
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] += 0.018 # this is one sigma
# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
return inv_func(data.dot(coeffs))
# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)
Then you can say e.g. about the mean shift in the untransformed outcome:
variant1.mean()-original.mean()
#0.2523083548367202
In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error #grisaitis provided a great answer explaining the main concepts behind it.
If you only want to use a plug-and-play function that will work with sciait-learn you can use this:
def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
y_true: 'np.ndarray', X: 'np.ndarray'):
"""Function that calculates the standard deviation of the coefficients of
a linear regression.
Parameters
----------
reg : sklearn.linear_model.LinearRegression
LinearRegression object which has been fitted
y_true : np.ndarray
array containing the target variable
X : np.ndarray
array containing the features used in the regression
Returns
-------
beta_std
Standard deviation of the regression coefficients
"""
y_pred = reg.predict(X) # get predictions
errors = y_true - y_pred # calculate residuals
sigma_sq_hat = np.var(errors) # calculate residuals std error
sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T # X)
return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances

Why does LinearRegression with polynomial features with degree = 1 gives different results?

I have a dataset for regression: (X_train_scaled, y_train) and (X_val_scaled, y_val) for training and validation respectively. The inputs were scaled using StandardScaler.
I create a linear regression model using sklearn.linear_model.LinearRegression like follows:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
After that I do the same but use polynomial features with degree = 1 (which are just the same as the original features but with an additional feature of ones, i.e. x^0, which I ignore).
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(1)
X_train_poly = pf.fit_transform(X_train_scaled)[:, 1:] # ignore first col
X_val_poly = pf.transform(X_val_scaled)[:, 1:] # ignore first col
linear_reg = LinearRegression()
linear_reg.fit(X_train_poly, y_train)
y_pred_train = linear_reg.predict(X_train_poly)
y_pred_val = linear_reg.predict(X_val_poly)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
However, I get different results. The first code gives me the following outputs:
r2_train 0.7409525513417043
r2_val 0.7239859358973735
whereas the second code gives this output:
r2_train 0.7410093370149977
r2_val 0.7241725658840452
Why are the outputs different although the dataset and model are the same?
To prove the datasets are the same, I tried the following code:
print(X_train_scaled.shape, X_train_poly.shape)
print(X_val_scaled.shape, X_val_poly.shape)
print((X_train_poly != X_train_scaled).sum())
print((X_val_poly != X_val_scaled).sum())
which has the output:
(802, 9) (802, 9)
(268, 9) (268, 9)
0
0
which indicates that the two datasets are identical.
Also, I use LinearRegession in the two cases which uses OLS algorithm and has no random operations at all. So, it's supposed to do the same calculations on the same data. However, I get different results.
Does anyone have an idea about the reason?
Sklearn LinearRegression uses ordinary least squares optimization to fit train data into a linear model while it is not clear what Sklearn PolynomialFeatures use. But based on its transform() function:
Prefer CSR over CSC for sparse input (for speed), but CSC is required
if the degree is 4 or higher. If the degree is less than 4 and the
input format is CSC, it will be converted to CSR, have its polynomial
features generated, then converted back to CSC.
(see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Assuming PolynomialFeatures uses ordinary least squares optimization, you would still have same results but with slight difference (just like yours) because Compressed Sparse Row (CSR) method would compromise float values (in other words, truncation/approximation error).

Large Margin Classifier

I am building a classifier to maximize the margin between positively and negatively labelled points.
I am using sklearn.LinearSVC to do this. I have to find both the weights (a vector, theta) and intercept ( a scalar theta_0). I also need to calculate the maximum margin. So, I wrote the below code.
import numpy as np
import sklearn
from sklearn.svm import LinearSVC
# training data
X_train = np.array([[0,0],[2,0],[3,0],[0,2],[2,2],[5,1],[5,2],[2,4],[4,4],[5,5]])
y_train = [-1,-1,-1,-1,-1,1,1,1,1,1]
classifier = LinearSVC(random_state = 0, C=1.0, fit_intercept= True)
classifier.fit(X_train, y_train)
theta = classifier.coef_
theta_0.intercept_
norm = np.linalg.norm(theta)
margin = 2/norm
As per my understanding, LinearSVC is the right package for this; though I see some tutorials in which people use SVC and then kernel = 'linear'.
I am not sure whether I should set the fit_intercept parameter to True. I am getting a different value for theta and theta_0 when I default it to False.
Can somebody guide me on the understanding of this parameter and also whether the margin calculation is correct? Lastly, whether LinearSVC is the right model. Thanks.
This statement is wrong:
theta_0.intercept_
I assume that it should be:
theta_0 = classifier.intercept_

How can I replace svm.SVR 'rbf' kernel in sklearn using my own RBF function?

I have developed the code below for starting a project for svm method:
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error
housing = load_boston()
df = pd.DataFrame(np.c_[housing['data'], housing['target']],
columns= np.append(housing['feature_names'], ['target']))
features = df.columns.tolist()
label = features[-1]
features = features[:-1]
x_train = df[features].iloc[:400]
y_train = df[label].iloc[:400]
x_test = df[features].iloc[400:]
y_test = df[label].iloc[400:]
svr = svm.SVR(kernel='rbf')
svr.fit(x_train, y_train)
y_pred = svr.predict(x_test)
print(mean_absolute_error(y_pred, y_test))
Now I want to use my customized rbf kernel which is:
def my_rbf(feat, lbl):
#feat = feat.values
#lbl = lbl.values
ans = np.array([])
gamma = 0.000005
for i in range(len(feat)):
ans = np.append(ans, np.exp(-gamma * np.dot(feat[i]-lbl[i], feat[i]-lbl[i])))
return ans
Then I changed svm.SVR(kernel=my_rbf) But I get plenty of errors while modifying it in any way. I also tried to use a simple function like np.dot(feat-lbl,feat-lbl) which worked fine in SVR.fit method but in svr.predict some error occurred which said that shape of input matrix has to be like [n_samples_test, n_samples_train].
I'm stymied to deal with the errors. Can anyone help me make this code work?
The custom kernel method my_rbf you coded uses both X (features) and y (labels). You cannot evaluate this function during predictions as you have no access to labels. The custom kernel if flawed.
Backgroud
The RBF function is defined as below (from wiki)
where x and x' are two feature (X) vectors.
Let H(X) is a function with transforms a vector X to other dimension (normally to very very high dimension). SVM needs to calculate the dot product between all combinations of the feature vectors (ie all H(X)'s). So if H(X1) . H(X2) = K(X1, X2) then K is called the kernel function or kernalization of H. So instead of transforming the points X1 and X2 to very high dimensions and calculating the dot product there, K calculates it directly from X1 and X2.
Conclusion
The my_rbf is not a valid kernel function simply because it uses labels (Ys). It should be only on the feature vectors.
According to this source, RBF function which I was looking for (takes training featues as X and testing features as X' as inputs) and outputs [n_training_samples, n_testing_samples] as explained more thoroughly in docs, is something like this:
def my_kernel(X,Y):
K = np.zeros((X.shape[0],Y.shape[0]))
for i,x in enumerate(X):
for j,y in enumerate(Y):
K[i,j] = np.exp(-1*np.linalg.norm(x-y)**2)
return K
clf=SVR(kernel=my_kernel)
which results exactly equal to:
clf=SVR(kernel="rbf",gamma=1)
In terms of speed it lacks performance as efecient as the default svm library rbf. It could be useful to use static typing of cython library for indexes and also using memory-views for numpy arrays to speed it up a little bit.

How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).

Categories

Resources