Backward Elimination on large datasets in python

Backward Elimination on large datasets in python - python

I took an online course where the instructor explained backward elimination using a dataset(50,5) where you eliminate the columns manually by looking at their p-values.
import statsmodels.api as sm
X = np.append(arr = np.ones((2938, 1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
# Second Step
X_opt = X[:, [0,1,,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
# and so on
Now while practicing on on an large dataset such as (2938, 214) which I have, do I have to eliminate all the columns myself? Because that is a lot of work, or is there some sort of algorithm or way to do it.
This might be a stupid question but I am a begineer in machine learning so any help is appreciated.Thanks

What you sre trying to do is called "Recursive Feature Eliminatio ", RFE for short.
Example from sklearn.feature_selection.RFE:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
selector = selector.fit(X, y)
This would eliminate features using SVR one by one until only 5 most important are left. You could use any algorithm which provides feature_importances_ object member.
When it comes to p-values you could eliminate all greater than threshold (provided the null hypothesis is this coefficient has no meaning, e.g. is zero), but see below.
Just remember, usually coefficients weights will change as some of them are removed (as here or in RFE), so it's only an approximation dependent on many factors. You could do other preprocessing like removing correlated features or using OLS with L1 penalty which will choose only the most informative factors.

Related

Sklearn KNeighborsClassifier gives different results when using KNeighborsTransformer?

I am trying to get my head around how to use KNeighborsTransformer correctly, so I am using the Iris dataset to test it.
However, I find that when I use KNeighborsTransformer before the KNeighborsClassifier I get different results than using KNeighborsClassifier directly.
When I plot the decision boundaries, they are similar, but different.
I have given the metric and weights mode explicitly, so that cannot be the problem.
Why do I get this difference?
Does it have something to do with whether they count a point as its own nearest neighbour?
Or does it have something to do with the metric='precomputed'?
Below is the code I use to consider the two classifiers.
import numpy as np
from sklearn import neighbors, datasets
from sklearn.pipeline import make_pipeline
# import data
iris = datasets.load_iris()
# We only take the first two features.
X = iris.data[:, :2]
y = iris.target
n_neighbors = 15
knn_metric = 'minkowski'
knn_mode = 'distance'
# With estimator with KNeighborsTransformer
estimator = make_pipeline(
neighbors.KNeighborsTransformer(
n_neighbors = n_neighbors + 1, # one extra neighbor should already be computed when mode == 'distance'. But also the extra neighbour should be filtered by the following KNeighborsClassifier
metric = knn_metric,
mode = knn_mode),
neighbors.KNeighborsClassifier(
n_neighbors=n_neighbors, metric='precomputed'))
estimator.fit(X, y)
print(estimator.score(X, y)) # 0.82
# with just KNeighborsClassifier
clf = neighbors.KNeighborsClassifier(
n_neighbors,
weights = knn_mode,
metric = knn_metric)
clf.fit(X, y)
print(clf.score(X, y)) # 0.9266666666666666

Your pipeline approach uses the default uniform vote, but your direct approach uses the distance-weighted vote. Making them match (either both distance or both uniform) almost makes the behavior match; the seeming remaining difference is in tie-breaking of nearest neighbors; I'm not sure yet why the tie-breaking is happening differently in the two cases, but it's likely not such a big issue with more realistic datasets.

Getting feature importances out of an Adaboosted linear regression

I have the following code:
modelClf = AdaBoostRegressor(base_estimator=LinearRegression(), learning_rate=2, n_estimators=427, random_state=42)
modelClf.fit(X_train, y_train)
While trying to interpret and improve the results, I wanted to see the feature importances, however I get an error saying that linear regressions don't really do that kind of thing.
Alright, sounds reasonable, so I tried using .coef_ since it should work for linear regressions, but it, in place, turned out incompatible with the adaboost regressor.
Is there any way to find the feature importances or is it impossible when adaboost it used on a linear regression?

Issue12137 suggests to add support for this using the coefs_, although a choice needs to be made how to normalize negative coefficients. There's also the question of when coefficients are really good representatives of importance (you should at least scale your data first). And then there's the question of when adaptive boosting helps a linear model in the first place.
One way to do this quickly is to modify the LinearRegression class:
class MyLinReg(LinearRegression):
#property
def feature_importances_(self):
return self.coef_ # assuming one output
modelClf = AdaBoostRegressor(base_estimator=MyLinReg(), ...)

Checked with below code, there is an attribute for feature importance:
import pandas as pd
import random
from sklearn.ensemble import AdaBoostRegressor
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X = df[['x1','x2']].values
y = df['y'].values
regr = AdaBoostRegressor(random_state=0, n_estimators=100)
regr.fit(X, y)
regr.feature_importances_
Output: You can see feature 2 is more important as Y is nothing but half of it (as the data is created in such way).

Why does LinearRegression with polynomial features with degree = 1 gives different results?

I have a dataset for regression: (X_train_scaled, y_train) and (X_val_scaled, y_val) for training and validation respectively. The inputs were scaled using StandardScaler.
I create a linear regression model using sklearn.linear_model.LinearRegression like follows:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
linear_reg = LinearRegression()
linear_reg.fit(X_train_scaled, y_train)
y_pred_train = linear_reg.predict(X_train_scaled)
y_pred_val = linear_reg.predict(X_val_scaled)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
After that I do the same but use polynomial features with degree = 1 (which are just the same as the original features but with an additional feature of ones, i.e. x^0, which I ignore).
from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(1)
X_train_poly = pf.fit_transform(X_train_scaled)[:, 1:] # ignore first col
X_val_poly = pf.transform(X_val_scaled)[:, 1:] # ignore first col
linear_reg = LinearRegression()
linear_reg.fit(X_train_poly, y_train)
y_pred_train = linear_reg.predict(X_train_poly)
y_pred_val = linear_reg.predict(X_val_poly)
r2_train = r2_score(y_train, y_pred_train)
r2_val = r2_score(y_val, y_pred_val)
print('r2_train', r2_train)
print('r2_val', r2_val)
However, I get different results. The first code gives me the following outputs:
r2_train 0.7409525513417043
r2_val 0.7239859358973735
whereas the second code gives this output:
r2_train 0.7410093370149977
r2_val 0.7241725658840452
Why are the outputs different although the dataset and model are the same?
To prove the datasets are the same, I tried the following code:
print(X_train_scaled.shape, X_train_poly.shape)
print(X_val_scaled.shape, X_val_poly.shape)
print((X_train_poly != X_train_scaled).sum())
print((X_val_poly != X_val_scaled).sum())
which has the output:
(802, 9) (802, 9)
(268, 9) (268, 9)
0
0
which indicates that the two datasets are identical.
Also, I use LinearRegession in the two cases which uses OLS algorithm and has no random operations at all. So, it's supposed to do the same calculations on the same data. However, I get different results.
Does anyone have an idea about the reason?

Sklearn LinearRegression uses ordinary least squares optimization to fit train data into a linear model while it is not clear what Sklearn PolynomialFeatures use. But based on its transform() function:
Prefer CSR over CSC for sparse input (for speed), but CSC is required
if the degree is 4 or higher. If the degree is less than 4 and the
input format is CSC, it will be converted to CSR, have its polynomial
features generated, then converted back to CSC.
(see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Assuming PolynomialFeatures uses ordinary least squares optimization, you would still have same results but with slight difference (just like yours) because Compressed Sparse Row (CSR) method would compromise float values (in other words, truncation/approximation error).

Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset

I'm trying to reproduce the following R results in Python. In this particular case the R predictive skill is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore that detail here.
The aim is to predict the flower species ('versicolor' 0 or 'virginica' 1). We have 100 labelled samples, each consisting of 4 flower characteristics: sepal length, sepal width, petal length, petal width. I've split the data into training (60% of data) and test sets (40% of data). 10-fold cross-validation is applied to the training set to search for the optimal lambda (the parameter that is optimized is "C" in scikit-learn).
I'm using glmnet in R with alpha set to 1 (for the LASSO penalty), and for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation). The scoring metrics used in the cross-validation are the same between both languages. However somehow the model results are different (the intercepts and coefficients found for each feature vary quite a bit).
R Code
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python Code
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137

There are a few things that are different in the examples above:
Scale of the coefficients
glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf) standardizes the data and "The coefficients are always returned on the original scale". Hence you did not scale your data before calling glmnet.
The Python code standardizes the data, then fits to that standardized data. The coefs in this case are in the standardized scale, not the original scale. This makes the coefs between the examples non-comparable.
LogisticRegressionCV by default uses stratifiedfolds. glmnet uses k-fold.
They are fitting different equations. Notice that scikit-learn logistic fits (http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) with the regularization on the logistic side. glmnet puts the regularization on the penalty.
Choosing the regularization strengths to try - glmnet defaults to 100 lambdas to try. scikit LogisticRegressionCV defaults to 10. Due to the equation scikit solves, the range is between 1e-4 and 1e4 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV).
Tolerance is different. In some problems I have had, tightening the tolerance significantly changed the coefs.
glmnet defaults thresh to 1e-7
LogisticRegressionCV default tol to 1e-4
Even after making them the same, they may not measure the same thing. I do not know what liblinear measures. glmnet - "Each inner coordinate-descent loop continues until the maximum change in the objective after any coefficient update is less than thresh times the null deviance."
You may want to try printing the regularization paths to see if they are very similar, just stopping on a different strength. Then you can research why.
Even after changing what you can change which is not all of the above, you may not get the same coefs or results. Though you are solving the same problem in different software, how the software solves the problem may be different. We see different scales, different equations, different defaults, different solvers, etc.

The problem that you've got here is the ordering of the datasets (note I haven't checked the R code, but I'm certain this is the problem). If I run your code and then run this
print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]
You can see the training set is not representative of the test set. However if I make a couple of changes to your Python code then I get a test accuracy of 0.9.
from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.4,
random_state=42,
stratify=y)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.9
print clf.intercept_ #0.
print clf.coef_ # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]

I have to take umbrage with a couple of things here.
Firstly, "for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation)". That is just patently false, unless you meant to qualify that in some more definitive way. Just take a look at the descriptions of the sklearn.linear_model classes and you will see a handful that specifically mention L1. I am sure that others allow you to implement it as well, but I don't really feel like counting them.
Secondly, your method for splitting the data is less than ideal. Take a look at your input and output after the split and you will find that in your split all of the test samples have target values of 1, while the target of 1 only accounts for 1/6 of your training sample. This imbalance, which is not representative of the distribution of the targets, will cause your model to be poorly fit. For example, just using sklearn.model_selection.train_test_split out of the box and then refitting the LogisticRegressionCV classifier exactly as you had, results in an accuray of .92
Now all that being said there is a glmnet package for python and you can replicate your results using this package. There is a blog by the authors of this project that discusses some of the limitations in trying to recreate glmnet results with sklearn. Specifically:
"Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty." - Bill Lattner GLMNET FOR PYTHON

How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV

I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.

Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.