Choosing best model regarding to k-fold cross validation

Choosing best model regarding to k-fold cross validation - python

I want to take Iris data and choose best logistic model based on GridSearchCV function.
My work so far
import numpy as np
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
# Logistic regression
reg_log = LogisticRegression()
# Penalties
pen = ['l1', 'l2','none']
#Regularization strength (numbers from -10 up to 10)
C = np.logspace(-10, 10, 100)
# Possibilities for those parameters
parameters= dict(C=C, penalty=pen)
# choosing best model based on 5-fold cross validation
Model = GridSearchCV(reg_log, parameters, cv=5)
# Fitting best model
Best_model = Model.fit(X, y)
And I get a lot of errors. Do you know maybe what I'm doing wrong ?

Since you are choosing different regularization, you can see on the help page:
The ‘newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2
regularization with primal formulation, or no regularization. The
‘liblinear’ solver supports both L1 and L2 regularization, with a dual
formulation only for the L2 penalty. The Elastic-Net regularization is
only supported by the ‘saga’ solver.
I am not quite sure if you want to do a grid search with penalization = 'none' and penalization scores. So if you use saga and increase the iteration:
reg_log = LogisticRegression(solver="saga",max_iter=1000)
pen = ['l1', 'l2']
C = [0.1,0.001]
parameters= dict(C=C, penalty=pen)
Model = GridSearchCV(reg_log, parameters, cv=5)
Best_model = Model.fit(X, y)
res = pd.DataFrame(Best_model.cv_results_)
res[['param_C','param_penalty','mean_test_score']]
param_C param_penalty mean_test_score
0 0.1 l1 0.753333
1 0.1 l2 0.833333
2 0.001 l1 0.333333
3 0.001 l2 0.700000
It works pretty ok. If you get more errors with your penalization values.. try to look at them and make sure they are not some crazy values.

Related

Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

I would like to be able to reproduce sklearn SelectKBest results when using GridSearchCV by performing the grid-search CV myself. However, I find my code to produce different results. Here is a reproducible example:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import itertools
r = 1
X, y = make_classification(n_samples = 50, n_features = 20, weights = [3/5], random_state = r)
np.random.seed(r)
X = np.random.rand(X.shape[0], X.shape[1])
K = [1,3,5]
C = [0.1,1]
cv = StratifiedKFold(n_splits = 10)
space = dict()
space['anova__k'] = K
space['svc__C'] = C
clf = Pipeline([('anova', SelectKBest()), ('svc', SVC(probability = True, random_state = r))])
search = GridSearchCV(clf, space, scoring = 'roc_auc', cv = cv, refit = True, n_jobs = -1)
result = search.fit(X, y)
print('GridSearchCV results:')
print(result.cv_results_['mean_test_score'])
scores = []
for train_indx, test_indx in cv.split(X, y):
X_train, y_train = X[train_indx,:], y[train_indx]
X_test, y_test = X[test_indx,:], y[test_indx]
scores_ = []
for k, c in itertools.product(K, C):
anova = SelectKBest(k = k)
X_train_k = anova.fit_transform(X_train, y_train)
clf = SVC(C = c, probability = True, random_state = r).fit(X_train_k, y_train)
y_pred = clf.predict_proba(anova.transform(X_test))[:, 1]
scores_.append(roc_auc_score(y_test, y_pred))
scores.append(scores_)
print('Manual grid-search CV results:')
print(np.mean(np.array(scores), axis = 0))
For me, this produces the following output:
GridSearchCV results:
[0.41666667 0.4 0.4 0.4 0.21666667 0.26666667]
Manual grid-search CV results:
[0.58333333 0.6 0.53333333 0.46666667 0.48333333 0.5 ]
when using the make_classification dataset directly, the output matches. On the other hand, when X is computed based on np.random.rand, the scores differ.
Is there some random process that I am not aware of underneath?

Edit: restructured my answer, since it seems you are after more of a "why?" and "how should I?" vs a "how can I?"
The Issue
The scorer that you're using in GridSearchCV isn't being passed the output of predict_proba like it is in your loop version. It's being passed the output of decision_function. For SVM's the argmax of the probabilities may differ from the decisions, as described here:
The cross-validation
involved in Platt scaling is an expensive operation for large
datasets. In addition, the probability estimates may be inconsistent
with the scores:
the “argmax” of the scores may not be the argmax of the probabilities
in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is
less than 0.5; and similarly, it could be labeled as negative even if
the output of predict_proba is more than 0.5.
How I would Fix It
Use SVC(probability = False, ...) in both the Pipeline/GridSearchCV approach and the loop, and decision_function in the loop instead of predict_proba. According to this blurb above, this will also speed up your code.
My Original, Literal Answer to Your Question
To make your loop match GridSearchCV, leaving the GridSearchCV approach alone:
y_pred = clf.decision_function(anova.transform(X_test)) # instead of predict_proba
To make GridSearchCV match your loop, leaving the loop code alone:
from sklearn.metrics import make_scorer
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, needs_proba=True)
search = GridSearchCV(clf, space, scoring = roc_auc_scorer, cv = cv, refit = True, n_jobs = -1)

The key difference between your implementation and the way GridSearchCV operates is that
GridSearchCV uses decision_function method for computing the roc_auc.
In your implementation, predict_proba was used.
Just change the following line:
y_pred = clf.decision_function(anova.transform(X_test))
You will get the same results for both the ways after that.
GridSearchCV results:
[0.41666667 0.4 0.4 0.4 0.21666667 0.26666667]
Manual grid-search CV results:
[0.41666667 0.4 0.4 0.4 0.21666667 0.26666667]
More explanation about the scoring in GridSearchCV here.
This inconsistency is documented in the SVC's probability parameter:
probability bool, default=False
Whether to enable probability estimates. This must be enabled prior to
calling fit, will slow down that method as it internally uses 5-fold
cross-validation, and predict_proba may be inconsistent with predict.
Read more in the User Guide.
This probably is the reason for why there is no difference when using make_classification dataset. Meaning, the 5-fold cv based probability estimation would be similar to predict_proba output because the Xs are taken from Gaussian distribution. Whereas in np.random.rand(), the 5-fold based estimation might have given completely different estimates.

Model to add two numbers using machine learning

I'm trying to use machine learning to model an addition. But the model always predicts the same. Here is my code:
import numpy as np
import random
from sklearn.naive_bayes import GaussianNB
X=np.array([[0,1],[1,1],[2,1],[2,2],[2,3],[3,3],[3,4],[4,4],[4,5]])
Y=np.array([1,2,3,4,5,6,7,8,9])
clf = GaussianNB()
clf.fit(X,Y)
x=random.random()
y=random.random()
d=1
e=10000
accuracy=0
while d<e:
d+=1
if (clf.predict([[x, y]])) == x+y:
accuracy+=1
if d==e:
print(accuracy)
In 10000 predictions zero predicted Y is and addition of both random variables in X what went wrong.

First of all, as pointed out in the comments, this is a regression problem, not a classification one, and GaussianNB is a classifier. Secondly, your code is wrong, you're predicting on the same test set, since you're not regenerating the random values to predict on.
Here's a how you could go about solving this problem. First of all, you're trying to model a linear relation between the features and the target variable, hence you want your model to learn how to map f(X)->y with a linear function, in this case a simple addition. Hence you need a linear model.
So here we could use LinearRegression. To train the regressor you could do:
from sklearn.linear_model import LinearRegression
X = np.random.randint(0,1000, (20000, 2))
y = X.sum(1)
lr = LinearRegression()
lr.fit(X,y)
And then similarly generate a test test with unseen combinations, which hopefully the regressor should be able to accurately predict on:
X_test = X = np.random.randint(0,1000, (2000, 2))
y_test = X.sum(1)
If we predict with the trained model, and compare the predicted values with the orginal ones, we see that the model indeed perfectly maps the addition function as we expected it to:
y_pred = lr.predict(X_test)
pd.DataFrame(np.vstack([y_test, y_pred]).T, columns=['Test', 'Pred']).head(10)
Test Pred
0 1110.0 1110.0
1 557.0 557.0
2 92.0 92.0
3 1210.0 1210.0
4 1176.0 1176.0
5 1542.0 1542.0
By checking the model's coef_, we can see that the model has learnt the following optimal coefficients:
lr.coef_
# array([1., 1.])
And:
lr.intercept_
# 4.547473508864641e-13 -> 0
Which basically turns a linear regression into an addition, for instance:
X_test[0]
# array([127, 846])
So we'd have that y_pred = 0 + 1*127 + 1*846

the random.random() generate real number random between 0 and 1, so model just predicted 0 or 1 and maximum 2, and x+y is not an integer.
you can use of random.randint(a,b)
x=random.randint(0,4)
y=random.randint(1,5)

Finding regression equation from support vector regression

I'm currently using Python's scikit-learn to create a support vector regression model, and I was wondering how one would go about finding the explicit regression equation of our target variable in terms of our predictors. It doesn't have to be simple or pretty, but is there a method Python has to output this (for a polynomial kernel, specifically)? I am fairly new to using SVR, and I am not certain of what to expect a regression equation to look like used in the prediction from a test observation after the regression is fit.
I've already fit an SVR model that predicts with a performance I'm happy with, and I've used GridSearchCV to tune hyper-parameters. However, I need an explicit form of my target variable in terms of the predictors for an independent optimization, and don't know how to find this equation.
from sklearn.svm import SVR
svr = SVR(kernel = 'poly', C = best_params['C'], epsilon = best_params['epsilon'], gamma = best_params['gamma'], coef0 = 0.1, shrinking = True, tol = 0.001, cache_size = 200, verbose = False, max_iter = -1)
svr.fit(x,y)
Where x is my matrix of observations, y is my vector of target values from the observations, and best_params is the output (optimal hyperparameters) found by GridSearchCV.
Does Python have any method for outputting the resulting equation of the SVR model used in predicting future target values from a set of predictors? Or is there a straightforward way of using values found by SVR to create an equation myself if I specify the kernel to be of polynomial type?
Thank you!

If you use a linear kernel, then you can output your coefficient.
For example
from sklearn.svm import SVR
import numpy as np
n_samples, n_features = 1000, 5
rng = np.random.RandomState(0)
coef = [1,2,3,4,5]
X = rng.randn(n_samples, n_features)
y = coef * X
y = y.sum(axis = 1) + rng.randn(n_samples)
clf = SVR(kernel = 'linear', gamma='scale', C=1.0, epsilon=0.2)
clf.fit(X, y)
clf.coef_
array([[0.97626634, 2.00013793, 2.96205576, 4.00651352, 4.95923782]])

Reproducing LASSO / Logistic Regression results in R with Python using the Iris Dataset

I'm trying to reproduce the following R results in Python. In this particular case the R predictive skill is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore that detail here.
The aim is to predict the flower species ('versicolor' 0 or 'virginica' 1). We have 100 labelled samples, each consisting of 4 flower characteristics: sepal length, sepal width, petal length, petal width. I've split the data into training (60% of data) and test sets (40% of data). 10-fold cross-validation is applied to the training set to search for the optimal lambda (the parameter that is optimized is "C" in scikit-learn).
I'm using glmnet in R with alpha set to 1 (for the LASSO penalty), and for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation). The scoring metrics used in the cross-validation are the same between both languages. However somehow the model results are different (the intercepts and coefficients found for each feature vary quite a bit).
R Code
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python Code
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137

There are a few things that are different in the examples above:
Scale of the coefficients
glmnet (https://cran.r-project.org/web/packages/glmnet/glmnet.pdf) standardizes the data and "The coefficients are always returned on the original scale". Hence you did not scale your data before calling glmnet.
The Python code standardizes the data, then fits to that standardized data. The coefs in this case are in the standardized scale, not the original scale. This makes the coefs between the examples non-comparable.
LogisticRegressionCV by default uses stratifiedfolds. glmnet uses k-fold.
They are fitting different equations. Notice that scikit-learn logistic fits (http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) with the regularization on the logistic side. glmnet puts the regularization on the penalty.
Choosing the regularization strengths to try - glmnet defaults to 100 lambdas to try. scikit LogisticRegressionCV defaults to 10. Due to the equation scikit solves, the range is between 1e-4 and 1e4 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV).
Tolerance is different. In some problems I have had, tightening the tolerance significantly changed the coefs.
glmnet defaults thresh to 1e-7
LogisticRegressionCV default tol to 1e-4
Even after making them the same, they may not measure the same thing. I do not know what liblinear measures. glmnet - "Each inner coordinate-descent loop continues until the maximum change in the objective after any coefficient update is less than thresh times the null deviance."
You may want to try printing the regularization paths to see if they are very similar, just stopping on a different strength. Then you can research why.
Even after changing what you can change which is not all of the above, you may not get the same coefs or results. Though you are solving the same problem in different software, how the software solves the problem may be different. We see different scales, different equations, different defaults, different solvers, etc.

The problem that you've got here is the ordering of the datasets (note I haven't checked the R code, but I'm certain this is the problem). If I run your code and then run this
print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]
You can see the training set is not representative of the test set. However if I make a couple of changes to your Python code then I get a test accuracy of 0.9.
from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.4,
random_state=42,
stratify=y)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.9
print clf.intercept_ #0.
print clf.coef_ # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]

I have to take umbrage with a couple of things here.
Firstly, "for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation)". That is just patently false, unless you meant to qualify that in some more definitive way. Just take a look at the descriptions of the sklearn.linear_model classes and you will see a handful that specifically mention L1. I am sure that others allow you to implement it as well, but I don't really feel like counting them.
Secondly, your method for splitting the data is less than ideal. Take a look at your input and output after the split and you will find that in your split all of the test samples have target values of 1, while the target of 1 only accounts for 1/6 of your training sample. This imbalance, which is not representative of the distribution of the targets, will cause your model to be poorly fit. For example, just using sklearn.model_selection.train_test_split out of the box and then refitting the LogisticRegressionCV classifier exactly as you had, results in an accuray of .92
Now all that being said there is a glmnet package for python and you can replicate your results using this package. There is a blog by the authors of this project that discusses some of the limitations in trying to recreate glmnet results with sklearn. Specifically:
"Scikit-Learn has a few solvers that are similar to glmnet, ElasticNetCV and LogisticRegressionCV, but they have some limitations. The first one only works for linear regression and the latter does not handle the elastic net penalty." - Bill Lattner GLMNET FOR PYTHON

l1 regularized support for Multinomial Logistic Regresion

The current sklearn LogisticRegression supports the multinomial setting but only allows for an l2 regularization since the solvers l-bfgs-b and newton-cg only support that. Andrew Ng has a paper that discusses why l2 regularization shouldn't be used with l-bfgs-b.
If I were to use sklearn's SGDClassifier with log loss and l1 penalty, would that be the same as multinomial logistic regression with l1 regularization minimized by stochastic gradient descent? If not, are there any open source python packages that support l1 regularized loss for multinomial logistic regression?

According to the SGD documentation:
For multi-class classification, a “one versus all” approach is used.
So I think using SGDClassifier cannot perform multinomial logistic regression either.
You can use statsmodels.discrete.discrete_model.MNLogit, which has a method fit_regularized which supports L1 regularization.
The example below is modified from this example:
import numpy as np
import statsmodels.api as sm
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
iris = load_iris()
X = iris.data
y = iris.target
X = sm.add_constant(X, prepend=False) # An interecept is not included by default and should be added by the user.
X_train, X_test, y_train, y_test = train_test_split(X, y)
mlogit_mod = sm.MNLogit(y_train, X_train)
alpha = 1 * np.ones((mlogit_mod.K, mlogit_mod.J - 1)) # The regularization parameter alpha should be a scalar or have the same shape as as results.params
alpha[-1, :] = 0 # Choose not to regularize the constant
mlogit_l1_res = mlogit_mod.fit_regularized(method='l1', alpha=alpha)
y_pred = np.argmax(mlogit_l1_res.predict(X_test), 1)
Admittedly, the interface of this library is not as easy to use as scikit-learn, but it provides more advanced stuff in statistics.

The package Lighting has support for multinomial logit via SGD for l1 regularization.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.