FutureWarning in scikit-learn Logistic Regression solver - python

I have been using a course on Udemy for learning Machine-Learning. I have found a lot of deprecated code and now I have this issue:
The code:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
The warning:
C:\Users\admin\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
How can I get rid of this deprecation warning?

In scikit-learn v0.20, which you probably use, the default value for the solver used in LogisticRegression was liblinear; from the docs:
solver : str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default: ‘liblinear’.
This changed in v0.22 (current latest) to lbfgs.
So, in order to avoid surprizes from this change, scikit-learn warns you for this change in the default that will come in future versions, in order to keep your code future-proof.
To get rid of it, just define explicitly a solver in your LogisticRegression definition, i.e.
classifier = LogisticRegression(random_state = 0, solver='lbfgs') # default in v0.22
or
classifier = LogisticRegression(random_state = 0, solver='liblinear') # default until v0.21
The first documentation link provided above shows all the available options, along with some short comment/advice on each one.

Well, the warning message is telling you. All you need to do is to explicitly specify which solver to use:
classifier = LogisticRegression(random_state = 0, solver='lbfgs')
(or any other solver you want to use)
For available options, see the sklearn docs.

Try using
classifier = LogisticRegression(random_state=0, solver="liblinear")
And checkout solver parameter in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Related

Parameters: { scale_pos_weight } might not be used

I'm dealing with this warning:
[20:16:09] WARNING: ../src/learner.cc:541:
Parameters: { scale_pos_weight } might not be used.
This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.
while training an XGBoost in Python.
I've been researching about it, it's due to the classification type (binary or multiclass). The thing is that I'm doing binary classification over unbalanced data (6483252 negative / 70659 positive), so I need to set that parameter in order to consider this unbalance during training, but I don't understand why I'm getting that warning :(
This is how I'm initializing and training the XGBoost:
param = {'n_jobs':-1, 'random_state':5, 'booster':'gbtree', 'seed':5, 'objective': 'binary:hinge', 'scale_pos_weight':ratio}
param['eval_metric'] = ['auc', 'aucpr', 'rmse', 'error']
xgb_clf =xgb.XGBClassifier(**param)
xgb_clf.fit(dtrain,y_train)
dtrain is a pandas dataframe and y_train is a pandas series with the labels (0,1).
Thanks!
possible fix are 2:
Your training set is multiclass and then the parameter is not valid.
n_jobs problem in the implementation (set n_jobs to 0)
this are the most common problems

python sklearn get list of available hyper parameters for model

I am using python with sklearn, and would like to get a list of available hyper parameters for a model, how can this be done? Thanks
This needs to happen before I initialize the model, when I try to use
model.get_params()
I get this
TypeError: get_params() missing 1 required positional argument: 'self'
This should do it: estimator.get_params() where estimator is the name of your model.
To use it on a model you can do the following:
reg = RandomForestRegressor()
params = reg.get_params()
# do something...
reg.set_params(params)
reg.fit(X, y)
EDIT:
To get the model hyperparameters before you instantiate the class:
import inspect
import sklearn
models = [sklearn.ensemble.RandomForestRegressor, sklearn.linear_model.LinearRegression]
for m in models:
hyperparams = inspect.getargspec(m.__init__).args
print(hyperparams) # Do something with them here
The model hyperparameters are passed in to the constructor in sklearn so we can use the inspect model to see what constructor parameters are available, and thus the hyperparameters. You may need to filter out some arguments that aren't specific to the model such as self and n_jobs.
As of May 2021:
(Building on sudo's answer)
# To get the model hyperparameters before you instantiate the class
import inspect
import sklearn
models = [sklearn.linear_model.LinearRegression]
for m in models:
hyperparams = inspect.signature(m.__init__)
print(hyperparams)
#>>> (self, *, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)
Using inspect.getargspec(m.__init__).args, as suggested by sudo in the accepted answer, generated the following warning:
DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0,
use inspect.signature() or inspect.getfullargspec()
If you happen to be looking at CatBoost, try .get_all_params() instead of get_params().
estimator._get_param_names() will print out all available hyperparameters for a given estimator (model).
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
SVR._get_param_names()
['C',
'cache_size',
'coef0',
'degree',
'epsilon',
'gamma',
'kernel',
'max_iter',
'shrinking',
'tol',
'verbose']
RandomForestRegressor._get_param_names()
['bootstrap',
'ccp_alpha',
'criterion',
'max_depth',
'max_features',
'max_leaf_nodes',
'max_samples',
'min_impurity_decrease',
'min_samples_leaf',
'min_samples_split',
'min_weight_fraction_leaf',
'n_estimators',
'n_jobs',
'oob_score',
'random_state',
'verbose',
'warm_start']

how to enforce Monotonic Constraints in XGBoost with ScikitLearn?

I build up a XGBoost model using scikit-learn and I am pretty happy with it. As fine tuning to avoid overfitting, I'd like to ensure monotonicity of some features but there I start facing some difficulties...
As far as I understood, there is no documentation in scikit-learn about xgboost (which I confess I am really surprised about - knowing that this situation is lasting for several months). The only documentation I found is directly on http://xgboost.readthedocs.io
On this website, I found out that monotonicity can be enforced using "monotone_constraints" option.
I tried to use it in Scikit-Learn but I got an error message "TypeError: init() got an unexpected keyword argument 'monotone_constraints'"
Do you know a way to do it ?
Here is the code I wrote in python (using spyder):
grid = {'learning_rate' : 0.01, 'subsample' : 0.5, 'colsample_bytree' : 0.5,
'max_depth' : 6, 'min_child_weight' : 10, 'gamma' : 1,
'monotone_constraints' : monotonic_indexes}
#'monotone_constraints' ~ = "(1,-1)"
m07_xgm06 = xgb.XGBClassifier(n_estimators=2000, **grid)
m07_xgm06.fit(X_train_v01_oe, Label_train, early_stopping_rounds=10, eval_metric="logloss",
eval_set=[(X_test1_v01_oe, Label_test1)])
In order to do this using the xgboost sklearn API, you need to upgrade to xgboost 0.81. They fixed the ability to set parameters controlled via kwargs as part of this PR:
https://github.com/dmlc/xgboost/pull/3791
XGBoost Scikit-Learn API currently (0.6a2) doesn't support monotone_constraints. You can use Python API instead. Take a look into example.
This code in the example can be removed:
params_constr['updater'] = "grow_monotone_colmaker,prune"
How would you expect monotone constraints to work for a general classification problem where the response might have more than 2 levels? All the examples I've seen relating to this functionality are for regression problems. If your classification response only has 2 levels, try switching to regression on an indicator variable and then choose an appropriate score threshold for classification.
This feature appears to work as of the latest xgboost / scikit-learn, provided that you use an XGBregressor rather than an XGBclassifier and set monotone_constraints via kwargs.
The syntax is like this:
params = {
'monotone_constraints':'(-1,0,1)'
}
normalised_weighted_poisson_model = XGBRegressor(**params)
In this example, there is a negative constraint on column 1 in the training data, no constraint on column 2, and a positive constraint on column 3. It is up to you to keep track of which is which - you cannot refer to columns by name, only by position, and you must specify an entry in the constraint tuple for every column in your training data.

Semi-supervised learning for regression by scikit-learn

Can Label Propagation be used for semi-supervised regression tasks in scikit-learn?
According to its API, the answer is YES.
http://scikit-learn.org/stable/modules/label_propagation.html
However, I got the error message when I tried to run the following code.
from sklearn import datasets
from sklearn.semi_supervised import label_propagation
import numpy as np
rng=np.random.RandomState(0)
boston = datasets.load_boston()
X=boston.data
y=boston.target
y_30=np.copy(y)
y_30[rng.rand(len(y))<0.3]=-999
label_propagation.LabelSpreading().fit(X,y_30)
It shows that "ValueError: Unknown label type: 'continuous'" in the label_propagation.LabelSpreading().fit(X,y_30) line.
How should I solve the problem? Thanks a lot.
It looks like the error in the documentation, code itself clearly is classification only (beggining of the .fit call of the BasePropagation class):
check_classification_targets(y)
# actual graph construction (implementations should override this)
graph_matrix = self._build_graph()
# label construction
# construct a categorical distribution for classification only
classes = np.unique(y)
classes = (classes[classes != -1])
In theory you could remove the "check_classification_targets" call and use "regression like method", but it will not be the true regression since you will never "propagate" any value which is not encountered in the training set, you will simply treat the regression value as the class identifier. And you will be unable to use value "-1" since it is a codename for "unlabeled"...

How to pass argument to scoring function in scikit-learn's LogisticRegressionCV call

Problem
I am trying to use scikit-learn's LogisticRegressionCV with roc_auc_score as the scoring metric.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
clf = LogisticRegressionCV(scoring=roc_auc_score)
But when I attempt to fit the model (clf.fit(X, y)), it throws an error.
ValueError: average has to be one of (None, 'micro', 'macro', 'weighted', 'samples')
That's cool. It's clear what's going on: roc_auc_score needs to be called with the average argument specified, per its documentation and the error above. So I tried that.
clf = LogisticRegressionCV(scoring=roc_auc_score(average='weighted'))
But it turns out that roc_auc_score can't be called with an optional argument alone, because this throws another error.
TypeError: roc_auc_score() takes at least 2 arguments (1 given)
Question
Any thoughts on how I can use roc_auc_score as the scoring metric for LogisticRegressionCV in a way that I can specify an argument for the scoring function?
I can't find an SO question on this issue or a discussion of this issue in scikit-learn's GitHub repo, but surely someone has run into this before?
You can use make_scorer, e.g.
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.datasets import make_classification
# some example data
X, y = make_classification()
# little hack to filter out Proba(y==1)
def roc_auc_score_proba(y_true, proba):
return roc_auc_score(y_true, proba[:, 1])
# define your scorer
auc = make_scorer(roc_auc_score_proba, needs_proba=True)
# define your classifier
clf = LogisticRegressionCV(scoring=auc)
# train
clf.fit(X, y)
# have look at the scores
print clf.scores_
I found a way to solve this problem!
scikit-learn offers a make_scorer function in its metrics module that allows a user to create a scoring object from one of its native scoring functions with arguments specified to non-default values (see here for more information on this function from the scikit-learn docs).
So, I created a scoring object with the average argument specified.
roc_auc_weighted = sk.metrics.make_scorer(sk.metrics.roc_auc_score, average='weighted')
Then, I passed that object in the call to LogisticRegressionCV and it ran without any issues!
clf = LogisticRegressionCV(scoring=roc_auc_weighted)
A bit late (4 years later). But today you can use:
clf = LogisticRegressionCV(scoring='roc_auc')
Also, all other scoring keys can be obtained through:
from sklearn.metrics import SCORERS
print(SCORERS.keys())

Categories

Resources