I've currently implemented a probabilistic (at least I think so) for binary classes. Now I want to extend this approach for regression, and I'm trying to use it for the Boston dataset. Unfortunately, it seems like my algorithm is stuck, the code I'm currently running is looking like this:
from sklearn import decomposition
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
inputs_train, inputs_test, targets_train, targets_test = train_test_split(X, y, test_size=0.33, random_state=42)
def plotting():
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(svm.SVR(), cv = 5, param_grid= param_grid)
clf.fit(inputs_train, targets_train)
clf = SVR(C=clf.best_params_['C'], cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=5, gamma=clf.best_params_['gamma'],
kernel=clf.best_params_['kernel'],
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.fit(inputs_train, targets_train)
a = clf.predict(inputs_test[0])
print(a)
plotting()
Can someone tell me, what is wrong in this approach, It's not the fact that I get some error message (I know, I've suppresed them above), but the code never stops running. Any suggestions is hugely appreciated.
There are several issues with your code.
To start with, what is taking forever is the first clf.fit (i.e. the grid search one), and that's why you didn't see any change when you set max_iter and tol in your second clf.fit.
Second, the clf=SVR() part will not work, because:
You have to import it, SVR is not recognizable
You have a bunch of illegal arguments in there (decision_function_shape, probability, random_state etc) - check the docs for the admissible SVR arguments.
Third, you don't need to explicitly fit again with the best parameters; you should simply ask for refit=True in your GridSearchCV definition and subsequently use clf.best_estimator_ for your predictions (EDIT after comment: simply clf.predict will also work).
So, moving the stuff outside of any function definition, here is a working version of your code:
from sklearn.svm import SVR
# other imports as-is
# data loading & splitting as-is
param_C = [0.01, 0.1]
param_grid = {'C': param_C, 'kernel': ['poly', 'rbf'], 'gamma': [0.1, 0.01]}
clf = GridSearchCV(SVR(degree=5, max_iter=10000), cv = 5, param_grid= param_grid, refit=True,)
clf.fit(inputs_train, targets_train)
a = clf.best_estimator_.predict(inputs_test[0])
# a = clf.predict(inputs_test[0]) will also work
print(a)
# [ 21.89849792]
Apart from degree, all the other admissible argument values you are are using are actually the respective default values, so the only arguments you really need in your SVR definition are degree and max_iter.
You'll get a couple of warnings (not errors), i.e. after fitting:
/databricks/python/lib/python3.5/site-packages/sklearn/svm/base.py:220:
ConvergenceWarning: Solver terminated early (max_iter=10000). Consider
pre-processing your data with StandardScaler or MinMaxScaler.
and after predicting:
/databricks/python/lib/python3.5/site-packages/sklearn/utils/validation.py:395:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample. DeprecationWarning)
which already contain some advice for what to do next...
Last but not least: a probabilistic classifier (i.e. one that produces probabilities instead of hard labels) is a valid thing, but a "probabilistic" regression model is not...
Tested with Python 3.5 and scikit-learn 0.18.1
Related
I am trying to determine which alpha is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'.
I have an array with some values for alpha ranging from 5e09 to 5e-03:
array([5.00000000e+09, 3.78231664e+09, 2.86118383e+09, 2.16438064e+09,
1.63727458e+09, 1.23853818e+09, 9.36908711e+08, 7.08737081e+08,
5.36133611e+08, 4.05565415e+08, 3.06795364e+08, 2.32079442e+08,
1.75559587e+08, 1.32804389e+08, 1.00461650e+08, 7.59955541e+07,
5.74878498e+07, 4.34874501e+07, 3.28966612e+07, 2.48851178e+07,
1.88246790e+07, 1.42401793e+07, 1.07721735e+07, 8.14875417e+06,
6.16423370e+06, 4.66301673e+06, 3.52740116e+06, 2.66834962e+06,
2.01850863e+06, 1.52692775e+06, 1.15506485e+06, 8.73764200e+05,
6.60970574e+05, 5.00000000e+05, 3.78231664e+05, 2.86118383e+05,
2.16438064e+05, 1.63727458e+05, 1.23853818e+05, 9.36908711e+04,
7.08737081e+04, 5.36133611e+04, 4.05565415e+04, 3.06795364e+04,
2.32079442e+04, 1.75559587e+04, 1.32804389e+04, 1.00461650e+04,
7.59955541e+03, 5.74878498e+03, 4.34874501e+03, 3.28966612e+03,
2.48851178e+03, 1.88246790e+03, 1.42401793e+03, 1.07721735e+03,
8.14875417e+02, 6.16423370e+02, 4.66301673e+02, 3.52740116e+02,
2.66834962e+02, 2.01850863e+02, 1.52692775e+02, 1.15506485e+02,
8.73764200e+01, 6.60970574e+01, 5.00000000e+01, 3.78231664e+01,
2.86118383e+01, 2.16438064e+01, 1.63727458e+01, 1.23853818e+01,
9.36908711e+00, 7.08737081e+00, 5.36133611e+00, 4.05565415e+00,
3.06795364e+00, 2.32079442e+00, 1.75559587e+00, 1.32804389e+00,
1.00461650e+00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])
Then, I used RidgeCV to try and determine which of these values would be best:
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error',
normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
and I got ridgecv.alpha_ = 0.006609705742330144
However, I received a warning that normalize = True is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline and StandardScaler instead. Then, following instructions of how to do a Pipeline, I did:
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
y_pred = ridge_pipe.predict(X_test)
ridge_pipe2.named_steps.model.alpha_
Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342
For a last check, I also used GridSearchCV as follows:
steps = [
('scalar', StandardScaler()),
('model',Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
parameters = [{'model__alpha':alphas}]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'neg_mean_squared_error',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_.get_params
where I got grid_search.best_estimator_.get_params = 1.328043891473342 (same as the other Pipeline approach).
Therefore, my question... why normalizing my dataset with normalize=True or with StandardScaler() yields different best alpha values?
The corresponding warning message for ordinary Ridge makes an additional mention:
Set parameter alpha to: original_alpha * n_samples.
(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV along these lines.) Changing your alphas parameter in the second approach to [alph * X.shape[0] for alph in alphas] should work. The selected alpha_ will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0] and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).
(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)
You need to ensure the same cross validation is used and scale without centering the data.
When you run with normalize=True, you get this as part of the warning :
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(with_mean=False), Ridge())
Regarding the cv, if you check the documentation, RidgeCV by default performs leave-one-out cross validation :
Ridge regression with built-in cross-validation.
See glossary entry for cross-validation estimator.
By default, it performs efficient Leave-One-Out Cross-Validation.
So to get the same result, we can define a cross-validation to use :
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
kf = KFold(10)
X_train, y_train = datasets.make_regression()
alphas = [0.001,0.005,0.01,0.05,0.1]
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error', normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
0.001
And use it on pipeline :
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error',cv=kf))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
ridge_pipe2.named_steps.model.alpha_
0.001
I am exploring the hyperparametrs of a KNN model to fit MNIST dataset by using RandomizedSearchCV in sklearn.
To do this exploring I kept only 10000 data, otherwise it will take forever.
My code is something similar to this:
from sklearn.datasets import fetch_openml
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neighbors import KNeighborsClassifier
mnist = fetch_openml("mnist_784");
X, y = mnist['data'], mnist['target'];
X_train_pre, X_test, y_train_pre, y_test = X[:60000], X[60000:], y[:60000], y[60000:];
shuffle_index = np.random.permutation(60000);
X_train, y_train = X_train_pre[shuffle_index], y_train_pre[shuffle_index];
del y_train_pre, X_train_pre
X_short, y_short = X_train[:10000], y_train[:10000];
param_grid2 = {"weights": ["uniform", "distance"],"n_neighbors":[3,4,5,6,7,8], };#5,10,15, 20, 25, 30, 40,50]
knn_clf_rs2 = KNeighborsClassifier(n_jobs=-1);
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5, n_jobs=-1);
random_search2.fit(X_short, y_short);
To speed up, I set the parameter n_jobs=-1 both in the classifier constructor and in the random_search2 constructor, although for the moment I need only the one for the search while the one set in the classifier constructor is unused. However, looking around I saw that according to many people n_jobs=-1 is ineffective and actually just one processor is used. Even worse, it slows down performances due to the increase resource usage of multiprocessing. So it is actually better to do (in my case too):
random_search2 = RandomizedSearchCV(knn_clf_rs2, param_grid2, cv=5,
scoring="neg_mean_squared_error",
verbose = 5);
First of all, is n_jobs=-1 not working because of the classifier type (KNN) or because of RandomizedSearchCV or because of the combination of the two?
Furthermore is there any solution to this such as to force multiprocessing?
I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)
I'm using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the "best" parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I'm surprised this would happen.
For example:
from sklearn import svm, grid_search
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(verbose=1)
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],
'min_samples_split':[2,5,10,20],
'max_depth':[2,3,5,10]}
clf = grid_search.GridSearchCV(gbc, parameters)
t0 = time()
clf.fit(X_crossval, labels)
print "Gridsearch time:", round(time() - t0, 3), "s"
print clf.best_params_
# The output is: {'min_samples_split': 2, 'learning_rate': 0.01, 'max_depth': 2}
This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default.
One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation. Then after obtaining the parameters, I ran it using the same dataset, split into 75-25 training/testing.
Is there a reason my grid search overlooked the "superior" defaults?
Running cross-validation on your entire dataset for parameter and/or feature selection can definitely cause problems when you test on the same dataset. It looks like that's at least part of the problem here. Running CV on a subset of your data for parameter optimization, and leaving a holdout set for testing, is good practice.
Assuming you're using the iris dataset (that's the dataset used in the example in your comment link), here's an example of how GridSearchCV parameter optimization is affected by first making a holdout set with train_test_split:
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],
'min_samples_split':[2,5,10,20],
'max_depth':[2,3,5,10]}
clf = GridSearchCV(gbc, parameters)
clf.fit(iris.data, iris.target)
print(clf.best_params_)
# {'learning_rate': 1, 'max_depth': 2, 'min_samples_split': 2}
Now repeat the grid search using a random training subset:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target,
test_size=0.33,
random_state=42)
clf = GridSearchCV(gbc, parameters)
clf.fit(X_train, y_train)
print(clf.best_params_)
# {'learning_rate': 0.01, 'max_depth': 5, 'min_samples_split': 2}
I'm seeing much higher classification accuracy with both of these approaches, which makes me think maybe you're using different data - but the basic point about performing parameter selection while maintaining a holdout set is demonstrated here. Hope it helps.
You can also use Kfolds cross_validator
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
iris = datasets.load_iris()
gbc = GradientBoostingClassifier()
parameters = {'learning_rate':[0.01, 0.05, 0.1, 0.5, 1],
'min_samples_split':[2,5,10,20],
'max_depth':[2,3,5,10]}
cv_test= KFold(n_splits=5)
clf = GridSearchCV(gbc, parameters,cv=cv_test)
clf.fit(iris.data, iris.target)
print(clf.best_params_)
I want to score different classifiers with different parameters.
For speedup on LogisticRegression I use LogisticRegressionCV (which at least 2x faster) and plan use GridSearchCV for others.
But problem while it give me equal C parameters, but not the AUC ROC scoring.
I'll try fix many parameters like scorer, random_state, solver, max_iter, tol...
Please look at example (real data have no mater):
Test data and common part:
from sklearn import datasets
boston = datasets.load_boston()
X = boston.data
y = boston.target
y[y <= y.mean()] = 0; y[y > 0] = 1
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegressionCV
fold = KFold(len(y), n_folds=5, shuffle=True, random_state=777)
GridSearchCV
grid = {
'C': np.power(10.0, np.arange(-10, 10))
, 'solver': ['newton-cg']
}
clf = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10)
gs = GridSearchCV(clf, grid, scoring='roc_auc', cv=fold)
gs.fit(X, y)
print ('gs.best_score_:', gs.best_score_)
gs.best_score_: 0.939162082194
LogisticRegressionCV
searchCV = LogisticRegressionCV(
Cs=list(np.power(10.0, np.arange(-10, 10)))
,penalty='l2'
,scoring='roc_auc'
,cv=fold
,random_state=777
,max_iter=10000
,fit_intercept=True
,solver='newton-cg'
,tol=10
)
searchCV.fit(X, y)
print ('Max auc_roc:', searchCV.scores_[1].max())
Max auc_roc: 0.970588235294
Solver newton-cg used just to provide fixed value, other tried too.
What I forgot?
P.S. In both cases I also got warning "/usr/lib64/python3.4/site-packages/sklearn/utils/optimize.py:193: UserWarning: Line Search failed
warnings.warn('Line Search failed')" which I can't understand too. I'll be happy if someone also describe what it mean, but I hope it is not relevant to my main question.
EDIT UPDATES
By #joeln comment add max_iter=10000 and tol=10 parameters too. It does not change result in any digit, but the warning disappeared.
Here is a copy of the answer by Tom on the scikit-learn issue tracker:
LogisticRegressionCV.scores_ gives the score for all the folds.
GridSearchCV.best_score_ gives the best mean score over all the folds.
To get the same result, you need to change your code:
print('Max auc_roc:', searchCV.scores_[1].max()) # is wrong
print('Max auc_roc:', searchCV.scores_[1].mean(axis=0).max()) # is correct
By also using the default tol=1e-4 instead of your tol=10, I get:
('gs.best_score_:', 0.939162082193857)
('Max auc_roc:', 0.93915947999923843)
The (small) remaining difference might come from warm starting in LogisticRegressionCV (which is actually what makes it faster than GridSearchCV).