I am using python 3.5 with tensorflow 0.11 and sklearn 0.18.
I wrote a simple example code to calculate the cross-validation score with iris data using tensorflow. I used the skflow as the wrapper.
import tensorflow.contrib.learn as skflow
from sklearn import datasets
from sklearn import cross_validation
iris=datasets.load_iris()
feature_columns = skflow.infer_real_valued_columns_from_input(iris.data)
classifier = skflow.DNNClassifier(hidden_units=[10, 10, 10], n_classes=3, feature_columns=feature_columns)
print(cross_validation.cross_val_score(classifier, iris.data, iris.target, cv=2, scoring = 'accuracy'))
But I got an error like below. It seems that skflow is not compatible with cross_val_score of sklearn.
TypeError: Cannot clone object '' (type ): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
Is there any other way to deal with this problem?
Related
We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc). We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.
What is the quickest way to convert our sklearn pipeline into something that computes in parallel? (We can easily switch between pandas and spark DFs as required.)
For context, our options seem to be:
Rewrite the pipeline using MLLib (time-consuming)
Use a sklearn-spark bridging library
On option 2, Spark-Sklearn seems to be deprecated, but Databricks instead recommends that we use joblibspark. However, this raises an exception on Databricks:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend
register_spark() # register spark backend
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(svr, parameters, cv=5)
with parallel_backend('spark', n_jobs=3):
clf.fit(iris.data, iris.target)
raises
py4j.security.Py4JSecurityException: Method public int org.apache.spark.SparkContext.maxNumConcurrentTasks() is not whitelisted on class class org.apache.spark.SparkContext
According to the Databricks instructions (here and here), the necessary requirements are:
Python 3.6+
pyspark>=2.4
scikit-learn>=0.21
joblib>=0.14
I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:
import sys
import sklearn
import joblib
spark.version
# '3.0.0'
sys.version
# '3.7.5 (default, Nov 7 2019, 10:50:52) \n[GCC 8.3.0]'
sklearn.__version__
# '0.22.1'
joblib.__version__
# '0.14.1'
With the above settings, your code snippet runs smoothly, and produces indeed a classifier clf as:
GridSearchCV(cv=5, error_score=nan,
estimator=SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3,
gamma='auto', kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
iid='deprecated', n_jobs=None,
param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
as does the alternative example from here:
from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark
register_spark() # register spark backend
iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
giving
[0.96666667 1. 0.96666667 0.96666667 1. ]
Thanks to desertnaut for the response - this answer should be correct for a standard Spark / Databricks setup, so have accepted it, given the wording of my question / potential usefulness for other readers
Contributing a separate "answer" having discovered what the issue was in our case: Databricks support advised that the issue in our case was due to our using a special type of cluster (High Concurrency with credentials passthrough enabled, on AWS). grid.fit() was not whitelisted for this type of cluster, and Databricks advised that they would need to raise it with their engineering team to whitelist it.
I am new to Sklearn, and I am trying to combine KNN, Decision Tree, SVM, and Gaussian NB for BaggingClassifier.
Part of my code looks like this:
best_KNN = KNeighborsClassifier(n_neighbors=5, p=1)
best_KNN.fit(X_train, y_train)
majority_voting = VotingClassifier(estimators=[('KNN', best_KNN), ('DT', best_DT), ('SVM', best_SVM), ('gaussian', gaussian_NB)], voting='hard')
majority_voting.fit(X_train, y_train)
bagging = BaggingClassifier(base_estimator=majority_voting)
bagging.fit(X_train, y_train)
But this causes an error saying:
TypeError: Underlying estimator KNeighborsClassifier does not support sample weights.
The "bagging" part worked fine if I remove KNN.
Does anyone have any idea to solve this issue? Thank you for your time.
In BaggingClassifier you can only use base estimators that support sample weights because it relies on score method, which takes in sample_weightparam.
You can list all the available classifiers like:
import inspect
from sklearn.utils.testing import all_estimators
for name, clf in all_estimators(type_filter='classifier'):
if 'sample_weight' in inspect.getargspec(clf.fit)[0]:
print(name)
I'm using the show_prediction function in the eli5 package to understand how my XGBoost classifier arrived at a prediction. For some reason I seem to be getting a regression score instead of a probability for my model.
Below is a fully reproducible example with a public dataset.
from sklearn.datasets import load_breast_cancer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from eli5 import show_prediction
# Load dataset
data = load_breast_cancer()
# Organize our data
label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']
# Split the data
train, test, train_labels, test_labels = train_test_split(
features,
labels,
test_size=0.33,
random_state=42
)
# Define the model
xgb_model = XGBClassifier(
n_jobs=16,
eval_metric='auc'
)
# Train the model
xgb_model.fit(
train,
train_labels
)
show_prediction(xgb_model.get_booster(), test[0], show_feature_values=True, feature_names=feature_names)
This gives me the following result. Note the score of 3.7, which is definitely not a probability.
The official eli5 documentation correctly shows a probability though.
The missing probability seems to be related to my use of xgb_model.get_booster(). Looks like the official documentation doesn't use that and passes the model as-is instead, but when I do that I get TypeError: 'str' object is not callable, so that doesn't seem to be an option.
I'm also concerned that eli5 is not explaining the prediction by traversing the xgboost trees. It appears that the "score" I'm getting is actually just a sum of all the feature contributions, like I would expect if eli5 wasn't actually traversing the tree but fitting a linear model instead. Is that true? How can I also make eli5 traverse the tree?
Fixed my own problem. According to this Github Issue eli5 only supports an older version of XGBoost (<=0.6). I was using XGBoost version 0.80 and eli5 version 0.8.
Posting the solution from the issue:
import eli5
from xgboost import XGBClassifier, XGBRegressor
def _check_booster_args(xgb, is_regression=None):
# type: (Any, bool) -> Tuple[Booster, bool]
if isinstance(xgb, eli5.xgboost.Booster): # patch (from "xgb, Booster")
booster = xgb
else:
booster = xgb.get_booster() # patch (from "xgb.booster()" where `booster` is now a string)
_is_regression = isinstance(xgb, XGBRegressor)
if is_regression is not None and is_regression != _is_regression:
raise ValueError(
'Inconsistent is_regression={} passed. '
'You don\'t have to pass it when using scikit-learn API'
.format(is_regression))
is_regression = _is_regression
return booster, is_regression
eli5.xgboost._check_booster_args = _check_booster_args
And then replacing the last line of my question's code snippet with:
show_prediction(xgb_model, test[0], show_feature_values=True, feature_names=feature_names)
fixed my problem.
I am trying to build a GridSearchCV pipeline in sklearn for using KNeighborsClassifier and SVM. SO far, have tried the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
from sklearn import svm
from sklearn.svm import SVC
clf = SVC(kernel='linear')
pipeline = Pipeline([ ('knn',neigh), ('sVM', clf)]) # Code breaks here
weight_options = ['uniform','distance']
param_knn = {'weights':weight_options}
param_svc = {'kernel':('linear', 'rbf'), 'C':[1,5,10]}
grid = GridSearchCV(pipeline, param_knn, param_svc, cv=5, scoring='accuracy')
but am getting the following error:
TypeError: All intermediate steps should be transformers and implement fit and transform. 'KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')' (type <class 'sklearn.neighbors.classification.KNeighborsClassifier'>) doesn't
Can anyone please help me with what am I going wrong, and how to correct it? I think there is something wrong with the last line as well, re params.
The error clearly says that the KNeighborsClassifier doesnt have transform method KNN has only fit method where as SVM has fit_transform() method. for the Pipeline we can pass n number of arguments in to it. but all the arguments should have transformer methods in it.Please refer the below link
http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
The scikit-learn Pipeline steps require to have the transform() method. You might want to try the pipeline from imblearn instead.
See for instance here: https://bsolomon1124.github.io/oversamp/
I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit argument to the sklearn pipeline?
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':
pipe.fit(X, y) # works fine
# however this does not work (even after correcting the underscores):
pipe.fit(X, y, classifier__eval_metric='auc') # fails
The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:
pipe.fit(X_train, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc'
pipe.fit(X, y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation
gsearch.fit(X, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_