How to convert a sklearn pipeline into a pyspark pipeline? - python

We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc). We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.
What is the quickest way to convert our sklearn pipeline into something that computes in parallel? (We can easily switch between pandas and spark DFs as required.)
For context, our options seem to be:
Rewrite the pipeline using MLLib (time-consuming)
Use a sklearn-spark bridging library
On option 2, Spark-Sklearn seems to be deprecated, but Databricks instead recommends that we use joblibspark. However, this raises an exception on Databricks:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend
register_spark() # register spark backend
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(svr, parameters, cv=5)
with parallel_backend('spark', n_jobs=3):,
raises Method public int org.apache.spark.SparkContext.maxNumConcurrentTasks() is not whitelisted on class class org.apache.spark.SparkContext

According to the Databricks instructions (here and here), the necessary requirements are:
Python 3.6+
I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:
import sys
import sklearn
import joblib
# '3.0.0'
# '3.7.5 (default, Nov 7 2019, 10:50:52) \n[GCC 8.3.0]'
# '0.22.1'
# '0.14.1'
With the above settings, your code snippet runs smoothly, and produces indeed a classifier clf as:
GridSearchCV(cv=5, error_score=nan,
estimator=SVC(C=1.0, break_ties=False, cache_size=200,
class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3,
gamma='auto', kernel='rbf', max_iter=-1,
probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False),
iid='deprecated', n_jobs=None,
param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
as does the alternative example from here:
from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark
register_spark() # register spark backend
iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
scores = cross_val_score(clf,,, cv=5)
[0.96666667 1. 0.96666667 0.96666667 1. ]

Thanks to desertnaut for the response - this answer should be correct for a standard Spark / Databricks setup, so have accepted it, given the wording of my question / potential usefulness for other readers
Contributing a separate "answer" having discovered what the issue was in our case: Databricks support advised that the issue in our case was due to our using a special type of cluster (High Concurrency with credentials passthrough enabled, on AWS). was not whitelisted for this type of cluster, and Databricks advised that they would need to raise it with their engineering team to whitelist it.


How to port feature pipeline from scikit-learn V0.21 to V0.24

I am trying to port a sklearn feature pipeline trained in scikit-learn V0.21 to scikit-learn V0.24, because I do not have the original feature data to train the pipeline again. If I use new data, the feature dimension and position may be off from the following model, as I have DictVectorizer in the pipeline.
I've tried to use pickle and joblib to serialize the pipeline in V0.21 and then deserialize it in V0.24. Unfortunately, in both cases, the code raised ModuleNotFoundError: No module named 'sklearn.feature_extraction.dict_vectorizer' error when loading in V0.24.
I created the pipeline with the same code using V0.21 and V0.24 respectively. When printing them out, they show some minor difference.
In V0.21
steps=[('selector', ItemSelector(key='hsd_feature_map')),
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=',
sort=True, sparse=False)),
TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=True,
('max', MaxAbsScaler(copy=True))],
In V0.24
Pipeline(steps=[('selector', ItemSelector(key='hsd_feature_map')),
('dv1', DictVectorizer(sparse=False)),
('tfidf', TfidfTransformer(sublinear_tf=True)),
('max', MaxAbsScaler())])
I wonder if there is anyway to transfer the feature pipeline or its parameters from scikit-learn V0.21 to V0.24.
From sklearn version 0.22.X DictVectorizer import changed
I think you could override the DictVectorizer import according to this answer

How to use the imbalanced library with sklearn pipeline?

I am trying to solve a text classification problem. I want to create baseline model using MultinomialNB
my data is highly imbalnced for few categories, hence decided to use the imbalanced library with sklearn pipeline and referring the tutorial.
The model is failing and giving error after introducing the two stages in pipeline as suggested in docs.
from imblearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from imblearn.under_sampling import (EditedNearestNeighbours,
# Create the samplers
enn = EditedNearestNeighbours()
renn = RepeatedEditedNearestNeighbours()
pipe = make_pipeline_imb([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('enn', EditedNearestNeighbours()),\
('renn', RepeatedEditedNearestNeighbours()),\
('clf-gnb', MultinomialNB()),])
TypeError: Last step of Pipeline should implement fit. '[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
Can someone please help here. I am also open to use different way of (Boosting/SMOTE) implementation as well ?
It seems that the pipeline from ìmblearn doesn't support naming like the one in sklearn. From imblearn documentation :
*steps : list of estimators.
You should modify your code to :
pipe = make_pipeline_imb( CountVectorizer(max_features=100000,\
ngram_range= (1, 2),tokenizer=tokenize_and_stem),\
TfidfTransformer(use_idf= True),\

SKlearn pipeline using KNeighborsClassifier

I am trying to build a GridSearchCV pipeline in sklearn for using KNeighborsClassifier and SVM. SO far, have tried the following code:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
from sklearn import svm
from sklearn.svm import SVC
clf = SVC(kernel='linear')
pipeline = Pipeline([ ('knn',neigh), ('sVM', clf)]) # Code breaks here
weight_options = ['uniform','distance']
param_knn = {'weights':weight_options}
param_svc = {'kernel':('linear', 'rbf'), 'C':[1,5,10]}
grid = GridSearchCV(pipeline, param_knn, param_svc, cv=5, scoring='accuracy')
but am getting the following error:
TypeError: All intermediate steps should be transformers and implement fit and transform. 'KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=3, p=2,
weights='uniform')' (type <class 'sklearn.neighbors.classification.KNeighborsClassifier'>) doesn't
Can anyone please help me with what am I going wrong, and how to correct it? I think there is something wrong with the last line as well, re params.
The error clearly says that the KNeighborsClassifier doesnt have transform method KNN has only fit method where as SVM has fit_transform() method. for the Pipeline we can pass n number of arguments in to it. but all the arguments should have transformer methods in it.Please refer the below link
The scikit-learn Pipeline steps require to have the transform() method. You might want to try the pipeline from imblearn instead.
See for instance here:

How to optimize a sklearn pipeline, using XGboost, for a different `eval_metric`?

I'm trying to use XGBoost, and optimize the eval_metric as auc(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit argument to the sklearn pipeline?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier(), y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':, y) # works fine
# however this does not work (even after correcting the underscores):, y, classifier__eval_metric='auc') # fails
The error:
TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:
xgboost.__version__ shows 0.6
pip3 freeze | grep xgboost shows xgboost==0.6a2.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of, we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc', y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_

cross_val_score fails with tensorflow(skflow)

I am using python 3.5 with tensorflow 0.11 and sklearn 0.18.
I wrote a simple example code to calculate the cross-validation score with iris data using tensorflow. I used the skflow as the wrapper.
import tensorflow.contrib.learn as skflow
from sklearn import datasets
from sklearn import cross_validation
feature_columns = skflow.infer_real_valued_columns_from_input(
classifier = skflow.DNNClassifier(hidden_units=[10, 10, 10], n_classes=3, feature_columns=feature_columns)
print(cross_validation.cross_val_score(classifier,,, cv=2, scoring = 'accuracy'))
But I got an error like below. It seems that skflow is not compatible with cross_val_score of sklearn.
TypeError: Cannot clone object '' (type ): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
Is there any other way to deal with this problem?

