I'm using LogisticRegressionCV on my data in a pipeline. After fitting to the data, I'd like to return my optimal C value. How do I do this since I can't use .best_params_ since that is a feature of GridSearchCV. I know that .C_ is the correct feature of LogisticRegressionCV, but my estimator is in a pipeline, so that doesn't work right now.
lr_cv2 = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(solver='liblinear', cv=10, Cs=np.logspace(-5, 8, 15) ))])
lr_cv2.fit(X_train, y_train)
lr_cv2.C_
AttributeError: 'Pipeline' object has no attribute 'C_'
By using the named_steps method of your instance of Pipeline, you can access to the methods composing the single elements of your pipeline:
print(lr_cv2.named_steps['classifier'].C_ )
Related
from sklearn.linear_model import LogisticRegression
pipe4 = Pipeline([('ss', StandardScaler()), ('clf', knn)])
grid2 = GridSearchCV(pipe4, {'clf':[ knn, LogisticRegression()]})
grid2.fit(X_train, y_train)
pd.DataFrame(grid2.cv_results_).T
I made a knn classifier and logistic regression model and wanted to check which model is better through pipeline method.
as you can see the code above I put the knn only in the pipe4 but in grid search, both knn and logsistic regression are working and I could check the result
does it mean I can add the models in Gridseacrh even though I put the one model in pipeline?
Sure. As long as the estimator given to the GridSearchCV (in your example: pipe4) supports the parameters passed to param_grid (in your example: 'clf'), you can pass any values to the estimator's parameters in the grid search (in your example: [knn, LogisticRegression()]).
I am tasked with a supervised learning problem on a dataset and want to create a full Pipeline from complete beginning to end.
Starting with the train-test splitting. I wrote a custom class to implement sklearns train_test_split into the sklearn pipeline. Its fit_transform returns the training set. Later i still want to accsess the test set, so i made it an instance variable in the custom transformer class like this:
self.test_set = test_set
from sklearn.model_selection import train_test_split
class train_test_splitter([...])
[...
...]
def transform(self, X):
train_set, test_set = train_test_split(X, test_size=0.2)
self.test_set = test_set
return train_set
split_pipeline = Pipeline([
('splitter', train_test_splitter() ),
])
df_train = split_pipeline.fit_transform(df)
Now i want to get the test set like this:
df_test = splitter.test_set
Its not working. How do I get the variables of the instance "splitter". Where does it get stored?
You can access the steps of a pipeline in a number of ways. For example,
split_pipeline['splitter'].test_set
That said, I don't think this is a good approach. When you fill out the pipeline with more steps, at fit time everything will work how you want, but when predicting/transforming on other data you will still be calling your transform method, which will generate a new train-test split, forgetting the old one, and sending the new train set down the pipe for the remaining steps.
I'm using sklearn linear implementation of SVM classifier LinearSVM.
I didn't use it directly but I wrap it with CalibratedClassifierCV to get the probabilities in the prediction time, like:
model = CalibratedClassifierCV(LinearSVC(random_state=0))
After fitting the model, I tried to get the coef_ to print the Top features, following this post Visualising Top Features in Linear SVM with Scikit Learn and Matplotlib, but this I got this error:
coef = classifier.coef_.ravel()
AttributeError: 'CalibratedClassifierCV' object has no attribute 'coef_'
How can I get the coef in the case I wrap the classifier with a calibrator?, I'm not totally interested in this way, thus if there is another way to get the features importance, it will be welcomed.
coef_ is not an attribute of CalibratedClassifierCV however, it is an attribute of the base_estimator which is a LinearSVC in your case. You can access your base estimator via the calibrated_classifiers_ which is a list of the fitted models (which depends on the number of models you fit based on your cv value). I have shown a sample code which you can refer to for your need.
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
model = CalibratedClassifierCV(LinearSVC(random_state=0))
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
[<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57c18>,
<sklearn.calibration._CalibratedClassifier at 0x7f15d0aec080>]
In this case my cv is three so I have three models built, so I would simple loop through them and taken an average.
coef_avg = 0
for i in model.calibrated_classifiers_:
coef_avg = coef_avg + i.base_estimator.coef_
coef_avg = coef_avg/len(model.calibrated_classifiers_)
array([[ 0.16464871, 0.45680981, -0.77801375, -0.4170196 ],
[ 0.1238834 , -0.89117967, 0.35451826, -0.89231957],
[-0.83826029, -0.9237139 , 1.30772955, 1.67592916]])
Note: Starting from sklearn version 0.24, CalibratedClassifierCV constructor exposes an ensemble argument, that, if set to False (assuming cv is not set to "prefit"), makes CalibratedClassifierCV expose only one calibrated classifier trained using all training data. This means we no longer need to loop over all calibrated_classifiers_ at prediction time:
model = CalibratedClassifierCV(LinearSVC(random_state=0), ensemble=False)
model.fit(iris.data, iris.target)
model.calibrated_classifiers_
# Returns a list with one element, [<sklearn.calibration._CalibratedClassifier at 0x7f15d0c57550>]
(using an example above, given by Parthasarathy)
Does scikit-learn have an estimator that runs several other estimators and automatically selects the one with the best performance (e.g. according to their cross-validation scores)?
I believe there must be something like this in a class that conforms to the estimator interface, so that it can be combined in a pipeline - correct?
You can use GridSearchCV, not only to choose the best estimator but also to tune its hyperparameters, for example, I'm using this for finding the best text classifier:
pipeline = Pipeline([
('vect', CountVectorizer(ngram_range=(2,2))),
('tfidf', TfidfTransformer(use_idf=True)),
('clf', SVC())
])
parameters = {'clf': [
SVC(),
MultinomialNB(),
BernoulliNB(),
MLPClassifier(max_iter=1000),
KNeighborsClassifier(),
SGDClassifier(max_iter=1000),
RandomForestClassifier()
]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(X, y)
print("Best score", gs_clf.best_score_)
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
Example from the official docs: http://scikit-learn.org/stable/auto_examples/plot_compare_reduction.html#sphx-glr-auto-examples-plot-compare-reduction-py
You can even define your own scoring function, to define what "best" means to you:
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
scikit-learn itself currently doesn't have what you are looking for. However, there are libraries like TPOT and automl-learn with sklearn-like interface for automatic selection of best estimator or even constructing the whole pipeline.
I'm practicing with some text using scikit-learn.
Towards getting more familiar with GridSearch, I am starting with some example code found here:
###############################################################################
# define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
('vect', CountVectorizer())
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0)
}
grid_search.fit(X_train, y_train)
print("Best score: %0.3f" % grid_search.best_score_)
Notice I am being very careful here, and I've only got one estimator and one parameter!
I'm finding that when I run this, I get the error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None))]) does not.
Hummmm...why am I missing some sort of 'score' attribute?
When I check the possible parameters,
print CountVectorizer().get_params().keys()
I don't see anything where I can score, as was implied by this answer.
The documentation says By default, parameter search uses the score function of the estimator to evaluate a parameter setting. So why do I need to specify a score method?
Regardless, I thought I might need to explicity pass a scoring argument, but this didn't help and gave me an error: grid_search.fit(X_train, y_train, scoring=None)
I don't understand this error!
GridSearch maximizes a score over the grid of parameters. You have to specify what kind of score to use because there are many different types of scores possible. For example, for classification problems, you could use accuracy, f1-score, etc. Usually, score type is specified by passing a string in the scoring argument (see scoring parameter). Alternatively, model classes, like SVC or RandomForestRegressor, will have a .score() method. GridSearch will call that if no scoring argument is provided. However, that may or may not be the type of score that you want to optimize. There is also an option of passing in a function as the scoring argument if you have an unusual metric that you want GridSearch to use.
Transformers, like CountVectorizer, do not implement a score method, because they are just deterministic feature transformations. For the same reason, there aren't any scoring methods that make sense to apply to that type of object. You need a model class (or possibly a clustering algorithm) at the end of your pipeline for scoring to make sense.
Aha! I figured it out.
I wasn't understanding how the pipeline works. Sure, I could create a CountVectorizer, but why? There is no way you can get a score out of it, or basically do anything with it other than have a sparse matrix just sitting there.
I need to create a classifier (SGDRegressor) or a regressor (SGDClassifier).
I didn't realize that the pipeline will go
CV --> Regressor
or
CV --> Classifier
The pipeline does what it's name implies...pipes the objects together in series.
In other words, this works:
pipeline = Pipeline([
('vect', CountVectorizer()),
('clf', SGDRegressor())
])