I wonder if we can set up an "optional" step in sklearn.pipeline. For example, for a classification problem, I may want to try an ExtraTreesClassifier with AND without a PCA transformation ahead of it. In practice, it might be a pipeline with an extra parameter specifying the toggle of the PCA step, so that I can optimize on it via GridSearch and etc. I don't see such an implementation in sklearn source, but is there any work-around?
Furthermore, since the possible parameter values of a following step in pipeline might depend on the parameters in a previous step (e.g., valid values of ExtraTreesClassifier.max_features depend on PCA.n_components), is it possible to specify such a conditional dependency in sklearn.pipeline and sklearn.grid_search?
Thank you!
From the docs:
Individual steps may also be replaced as parameters, and non-final
steps may be ignored by setting them to None:
from sklearn.linear_model import LogisticRegression
params = dict(reduce_dim=[None, PCA(5), PCA(10)],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)
Pipeline steps cannot currently be made optional in a grid search but you could wrap the PCA class into your own OptionalPCA component with a boolean parameter to turn off PCA when requested as a quick workaround. You might want to have a look at hyperopt to setup more complex search spaces. I think it has good sklearn integration to support this kind of patterns by default but I cannot find the doc anymore. Maybe have a look at this talk.
For the dependent parameters problem, GridSearchCV supports trees of parameters to handle this case as demonstrated in the documentation.
Related
I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link
GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.
Is there any way to extract best features from the data. Right now, I am using 'KBest' from sklearn.
In this, I have to specify number of K best features that needs to be selected.
Is there any way in which I don't have to specify the number of features to be extracted? Rather we extract all the useful features?
from sklearn.feature_selection import SelectKBest
test = SelectKBest(score_func=chi2, k=4)
You can use "all" instead of a number
test = SelectKBest(score_func=chi2, k="all")
From docs
k : int or “all”, optional, default=10
Number of top features to select. The “all” option bypasses selection, for use in a parameter
search.
Many ways to select features. In wiki, you can find them.And I think the best feature selection method is that you have a deep understanding of these features.But usually we have a hard time understanding them.
Maybe you can use 5-Kfold cross-validation to make a feature importance ranking, and them select important feature from it.
And you also can use Embedded method to select it, like this:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#Feature selection of GBDT as base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)
It's worth noting that you cannot delete a feature that seems to be useless alone,because it may be related to other features.So feature selection is a greedy search process, which is often time consuming.
Is it possible to delete or insert a step in a sklearn.pipeline.Pipeline object?
I am trying to do a grid search with or without one step in the Pipeline object. And wondering whether I can insert or delete a step in the pipeline. I saw in the Pipeline source code, there is a self.steps object holding all the steps. We can get the steps by named_steps(). Before modifying it, I want to make sure, I do not cause unexpected effects.
Here is a example code:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)
clf
Is it possible that we do something like steps = clf.named_steps(), then insert or delete in this list? Does this cause undesired effect on the clf object?
I see that everyone mentioned only the delete step. In case you want to also insert a step in the pipeline:
pipe.steps.append(['step name',transformer()])
pipe.steps works in the same way as lists do, so you can also insert an item into a specific location:
pipe.steps.insert(1,['estimator',transformer()]) #insert as second step
Based on rudimentary testing you can safely remove a step from a scikit-learn pipeline just like you would any list item, with a simple
clf_pipeline.steps.pop(n)
where n is the position of the individual estimator you are trying to remove.
Just chiming in because I feel like the other answers answered the question of adding steps to a pipeline really well, but didn't really cover how to delete a step from a pipeline.
Watch out with my approach though. Slicing lists in this instance is a bit weird.
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures
estimators = [('reduce_dim', PCA()), ('poly', PolynomialFeatures()), ('svm', SVC())]
clf = Pipeline(estimators)
If you want to create a pipeline with just steps PCA/Polynomial you can just slice the list step by indexes and pass it to Pipeline
clf1 = Pipeline(clf.steps[0:2])
Want to just use steps 2/3?
Watch out these slices don't always make the most amount of sense
clf2 = Pipeline(clf.steps[1:3])
Want to just use steps 1/3?
I can't seem to do using this approach
clf3 = Pipeline(clf.steps[0] + clf.steps[2]) # errors
Yes, that's possible, but you must fulfill same requirements which Pipeline requires at initialization, i.e. you cannot insert predictor in any step except last, you should call fit after you update Pipeline.steps, because after such update all steps (maybe they were learned in previous fit calls) will be invalidated, also last step of Pipeline should always implement fit method, all previous steps should implement fit_transform.
So yes, it will work in current codebase, but i think it's not a good solution for your task, it makes your code more dependent on current implementation of Pipeline, i think it's more convenient to create new Pipeline with modified steps, because Pipeline will at least validate all your steps in initialization, also creating new Pipeline will not significantly differ in terms of speed from modifying steps of existing pipeline, but as i've just said - creation of new Pipeline after each modification of steps is safer in case when someone will significantly change implementation of Pipeline.
I would like to add oversampling procedure, like SMOTE oversampling, to scikit's Pipeline. But the transformers only supports fit and transform method, and do not provide a way to increase the number of samples and targets.
One possible way to do this is to break the pipeline to two separate pipelines connected by SMOTE sampling.
Is there any better solutions?
Our current Pipeline does not support changing the number of samples between steps as the Transformer.transform method does not return the y argument that would need to also be resampled. This is a know limitation of the current design. It might be fixed in a future version but we have not started to work on that yet.
Is there a method that I can input the coefficients to the clf of SVC in my script, then apply clf.score() or clf.predict() function for further test?
Currently I am using joblib.dump(clf,'file.plk') to save all the information of a trained clf. But this involves the disk writing/reading. It will be helpful for me if I can just define a clf with two arrays representing the support vector (clf.support_vectors_), weights (clf.coef_/clf.dual_coef_), and bias (clf.intercept_) respectively.
This line calls the prediction function from libsvm. It looks like this (but please take a look at the whole function _dense_predict):
libsvm.predict(
X, self.support_, self.support_vectors_, self.n_support_,
self.dual_coef_, self._intercept_,
self.probA_, self.probB_, svm_type=svm_type, kernel=kernel,
degree=self.degree, coef0=self.coef0, gamma=self._gamma,
cache_size=self.cache_size)
You can use this line and give it all the relevant information directly and will obtain a raw prediction. In order to do this, you must import the libsvm from sklearn.svm import libsvm. If your initial fitted classifier is called svc, then you can obtain all the relevant information from it by replacing all the self keywords with svc and keeping the values. If svc._impl gives you "c_svc", then you set svm_type=0.
Note that at the beginning of the _dense_predict function you have X = self._compute_kernel(X). If your data is X, then you need to transform it by doing K = svc._compute_kernel(X), and call the libsvm.predict function with K as the first argument
Scoring is independent from all this. Take a look at sklearn.metrics, where you will find e.g. the accuracy_score, which is the default score in SVM.
This is of course a somewhat suboptimal way of doing things, but in this specific case, if is impossible (I didn't check very hard) to set coefficients, then going into the code and seeing what it does and extracting the relevant part is surely an option.
Check out this blog post on memory usage of sklearn models using succinct tries to see if it is applicable.
If the other location does not have access to the sklearn packages you would need to create your own score and predict functions. clf.score() and clf.predict() requires clf to be an sklearn object.