I love using GridSearchCV for hyper parameter tuning of machine learning models (mostly using sklearn here).
Is there a way to pass a function/lambda as callback that would get called after every search combination? I would need this to add custom logging and even send events in certain scenarios.
In fact, I'm looking for a similar pattern as with Keras callbacks where every callback is being executed after every epoch.
Thanks for any insights
I was searching for a way to get current parameters on my callback and found your question, hope this helps someone.
grid = GridSearchCV(estimator=model, param_grid=param_grid, verbose=0, n_jobs=1)
grid_result = grid.fit(X_train, Y_train, callbacks=[YourCallback()])
Related
As per the question title, I'd like to know if there's a way to specify a custom validation set for Scikit-Learn's GradientBoostingRegressor? I think the answer is no, but I figured I'd check.
On their documentation it states that the validation_fraction argument only accepts a float as an argument, so I'm guessing there's no direct way to create your own validation set and use that.
Does anyone know if there's a way to do this? Being able to create your own validation set is a reason why I typically use xgboost, but sometimes sklearn is better for what I need. I'd settle for at least being able to use some of the custom splitter classes in the library if I couldn't create the validation set directly.
Thank you!
EDIT
The main purpose of being able to supply the custom validation set is to use it in conjunction with the early stopping feature, which was not noted above.
Basically you can use sklearn metrics and sklearn .predict method. A pseudo-code for this kind of a problem would be:
X_train, X_test, y_train, y_test ==CustomSplitMethod(X,y)
model= GradientBoostingRegressor(random_state=0)
model.fit(X_train,y_train)
prediction=model.predict(X_test)
evaluation_metric(prediction,y_test)# For example ///sklearn.metrics.mean_squared_error(prediction,y_test)
I am writing a custom CycleGan training loop following TF's documentation. I'd like to use several existing callbacks, and prefer not re-writing their logic.
I've found this SO question, whose answers unfortunately focused on EarlyStopping, rather than on the broad range of callbacks.
I further found this reddit post, which suggested to call them manually. Several callbacks however work with an internal model object (they call self.model, which is the model they are applied on).
How can I simply use the existing callbacks in a custom training loop? I appreciate any code outlines or further recommendations!
If you want to use the existing callbacks you have to create the callbacks you want and pass them to callbacks argument while using model.fit().
For example, i want to use the ModelCheckpoint and EarlyStopping
checkpoint = keras.callbacks.ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
es = keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=1)
Create a callbacks_list
callbacks_list = [checkpoint, es]
pass the callbacks_list to callbacks argument while training the model.
model.fit(x_train,y_train,validation_data=(x_test, y_test),epochs=10,callbacks=callbacks_list)
Please refer to this gist for complete code example. Thank You.
Is it possible to fit a scikit-learn model in parallel? Something along the lines of
model.fit(X, y, n_jobs=20)
It really depends on the model you are trying to fit. Usually it will have an n_jobs parameter when you initialize the model. See glossary on n_jobs. For example random forest:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_jobs=10)
If it is an ensemble method, it makes sense to parallelize because you can fit models separately (see help page for ensemble methods). LogisticRegression() also has an n_job option but I honestly don't know how much this speeds up the fitting process, if that's your bottle neck. See also this post
Other methods like elastic net, linear regression or SVM, i don't think there's a parallelization option.
I do a machine learning model training with pipelines, K-fold cross validation with Python and sklearn on a subset of my all historical data (omitting a test set), along the following:
pipeline = Pipeline([("combiner", PolynomialFeatures()),
("dimred", PCA()),
("classifier", RandomForestClassifier())])
parameters = [...]
CV = GridSearchCV(pipeline, parameters, cv=5, scoring="f1_weighted", refit=True, n_jobs=-1)
CV.fit(train_X, train_y)
So far, so good. However, at the end, I want to retrain the winning pipeline hyperparameter combination on my full X and y, without any cross validation. How could I have this? Simply applying CV.fit(X, y) again would re-doing the whole alternating process with CV, which is obviously unnecessary. I could also parse CV.get_params() for the best combination hyperparameters and build up the pipeline again accordingly, but this somehow seems clumsy and unprofessional...
The answer to your question is in the GridSearchCV documentation. See the Attributes section: best_estimator_ is where the best model is stored, so you can access it from there after you are done with fitting. You can use it by directly calling `CV.best_estimatory_', you can make a new reference to it or pickle it for later using joblib, ie.:
import joblib
joblib.dump(CV.best_estimator_, 'my_pipeline.pkl')
Later you can load your model for further work:
import joblib
my_pipeline = joblib.load('my_pipeline.pkl')
If you do not need the model, but only its hyperparameters you can access those from the best_params_ attribute, ie.:
CV.best_params_
which is a dictionary the best settings that you can use to construct a new pipeline.
I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights), but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:
To pass the weights when using the grid parameter, the usage is:
grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1,
fit_params={"sample_weight"=weights})
The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y) fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights) but now weights hasn't been resampled so an exception is thrown.
For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?
I have too little reputation so I can't comment on #xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was:
fit_params={'classifier__sample_weight': w} where w is the weight vector and classifier is the step name in the pipeline.
Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.
It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:
model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)
Hope that helps (you and others who see this post).
I would suggest writing own cross validation parameters selection, as it is just 10-15 lines of code (especially using the kfold object from scikit-learn) in python, while oversampling is possibly a great bottleneck.