I am attempting to perform cross validation on a SGD model in pyspark, I am working with LinearRegressionWithSGD from pyspark.mllib.regression , ParamGridBuilder and CrossValidator both from the pyspark.ml.tuning libraries.
After following documentation from the Spark website, I was hoping running this would work
lr = LinearRegressionWithSGD()
pipeline=Pipeline(stages=[lr])
paramGrid = ParamGridBuilder()\
.addGrid(lr.stepSize, Array(0.1, 0.01))\
.build()
crossval = CrossValidator(estimator=pipeline,estimatorParamMaps= paramGrid,
evaluator=RegressionEvaluator(),
numFolds=10)
But LinearRegressionWithSGD() does not have the attributes stepSize (tried others with no luck either).
I can set lr to LinearRegression but then I am unable to use SGD for my model and cross validate.
There is the kFold method within scala but I am not sure how to access that from pyspark
You can use the step parameter from LinearRegressionWithSGD to define your step size but that will not allow your code to work because you are mixing incompatible libraries. Unfortunately, I do not know how to do cross validation with the ml library using SGD optimization and I would like to know myself but you are mixing the libraries pyspark.ml and pyspark.mllib. Specifically you cannot use LinearRegressionWithSGD with the pyspark.ml library. You have to use pyspark.ml.regression.LinearRegression.
The good news is you can set the set the solver attribute of pyspark.ml.regression.LinearRegression to use 'gd'. Therefore, you can probably set the parameters of the 'gd' optimizer run as SGD, but I am not sure where the solver documentation is or how to set the solver attributes (e.g. the batch size). The api shows the LinearRegression object calling Param() but I am not sure if it is using the pyspark.mllib optimizer. If anyone knows how to set the solver attributes, that could answer your question by allowing you to use the Pipeline, ParamGridBuilder, and CrossValidation ml packages for model selection with LinearRegression utilizing SGD optimization for parameter tuning.
Respectfully,
Shane
Related
I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link
GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.
I'm performing a classification task using XGBClassifier - I want to reuse sklearn's functionalities as much as possible. Especially I'm interested in defining my custom scorer using f_beta function to define f0.5 score.
When I run the following code:
from sklearn.metrics import f1_score
clf = xgb.XGBClassifier(max_depth=5,
learning_rate=0.25,
objective='binary:logistic',
use_label_encoder=False,
eval_metric=make_scorer(fbeta_score(beta=0.5)),
)
I get the following error:
TypeError: fbeta_score() missing 2 required positional arguments: 'y_true' and 'y_pred'
Also, following this part of XGBoost documentation I simplified the case just to use a predefined, ready f1_score metric: eval_metric=f1_score but XGBClassifier switches back to log-loss one.
How can I implement my customised metric in the appropriate way?
if you check the documentation you cannot use with eval_metric creating you own metric but the one listed in the documentation
But if you want to optimize I think you can precise a custom metric in gridsearchCV with scoring
I am currently trying to create a Tensorflow DNN model with a multilabel target variable, and whilst my code hasn't had any problems so far, the imbalanced nature of the dataset that I'm working with has caused a few problems.
As per recommendations in Keras' documentation, I've applied an intial bias to the model. I've also tried to enable the class weight parameter in the model compile function and this is where I'm stuck
https://github.com/tensorflow/tensorflow/issues/41448
There seems to be a known bug in this method as seen in this GitHub link, and my attempts at creating a workaround haven't been successful at all. I'd appreciate any advice on creating a workaround because I'm at a loss myself to be honest. Currently running Tensorflow 2.4
You are using a slightly old version of TensorFlow. This worked for me in a multiclass dataset using TensorFlow 2.7 and Keras 2.7:
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(y_train),
y=y_train)
model.fit(
...
class_weight=dict(enumerate(class_weights))
)
The values of y_train must be integers in the range [0, NUMBER_CLASSES - 1] for this code to work correctly. You can accomplish this using LabelEncoder.
Alternatively, you can use sample_weight instead of class_weight to accomplish the same thing (in fact, Keras internally converts class_weight to sample_weight). Here you can find the documentation about these parameters.
Other easy-to-implement and effective methods to combat data imbalance are oversampling and undersampling, which have a similar effect to using class_weight. You can use them in case you have problems using class_weight or sample_weight.
I have a binary classification problem which I'm trying to solve using LightGBM's train and cv APIs.
First I have tuned the hyperparameters by using hyperopt together with an objective function that wraps the LightGBM CV API call. For that, since the target classes are highly unbalanced, I've used the customized focal loss function with f1-score evaluation to find the best fit.
When I try to fit the final model using the optimized parameters, the model doesn't consider it as a binary problem and outputs continuous values at prediction. See the attached image.
Anyone knows what I'm missing ?
Jupyter notebook
I am learning XGBoost and I am using python (3.x). I came cross the XGBoost cv function. Suppose, I have two models gbt1 and gbt2 which I created using XGBClassifier. Now, I was looking to use the CV method of XGBoost for cross validation. I noticed that I didn't nee to specify which model I am trying to optimize here. I just need to pass the param and DMatrix. My question here is how XGBoost determine what model or estimator use ?
cv_df = xgb.cv(params, DTrain, num_boost_round = 5, nfold=n_folds,
early_stopping_rounds= early_stopping)