I have an assignment and it asks me to:
Improve the performance of the models from the previous stepwith
hyperparameter tuning and select a final optimal model using grid
search based on a metric (or metrics) that you choose. Choosing an
optimal model for a given task (comparing multiple regressors on a
specific domain) requires selecting performance measures, for example,
R2(coefficient of determination) and/or RMSE (root mean squared
error) to compare the model performance.
I used this code for hyperparameter tuning:
model_example = GradientBoostingRegressor()
parameters = {'learning_rate': [0.1, 1],
'max_depth': [5,10]}
model_best = GridSearchCV(model_example,
param_grid=parameters,
cv=2,scoring='r2').fit(X_train_new,y_train_new)
model_best.best_estimator_
I found the learning rate=0.1 and max_dept=5
I have chosen scoring='r3' as a performance measure but it doesn't have any effect on my model accuracy when I used this code for providing my best model:
my_best_model = GradientBoostingRegressor(learning_rate=0.1,
max_depth=5).fit(X_train_new,y_train_new)
my_best_model.score(X_train_new,y_train_new)
Do you know what's wrong with my work?
Try setting a random_state as a parameter of your GradientBoostingRegressor(). For example, GradientBoostingRegressor(random_state=1).
The model will then produce the same results on the same data. Without that parameter, there's an element of randomness that makes it difficult to compare different model fits.
Setting a random state on the train-test-split will also help with this.
Related
I am new to data science and doing a project about Kernel Density Estimation, specifically about finding the best bandwidth and kernel function to use.
I want to use Scikit Learn's KernelDensity which allows choosing the bandwidth and the kernel.
I need to use some datasets to create a KDE model which will evaluate the probability density function and somehow evaluate its performance.
The problem is, I don't have the actual PDF to compare to, so I'm not sure how to evaluate the performance. I guess I could split the data to train and test sets, but then I'm not sure how to evaluate the model's performance on the test set.
Can anyone suggest good methods in order to evaluate how good is the evaluated PDF compared to the underlying distribution?
Also, I found Scikit Learn's GridSearchCV which tries to find the best hyperparameters (such as bandwidth or kernel) using cross validation.
As far as I saw, this can be used on KernelDensity with a certain dataset without giving the actual values to compare to. Something like this:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
{'bandwidth': np.linspace(0.1, 1.0, 30)},
cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_
So my question is, how does this grid search know to evaluate the performance of KDE in order to select the best hyperparameters if it doesn't have the actual underlying PDF to compare to?
Maybe if I understand what method is used for this, I could get an idea of how to evaluate my models.
So regarding compared to pdf without knowing. Basically, It splits the data into training and testing sets and uses cross-validation to estimate the performance of the model on unseen data. It then selects the hyperparameters that result in the highest cross-validated log-likelihood.
You can use GridSearchCV to find the optimal bandwidth and kernel for your KDE model
param_grid = {
'bandwidth': np.linspace(1e-3, 1, 30),
'kernel': ['gaussian', 'tophat', 'exponential']
}
grid = GridSearchCV(KernelDensity(), param_grid, cv=5)
grid.fit(data)
In this example penalty and regularization parameters for a logistic regression model are tested. I do not understand how to choose such parameters. For example why focus on (11,12) or (0,4)? How does this relate to the data we have?
The purpose of Grid search is to find the generalized optimal parameter.
For example why focus on (l1,l2) or (0,4)?
The penalty parameter and regularization parameter affects the classification boundary. So to find the best classification the focus is made.
How does this relate to the data we have?
These are not directly related to the data we have. The idea is, for example we have to choose 'C' (regularization) parameter, which gives gives smallest difference between the training and validation set. So that the model should be simple as well as generalized on future data.
In general,to choose the range of parameters, it is not single time attempt based on the previous iterations the range can be widen according to the model performance.
I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation
You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.
You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
averaged.
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense
I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))
How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions?
Here's an example - we train our cv model using the code below:
cv_mod = lgb.cv(params,
d_train,
500,
nfold = 10,
early_stopping_rounds = 25,
stratified = True)
How can we use the parameters found from the best iteration of the above code to predict an output? In this case, cv_mod has no "predict" method like lightgbm.train, and the dictionary output from lightgbm.cvthrows an error when used in lightgbm.train.predict(..., pred_parameters = cv_mod).
Am I missing an important transformation step?
In general, the purpose of CV is NOT to do hyperparameter optimisation. The purpose is to evaluate performance of model-building procedure.
A basic train/test split is conceptually identical to a 1-fold CV (with a custom size of the split in contrast to the 1/K train size in the k-fold CV). The advantage of doing more splits (i.e. k>1 CV) is to get more information about the estimate of generalisation error. There is more info in a sense of getting the error + stat uncertainty. There is an excellent discussion on CrossValidated (start with the links added to the question, which cover the same question, but formulated in a different way). It covers nested cross validation and is absolutely not straightforward. But if you will wrap your head around the concept in general, this will help you in various non-trivial situations. The idea that you have to take away is: The purpose of CV is to evaluate performance of model-building procedure.
Keeping that idea in mind, how does one approach hyperparameter estimation in general (not only in LightGBM)?
You want to train a model with a set of parameters on some data and evaluate each variation of the model on an independent (validation) set. Then you intend to choose the best parameters by choosing the variant that gives the best evaluation metric of your choice.
This can be done with a simple train/test split. But evaluated performance, and thus the choice of the optimal model parameters, might be just a fluctuation on a particular split.
Thus, you can evaluate each of those models more statistically robust averaging evaluation over several train/test splits, i.e k-fold CV.
Then you can make a step further and say that you had an additional hold-out set, that was separated before hyperparameter optimisation was started. This way you can evaluate the chosen best model on that set to measure the final generalisation error. However, you can make even step further and instead of having a single test sample you can have an outer CV loop, which brings us to nested cross validation.
Technically, lightbgm.cv() allows you only to evaluate performance on a k-fold split with fixed model parameters. For hyper-parameter tuning you will need to run it in a loop providing different parameters and recoding averaged performance to choose the best parameter set. after the loop is complete. This interface is different from sklearn, which provides you with complete functionality to do hyperparameter optimisation in a CV loop. Personally, I would recommend to use the sklearn-API of lightgbm. It is just a wrapper around the native lightgbm.train() functionality, thus it is not slower. But it allows you to use the full stack of sklearn toolkit, thich makes your life MUCH easier.
If you're happy with your CV results, you just use those parameters to call the 'lightgbm.train' method. Like #pho said, CV is usually just for param tuning. You don't use the actual CV object for predictions.
You should use CV for parameter optimization.
If your model performs well on all folds use these parameters to train on the whole training set.
Then evaluate that model on the external test set.
I am performing a machine learning task wherein I am using logistic regression for topic classification.
If this is my code:
model= LogisticRegression()
model= model.fit(mat_tmp, label_tmp)
y_train_pred = model.predict(mat_tmp_test)
print(metrics.accuracy_score(label_tmp_test, y_train_pred))
Is there a way I can output what exactly is happening inside the model. Like probably a working example of what my model is doing? Like maybe displaying 2-3 documents and how they are being classified?
In order to be fully aware of what is happening in your model, you must first take some time to study the logistic regression algorithm (eg. from lecture notes or Wikipedia). As with other supervised techniques, logistic regression has hyper-parameters and parameters. Hyper-parameters basically specify how your algorithm runs, which you must provide at initialisation (ie. before it sees any data). For example, you could have prior information about the distribution of classes, which then would be a hyper-parameter. Parameters are "learnt" from your data.
Once you understand the algorithm, the interesting question will be what the parameters of your model are (recall that these are retrieved from the data). By visiting the documentation, you find in the attributes section, that this classifier has 3 parameters, which you can access by their field names.
If you are not interested in such details, but only want to assess the accuracy of your classifier, a useful technique is cross-validation. You split your labeled data into k equal sized subsets, and train your classifier using k-1 of them. Then you evaluate the trained classifier on the remaining 1 subset and calculate the accuracy (ie. what proportion of the data could be predicted properly). This method has its drawbacks, but proves to be very useful in general.