Scikit Learn KernelDensity and GridSearchCV

Scikit Learn KernelDensity and GridSearchCV - python

I am new to data science and doing a project about Kernel Density Estimation, specifically about finding the best bandwidth and kernel function to use.
I want to use Scikit Learn's KernelDensity which allows choosing the bandwidth and the kernel.
I need to use some datasets to create a KDE model which will evaluate the probability density function and somehow evaluate its performance.
The problem is, I don't have the actual PDF to compare to, so I'm not sure how to evaluate the performance. I guess I could split the data to train and test sets, but then I'm not sure how to evaluate the model's performance on the test set.
Can anyone suggest good methods in order to evaluate how good is the evaluated PDF compared to the underlying distribution?
Also, I found Scikit Learn's GridSearchCV which tries to find the best hyperparameters (such as bandwidth or kernel) using cross validation.
As far as I saw, this can be used on KernelDensity with a certain dataset without giving the actual values to compare to. Something like this:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
{'bandwidth': np.linspace(0.1, 1.0, 30)},
cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_
So my question is, how does this grid search know to evaluate the performance of KDE in order to select the best hyperparameters if it doesn't have the actual underlying PDF to compare to?
Maybe if I understand what method is used for this, I could get an idea of how to evaluate my models.

So regarding compared to pdf without knowing. Basically, It splits the data into training and testing sets and uses cross-validation to estimate the performance of the model on unseen data. It then selects the hyperparameters that result in the highest cross-validated log-likelihood.
You can use GridSearchCV to find the optimal bandwidth and kernel for your KDE model
param_grid = {
'bandwidth': np.linspace(1e-3, 1, 30),
'kernel': ['gaussian', 'tophat', 'exponential']
}
grid = GridSearchCV(KernelDensity(), param_grid, cv=5)
grid.fit(data)

Related

xgboost and gridsearchcv in python

I have question about this tutorial.
The author is doing hyper parameter tuning. The first window shows different values of hyperparameters
Then he initializes gridsearchcv and mentions cv=3 and scoring='roc_auc'
then he fits gridsearchcv and uses eval_set and eval_metric='auc'
what is the purpose using cv and eval_set both? shouldn't we use just one of them? how they are used along with scoring='roc_auc' and eval_metric='auc'
is there a better way to do hyper parameter tuning using gridsearchcv? please suggest or provide a link

GridSearchCV performs cv for hyperparameter tuning using only training data. Since refit=True by default, the best fit is then validated on the eval set provided (a true test score).
You can use any metric to perform cv and testing. However, it would be odd to use a different metric for cv hyperparameter optimization and testing phases. So, the same metric is used. If you are wondering about the slightly different metric naming, I think it's just because xgboost is a sklearn-interface-compliant package, but it's not being developed by the same guys from sklearn. They should do both the same thing (area under the curve of receiving operator for predictions). Take a look at the sklearn docs: auc and roc_auc.
I don't think there is a better way.

gridSearch performance measure effect

I have an assignment and it asks me to:
Improve the performance of the models from the previous stepwith
hyperparameter tuning and select a final optimal model using grid
search based on a metric (or metrics) that you choose. Choosing an
optimal model for a given task (comparing multiple regressors on a
specific domain) requires selecting performance measures, for example,
R2(coefficient of determination) and/or RMSE (root mean squared
error) to compare the model performance.
I used this code for hyperparameter tuning:
model_example = GradientBoostingRegressor()
parameters = {'learning_rate': [0.1, 1],
'max_depth': [5,10]}
model_best = GridSearchCV(model_example,
param_grid=parameters,
cv=2,scoring='r2').fit(X_train_new,y_train_new)
model_best.best_estimator_
I found the learning rate=0.1 and max_dept=5
I have chosen scoring='r3' as a performance measure but it doesn't have any effect on my model accuracy when I used this code for providing my best model:
my_best_model = GradientBoostingRegressor(learning_rate=0.1,
max_depth=5).fit(X_train_new,y_train_new)
my_best_model.score(X_train_new,y_train_new)
Do you know what's wrong with my work?

Try setting a random_state as a parameter of your GradientBoostingRegressor(). For example, GradientBoostingRegressor(random_state=1).
The model will then produce the same results on the same data. Without that parameter, there's an element of randomness that makes it difficult to compare different model fits.
Setting a random state on the train-test-split will also help with this.

Does sklearn LogisticRegressionCV use all data for final model

I was wondering how the final model (i.e. decision boundary) of LogisticRegressionCV in sklearn was calculated. So say I have some Xdata and ylabels such that
Xdata # shape of this is (n_samples,n_features)
ylabels # shape of this is (n_samples,), and it is binary
and now I run
from sklearn.linear_model import LogisticRegressionCV
clf = LogisticRegressionCV(Cs=[1.0],cv=5)
clf.fit(Xdata,ylabels)
This is looking at just one regularization parameter and 5 folds in the CV. So clf.scores_ will be a dictionary with one key with a value that is an array with shape (n_folds,1). With these five folds you can get a better idea of how the model performs.
However, I'm confused about what you get from clf.coef_ (and I'm assuming the parameters in clf.coef_ are the ones used in clf.predict). I have a few options I think it could be:
The parameters in clf.coef_ are from training the model on all the data
The parameters in clf.coef_ are from the best scoring fold
The parameters in clf.coef_ are averaged across the folds in some way.
I imagine this is a duplicate question, but for the life of me I can't find a straightforward answer online, in the sklearn documentation, or in the source code for LogisticRegressionCV. Some relevant posts I found are:
GridSearchCV final model
scikit-learn LogisticRegressionCV: best coefficients
Using cross validation and AUC-ROC for a logistic regression model in sklearn
Evaluating Logistic regression with cross validation

You are mistaking between hyper-parameters and parameters. All scikit-learn estimators which have CV in the end, like LogisticRegressionCV, GridSearchCV, or RandomizedSearchCV tune the hyper-parameters.
Hyper-parameters are not learnt from training on the data. They are set prior to learning assuming that they will contribute to optimal learning. More information is present here:
Hyper-parameters are parameters that are not directly learnt within
estimators. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. Typical examples include C,
kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.
In case of LogisticRegression, C is a hyper-parameter which describes the inverse of regularization strength. The higher the C, the less regularization is applied on the training. Its not that C will be changed during training. It will be fixed.
Now coming to coef_. coef_ contains coefficient (also called weights) of the features, which are learnt (and updated) during the training. Now depending on the value of C (and other hyper-parameters present in contructor), these can vary during the training.
Now there is another topic on how to get the optimum initial values of coef_, so that the training is faster and better. Thats optimization. Some start with random weights between 0-1, others start with 0, etc etc. But for the scope of your question, that is not relevant. LogisticRegressionCV is not used for that.
This is what LogisticRegressionCV does:
Get the values of different C from constructor (In your example you passed 1.0).
For each value of C, do the cross-validation of supplied data, in which the LogisticRegression will be fit() on training data of the current fold, and scored on the test data. The scores from test data of all folds are averaged and that becomes the score of the current C. This is done for all C values you provided, and the C with the highest average score will be chosen.
Now the chosen C is set as the final C and LogisticRegression is again trained (by calling fit()) on the whole data (Xdata,ylabels here).
Thats what all the hyper-parameter tuners do, be it GridSearchCV, or LogisticRegressionCV, or LassoCV etc.
The initializing and updating of coef_ feature weights is done inside the fit() function of the algorithm which is out of scope for the hyper-parameter tuning. That optimization part is dependent on the internal optimization algorithm of the process. For example solver param in case of LogisticRegression.
Hope this makes things clear. Feel free to ask if still any doubt.

You have the parameter refit=True by default. On the docs you can read:
If set to True, the scores are averaged across all folds, and the
coefs and the C that corresponds to the best score is taken, and a
final refit is done using these parameters. Otherwise the coefs,
intercepts and C that correspond to the best scores across folds are
averaged.
So if refit=True the CV model is retrained using all the data.
When it says the final refit is done using these parameters it is talking about the C regularization parameter. So it uses the C that gives the best
average score across the K folds.
When refit=False it retrieves you the best model in cross validation.
So if you trained 5 folds, you will get the model (coeff + C + intercept), trained on 4 folds of data, which gave the best score on its fold test set.
I agree that the documetation here is not very clear but averaging C values and coefficients does not really make much sense

I just took a look at the source code. It seems for refit = True, they just selected the best hyperparameter (C and l1_ratio) and retrain the model with all the data.
for refit = False:
It seems they do average the hyperparameters, see the blow source code:
best_indices = np.argmax(scores, axis=1)
...
best_indices_C = best_indices % len(self.Cs_)
self.C_.append(np.mean(self.Cs_[best_indices_C]))

Converting LinearSVC's decision function to probabilities (Scikit learn python )

I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?
import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test)
I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.
Alternately, are there other options wrt classifiers that I can use to do this efficiently?
Thanks.

scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:
svm = LinearSVC()
clf = CalibratedClassifierCV(svm)
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)
User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.

I took a look at the apis in sklearn.svm.* family. All below models, e.g.,
sklearn.svm.SVC
sklearn.svm.NuSVC
sklearn.svm.SVR
sklearn.svm.NuSVR
have a common interface that supplies a
probability: boolean, optional (default=False)
parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.
I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.
Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.

If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.
Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.

If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.

Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by
(clip(decision_function(X), -1, 1) + 1) / 2)
An example would look like:
from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber")
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)

How does LassoCV in scikit-learn partition data?

I am performing linear regression using the Lasso method in sklearn.
According to their guidance, and that which I have seen elsewhere, instead of simply conducting cross validation on all of the training data it is advised to split it up into more traditional training set / validation set partitions.
The Lasso is thus trained on the training set and then the hyperparameter alpha is tuned on the basis of results from cross validation of the validation set. Finally, the accepted model is used on the test set to give a realistic view oh how it will perform in reality. Seperating the concerns out here is a preventative measure against overfitting.
Actual Question
Does Lasso CV conform to the above protocol or does it just somehow train the model paramaters and hyperparameters on the same data and/or during the same rounds of CV?
Thanks.

If you use sklearn.cross_validation.cross_val_score with a sklearn.linear_model.LassoCV object, then you are performing nested cross-validation. cross_val_score will divide your data into train and test sets according to how you specify the folds (which can be done with objects such as sklearn.cross_validation.KFold). The train set will be passed to the LassoCV, which itself performs another splitting of the data in order to choose the right penalty. This, it seems, corresponds to the setting you are seeking.
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.linear_model import LassoCV
X = np.random.randn(20, 10)
y = np.random.randn(len(X))
cv_outer = KFold(len(X), n_folds=5)
lasso = LassoCV(cv=3) # cv=3 makes a KFold inner splitting with 3 folds
scores = cross_val_score(lasso, X, y, cv=cv_outer)
Answer: no, LassoCV will not do all the work for you, and you have to use it in conjunction with cross_val_score to obtain what you want. This is at the same time the reasonable way of implementing such objects, since we can also be interested in only fitting a hyperparameter optimized LassoCV without necessarily evaluating it directly on another set of held out data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.