I'm working on a learning to rank task, dataset has a column thread_id which is a group label (stratified data).
In the evaluation phase I must take into account these groups as my scoring function works on a per-thread fashion (e.g. nDCG).
Now, if I implement nDCG with a signature scorer(estimator, X, y) I can easily pass it to GridSearchCV as scoring function as in the example below:
def my_nDCG(estimator, X, y):
# group by X['thread_id']
# compute the result
return result
splitter = GroupShuffleSplit(...).split(X, groups=X['thread_id'])
cv = GridSearchCV(clf, cv=splitter, scoring=my_nDCG)
GridSearchCV selects the model by calling my_nDCG().
Unfortunately, inside my_nDCG, X doesn't have the thread_id column as it must be dropped beforehand passing X to fit(), otherwise I'd train the model using thread_id as feature.
cv.fit(X.drop('best_answer', axis=1), y)
How can I do this without the terrible workaround of keeping thread_id apart as global and merging it with X inside my_nDCG()?
Is there any other way to use nDCG with scikit-learn? I see scikit supports stratified data but when it comes to model evaluation with stratified data it seems missing proper support.
Edit
Just noticed GridSearchCV.fit() accepts a groups parameter, in my case it'd still be X['thread_id'].
At this point I only need to read that param within my custom scoring function. How to do it?
Related
tl;dr: Is there any way to call .get_feature_names() on the fit and transformed data from the previous step of the pipeline to use as a hyperparameter in the next step of the pipeline?
I have a Pipeline that includes fitting and transforming text data with TfidfVectorizer, and then runs a RandomForestClassifier. I want to GridSearchCV across various levels of max_features in the classifier, based on the number of features that the transformation produced from the text.
#setup pipeline
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', RandomForestClassifier(random_state=1,
criterion='entropy',
n_estimators=800))
])
#setup parameter grid
params = {
'rf__max_features': np.arange(1, len(vect.get_feature_names()),1)
}
Instantiating returns the following error:
NameError: name 'vect' is not defined
Edit:
This is more relevant (and not explicated in the sample code) if I were modulating a parameter of the TfidfVectorizer such as ngram_range, one could see how this could change the number of features output to the next step...
The parameter grid gets populated before anything in the pipeline is fitted, so you can't do this directly. You might be able to monkey-patch the gridsearch, like here, but I'd expect it to be substantially harder since your second parameter depends on the results of fitting the first step.
I think the best approach, while it won't produce exactly what you're after, is to just use fractional values for max_features, i.e. a percentage of the columns coming out of the vectorizer.
If you really want a score for every integer max_features, I think the easiest way may be to have two nested grid searches, the inner one only instantiating the parameter space when its fit is called:
estimator = RandomForestClassifier(
random_state=1,
criterion='entropy',
n_estimators=800
)
class MySearcher(GridSearchCV):
def fit(self, X, y):
m = X.shape[1]
self.param_grid = {'max_features': np.arange(1, m, 1)}
return super().fit(X, y)
pipe = Pipeline([
('vect', TfidfVectorizer(max_df=.4,
min_df=3,
norm='l1',
stop_words='english',
use_idf=False)),
('rf', MySearcher(estimator=estimator,
param_grid={'fake': ['passes', 'check']}))
])
Now the search results will be awkwardly nested (best values of, say, ngram_range give you a refitted copy of pipe, whose second step will itself have a best value of max_features and a corresponding refitted random forest). Also, the data available for the inner search will be a bit smaller.
According to the DOC of scikit-learn
sklearn.model_selection.cross_val_score(estimator, X, y=None,
groups=None, scoring=None, cv=None, n_jobs=1, verbose=0,
fit_params=None, pre_dispatch=‘2*n_jobs’)
X and y
X : array-like The data to fit. Can be for example a list, or an
array.
y : array-like, optional, default: None The target variable to
try to predict in the case of supervised learning.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be the whole dataset. In some of the notebooks from kaggle some people use the whole dataset and some others X_train and y_train.
To my knowledge, cross validation just evaluate the model and shows whether or not you overfit/underfit your data (it does not actually train the model). Then, in my view the most data you have the better will be the performance, so I would use the whole dataset.
What do you think?
Model performance is dependent on way the data is split and sometimes model does not have ability to generalize.
So that's why we need the cross validation.
Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.
I am wondering whether [X,y] is X_train and y_train or [X,y] should be
the whole dataset.
[X, y] should be the whole dataset because internally cross validation spliting the data into training data and test data.
Suppose you use cross validation with 5 folds (cv = 5).
We begin by splitting the dataset into five groups or folds. Then we hold out the first fold as a test set, fit out model on the remaining four folds, predict on the test set and compute the metric of interest.
Next, we hold out the second fold as out test set, fit on the remaining data, predict on the test set and compute the metric of interest.
By default, scikit-learn's cross_val_score() function uses R^2 score as the metric of choice for regression.
R^2 score is called coefficient of determination.
After using sklearn.linear_model.LogisticRegression to fit a training data set, I would like to obtain the value of the cost function for the training data set and a cross validation data set.
Is it possible to have sklearn simply give me the value (at the fit minimum) of the function it minimized?
The function is stated in the documentation at http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (depending on the regularization one has chosen). But I can't find how to get sklearn to give me the value of this function.
I would have thought this is what LogisticRegression.score does, but that simply returns the accuracy (the fraction of data points its prediction classifies correctly).
I have found sklearn.metrics.log_loss, but of course this is not the actual function being minimized.
Unfortunately there is no "nice" way to do so, but there is a private function
_logistic_loss(w, X, y, alpha, sample_weight=None) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py, thus you can call it by hand
from sklearn.linear_model.logistic import _logistic_loss
print _logistic_loss(clf.coef_, X, y, 1 / clf.C)
where clf is your learned LogisticRegression
I used below code to calculate cost value.
import numpy as np
cost = np.sum((reg.predict(x) - y) ** 2)
where reg is your learned LogisticRegression
I have the following suggestions.
You can write the codes for the loss function of logistic regression as a function.
After you get your predicted labels of data, you can revoke your defined function to calculate the cost values.
I am new to scikit, and have 2 slight issues to combine a data scale and grid search.
Efficient scaler
Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold.
My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly:
classifier = svm.SVC(C=1)
clf = make_pipeline(preprocessing.StandardScaler(), classifier)
tuned_parameters = [{'C': [1, 10, 100, 1000]}]
my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5)
Retrieve inner scaler fitting
When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?
GridSearchCV knows nothing about the Pipeline object; it assumes that the provided estimator is atomic in the sense that it cannot choose only some particular stage (StandartScaler for example) and fit different stages on different data.
All GridSearchCV does - calls fit(X, y) method on the provided estimator, where X,y - some splits of data. Thus it fits all stages on same splits.
Try this:
best_pipeline = my_grid_search.best_estimator_
best_scaler = best_pipeline["standartscaler"]
In case when you wrap your transformers/estimators into Pipeline - you have to add a prefix to a name of each parameter, e.g: tuned_parameters = [{'svc__C': [1, 10, 100, 1000]}], look at these examples for more details Concatenating multiple feature extraction methods, Pipelining: chaining a PCA and a logistic regression
Anyway read this, it may help you GridSearchCV
I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I'd like to use classes of scikit-learn like GridSearchCV or cross_val_score to explore or cross validate some parameters on my estimator(e.g. SVC). However I see that you either pass the number of cv folds or an standard cross validation generator.
I'd like to create a custom cv generator so I get and Stratified 5 fold and oversample only my training data(4 folds) and let scikit-learn look through the grid of parameters of my estimator and score using the remaining fold for validation.
The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.
So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements:
An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data
I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels. I construct the custom cross-validation generator myCViterator as follows:
myCViterator = []
for i in range(nFolds):
trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
testIndices = myDf[ myDf['cvLabel']==i ].index.values.astype(int)
myCViterator.append( (trainIndices, testIndices) )
I had a similar problem and this quick hack is working for me:
class UpsampleStratifiedKFold:
def __init__(self, n_splits=3):
self.n_splits = n_splits
def split(self, X, y, groups=None):
for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
nix = np.where(y[rx]==0)[0]
pix = np.where(y[rx]==1)[0]
pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
ix = np.append(nix, pixu)
rxm = rx[ix]
yield rxm, tx
def get_n_splits(self, X, y, groups=None):
return self.n_splits
This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced. This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.
Scikit-Learn provides a workaround for this, with their Label k-fold iterator:
LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.
To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.
df['cv_label'] = df.index
Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:
cv_labels = df['cv_label']
Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.
After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:
clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)
class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py
def init(self):#coordinates,meter
pass # self.coordinates = coordinates # self.meter = meter
def split(self,X,y=None,groups=None):
#for compatibility with #cross_val_predict,cross_val_score
for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))