Use LeaveOneGroupOut strategy on cross_val_score in sklearn

Use LeaveOneGroupOut strategy on cross_val_score in sklearn - python

I'd like to use LeaveOneGroupOut strategy to evaluate my model. According to sklearn's document, cross_val_score seems convenient.
However, the following code does not work.
import sklearn
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.model_selection import cross_val_score
clf = sklearn.svm.SVC(kernel='linear', C=1)
# cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) # => this works
cv = LeaveOneGroupOut # => this does not work
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)
The error message is:
ValueError Traceback (most recent call last)
<ipython-input-40-435a3a7fa16c> in <module>()
4 from sklearn.model_selection import cross_val_score
5 clf = sklearn.svm.SVC(kernel='linear', C=1)
----> 6 scores = cross_val_score(clf, iris.data, iris.target, cv=LeaveOneGroupOut())
7 scores
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
138 train, test, verbose, None,
139 fit_params)
--> 140 for train, test in cv.split(X, y, groups))
141 return np.array(scores)[:, 0]
142
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
601
602 with self._lock:
--> 603 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
604 if len(tasks) == 0:
605 # No more tasks available in the iterator: tell caller to stop.
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in <genexpr>(***failed resolving arguments***)
135 parallel = Parallel(n_jobs=n_jobs, verbose=verbose,
136 pre_dispatch=pre_dispatch)
--> 137 scores = parallel(delayed(_fit_and_score)(clone(estimator), X, y, scorer,
138 train, test, verbose, None,
139 fit_params)
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in split(self, X, y, groups)
88 X, y, groups = indexable(X, y, groups)
89 indices = np.arange(_num_samples(X))
---> 90 for test_index in self._iter_test_masks(X, y, groups):
91 train_index = indices[np.logical_not(test_index)]
92 test_index = indices[test_index]
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in _iter_test_masks(self, X, y, groups)
770 def _iter_test_masks(self, X, y, groups):
771 if groups is None:
--> 772 raise ValueError("The groups parameter should not be None")
773 # We make a copy of groups to avoid side-effects during iteration
774 groups = np.array(groups, copy=True)
ValueError: The groups parameter should not be None
scores

You do not define your groups parameter which is the group according to which you are going to split your data.
The error comes from cross_val_score that takes this parameter in argument : in your case it is equal to None.
Try to follow the example below :
from sklearn.model_selection import LeaveOneGroupOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 1, 2])
groups = np.array([1, 1, 2, 2])
lol = LeaveOneGroupOut()
You have :
[In] lol.get_n_splits(X, y, groups)
[Out] 2
Then you will be able to use :
lol.split(X, y, groups)

Related

Error in finding optimal n_estimators via xgboost

Beginner here.
I'm trying to find the best number of n_estimators using xgboost.
But, I'm getting this error.
diabetes.head() #this is a toy dataset in sklearn.datasets.
diabetes.head()
x=diabetes.drop('y',axis=1).values
y=diabetes.y.values
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=16,test_size=0.25)
import xgboost as xgb
xgbmodel=xgb.XGBRegressor(objective="reg:squarederror",eval_metric='rmse',early_stopping_rounds=10,n_estimators=1000,random_state=16)
xgbmodel.fit(x_train,y_train,eval_set=[x_test,y_test])
I think the problem lies within:
eval_set=[x_test,y_test]
P.S. I double checked that diabetes dataset from sklearn can used used for regression.I was wondering if my error lies within eval_metric method.
Full error here:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10180/532743731.py in <module>
4 import xgboost as xgb
5 xgbmodel=xgb.XGBRegressor(objective="reg:squarederror",eval_metric='rmse',early_stopping_rounds=10,n_estimators=1000,random_state=16)
----> 6 xgbmodel.fit(x_train,y_train,eval_set=[x_test,y_test])
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\core.py in inner_f(*args, **kwargs)
530 for k, arg in zip(sig.parameters, args):
531 kwargs[k] = arg
--> 532 return f(**kwargs)
533
534 return inner_f
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
929 """
930 evals_result: TrainingCallback.EvalsLog = {}
--> 931 train_dmatrix, evals = _wrap_evaluation_matrices(
932 missing=self.missing,
933 X=X,
C:\ProgramData\Anaconda3\lib\site-packages\xgboost\sklearn.py in _wrap_evaluation_matrices(missing, X, y, group, qid, sample_weight, base_margin, feature_weights, eval_set, sample_weight_eval_set, base_margin_eval_set, eval_group, eval_qid, create_dmatrix, enable_categorical)
434
435 evals = []
--> 436 for i, (valid_X, valid_y) in enumerate(eval_set):
437 # Skip the duplicated entry.
438 if all(
ValueError: too many values to unpack (expected 2)

yes, it has to be a tuple, see below, your x_test, y_test should be enclosed in parentheses
eval_setparam = [(self.X_valid, self.y_valid)]
xg_model.fit(self.X_train, self.y_train,
eval_set = eval_setparam,
verbose=False)

Sklearn Naive Bayes with multiple features

Background
I'm struggling to implement a Naive Bayes classifier in python with sklearn across multiple features.
The features I have are:
Title - some short text
Description - some longer text
Timestamp - a float representing an hour of the day (e.g. 18.0 = 6:00PM, 11.5 = 11:30AM)
The labels/classes are categorical strings: e.g. "Class1", "Class2", "Class3"
Aim
My goal is to use the 3 features in order to construct a Naive Bayes classifier for 3 features in order to predict the class label. I specifically wish to use all of the features at the same time, i.e. not simply the description feature.
Initial Approach
I have setup some pre-processing pipelines using sklearn as follows:
from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,
text_columns = ['title', 'description']
time_columns = ['timestamp']
# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)
# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
('scaler', preprocessing.StandardScaler()),
('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])
# combine the pre-processors
preprocessor = compose.ColumnTransformer([
('text', text_pipeline, text_columns),
('time', time_pipeline, time_columns),
])
clf = pipeline.Pipeline([
('preprocessor', preprocessor),
('clf', naive_bayes.MultinomialNB()),
])
Here train is a pandas dataframe with the features and labels, read straight from a .csv file like this:
ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2
Also note that I'm not setting most of the params for the transformers/classifiers, as I want to use a grid-search to find the optimum ones later on.
The problem
When I call clf.fit(X_train, y_train), I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
33
34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
36 # # print the number of features
37
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
388 """
389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
392 if self._final_estimator != "passthrough":
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
346 cloned_transformer = clone(transformer)
347 # Fit or load from cache the current transformer
--> 348 X, fitted_transformer = fit_transform_one_cached(
349 cloned_transformer,
350 X,
~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
697 self._record_output_indices(Xs)
698
--> 699 return self._hstack(list(Xs))
700
701 def transform(self, X):
~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
789 else:
790 Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791 return np.hstack(Xs)
792
793 def _sk_visual_block_(self):
<__array_function__ internals> in hstack(*args, **kwargs)
~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
344 return _nx.concatenate(arrs, 0)
345 else:
--> 346 return _nx.concatenate(arrs, 1)
347
348
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001
I have the following shapes for X_train and y_train:
X_train: (3001, 3)
y_train: (3001,)
Steps Taken
Individual Features
I can use the same pipelines with individual features (by altering the text_features and time_features arrays), and get a perfectly fine classifier. E.g. only using the "title" field, or only using the "timestamp". Unfortunately, these individual features are not accurate enough, so I would like to use all the features to build a more accurate classifier. The issue seems to be when I attempt to combine more than one feature.
I'm open to potentially using multiple Naive Bayes classifiers, and trying to multiply the probabilities together to get some overall probability, but I honestly have no clue how to do that, and I'm sure I'm just missing something simple here.
Dropping the Time Features
I have tried running only the text_features, i.e. "title" and "description", and I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
33
34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
36 # # print the number of features
37
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
392 if self._final_estimator != "passthrough":
393 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394 self._final_estimator.fit(Xt, y, **fit_params_last_step)
395
396 return self
~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
661 Returns the instance itself.
662 """
--> 663 X, y = self._check_X_y(X, y)
664 _, n_features = X.shape
665
~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
521 def _check_X_y(self, X, y, reset=True):
522 """Validate X and y in fit methods."""
--> 523 return self._validate_data(X, y, accept_sparse="csr", reset=reset)
524
525 def _update_class_log_prior(self, class_prior=None):
~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
579 y = check_array(y, **check_y_params)
580 else:
--> 581 X, y = check_X_y(X, y, **check_params)
582 out = X, y
583
~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
979 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
980
--> 981 check_consistent_length(X, y)
982
983 return X, y
~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
330 uniques = np.unique(lengths)
331 if len(uniques) > 1:
--> 332 raise ValueError(
333 "Found input variables with inconsistent numbers of samples: %r"
334 % [int(l) for l in lengths]
ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]
And I have the following shapes:
X_train: (3001, 2)
y_train: (3001,)
Reshaping the Labels
I have also tried reshaping y_train variable by calling it wrapped in [] like so:
# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)
# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)
so that the resultant shapes are:
X_train: (3001, 3)
y_train: (3001, 1)
But unfortunately this doesn't appear to fix this.
Removing Naive Bayes Classifier
When I remove the final step of the pipeline (the naivebayes.MultinomialNB()), and I remove the text_features ("timestamp" feature), then I can build a pre-processor that works just fine for the text. I.e. I can pre-process the text fields ("title", "description"), but when I add the classifier, I get the error above under "Dropping the Time Features".

When vectorizing multiple text features, you should create CountVectorizer (or TfidfVectorizer) instances for every feature:
title_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
('title', title_pipeline, text_columns[0]),
('description', description_pipeline, text_columns[1]),
('time', time_pipeline, time_columns),
])
P.S. The combination of CountVectorizer and TfidfTransformer is equivalent to TfidfVectorizer. Also, you may just skip tf-idf weighting and use only CountVectorizer for MultinomialNB.

Receive a python exception when trying "LeaveOneGroupOut" from sklearn

I am new to Scikit-Learn package and am trying to use a LeaveOneGroupOut Cross-Validation for a simple classification task.
I used the following code, which I adopted based on the documentation at [this link] from the scikit-learn.org website:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import cross_val_score
from sklearn import svm
X = Selected_Dataset[:,:-1]
y = Selected_Labels
groups = Selected_SubjIDs
clf = svm.SVC(kernel='linear', C=1)
cv = LeaveOneGroupOut()
cv.get_n_splits(X, y, groups=groups)
cross_val_score(clf, X, y, cv=cv)
But this code generates the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-27b53a67db71> in <module>
14
15
---> 16 cross_val_score(clf, X, y, cv=cv)
17
18
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
340 n_jobs=n_jobs, verbose=verbose,
341 fit_params=fit_params,
--> 342 pre_dispatch=pre_dispatch)
343 return cv_results['test_score']
344
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
204 fit_params, return_train_score=return_train_score,
205 return_times=True)
--> 206 for train, test in cv.split(X, y, groups))
207
208 if return_train_score:
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
618
619 with self._lock:
--> 620 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
621 if len(tasks) == 0:
622 # No more tasks available in the iterator: tell caller to stop.
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in <genexpr>(.0)
200 pre_dispatch=pre_dispatch)
201 scores = parallel(
--> 202 delayed(_fit_and_score)(
203 clone(estimator), X, y, scorers, train, test, verbose, None,
204 fit_params, return_train_score=return_train_score,
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
93 X, y, groups = indexable(X, y, groups)
94 indices = np.arange(_num_samples(X))
---> 95 for test_index in self._iter_test_masks(X, y, groups):
96 train_index = indices[np.logical_not(test_index)]
97 test_index = indices[test_index]
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
822 def _iter_test_masks(self, X, y, groups):
823 if groups is None:
--> 824 raise ValueError("The 'groups' parameter should not be None.")
825 # We make a copy of groups to avoid side-effects during iteration
826 groups = check_array(groups, copy=True, ensure_2d=False, dtype=None)
ValueError: The 'groups' parameter should not be None.
I found these two related Bugs being reported in 2016, and 2017.
Is there any way around it?

You have to use
cross_val_score(clf, X, y, cv=cv, groups=groups)
and you can remove the get_n_splits.
Working example
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import cross_val_score
from sklearn import svm
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Normalizer
#load the data
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
groups = np.random.binomial(1,0.5,size=len(X))
clf = svm.SVC(kernel='linear', C=1)
cv = LeaveOneGroupOut()
cross_val_score(clf, X, y, cv=cv,groups=groups)

TypeError: 'TimeSeriesSplit' object is not iterable

I am carrying out a gridsearch for a SVR desigh which has a time series split. My code is:
from sklearn.svm import SVR
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing as pre
X_feature = X_feature.reshape(-1, 1)
y_label = y_label.reshape(-1,1)
param = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
'C': [1, 10, 100, 1000]},
{'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [1, 2, 3, 4]}]
reg = SVR(C=1)
timeseries_split = TimeSeriesSplit(n_splits=3)
clf = GridSearchCV(reg, param, cv=timeseries_split, scoring='neg_mean_squared_error')
X= pre.MinMaxScaler(feature_range=(0,1)).fit(X_feature)
scaled_X = X.transform(X_feature)
y = pre.MinMaxScaler(feature_range=(0,1)).fit(y_label)
scaled_y = y.transform(y_label)
clf.fit(scaled_X,scaled_y )
My data for scaled y is:
[0.11321139]
[0.07218848]
...
[0.64844211]
[0.4926122 ]
[0.4030334 ]]
And my data for scaled X is:
[[0.2681013 ]
[0.03454225]
[0.02062136]
...
[0.92857565]
[0.64930691]
[0.20325924]]
However, I am getting the error message
TypeError: 'TimeSeriesSplit' object is not iterable
My traeback error message is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-4403e696bf0d> in <module>()
19
20
---> 21 clf.fit(scaled_X,scaled_y )
~/anaconda3_501/lib/python3.6/site-packages/sklearn/grid_search.py in fit(self, X, y)
836
837 """
--> 838 return self._fit(X, y, ParameterGrid(self.param_grid))
839
840
~/anaconda3_501/lib/python3.6/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
572 self.fit_params, return_parameters=True,
573 error_score=self.error_score)
--> 574 for parameters in parameter_iterable
575 for train, test in cv)
576
~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
618
619 with self._lock:
--> 620 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
621 if len(tasks) == 0:
622 # No more tasks available in the iterator: tell caller to stop.
~/anaconda3_501/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
~/anaconda3_501/lib/python3.6/site-packages/sklearn/grid_search.py in <genexpr>(.0)
573 error_score=self.error_score)
574 for parameters in parameter_iterable
--> 575 for train, test in cv)
576
577 # Out is a list of triplet: score, estimator, n_test_samples
TypeError: 'TimeSeriesSplit' object is not iterable
Im not sure why this could be, I suspect this is happening when I am fitting in the last line. Help with this would be appreciated.

First thing, you are using incompatible packages. grid_search is old version which is now deprecated and does not work with model_selection.
In place of:
from sklearn.grid_search import GridSearchCV
Do this:
from sklearn.model_selection import GridSearchCV
Secondly, You only need to send TimeSeriesSplit(n_splits=3) to the cv param. Like this:
timeseries_split = TimeSeriesSplit(n_splits=3)
clf = GridSearchCV(reg, param, cv=timeseries_split, scoring='neg_mean_squared_error')
No need to call split(). It will be internally called by grid_search.

Length of generators can not be found, they don't contain complete information to find length, these maintain only current state. In your grid_search.py file line 579, it is trying to find length of generator. you need to convert them to iterators to find the length, so you can do:
n_folds = list(n_folds)
before you do:
n_folds = len(cv)
if you want to keep it as generator refer:
How to len(generator())

sklearn 0.14.1 RBM dies on NaN or Inf where there is none

I'm borrowing an idea here from the documentation to use RBMs + Logistic regression for classification.
However I'm getting an error that should not be thrown since all entries in my data matrix are numerical.
Code:
from sklearn import preprocessing, cross_validation
from scipy.ndimage import convolve
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn import linear_model, datasets, metrics
import numpy as np
# create fake dataset
data, labels = datasets.make_classification(n_samples=250000)
data = preprocessing.scale(data)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, labels, test_size=0.7, random_state=0)
# print details
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
print np.max(X_train)
print np.min(X_train)
print np.mean(X_train, axis=0)
print np.std(X_train, axis=0)
if np.sum(np.isnan(X_train)) or np.sum(np.isnan(X_test)):
print "NaN found!"
if np.sum(np.isnan(y_train)) or np.sum(np.isnan(y_test)):
print "NaN found!"
if np.sum(np.isinf(X_train)) or np.sum(np.isinf(X_test)):
print "Inf found!"
if np.sum(np.isinf(y_train)) or np.sum(np.isinf(y_test)):
print "Inf found!"
# train and test
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
# Training RBM-Logistic Pipeline
classifier.fit(X_train, y_train)
# Training Logistic regression
logistic_classifier = linear_model.LogisticRegression(C=100.0)
logistic_classifier.fit(X_train, y_train)
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
y_test,
classifier.predict(X_test))))
Ouput:
(73517, 3) (171540, 3) (73517,) (171540,)
2.0871168057
-2.21062647188
[-0.00237028 -0.00104526 0.00330683]
[ 0.99907225 0.99977328 1.00225843]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/test.py in <module>()
75
76 # Training RBM-Logistic Pipeline
---> 77 classifier.fit(X_train, y_train)
78
79 # Training Logistic regression
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
128 data, then fit the transformed data using the final estimator.
129 """
--> 130 Xt, fit_params = self._pre_transform(X, y, **fit_params)
131 self.steps[-1][-1].fit(Xt, y, **fit_params)
132 return self
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
118 for name, transform in self.steps[:-1]:
119 if hasattr(transform, "fit_transform"):
--> 120 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
121 else:
122 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/usr/local/lib/python2.7/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
409 else:
410 # fit method of arity 2 (supervised transformation)
--> 411 return self.fit(X, y, **fit_params).transform(X)
412
413
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304
305 for batch_slice in batch_slices:
--> 306 pl_batch = self._fit(X[batch_slice], rng)
307
308 if verbose:
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in _fit(self, v_pos, rng)
245
246 if self.verbose:
--> 247 return self.score_samples(v_pos)
248
249 def score_samples(self, v):
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in score_samples(self, v)
268 fe_ = self._free_energy(v_)
269
--> 270 return v.shape[1] * logistic_sigmoid(fe_ - fe, log=True)
271
272 def fit(self, X, y=None):
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in logistic_sigmoid(X, log, out)
498 """
499 is_1d = X.ndim == 1
--> 500 X = array2d(X, dtype=np.float)
501
502 n_samples, n_features = X.shape
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy, force_all_finite)
91 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
92 if force_all_finite:
---> 93 _assert_all_finite(X_2d)
94 if X is X_2d and copy:
95 X_2d = safe_copy(X_2d)
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
26 and not np.isfinite(X).all()):
---> 27 raise ValueError("Array contains NaN or infinity.")
28
29
ValueError: Array contains NaN or infinity.
There are no infs or nans in the data matrix...what could be causing this behaviour?
EDIT: Apparently I'm not the only one.

This looks like a numerical stability bug in RBMs. Can you please open a github issue with your script in it?
Edit: by the way if you are interested you can try to find the source of the issue by adding np.isfinite() checks in the inner loops of the _fit method of the BernoulliRBM class.

This issue is usually caused by two factors. Incorrect initial scaling of the data. Firstly the input data needs to be bound between 0 and 1. Remember RBM's were originally designed for binary data only. Secondly the learning rates could be too high. Defaults for RBM code are often based on the MNIST digit recognition dataset which can handle larger learning rates.
So I would trust sklearn's implementation, but not the stability of the algorithm for a new dataset based on default values that don't fit with the current dataset. Adding checks for infinity wont help you will still need to tweak the learning rates.
This is why deep learning is said to be a bit of art, you probably also need to play around with the number of gibs samples, size of minibatch and amount of momentum. Dont give up though, the rewards are mostly worth it. Further reading

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use LeaveOneGroupOut strategy on cross_val_score in sklearn - python

Related

Error in finding optimal n_estimators via xgboost

Sklearn Naive Bayes with multiple features

Receive a python exception when trying "LeaveOneGroupOut" from sklearn

TypeError: 'TimeSeriesSplit' object is not iterable

sklearn 0.14.1 RBM dies on NaN or Inf where there is none

Categories

Resources