Sklearn Naive Bayes with multiple features - python

Background
I'm struggling to implement a Naive Bayes classifier in python with sklearn across multiple features.
The features I have are:
Title - some short text
Description - some longer text
Timestamp - a float representing an hour of the day (e.g. 18.0 = 6:00PM, 11.5 = 11:30AM)
The labels/classes are categorical strings: e.g. "Class1", "Class2", "Class3"
Aim
My goal is to use the 3 features in order to construct a Naive Bayes classifier for 3 features in order to predict the class label. I specifically wish to use all of the features at the same time, i.e. not simply the description feature.
Initial Approach
I have setup some pre-processing pipelines using sklearn as follows:
from sklearn import preprocessing, naive_bayes, feature_extraction, pipeline, model_selection, compose,
text_columns = ['title', 'description']
time_columns = ['timestamp']
# get an 80-20 test-train split
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)
# convert the text data into vectors
text_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
# preprocess by scaling the data, and binning the data
time_pipeline = pipeline.Pipeline([
('scaler', preprocessing.StandardScaler()),
('bin', preprocessing.KBinsDiscretizer(n_bins=6, encode='ordinal', strategy='quantile')),
])
# combine the pre-processors
preprocessor = compose.ColumnTransformer([
('text', text_pipeline, text_columns),
('time', time_pipeline, time_columns),
])
clf = pipeline.Pipeline([
('preprocessor', preprocessor),
('clf', naive_bayes.MultinomialNB()),
])
Here train is a pandas dataframe with the features and labels, read straight from a .csv file like this:
ID,title,description,timestamp,class
1,First Title String,"A description of the first title",13.0,Class1
2,Second Title String,"A description of the second title",17.5,Class2
Also note that I'm not setting most of the params for the transformers/classifiers, as I want to use a grid-search to find the optimum ones later on.
The problem
When I call clf.fit(X_train, y_train), I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_7500/3039541201.py in <module>
33
34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
36 # # print the number of features
37
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
388 """
389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
392 if self._final_estimator != "passthrough":
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
346 cloned_transformer = clone(transformer)
347 # Fit or load from cache the current transformer
--> 348 X, fitted_transformer = fit_transform_one_cached(
349 cloned_transformer,
350 X,
~/.local/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
697 self._record_output_indices(Xs)
698
--> 699 return self._hstack(list(Xs))
700
701 def transform(self, X):
~/.local/lib/python3.9/site-packages/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
789 else:
790 Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
--> 791 return np.hstack(Xs)
792
793 def _sk_visual_block_(self):
<__array_function__ internals> in hstack(*args, **kwargs)
~/.local/lib/python3.9/site-packages/numpy/core/shape_base.py in hstack(tup)
344 return _nx.concatenate(arrs, 0)
345 else:
--> 346 return _nx.concatenate(arrs, 1)
347
348
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 2 and the array at index 1 has size 3001
I have the following shapes for X_train and y_train:
X_train: (3001, 3)
y_train: (3001,)
Steps Taken
Individual Features
I can use the same pipelines with individual features (by altering the text_features and time_features arrays), and get a perfectly fine classifier. E.g. only using the "title" field, or only using the "timestamp". Unfortunately, these individual features are not accurate enough, so I would like to use all the features to build a more accurate classifier. The issue seems to be when I attempt to combine more than one feature.
I'm open to potentially using multiple Naive Bayes classifiers, and trying to multiply the probabilities together to get some overall probability, but I honestly have no clue how to do that, and I'm sure I'm just missing something simple here.
Dropping the Time Features
I have tried running only the text_features, i.e. "title" and "description", and I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_7500/1900884535.py in <module>
33
34 # x = pd.DataFrame(text_pipeline.fit_transform(X_train['mean_checkin_time']))
---> 35 x = clf.fit(X_train, y_train)
36 # # print the number of features
37
~/.local/lib/python3.9/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
392 if self._final_estimator != "passthrough":
393 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 394 self._final_estimator.fit(Xt, y, **fit_params_last_step)
395
396 return self
~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
661 Returns the instance itself.
662 """
--> 663 X, y = self._check_X_y(X, y)
664 _, n_features = X.shape
665
~/.local/lib/python3.9/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y, reset)
521 def _check_X_y(self, X, y, reset=True):
522 """Validate X and y in fit methods."""
--> 523 return self._validate_data(X, y, accept_sparse="csr", reset=reset)
524
525 def _update_class_log_prior(self, class_prior=None):
~/.local/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
579 y = check_array(y, **check_y_params)
580 else:
--> 581 X, y = check_X_y(X, y, **check_params)
582 out = X, y
583
~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
979 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
980
--> 981 check_consistent_length(X, y)
982
983 return X, y
~/.local/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
330 uniques = np.unique(lengths)
331 if len(uniques) > 1:
--> 332 raise ValueError(
333 "Found input variables with inconsistent numbers of samples: %r"
334 % [int(l) for l in lengths]
ValueError: Found input variables with inconsistent numbers of samples: [2, 3001]
And I have the following shapes:
X_train: (3001, 2)
y_train: (3001,)
Reshaping the Labels
I have also tried reshaping y_train variable by calling it wrapped in [] like so:
# new
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train[['class']], test_size=0.2, random_state=RANDOM_STATE)
# previous
X_train, X_test, y_train, y_test = model_selection.train_test_split(train[text_columns + time_columns], train['class'], test_size=0.2, random_state=RANDOM_STATE)
so that the resultant shapes are:
X_train: (3001, 3)
y_train: (3001, 1)
But unfortunately this doesn't appear to fix this.
Removing Naive Bayes Classifier
When I remove the final step of the pipeline (the naivebayes.MultinomialNB()), and I remove the text_features ("timestamp" feature), then I can build a pre-processor that works just fine for the text. I.e. I can pre-process the text fields ("title", "description"), but when I add the classifier, I get the error above under "Dropping the Time Features".

When vectorizing multiple text features, you should create CountVectorizer (or TfidfVectorizer) instances for every feature:
title_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
description_pipeline = pipeline.Pipeline([
('vect', feature_extraction.text.CountVectorizer()),
('tfidf', feature_extraction.text.TfidfTransformer()),
])
preprocessor = compose.ColumnTransformer([
('title', title_pipeline, text_columns[0]),
('description', description_pipeline, text_columns[1]),
('time', time_pipeline, time_columns),
])
P.S. The combination of CountVectorizer and TfidfTransformer is equivalent to TfidfVectorizer. Also, you may just skip tf-idf weighting and use only CountVectorizer for MultinomialNB.

Related

AttributeError while implementing FAMD with SMOTENC in a imblearn pipeline

I'm trying to implement a pipeline with FAMD, SMOTENC, and other preprocessing steps. However it gives error each time. If i remove FAMD from the pipeline it works fine.
My code:
#Seperate the dataset in two parts
num_df= X_train_new.select_dtypes(include=[np.number]).columns
cat_df= X_train_new.select_dtypes(exclude=[np.number]).columns
#Create a mask for categorical features
categorical_feature_mask = X_train_new.dtypes == object
print(categorical_feature_mask)
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
#Create a pipeline to automate the preprocessing steps and SMOTENC together
num_pipe = make_pipeline(SimpleImputer(strategy='median'))
cat_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
OneHotEncoder(handle_unknown='ignore'))
transformer= make_column_transformer((num_pipe, selector(dtype_include='number')),
(cat_pipe, selector(dtype_include='object')),n_jobs=2)
#Undersampling with SMOTENC
from imblearn.over_sampling import SMOTENC
smote= SMOTENC(categorical_features=categorical_feature_mask,random_state=99)
!pip install prince
from prince import FAMD
famd=FAMD(n_components=4,random_state=99)
from imblearn.pipeline import make_pipeline as imb_pipeline
#Fit the random forest learner
rf=RandomForestClassifier(n_estimators=300random_state=99)
pipe=imb_pipeline(transformer,smote,famd,rf)
pipe.fit(X_train_new,y_train_new)
print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))
The error:
AttributeError Traceback (most recent call last)
<ipython-input-24-2b7ea084a318> in <module>()
3 rf=RandomForestClassifier(n_estimators=300,max_features=3,criterion='entropy',random_state=99)
4 pipe=imb_pipeline(transformer,smote,famd,rf)
----> 5 pipe.fit(X_train_new,y_train_new)
6 print('Training Accuracy:%s'%pipe.score(X_train_new,y_train_new))
6 frames
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in fit(self, X, y, **fit_params)
235
236 """
--> 237 Xt, yt, fit_params = self._fit(X, y, **fit_params)
238 if self._final_estimator is not None:
239 self._final_estimator.fit(Xt, yt, **fit_params)
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit(self, X, y, **fit_params)
195 Xt, fitted_transformer = fit_transform_one_cached(
196 cloned_transformer, None, Xt, yt,
--> 197 **fit_params_steps[name])
198 elif hasattr(cloned_transformer, "fit_resample"):
199 Xt, yt, fitted_transformer = fit_resample_one_cached(
/usr/local/lib/python3.7/dist-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)
564 def _fit_transform_one(transformer, weight, X, y, **fit_params):
565 if hasattr(transformer, 'fit_transform'):
--> 566 res = transformer.fit_transform(X, y, **fit_params)
567 else:
568 res = transformer.fit(X, y, **fit_params).transform(X)
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
572 else:
573 # fit method of arity 2 (supervised transformation)
--> 574 return self.fit(X, y, **fit_params).transform(X)
575
576
/usr/local/lib/python3.7/dist-packages/prince/famd.py in fit(self, X, y)
27
28 # Separate numerical columns from categorical columns
---> 29 num_cols = X.select_dtypes(np.number).columns.tolist()
30 cat_cols = list(set(X.columns) - set(num_cols))
31
/usr/local/lib/python3.7/dist-packages/scipy/sparse/base.py in __getattr__(self, attr)
689 return self.getnnz()
690 else:
--> 691 raise AttributeError(attr + " not found")
692
693 def transpose(self, axes=None, copy=False):
AttributeError: select_dtypes not found
tl;dr: try adding sparse=False to your OneHotEncoder. Consider raising an Issue with prince, to handle sparse inputs.
You can see from the traceback that the problem is that FAMD.fit tries X.select_dtypes to separate categorical and numeric data. select_dtypes is a pandas function, so normally I would assume that prince is written to operate on dataframes and not the numpy arrays that sklearn uses internally (after converting from frames if necessary). However, looking at the source, a few lines above that one they do convert from numpy array to dataframe. But, the last trace message is from scipy. That hints that your X may actually be a sparse array. And indeed OneHotEncoder (earlier in your pipeline) prefers to output sparse arrays, and ColumnTransformer determines whether to transform into sparse or dense depending on its component parts and the parameter sparse_threshold.

Receive a python exception when trying "LeaveOneGroupOut" from sklearn

I am new to Scikit-Learn package and am trying to use a LeaveOneGroupOut Cross-Validation for a simple classification task.
I used the following code, which I adopted based on the documentation at [this link] from the scikit-learn.org website:
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import cross_val_score
from sklearn import svm
X = Selected_Dataset[:,:-1]
y = Selected_Labels
groups = Selected_SubjIDs
clf = svm.SVC(kernel='linear', C=1)
cv = LeaveOneGroupOut()
cv.get_n_splits(X, y, groups=groups)
cross_val_score(clf, X, y, cv=cv)
But this code generates the following exception:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-27b53a67db71> in <module>
14
15
---> 16 cross_val_score(clf, X, y, cv=cv)
17
18
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
340 n_jobs=n_jobs, verbose=verbose,
341 fit_params=fit_params,
--> 342 pre_dispatch=pre_dispatch)
343 return cv_results['test_score']
344
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score)
204 fit_params, return_train_score=return_train_score,
205 return_times=True)
--> 206 for train, test in cv.split(X, y, groups))
207
208 if return_train_score:
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
777 # was dispatched. In particular this covers the edge
778 # case of Parallel used with an exhausted iterator.
--> 779 while self.dispatch_one_batch(iterator):
780 self._iterating = True
781 else:
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
618
619 with self._lock:
--> 620 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
621 if len(tasks) == 0:
622 # No more tasks available in the iterator: tell caller to stop.
~/miniconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in <genexpr>(.0)
200 pre_dispatch=pre_dispatch)
201 scores = parallel(
--> 202 delayed(_fit_and_score)(
203 clone(estimator), X, y, scorers, train, test, verbose, None,
204 fit_params, return_train_score=return_train_score,
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
93 X, y, groups = indexable(X, y, groups)
94 indices = np.arange(_num_samples(X))
---> 95 for test_index in self._iter_test_masks(X, y, groups):
96 train_index = indices[np.logical_not(test_index)]
97 test_index = indices[test_index]
~/miniconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
822 def _iter_test_masks(self, X, y, groups):
823 if groups is None:
--> 824 raise ValueError("The 'groups' parameter should not be None.")
825 # We make a copy of groups to avoid side-effects during iteration
826 groups = check_array(groups, copy=True, ensure_2d=False, dtype=None)
ValueError: The 'groups' parameter should not be None.
I found these two related Bugs being reported in 2016, and 2017.
Is there any way around it?
You have to use
cross_val_score(clf, X, y, cv=cv, groups=groups)
and you can remove the get_n_splits.
Working example
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import cross_val_score
from sklearn import svm
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import Normalizer
#load the data
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
groups = np.random.binomial(1,0.5,size=len(X))
clf = svm.SVC(kernel='linear', C=1)
cv = LeaveOneGroupOut()
cross_val_score(clf, X, y, cv=cv,groups=groups)

FeatureUnion , pipeline categorical features with tfidf features throwing error

I am trying to concat features from tfidf and other categorical features to perform classification on the resultant dataset. From various blogs I understand that FeatureUnion can be used to concat the features and then pipeline the same to algorithm (in my case Naive bayes).
I have followed the code from this link - http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html
When I try to execute the code it is giving error
TypeError: no supported conversion for types: (dtype('O'),)
Below is the code which I am trying to execute:
class textdata():
def transform(self, X, Y):
return X[desc]
def fit(self, X, Y):
return self
class one_hot_trans():
def transform(self, X, Y):
X = pd.get_dummies(X, columns=obj_cols)
return X
def fit(self, X, Y):
return self
pipeline = Pipeline([
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('text', textdata()),
('tf_idf', TfidfTransformer())
])),
('one_hot', one_hot_trans())
])),
('classifier', MultinomialNB())
])
d_train, d_test, y_train, y_test = train_test_split(data, data[target], test_size=0.2, random_state = 2018)
pipeline.fit(d_train, y_train)
Can anyone help me in resolving this error.
Note: data has total 9 columns with 1 target variable (categorical) and 1 text column (on which I want to perform tfidf) and rest are categorical (obj_cols in above code).
Edit:
Thanks Vivek. I did not notice that. It was by mistake i have put transformer instead of Vectorizer. Even after replacing I am getting below error.
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, weight, X, y, **fit_params)
579 **fit_params):
580 if hasattr(transformer, 'fit_transform'):
--> 581 res = transformer.fit_transform(X, y, **fit_params)
582 else:
583 res = transformer.fit(X, y, **fit_params).transform(X)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
745 self._update_transformer_list(transformers)
746 if any(sparse.issparse(f) for f in Xs):
--> 747 Xs = sparse.hstack(Xs).tocsr()
748 else:
749 Xs = np.hstack(Xs)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
598 if dtype is None:
599 all_dtypes = [blk.dtype for blk in blocks[block_mask]]
--> 600 dtype = upcast(*all_dtypes) if all_dtypes else None
601
602 row_offsets = np.append(0, np.cumsum(brow_lengths))
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\sputils.py in upcast(*args)
50 return t
51
---> 52 raise TypeError('no supported conversion for types: %r' % (args,))
53
54
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
Edit::
I have checked for the unique values in all the categorical variables except for description column and I found none of the values appearing in test data which are not there in train. Am I doing something wrong.
for col in d_train.columns.drop(desc):
ext = set(d_test[col].unique().tolist()) - set(d_train[col].unique().tolist())
if ext: print ("extra columns: \n\n", ext)
Edit2::
Additional info - details of the d_train, d_test features mentioned. Can anyone help I am still getting "dimension mismatch" error on predict method.
obj cols:: ['priority', 'ticket_type', 'created_group', 'Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day']
d_train cols:: Index(['priority', 'ticket_type', 'created_group', 'Description_ticket', 'Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day'], dtype='object')
d_test cols:: Index(['priority', 'ticket_type', 'created_group', 'Description_ticket','Classification', 'Component', 'ATR_OWNER_PLANT', 'created_day'], dtype='object')
d_train shape:: (95080, 8)
d_test shape:: (23770, 8)
desc:: Description_ticket
I think, you are passing text column also through one_hot_trans function.
Can you try making the output of one_hot_trans as following.
class one_hot_trans():
def transform(self, X, Y):
X = pd.get_dummies(X.drop(desc,axis=1), obj_cols])
return X
def fit(self, X, Y):
return self

Use LeaveOneGroupOut strategy on cross_val_score in sklearn

I'd like to use LeaveOneGroupOut strategy to evaluate my model. According to sklearn's document, cross_val_score seems convenient.
However, the following code does not work.
import sklearn
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.model_selection import cross_val_score
clf = sklearn.svm.SVC(kernel='linear', C=1)
# cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0) # => this works
cv = LeaveOneGroupOut # => this does not work
scores = cross_val_score(clf, iris.data, iris.target, cv=cv)
The error message is:
ValueError Traceback (most recent call last)
<ipython-input-40-435a3a7fa16c> in <module>()
4 from sklearn.model_selection import cross_val_score
5 clf = sklearn.svm.SVC(kernel='linear', C=1)
----> 6 scores = cross_val_score(clf, iris.data, iris.target, cv=LeaveOneGroupOut())
7 scores
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch)
138 train, test, verbose, None,
139 fit_params)
--> 140 for train, test in cv.split(X, y, groups))
141 return np.array(scores)[:, 0]
142
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in dispatch_one_batch(self, iterator)
601
602 with self._lock:
--> 603 tasks = BatchedCalls(itertools.islice(iterator, batch_size))
604 if len(tasks) == 0:
605 # No more tasks available in the iterator: tell caller to stop.
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __init__(self, iterator_slice)
125
126 def __init__(self, iterator_slice):
--> 127 self.items = list(iterator_slice)
128 self._size = len(self.items)
129
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_validation.pyc in <genexpr>(***failed resolving arguments***)
135 parallel = Parallel(n_jobs=n_jobs, verbose=verbose,
136 pre_dispatch=pre_dispatch)
--> 137 scores = parallel(delayed(_fit_and_score)(clone(estimator), X, y, scorer,
138 train, test, verbose, None,
139 fit_params)
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in split(self, X, y, groups)
88 X, y, groups = indexable(X, y, groups)
89 indices = np.arange(_num_samples(X))
---> 90 for test_index in self._iter_test_masks(X, y, groups):
91 train_index = indices[np.logical_not(test_index)]
92 test_index = indices[test_index]
/Users/xxx/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/sklearn/model_selection/_split.pyc in _iter_test_masks(self, X, y, groups)
770 def _iter_test_masks(self, X, y, groups):
771 if groups is None:
--> 772 raise ValueError("The groups parameter should not be None")
773 # We make a copy of groups to avoid side-effects during iteration
774 groups = np.array(groups, copy=True)
ValueError: The groups parameter should not be None
scores
You do not define your groups parameter which is the group according to which you are going to split your data.
The error comes from cross_val_score that takes this parameter in argument : in your case it is equal to None.
Try to follow the example below :
from sklearn.model_selection import LeaveOneGroupOut
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 1, 2])
groups = np.array([1, 1, 2, 2])
lol = LeaveOneGroupOut()
You have :
[In] lol.get_n_splits(X, y, groups)
[Out] 2
Then you will be able to use :
lol.split(X, y, groups)

sklearn 0.14.1 RBM dies on NaN or Inf where there is none

I'm borrowing an idea here from the documentation to use RBMs + Logistic regression for classification.
However I'm getting an error that should not be thrown since all entries in my data matrix are numerical.
Code:
from sklearn import preprocessing, cross_validation
from scipy.ndimage import convolve
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn import linear_model, datasets, metrics
import numpy as np
# create fake dataset
data, labels = datasets.make_classification(n_samples=250000)
data = preprocessing.scale(data)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, labels, test_size=0.7, random_state=0)
# print details
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
print np.max(X_train)
print np.min(X_train)
print np.mean(X_train, axis=0)
print np.std(X_train, axis=0)
if np.sum(np.isnan(X_train)) or np.sum(np.isnan(X_test)):
print "NaN found!"
if np.sum(np.isnan(y_train)) or np.sum(np.isnan(y_test)):
print "NaN found!"
if np.sum(np.isinf(X_train)) or np.sum(np.isinf(X_test)):
print "Inf found!"
if np.sum(np.isinf(y_train)) or np.sum(np.isinf(y_test)):
print "Inf found!"
# train and test
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
# Training RBM-Logistic Pipeline
classifier.fit(X_train, y_train)
# Training Logistic regression
logistic_classifier = linear_model.LogisticRegression(C=100.0)
logistic_classifier.fit(X_train, y_train)
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
y_test,
classifier.predict(X_test))))
Ouput:
(73517, 3) (171540, 3) (73517,) (171540,)
2.0871168057
-2.21062647188
[-0.00237028 -0.00104526 0.00330683]
[ 0.99907225 0.99977328 1.00225843]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/test.py in <module>()
75
76 # Training RBM-Logistic Pipeline
---> 77 classifier.fit(X_train, y_train)
78
79 # Training Logistic regression
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
128 data, then fit the transformed data using the final estimator.
129 """
--> 130 Xt, fit_params = self._pre_transform(X, y, **fit_params)
131 self.steps[-1][-1].fit(Xt, y, **fit_params)
132 return self
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
118 for name, transform in self.steps[:-1]:
119 if hasattr(transform, "fit_transform"):
--> 120 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
121 else:
122 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/usr/local/lib/python2.7/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
409 else:
410 # fit method of arity 2 (supervised transformation)
--> 411 return self.fit(X, y, **fit_params).transform(X)
412
413
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304
305 for batch_slice in batch_slices:
--> 306 pl_batch = self._fit(X[batch_slice], rng)
307
308 if verbose:
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in _fit(self, v_pos, rng)
245
246 if self.verbose:
--> 247 return self.score_samples(v_pos)
248
249 def score_samples(self, v):
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in score_samples(self, v)
268 fe_ = self._free_energy(v_)
269
--> 270 return v.shape[1] * logistic_sigmoid(fe_ - fe, log=True)
271
272 def fit(self, X, y=None):
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in logistic_sigmoid(X, log, out)
498 """
499 is_1d = X.ndim == 1
--> 500 X = array2d(X, dtype=np.float)
501
502 n_samples, n_features = X.shape
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy, force_all_finite)
91 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
92 if force_all_finite:
---> 93 _assert_all_finite(X_2d)
94 if X is X_2d and copy:
95 X_2d = safe_copy(X_2d)
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
26 and not np.isfinite(X).all()):
---> 27 raise ValueError("Array contains NaN or infinity.")
28
29
ValueError: Array contains NaN or infinity.
There are no infs or nans in the data matrix...what could be causing this behaviour?
EDIT: Apparently I'm not the only one.
This looks like a numerical stability bug in RBMs. Can you please open a github issue with your script in it?
Edit: by the way if you are interested you can try to find the source of the issue by adding np.isfinite() checks in the inner loops of the _fit method of the BernoulliRBM class.
This issue is usually caused by two factors. Incorrect initial scaling of the data. Firstly the input data needs to be bound between 0 and 1. Remember RBM's were originally designed for binary data only. Secondly the learning rates could be too high. Defaults for RBM code are often based on the MNIST digit recognition dataset which can handle larger learning rates.
So I would trust sklearn's implementation, but not the stability of the algorithm for a new dataset based on default values that don't fit with the current dataset. Adding checks for infinity wont help you will still need to tweak the learning rates.
This is why deep learning is said to be a bit of art, you probably also need to play around with the number of gibs samples, size of minibatch and amount of momentum. Dont give up though, the rewards are mostly worth it. Further reading

Categories

Resources