I'm working through Aurelien Geron's Hands-On ML textbook and have got stuck trying to train an SGDClassifier.
I'm using the MNIST handwritten numbers data and running my code in a Jupyter Notebook via Anaconda. Both my anaconda (1.7.0) and sklearn (0.20.dev0) are updated. I've pasted the code I used to load the data, select the first 60k rows, shuffle the order and convert the labels to 1 (True) for all 5's and 0 (False) for all other numbers. Both X_train and y_train_5 are numpy arrays.
I've pasted the error message I get below.
Nothing seems to be wrong with the dimensions of the data, I tried converting X_train to a sparse matrix (the suggested format for SGDClassifier) and various max_iter values and got the same error message each time. Am I missing something obvious? Do I need to use a different version of sklearn? I've searched online but couldn't find any posts describing similar issues with SGDClassifier. I'd be super grateful for any kind of pointer.
Code
from six.moves import urllib
from scipy.io import loadmat
import numpy as np
from sklearn.linear_model import SGDClassifier
# Load MNIST data #
from scipy.io import loadmat
mnist_alternative_url = "https://github.com/amplab/datascience-
sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
content = response.read()
f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
"data": mnist_raw["data"].T,
"target": mnist_raw["label"][0],
"COL_NAMES": ["label", "data"],
"DESCR": "mldata.org dataset: mnist-original",
}
# Assign X and y #
X, y = mnist['data'], mnist['target']
# Select first 60000 numbers #
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000],
y[60000:]
# Shuffle order #
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# Convert labels to binary (5 or "not 5") #
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
# Train SGDClassifier #
sgd_clf = SGDClassifier(max_iter=5, random_state=42)
sgd_clf.fit(X_train, y_train_5)
Error Message
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-10-5a25eed28833> in <module>()
37 # Train SGDClassifier
38 sgd_clf = SGDClassifier(max_iter=5, random_state=42)
---> 39 sgd_clf.fit(X_train, y_train_5)
~\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py in fit(self, X, y, coef_init, intercept_init, sample_weight)
712 loss=self.loss, learning_rate=self.learning_rate,
713 coef_init=coef_init, intercept_init=intercept_init,
--> 714 sample_weight=sample_weight)
715
716
~\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py in _fit(self, X, y, alpha, C, loss, learning_rate, coef_init, intercept_init, sample_weight)
570
571 self._partial_fit(X, y, alpha, C, loss, learning_rate, self._max_iter,
--> 572 classes, sample_weight, coef_init, intercept_init)
573
574 if (self._tol is not None and self._tol > -np.inf
~\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)
529 learning_rate=learning_rate,
530 sample_weight=sample_weight,
--> 531 max_iter=max_iter)
532 else:
533 raise ValueError(
~\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py in _fit_binary(self, X, y, alpha, C, sample_weight, learning_rate, max_iter)
587 self._expanded_class_weight[1],
588 self._expanded_class_weight[0],
--> 589 sample_weight)
590
591 self.t_ += n_iter_ * X.shape[0]
~\Anaconda3\lib\site-packages\sklearn\linear_model\stochastic_gradient.py in fit_binary(est, i, X, y, alpha, C, learning_rate, max_iter, pos_weight, neg_weight, sample_weight)
419 pos_weight, neg_weight,
420 learning_rate_type, est.eta0,
--> 421 est.power_t, est.t_, intercept_decay)
422
423 else:
~\Anaconda3\lib\site-packages\sklearn\linear_model\sgd_fast.pyx in sklearn.linear_model.sgd_fast.plain_sgd()
TypeError: plain_sgd() takes at most 21 positional arguments (25 given)
It appears your version of scikit-learn is just a little outdated. Try running:
pip install -U scikit-learn
then your code will run (with some slight formatting updates):
from six.moves import urllib
from scipy.io import loadmat
import numpy as np
from sklearn.linear_model import SGDClassifier
from scipy.io import loadmat
# Load MNIST data #
mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
mnist_path = "./mnist-original.mat"
response = urllib.request.urlopen(mnist_alternative_url)
with open(mnist_path, "wb") as f:
content = response.read()
f.write(content)
mnist_raw = loadmat(mnist_path)
mnist = {
"data": mnist_raw["data"].T,
"target": mnist_raw["label"][0],
"COL_NAMES": ["label", "data"],
"DESCR": "mldata.org dataset: mnist-original",
}
# Assign X and y #
X, y = mnist['data'], mnist['target']
# Select first 60000 numbers #
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
# Shuffle order #
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# Convert labels to binary (5 or "not 5") #
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
# Train SGDClassifier #
sgd_clf = SGDClassifier(max_iter=5, random_state=42)
sgd_clf.fit(X_train, y_train_5)
Related
When trying to tune the sklearn MLPClassifier hidden_layer_sizes hyper parameter, using BayesSearchCV, I get an error: ValueError: can only convert an array of size 1 to a Python scalar.
However, when I use GridSearchCV, it works great! What am I missing?
Here goes a reproducible example:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.datasets import load_iris
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
X, y = load_iris(True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
random_state=0)
# this does not work!
opt_bs = BayesSearchCV(MLPClassifier(),
{'learning_rate_init': Real(0.001, 0.05),
'solver': Categorical(["adam", 'sgd']),
'hidden_layer_sizes': Categorical([(10,5), (15,10,5)])},
n_iter=32,
random_state=0)
# this one does :)
opt_gs = GridSearchCV(MLPClassifier(),
{'learning_rate_init': [0.001, 0.05],
'solver': ["adam", 'sgd'],
'hidden_layer_sizes': [(10,5), (15,10,5)]})
# executes optimization using opt_gs or opt_bs
opt = opt_bs
res = opt.fit(X_train, y_train)
opt
Produces:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-64-78e6d29cae99> in <module>()
27 # executes optimization using opt_gs or opt_bs
28 opt = opt_bs
---> 29 res = opt.fit(X_train, y_train)
30 opt
/usr/local/lib/python3.6/dist-packages/skopt/searchcv.py in fit(self, X, y, groups, callback)
678 optim_result = self._step(
679 X, y, search_space, optimizer,
--> 680 groups=groups, n_points=n_points_adjusted
681 )
682 n_iter -= n_points
/usr/local/lib/python3.6/dist-packages/skopt/searchcv.py in _step(self, X, y, search_space, optimizer, groups, n_points)
553
554 # convert parameters to python native types
--> 555 params = [[np.array(v).item() for v in p] for p in params]
556
557 # make lists into dictionaries
/usr/local/lib/python3.6/dist-packages/skopt/searchcv.py in <listcomp>(.0)
553
554 # convert parameters to python native types
--> 555 params = [[np.array(v).item() for v in p] for p in params]
556
557 # make lists into dictionaries
/usr/local/lib/python3.6/dist-packages/skopt/searchcv.py in <listcomp>(.0)
553
554 # convert parameters to python native types
--> 555 params = [[np.array(v).item() for v in p] for p in params]
556
557 # make lists into dictionaries
ValueError: can only convert an array of size 1 to a Python scalar
Unfortunately, BayesSearchCV accepts only parameters in Categorical, Integer, or Real type values. In your case, there is no issue w.r.t learning_rate_init and solver parameters as they are clearly defined as Real and Categorical respectively, the problem comes in the hidden_layer_sizes where you have declared the number of neurons as Categorical values which in this case are tuples, and BayesSearchCV is not yet equipped to handle search spaces in tuples, refer here for more details on this. However, as a temporary hack, you could just create your own wrapper around MLPClassifier to make the parameters of the estimator be recognized properly. Please refer to the following code snippet for a sample:
from skopt import BayesSearchCV
from skopt.space import Integer
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
import itertools
X, y = load_iris(True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size=0.75,
random_state=0)
class MLPWrapper(BaseEstimator, ClassifierMixin):
def __init__(self, layer1=10, layer2=10, layer3=10):
self.layer1 = layer1
self.layer2 = layer2
self.layer3 = layer3
def fit(self, X, y):
model = MLPClassifier(
hidden_layer_sizes=[self.layer1, self.layer2, self.layer3]
)
model.fit(X, y)
self.model = model
return self
def predict(self, X):
return self.model.predict(X)
def score(self, X, y):
return self.model.score(X, y)
opt = BayesSearchCV(
estimator=MLPWrapper(),
search_spaces={
'layer1': Integer(10, 100),
'layer2': Integer(10, 100),
'layer3': Integer(10, 100)
},
n_iter=11
)
opt.fit(X_train, y_train)
opt.score(X_test,y_test)
0.9736842105263158
Note: This assumes that you build an MLP network with three layers. You can modify it as per your need. Also, it becomes slightly tricky to create a class that constructs any MLP with an arbitrary number of layers.
Recently I am doing a text classification problem and want to use Doc2vec for experiment. I googled and saw the tutorial post of #susanli2016. I was wondering how can I use cross validation to evaluate instead of train/test split in this case?
P.s.: As I couldn't really find anyone who use cv for evaluation, is train/test split better here?
Here is her code:
train, test = train_test_split(df, test_size=0.3, random_state=42)
import nltk
from nltk.corpus import stopwords
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
train_tagged = train.apply(
lambda r: TaggedDocument(words=tokenize_text(r['narrative']), tags=[r.Product]), axis=1)
test_tagged = test.apply(
lambda r: TaggedDocument(words=tokenize_text(r['narrative']), tags=[r.Product]), axis=1)
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, min_count=2, sample = 0, workers=cores)
model_dbow.build_vocab([x for x in tqdm(train_tagged.values)])
def vec_for_learning(model, tagged_docs):
sents = tagged_docs.values
targets, regressors = zip(*[(doc.tags[0], model.infer_vector(doc.words, steps=20)) for doc in sents])
return targets, regressors
y_train, X_train = vec_for_learning(model_dbow, train_tagged)
y_test, X_test = vec_for_learning(model_dbow, test_tagged)
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
from sklearn.metrics import accuracy_score, f1_score
print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
Here is my thought:
tagged = df.apply(
lambda r: TaggedDocument(words=tokenize_text(r['Text']), tags=[r.label]), axis=1)
doc = vec_for_learning(model_dbow, tagged)
text_clf = SVC(C=1, kernel= 'linear', max_iter= 1000, tol=0.0001, probability=True)
scores = cross_validate(text_clf, doc, df['label'],
cv=10, return_train_score=False)
sorted(scores.keys())
scores['test_score']
but I got the following error message:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-77-105f56d4a65a> in <module>()
1 scores = cross_validate(text_clf, doc, df['label'],
----> 2 cv=10, return_train_score=False)
3 sorted(scores.keys())
4
5 scores['test_score']
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
223
224 """
--> 225 X, y, groups = indexable(X, y, groups)
226
227 cv = check_cv(cv, y, classifier=is_classifier(estimator))
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in indexable(*iterables)
258 else:
259 result.append(np.array(X))
--> 260 check_consistent_length(*result)
261 return result
262
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
233 if len(uniques) > 1:
234 raise ValueError("Found input variables with inconsistent numbers of"
--> 235 " samples: %r" % [int(l) for l in lengths])
236
237
ValueError: Found input variables with inconsistent numbers of samples: [2, 2234]
I don't know where did I do wrong and how to fix it.
Since I am very new to python and NLP domain, I would be glad if someone can give me a hint on this so my work can proceed:)
Thank you in advanced!
I'm using anaconda-navigator -> python3.6
when I run the following code I get this error:
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.grid_search import GridSearchCV
def fit_model(X, y):
""" Performs grid search over the 'max_depth' parameter for a
decision tree regressor trained on the input data [X, y]. """
# Create cross-validation sets from the training data
# sklearn version 0.18: ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None)
# sklearn versiin 0.17: ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)
cv_sets = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.20, random_state = 0)
# TODO: Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth':range(1,10)}
# TODO: Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# TODO: Create the grid search cv object --> GridSearchCV()
# Make sure to include the right parameters in the object:
# (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
grid = GridSearchCV(regressor, params, scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_`
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)
# Produce the value for 'max_depth'
print ("Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']))
Here are the error messages:
ValueError Traceback (most recent call last)
<ipython-input-12-05857a84a7c5> in <module>()
1 # Fit the training data to the model using grid search
----> 2 reg = fit_model(X_train, y_train)
3
4 # Produce the value for 'max_depth'
5 print ("Parameter 'max_depth' is {} for the optimal model.".format(reg.get_params()['max_depth']))
<ipython-input-11-2c0c19498236> in fit_model(X, y)
26 # (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
27
---> 28 grid = GridSearchCV(regressor, params, scoring_fnc, cv=cv_sets)
29
30 # Fit the grid search object to the data to compute the optimal model
~/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py in __init__(self, estimator, param_grid, scoring, fit_params, n_jobs, iid, refit, cv, verbose, pre_dispatch, error_score)
819 refit, cv, verbose, pre_dispatch, error_score)
820 self.param_grid = param_grid
--> 821 _check_param_grid(param_grid)
822
823 def fit(self, X, y=None):
~/anaconda3/lib/python3.6/site-packages/sklearn/grid_search.py in _check_param_grid(param_grid)
349 if True not in check:
350 raise ValueError("Parameter values for parameter ({0}) need "
--> 351 "to be a sequence.".format(name))
352
353 if len(v) == 0:
ValueError: Parameter values for parameter (max_depth) need to be a sequence.
grid_search.py checks this:
check = [isinstance(v, k) for k in (list, tuple, np.ndarray)]
It seems like you can't use a range. I would try this:
params = {'max_depth': np.arange(1,10)}
or without numpy:
params = {'max_depth': [x for x in range(1,10)]}
I'm building a classifier using an SVM and want to perform a Grid Search to help automate finding the optimal model. Here's the code:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
X.shape # (22343, 323)
y.shape # (22343, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.4, random_state=0
)
tuned_parameters = [
{
'estimator__kernel': ['rbf'],
'estimator__gamma': [1e-3, 1e-4],
'estimator__C': [1, 10, 100, 1000]
},
{
'estimator__kernel': ['linear'],
'estimator__C': [1, 10, 100, 1000]
}
]
model_to_set = OneVsRestClassifier(SVC(), n_jobs=-1)
clf = GridSearchCV(model_to_set, tuned_parameters)
clf.fit(X_train, y_train)
and I get the following error message (this isn't the whole stack trace. just the last 3 calls):
----------------------------------------------------
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
88 X, y, groups = indexable(X, y, groups)
89 indices = np.arange(_num_samples(X))
---> 90 for test_index in self._iter_test_masks(X, y, groups):
91 train_index = indices[np.logical_not(test_index)]
92 test_index = indices[test_index]
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
606
607 def _iter_test_masks(self, X, y=None, groups=None):
--> 608 test_folds = self._make_test_folds(X, y)
609 for i in range(self.n_splits):
610 yield test_folds == i
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y, groups)
593 for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)):
594 for cls, (_, test_split) in zip(unique_y, per_cls_splits):
--> 595 cls_test_folds = test_folds[y == cls]
596 # the test split can be too big because we used
597 # KFold(...).split(X[:max(c, n_splits)]) when data is not 100%
IndexError: too many indices for array
Also, when I try reshaping the arrays so that the y is (22343,) I find that the GridSearch never finishes even if I set the tuned_parameters to only default values.
And here are the versions for all of the packages if that helps:
Python: 3.5.2
scikit-learn: 0.18
pandas: 0.19.0
It seems that there is no error in your implementation.
However, as it's mentioned in the sklearndocumentation, the "fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples". See documentation here
In your case, you have 22343 samples, which can lead to some computational problems/memory issues. That is why when you do your default CV it takes a lot of time. Try to reduce your train set using 10000 samples or less.
I'm borrowing an idea here from the documentation to use RBMs + Logistic regression for classification.
However I'm getting an error that should not be thrown since all entries in my data matrix are numerical.
Code:
from sklearn import preprocessing, cross_validation
from scipy.ndimage import convolve
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn import linear_model, datasets, metrics
import numpy as np
# create fake dataset
data, labels = datasets.make_classification(n_samples=250000)
data = preprocessing.scale(data)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, labels, test_size=0.7, random_state=0)
# print details
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
print np.max(X_train)
print np.min(X_train)
print np.mean(X_train, axis=0)
print np.std(X_train, axis=0)
if np.sum(np.isnan(X_train)) or np.sum(np.isnan(X_test)):
print "NaN found!"
if np.sum(np.isnan(y_train)) or np.sum(np.isnan(y_test)):
print "NaN found!"
if np.sum(np.isinf(X_train)) or np.sum(np.isinf(X_test)):
print "Inf found!"
if np.sum(np.isinf(y_train)) or np.sum(np.isinf(y_test)):
print "Inf found!"
# train and test
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
# Training RBM-Logistic Pipeline
classifier.fit(X_train, y_train)
# Training Logistic regression
logistic_classifier = linear_model.LogisticRegression(C=100.0)
logistic_classifier.fit(X_train, y_train)
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
y_test,
classifier.predict(X_test))))
Ouput:
(73517, 3) (171540, 3) (73517,) (171540,)
2.0871168057
-2.21062647188
[-0.00237028 -0.00104526 0.00330683]
[ 0.99907225 0.99977328 1.00225843]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/test.py in <module>()
75
76 # Training RBM-Logistic Pipeline
---> 77 classifier.fit(X_train, y_train)
78
79 # Training Logistic regression
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
128 data, then fit the transformed data using the final estimator.
129 """
--> 130 Xt, fit_params = self._pre_transform(X, y, **fit_params)
131 self.steps[-1][-1].fit(Xt, y, **fit_params)
132 return self
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
118 for name, transform in self.steps[:-1]:
119 if hasattr(transform, "fit_transform"):
--> 120 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
121 else:
122 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/usr/local/lib/python2.7/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
409 else:
410 # fit method of arity 2 (supervised transformation)
--> 411 return self.fit(X, y, **fit_params).transform(X)
412
413
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304
305 for batch_slice in batch_slices:
--> 306 pl_batch = self._fit(X[batch_slice], rng)
307
308 if verbose:
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in _fit(self, v_pos, rng)
245
246 if self.verbose:
--> 247 return self.score_samples(v_pos)
248
249 def score_samples(self, v):
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in score_samples(self, v)
268 fe_ = self._free_energy(v_)
269
--> 270 return v.shape[1] * logistic_sigmoid(fe_ - fe, log=True)
271
272 def fit(self, X, y=None):
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in logistic_sigmoid(X, log, out)
498 """
499 is_1d = X.ndim == 1
--> 500 X = array2d(X, dtype=np.float)
501
502 n_samples, n_features = X.shape
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy, force_all_finite)
91 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
92 if force_all_finite:
---> 93 _assert_all_finite(X_2d)
94 if X is X_2d and copy:
95 X_2d = safe_copy(X_2d)
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
26 and not np.isfinite(X).all()):
---> 27 raise ValueError("Array contains NaN or infinity.")
28
29
ValueError: Array contains NaN or infinity.
There are no infs or nans in the data matrix...what could be causing this behaviour?
EDIT: Apparently I'm not the only one.
This looks like a numerical stability bug in RBMs. Can you please open a github issue with your script in it?
Edit: by the way if you are interested you can try to find the source of the issue by adding np.isfinite() checks in the inner loops of the _fit method of the BernoulliRBM class.
This issue is usually caused by two factors. Incorrect initial scaling of the data. Firstly the input data needs to be bound between 0 and 1. Remember RBM's were originally designed for binary data only. Secondly the learning rates could be too high. Defaults for RBM code are often based on the MNIST digit recognition dataset which can handle larger learning rates.
So I would trust sklearn's implementation, but not the stability of the algorithm for a new dataset based on default values that don't fit with the current dataset. Adding checks for infinity wont help you will still need to tweak the learning rates.
This is why deep learning is said to be a bit of art, you probably also need to play around with the number of gibs samples, size of minibatch and amount of momentum. Dont give up though, the rewards are mostly worth it. Further reading