from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
gnb = GaussianNB()
gnb.fit(X_train,y_train)type here
I'm getting an AttributeError when I tried to train my model using Gaussian Naive Baiyes algorithm. I tried this with MultinomialNB and BernoulliNB also, but I'm recieving the same error.
This is the error message which I recieved.
AttributeError Traceback (most recent call last)
Cell In[290], line 2
1 #training Guassian Naive Bayes model
----> 2 gnb.fit(X_train,y_train)
3 y_pred = mnb.predict(X_test)
File ~\anaconda3\envs\NLP\lib\site-packages\sklearn\naive_bayes.py:265, in GaussianNB.fit(self, X, y, sample_weight)
242 def fit(self, X, y, sample_weight=None):
243 """Fit Gaussian Naive Bayes according to X, y.
244
245 Parameters
(...)
263 Returns the instance itself.
264 """
--> 265 self._validate_params()
266 y = self._validate_data(y=y)
267 return self._partial_fit(
268 X, y, np.unique(y), _refit=True, sample_weight=sample_weight
269 )
AttributeError: 'GaussianNB' object has no attribute '_validate_params'
Kindly someone help me to solve this.
Related
I am trying to implement combining over-sampling and under-sampling using RandomUnderSampler() and SMOTE().
I am working on the loan_status dataset.
I have done the following split.
X = df.drop(['Loan_Status'],axis=1).values # independant features
y = df['Loan_Status'].values# dependant variable
This is how my training data's distribution looks like.
this is the code snippet that i tried to execute for class-balancing.
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import make_pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
pipeline = make_pipeline(over,under)
x_sm,y_sm = pipeline.fit_resample(X_train,y_train)
it gave me a ValueError with the following traceback:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_64588/3438707951.py in <module>
4 pipeline = make_pipeline(over,under)
5
----> 6 x_copy,y_copy = pipeline.fit_resample(x_train_copy,y_train_copy)
~\Anaconda3\lib\site-packages\imblearn\pipeline.py in fit_resample(self, X, y, **fit_params)
351 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
352 if hasattr(last_step, "fit_resample"):
--> 353 return last_step.fit_resample(Xt, yt, **fit_params_last_step)
354
355 #if_delegate_has_method(delegate="_final_estimator")
~\Anaconda3\lib\site-packages\imblearn\base.py in fit_resample(self, X, y)
77 X, y, binarize_y = self._check_X_y(X, y)
78
---> 79 self.sampling_strategy_ = check_sampling_strategy(
80 self.sampling_strategy, y, self._sampling_type
81 )
~\Anaconda3\lib\site-packages\imblearn\utils\_validation.py in check_sampling_strategy(sampling_strategy, y, sampling_type, **kwargs)
532 return OrderedDict(
533 sorted(
--> 534 _sampling_strategy_float(sampling_strategy, y, sampling_type).items()
535 )
536 )
~\Anaconda3\lib\site-packages\imblearn\utils\_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)
391 ]
392 ):
--> 393 raise ValueError(
394 "The specified ratio required to generate new "
395 "sample in the majority class while trying to "
ValueError: The specified ratio required to generate new sample in the majority class while trying to remove samples. Please increase the ratio.
You have to increase the sampling strategy for the SMOTE because ((y_train==0).sum())/((y_train==1).sum()) is higher than 0.1. It seems that your starting imbalance ratio is about (by eye) 0.4. Try:
over = SMOTE(sampling_strategy=0.5)
Finally you probably want an equal final ratio (after the under-sampling) so you should set the sampling strategy to 1.0 for the RandomUnderSampler:
under = RandomUnderSampler(sampling_strategy=1)
Try this way and if you have other problems give me a feedback.
I am having this error ..
I am trying to predict some data using maching learning. of a regression tree model. I have a low score, so I want to choose the most important features
For this I am using sklearn
SelectKBest, but I have the following error
How can I solve it?
Read Data
data = pd.read_csv("EquiposData.csv")
target = data.iloc[:,1:2]
datos = data.iloc[:,2:]
SelectKbest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression #Calcula la mejor selección
selector = SelectKBest(mutual_info_regression, k=4)
selector.fit(datos,target)
scores = selector.scores_
AttributeError Traceback (most recent call last)
<ipython-input-341-7d9675b4a1f7> in <module>()
4
5 selector = SelectKBest(mutual_info_regression, k=4)
----> 6 selector.fit(datos,target)
7 scores = selector.scores_
/usr/local/lib/python3.6/dist-packages/sklearn/feature_selection/_univariate_selection.py in fit(self, X, y)
342 self : object
343 """
--> 344 X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc'],
345 multi_output=True)
346
AttributeError: 'SelectKBest' object has no attribute '_validate_data'
Goal: use brier score loss to train a random forest algorithm using GridSearchCV
Issue: The probability prediction for target "y" is the wrong dimension when using make_scorer.
After looking at this question, I am using its suggested proxy function to use GridSearchCV trained with brier score loss. Below is an example of a setup:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import brier_score_loss,make_scorer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
def ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs):
return proxied_func(y_true, y_probs[:, class_idx], **kwargs)
brier_scorer = make_scorer(ProbaScoreProxy, greater_is_better=False, \
needs_proba=True, class_idx=1, proxied_func=brier_score_loss)
X = np.random.randn(100,2)
y = (X[:,0]>0).astype(int)
random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X,y)
probs = random_forest.predict_proba(X)
Now passing the probs and y directly to either brier_score_loss or ProbaScoreProxy will not result in an error:
ProbaScoreProxy(y,probs,1,brier_score_loss)
outputs:
0.0006
Now pass it through brier_scorer:
brier_scorer(random_forest,X,y)
output:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-28-1474bb08e572> in <module>()
----> 1 brier_scorer(random_forest,X,y)
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/_scorer.py in __call__(self, estimator, X, y_true, sample_weight)
167 stacklevel=2)
168 return self._score(partial(_cached_call, None), estimator, X, y_true,
--> 169 sample_weight=sample_weight)
170
171 def _factory_args(self):
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/_scorer.py in _score(self, method_caller, clf, X, y, sample_weight)
258 **self._kwargs)
259 else:
--> 260 return self._sign * self._score_func(y, y_pred, **self._kwargs)
261
262 def _factory_args(self):
<ipython-input-25-5321477444e1> in ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs)
5
6 def ProbaScoreProxy(y_true, y_probs, class_idx, proxied_func, **kwargs):
----> 7 return proxied_func(y_true, y_probs[:, class_idx], **kwargs)
8
9 brier_scorer = make_scorer(ProbaScoreProxy, greater_is_better=False, needs_proba=True, class_idx=1, proxied_func=brier_score_loss)
IndexError: too many indices for array
So it seems like something is happening in make_scorer to change the dimension of its probability input, but I can't seem to see what the problem is.
Versions:
- sklearn: '0.22.2.post1'
- numpy: '1.18.1'
Note that here y is the correct dimension (1-d) and you'll find by fiddling around that its the dimension of y_probs that's being passed in to ProbaScoreProxy that causes the issue.
Is this just badly written code from that last question? What ultimately is the way to have a make_score object that something like GridSearchCV will accept to train an RF?
Goal: use brier score loss to train a random forest algorithm using GridSearchCV
For this goal, you can use the string value 'neg_brier_score' in GridSearchCV scoring parameter directly.
For example:
gc = GridSearchCV(random_forest,
param_grid={"n_estimators":[5, 10]},
scoring="neg_brier_score")
gc.fit(X, y)
print(gc.scorer_)
# make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
log=LogisticRegression()
print (x_train.shape) --(5, 13)
print (x_test.shape) --(3, 13)
print(y_train.shape) --(5,)
print(y_test.shape) --(3,)
log.fit(x_train,y_train)
please see the below
I have followed from youtube and internet sources for the code and with the above code it is giving following error .Please help me out
error :
ValueError Traceback (most recent call last)
<ipython-input-16-86c1075a1e93> in <module>
----> 1 log.fit(x_train,y_train)
/srv/conda/lib/python3.6/site-packages/sklearn/linear_model/logistic.py in fit(self, X, y, sample_weight)
1287 X, y = check_X_y(X, y, accept_sparse='csr', dtype=_dtype, order="C",
1288 accept_large_sparse=solver != 'liblinear')
-> 1289 check_classification_targets(y)
1290 self.classes_ = np.unique(y)
1291 n_samples, n_features = X.shape
/srv/conda/lib/python3.6/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
169 if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
170 'multilabel-indicator', 'multilabel-sequences']:
--> 171 raise ValueError("Unknown label type: %r" % y_type)
172
173
ValueError: Unknown label type: 'continuous'
Logistic regression is a statistical method for predicting binary classes. The dependent variable or target variable must be binary. In your case, you have "continuous" targets.
Types of Logistic Regression:
Binary Logistic Regression: The target variable has only two possible outcomes.
Multinomial Logistic Regression: The target variable has three or more nominal categories
Ordinal Logistic Regression: the target variable has three or more ordinal categories (Example: product rating from 1 to 5)
I'm borrowing an idea here from the documentation to use RBMs + Logistic regression for classification.
However I'm getting an error that should not be thrown since all entries in my data matrix are numerical.
Code:
from sklearn import preprocessing, cross_validation
from scipy.ndimage import convolve
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline
from sklearn import linear_model, datasets, metrics
import numpy as np
# create fake dataset
data, labels = datasets.make_classification(n_samples=250000)
data = preprocessing.scale(data)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, labels, test_size=0.7, random_state=0)
# print details
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
print np.max(X_train)
print np.min(X_train)
print np.mean(X_train, axis=0)
print np.std(X_train, axis=0)
if np.sum(np.isnan(X_train)) or np.sum(np.isnan(X_test)):
print "NaN found!"
if np.sum(np.isnan(y_train)) or np.sum(np.isnan(y_test)):
print "NaN found!"
if np.sum(np.isinf(X_train)) or np.sum(np.isinf(X_test)):
print "Inf found!"
if np.sum(np.isinf(y_train)) or np.sum(np.isinf(y_test)):
print "Inf found!"
# train and test
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
# Training RBM-Logistic Pipeline
classifier.fit(X_train, y_train)
# Training Logistic regression
logistic_classifier = linear_model.LogisticRegression(C=100.0)
logistic_classifier.fit(X_train, y_train)
print("Logistic regression using RBM features:\n%s\n" % (
metrics.classification_report(
y_test,
classifier.predict(X_test))))
Ouput:
(73517, 3) (171540, 3) (73517,) (171540,)
2.0871168057
-2.21062647188
[-0.00237028 -0.00104526 0.00330683]
[ 0.99907225 0.99977328 1.00225843]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/lib/python2.7/dist-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/home/test.py in <module>()
75
76 # Training RBM-Logistic Pipeline
---> 77 classifier.fit(X_train, y_train)
78
79 # Training Logistic regression
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
128 data, then fit the transformed data using the final estimator.
129 """
--> 130 Xt, fit_params = self._pre_transform(X, y, **fit_params)
131 self.steps[-1][-1].fit(Xt, y, **fit_params)
132 return self
/usr/local/lib/python2.7/dist-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
118 for name, transform in self.steps[:-1]:
119 if hasattr(transform, "fit_transform"):
--> 120 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
121 else:
122 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/usr/local/lib/python2.7/dist-packages/sklearn/base.pyc in fit_transform(self, X, y, **fit_params)
409 else:
410 # fit method of arity 2 (supervised transformation)
--> 411 return self.fit(X, y, **fit_params).transform(X)
412
413
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in fit(self, X, y)
304
305 for batch_slice in batch_slices:
--> 306 pl_batch = self._fit(X[batch_slice], rng)
307
308 if verbose:
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in _fit(self, v_pos, rng)
245
246 if self.verbose:
--> 247 return self.score_samples(v_pos)
248
249 def score_samples(self, v):
/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/rbm.pyc in score_samples(self, v)
268 fe_ = self._free_energy(v_)
269
--> 270 return v.shape[1] * logistic_sigmoid(fe_ - fe, log=True)
271
272 def fit(self, X, y=None):
/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.pyc in logistic_sigmoid(X, log, out)
498 """
499 is_1d = X.ndim == 1
--> 500 X = array2d(X, dtype=np.float)
501
502 n_samples, n_features = X.shape
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in array2d(X, dtype, order, copy, force_all_finite)
91 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
92 if force_all_finite:
---> 93 _assert_all_finite(X_2d)
94 if X is X_2d and copy:
95 X_2d = safe_copy(X_2d)
/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
25 if (X.dtype.char in np.typecodes['AllFloat'] and not np.isfinite(X.sum())
26 and not np.isfinite(X).all()):
---> 27 raise ValueError("Array contains NaN or infinity.")
28
29
ValueError: Array contains NaN or infinity.
There are no infs or nans in the data matrix...what could be causing this behaviour?
EDIT: Apparently I'm not the only one.
This looks like a numerical stability bug in RBMs. Can you please open a github issue with your script in it?
Edit: by the way if you are interested you can try to find the source of the issue by adding np.isfinite() checks in the inner loops of the _fit method of the BernoulliRBM class.
This issue is usually caused by two factors. Incorrect initial scaling of the data. Firstly the input data needs to be bound between 0 and 1. Remember RBM's were originally designed for binary data only. Secondly the learning rates could be too high. Defaults for RBM code are often based on the MNIST digit recognition dataset which can handle larger learning rates.
So I would trust sklearn's implementation, but not the stability of the algorithm for a new dataset based on default values that don't fit with the current dataset. Adding checks for infinity wont help you will still need to tweak the learning rates.
This is why deep learning is said to be a bit of art, you probably also need to play around with the number of gibs samples, size of minibatch and amount of momentum. Dont give up though, the rewards are mostly worth it. Further reading