I am having this error ..
I am trying to predict some data using maching learning. of a regression tree model. I have a low score, so I want to choose the most important features
For this I am using sklearn
SelectKBest, but I have the following error
How can I solve it?
Read Data
data = pd.read_csv("EquiposData.csv")
target = data.iloc[:,1:2]
datos = data.iloc[:,2:]
SelectKbest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression #Calcula la mejor selección
selector = SelectKBest(mutual_info_regression, k=4)
selector.fit(datos,target)
scores = selector.scores_
AttributeError Traceback (most recent call last)
<ipython-input-341-7d9675b4a1f7> in <module>()
4
5 selector = SelectKBest(mutual_info_regression, k=4)
----> 6 selector.fit(datos,target)
7 scores = selector.scores_
/usr/local/lib/python3.6/dist-packages/sklearn/feature_selection/_univariate_selection.py in fit(self, X, y)
342 self : object
343 """
--> 344 X, y = self._validate_data(X, y, accept_sparse=['csr', 'csc'],
345 multi_output=True)
346
AttributeError: 'SelectKBest' object has no attribute '_validate_data'
Related
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score
gnb = GaussianNB()
gnb.fit(X_train,y_train)type here
I'm getting an AttributeError when I tried to train my model using Gaussian Naive Baiyes algorithm. I tried this with MultinomialNB and BernoulliNB also, but I'm recieving the same error.
This is the error message which I recieved.
AttributeError Traceback (most recent call last)
Cell In[290], line 2
1 #training Guassian Naive Bayes model
----> 2 gnb.fit(X_train,y_train)
3 y_pred = mnb.predict(X_test)
File ~\anaconda3\envs\NLP\lib\site-packages\sklearn\naive_bayes.py:265, in GaussianNB.fit(self, X, y, sample_weight)
242 def fit(self, X, y, sample_weight=None):
243 """Fit Gaussian Naive Bayes according to X, y.
244
245 Parameters
(...)
263 Returns the instance itself.
264 """
--> 265 self._validate_params()
266 y = self._validate_data(y=y)
267 return self._partial_fit(
268 X, y, np.unique(y), _refit=True, sample_weight=sample_weight
269 )
AttributeError: 'GaussianNB' object has no attribute '_validate_params'
Kindly someone help me to solve this.
I am trying to select the best categorical features for a classification problem with chi2 and selectKBest. Here, I've sorted out the categorical columns:
I separated the features and target like this and fit it to selectKBest:
from sklearn.feature_selection import chi2, SelectKBest
X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]
selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)
When I run it, I am getting the error:
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_13272\2211654466.py in <module>
----> 1 selector = SelectKBest(score_func=chi2, k=3).fit_transform(X, y)
E:\Anaconda\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
853 else:
854 # fit method of arity 2 (supervised transformation)
--> 855 return self.fit(X, y, **fit_params).transform(X)
856
857
...
...
E:\Anaconda\lib\site-packages\pandas\core\generic.py in __array__(self, dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
ValueError: could not convert string to float: 'Self_emp_not_inc'
As far as I know, I can apply chi-square on categorical columns. Here, all the features are categorical, also the target. Then why is it saying that 'it can't convert string to float'?
Encode features would do the job. For example
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.pipeline import make_pipeline
X, y = df_cat_kbest.iloc[:, :-1], df_cat_kbest.iloc[:, -1]
selector = make_pipe(OneHotEncoder(drop='first'),SelectKBest(score_func=chi2, k=3)).fit_transform(X, y)
We have added a pre-processor! One-hot encoding. You can choose other encoding. The bottom line is that you need to transform your objects to numerical data ;)
There are other contributors encoders from contrib.scikit-category_encoders that might be helpful to your need
This is the link for the website that I've been refering to: https://www.analyticsvidhya.com/blog/2020/11/create-your-own-movie-movie-recommendation-system/
This is my code to remove the sparsity:
from scipy.sparse import csr_matrix
sample = np.array([[0,0,3,0,0],[4,0,0,0,2],[0,0,0,0,1]])
sparsity = 1.0 - ( np.count_nonzero(sample) / float(sample.size) )
print(sparsity)
csr_sample = csr_matrix(sample)
print(csr_sample)
csr_data = csr_matrix(data.values.astype(np.float))
data.reset_index(inplace=True)
This is where I'm getting an error:
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)//error
This the error that I'm getting:
ValueError Traceback (most recent call last)
<ipython-input-270-4b5ddca2edd0> in <module>()
2 from sklearn.neighbors import KNeighborsClassifier
3 knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
----> 4 knn.fit(csr_data)
5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
114 raise ValueError(
115 msg_err.format(
--> 116 type_err, msg_dtype if msg_dtype is not None else X.dtype
117 )
118 )
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I am making a feature selection by using chi square method on Python, however I am having some trouble with the last block of code, I am using a dataset composed just by categorical variables, the original dataset is composed by variables names (columns) and 'yes' and 'no'(rows) for each observation.
Here is the error:
TypeError Traceback (most recent call last)
<ipython-input-14-3b4cc24c7499> in <module>
1 fs = SelectKBest(score_func=chi2, k='all')
----> 2 fs.fit(X_train, y_train)
3 X_train_fs = fs.transform(X_train)
4 X_test_fs = fs.transform(X_test)
~\Documents\Nueva carpeta\lib\site-packages\sklearn\feature_selection\univariate_selection.py in fit(self, X, y)
347
348 self._check_params(X, y)
--> 349 score_func_ret = self.score_func(X, y)
350 if isinstance(score_func_ret, (list, tuple)):
351 self.scores_, self.pvalues_ = score_func_ret
~\Documents\Nueva carpeta\lib\site-packages\sklearn\feature_selection\univariate_selection.py in chi2(X, y)
213 # numerical stability.
214 X = check_array(X, accept_sparse='csr')
--> 215 if np.any((X.data if issparse(X) else X) < 0):
216 raise ValueError("Input X must be non-negative.")
217
TypeError: '<' not supported between instances of 'numpy.ndarray' and 'int'
And here is the code that I am currently using:
# example of chi squared feature selection for categorical data
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from matplotlib import pyplot
# load the dataset
def load_dataset(filename):
# load the dataset as a pandas DataFrame
data = read_csv(filename, header=None)
# retrieve numpy array
dataset = data.values
# split into input (X) and output (y) variables
X = dataset[:, :-1]
y = dataset[:,-1]
# format all fields as string
X = X.astype(str)
return X, y
# prepare input data
def prepare_inputs(X_train, X_test):
oe = OrdinalEncoder()
oe.fit(X_train)
X_train_enc = oe.transform(X_train)
X_test_enc = oe.transform(X_test)
return X_train_enc, X_test_enc
# prepare target
def prepare_targets(y_train, y_test):
le = LabelEncoder()
le.fit(y_train)
y_train_enc = le.transform(y_train)
y_test_enc = le.transform(y_test)
return y_train_enc, y_test_enc
# feature selection
def select_features(X_train, y_train, X_test):
fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs #This is the block where the problem arises.
Thanks in advance
i am trying to built a model for LasVegasTripAdvisorReviews-Dataset
using bagging algorithm ,
i have an error (Multilabel and multi-output classification is not supported)
can you please help me and tell me how to solve the error )
regards
the attachment contain link to lasvegas dataset LasVegasTripAdvisorReviews-Dataset
# Voting Ensemble for Classification
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier,GradientBoostingClassifier,AdaBoostClassifier,RandomForestClassifier
url = "h:/LasVegasTripAdvisorReviews-Dataset.csv"
names = ['User country','Nr. reviews','Nr. hotel reviews','Helpful votes','Period of stay','Traveler type','Pool','Gym','Tennis court','Spa','Casino','Free internet','Hotel name','Hotel stars','Nr. rooms','User continent','Member years','Review month','Review weekday','Score']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,:]
Y = array[:,:]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = AdaBoostClassifier()
estimators.append(('AdaBoost', model1))
model2 = GradientBoostingClassifier()
estimators.append(('GradientBoosting', model2))
model3 = RandomForestClassifier()
estimators.append(('RandomForest', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())
Stacktrace:
NotImplementedError Traceback (most recent call last)
<ipython-input-9-bda887b4022f> in <module>
27 # create the ensemble model
28 ensemble = VotingClassifier(estimators)
---> 29 results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold)
30 print(results.mean())
/usr/local/lib/python3.5/dist-packages/sklearn/model_selection/_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
400 fit_params=fit_params,
401 pre_dispatch=pre_dispatch,
--> 402 error_score=error_score)
403 return cv_results['test_score']
404
...
...
NotImplementedError: Multilabel and multi-output classification is not supported.
You have the line:
X = array[:,:]
Y = array[:,:]
Meaning that your feature matrix (X) and target vector (Y) are the same.
You need to chose only one column to be your Y.
For example, let's suppose your want your last column to be Y.
Then, you should change the above lines to this:
X = values[:,:-1]
Y = values[:,-1:]
This should solve the error you got. The error you have basically means: I don't support more than one column in Y.