Found input variables with inconsistent numbers of samples: [4, 1] [closed]

Found input variables with inconsistent numbers of samples: [4, 1] [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
this is what I did. The code is down bellow. I have the music.csv dataset.
The error is Found input variables with inconsistent numbers of samples: [4, 1]. The error details is after the code.
# importing Data
import pandas as pd
music_data = pd.read_csv('music.csv')
music_data
# split into training and testing- nothing to clean
# genre = predictions
# Inputs are age and gender and output is genre
# method=drop
X = music_data.drop(columns=['genre']) # has everything but genre
# X= INPUT
Y = music_data['genre'] # only genre
# Y=OUTPUT
# now select algorithm
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier() # model
model.fit(X, Y)
prediction = model.predict([[21, 1]])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # 20% of date=testing
# first two input other output
model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, predictions)
Then this error comes. This error is a value error
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28312/3992581865.py in <module>
5 model.fit(X_train, y_train)
6 from sklearn.metrics import accuracy_score
----> 7 score = accuracy_score(y_test, predictions)
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\metrics\_classification.py in accuracy_score(y_true, y_pred, normalize,
sample_weight)
200
201 # Compute accuracy for each possible representation
--> 202 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
203 check_consistent_length(y_true, y_pred, sample_weight)
204 if y_type.startswith('multilabel'):
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
81 y_pred : array or indicator matrix
82 """
---> 83 check_consistent_length(y_true, y_pred)
84 type_true = type_of_target(y_true)
85 type_pred = type_of_target(y_pred)
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
317 uniques = np.unique(lengths)
318 if len(uniques) > 1:
--> 319 raise ValueError("Found input variables with inconsistent numbers of"
320 " samples: %r" % [int(l) for l in lengths])
321
ValueError: Found input variables with inconsistent numbers of samples: [4, 1]
Pls help me. I dont know whats happening but I think it has to do with this score = accuracy_score(y_test, predictions).

In the test data after splitting, you have four entries (rows), which means y_test has a length of 4.
While trying to predict on [21, 1] you are basically predicting on just one row. So, prediction has length of 1.
That's why you get an inconsistent number of samples error.
You can navigate this by
predicting on X_test
prediction = model.predict(X_test)
In case you want to predict on a new data, you have to separate the targets(y_test) and the input features (X_test)
and then make predictions
For eg. if target for [21,1] is [2]
prediction = model.predict([[21,1]])
y_test = [2] ## note this depends on what the corresponding target label is
score = accuracy_score(y_test,prediction)

You need to change your predictions variable after train test splitting
prediction = model.predict(X_test)
```

Related

LogisticRegression not iterating through combinations of features in a dataframe to find the best combination

I wrote a function to find the best combination of given dataframe features, f1 score, and auc score using LogisticRegression. The problem is that when I try to pass a list of dataframes combinations, using itertools combinations, LogisticRegression doesn't recognize each combination as its own X variable/ dataframe.
I'm starting with a dataframe of 10 feature columns and 10k rows. When I run the below code I get a "ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input".
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
logreg = LogisticRegression()
logreg.fit(X_subset, y)
y_pred = logreg.predict(X_subset)
f1 = f1_score(y, y_pred)
auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
and the error
C:\Users\katurner\Anaconda3\lib\site-packages\sklearn\base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- IBE1273_01_11.0
- IBE1273_01_6.0
- IBE7808
- IBE8439_2.0
- IBE8557_7.0
- ...
warnings.warn(message, FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\2\ipykernel_15932\895415673.py in <module>
----> 1 best_combo = ml.find_best_combination(X,lg_y)
2 best_combo
~\Documents\Arcadia\modeling_library.py in find_best_combination(X, y)
176 # print(y_test)
177 f1 = f1_score(y, y_pred)
--> 178 auc = roc_auc_score(y, logreg.predict_proba(X)[:,1])
179 # evaluate performance on current combination of variables
180 if f1> best_f1 and auc > best_auc:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in predict_proba(self, X)
1309 )
1310 if ovr:
-> 1311 return super()._predict_proba_lr(X)
1312 else:
1313 decision = self.decision_function(X)
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _predict_proba_lr(self, X)
459 multiclass is handled by normalizing that over all classes.
460 """
--> 461 prob = self.decision_function(X)
462 expit(prob, out=prob)
463 if prob.ndim == 1:
~\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
427 check_is_fitted(self)
428
--> 429 X = self._validate_data(X, accept_sparse="csr", reset=False)
430 scores = safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_
431 return scores.ravel() if scores.shape[1] == 1 else scores
~\Anaconda3\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
598
599 if not no_val_X and check_params.get("ensure_2d", True):
--> 600 self._check_n_features(X, reset=reset)
601
602 return out
~\Anaconda3\lib\site-packages\sklearn\base.py in _check_n_features(self, X, reset)
398
399 if n_features != self.n_features_in_:
--> 400 raise ValueError(
401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
ValueError: X has 10 features, but LogisticRegression is expecting 1 features as input.
I'm xpecting the function to return a combination of best_variables, and accociated best_f1, best_auc.
I've also tried running the function using train, test, split. When I add train, test, split into the below code the function does run but returns "[], 0, 0" for best_variables, best_f1, best_auc.
def find_best_combination(X, y):
#initialize variables
best_f1 = 0
best_auc = 0
best_variables = []
# get all possible combinations of variables
for i in range(1, X.shape[1]):
for combination in combinations(X.columns, i):
X_subset = X[list(combination)]
X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.2, stratify=y, random_state=73)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, logreg.predict_proba(X_test)[:,1])
# evaluate performance on current combination of variables
if f1> best_f1 and auc > best_auc:
best_f1 = f1
best_auc = auc
best_variables = combination
return best_variables, best_f1, best_auc
I'm not sure what's going on under the hood of train, test, split that enables the function to iterate through and not error like before.
I hope this explains it enough. Thanks in advance for any help.

How to generate confusion matrix?

I have a school projects with deep learning face recognition. I need reciprocal matrix to measure performance metrics like accuracy, precision. I tried the following codes for this. However, the y_test parameter gives an error. How can I solve this?
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(img_array, img_labels,
shuffle=True, stratify=img_labels,
test_size=0.1, random_state=42)
print('Eğitim için eleman sayısı, yükseklik/genişlik ve kanal sayısı: ', x_train.shape)
print('Test için eleman sayısı, yükseklik/genişlik ve kanal sayısı: : ',x_test.shape)
print('Eğitimdeki örnek ve sınıf sayısı :', y_train.shape)
print('Testteki örnek ve sınıf sayısı : ',y_test.shape)
my code
cm = confusion_matrix(y_test, y_pred)
print(cm)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [55], in <cell line: 1>()
----> 1 cm = confusion_matrix(y_test, y_pred)
2 print(cm)
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:307, in confusion_matrix(y_true, y_pred, labels, sample_weight, normalize)
222 def confusion_matrix(
223 y_true, y_pred, *, labels=None, sample_weight=None, normalize=None
224 ):
225 """Compute confusion matrix to evaluate the accuracy of a classification.
226
227 By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
(...)
305 (0, 2, 1, 1)
306 """
--> 307 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
308 if y_type not in ("binary", "multiclass"):
309 raise ValueError("%s is not supported" % y_type)
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:93, in _check_targets(y_true, y_pred)
90 y_type = {"multiclass"}
92 if len(y_type) > 1:
---> 93 raise ValueError(
94 "Classification metrics can't handle a mix of {0} and {1} targets".format(
95 type_true, type_pred
96 )
97 )
99 # We can't have more than one value on y_type => The set is no more needed
100 y_type = y_type.pop()
ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

I know I should not be providing this in the answer but am not able to add comments right now.
The classification report expects both the y_pred and y_test to be a 1-D array with class labels as integers.
The prediction of the TensorFlow model is mostly a 2D array with each entry being 1D array with class probability for a given row. So, you need to do some preprocessing on the y_pred.
I came across something similar a few weeks ago, I am gonna share a few lines of codes that may be helpful.
res = np.array(res)
res = res.flatten()
res = np.round(res)
Please note that the above code is for binary classification. For multilabel classification, you may use np.argmax.

Using forestci to create error bars for random forest regression algorithms

I am using a program called GALPRO to implement a random forest regression algorithm to predict photometric redshift estimates. It uses a random forest algorithm as a method of machine learning. I input testing and training data. I use x_train (dimensions = [90,13]), x_train (dimensions = [10,13]) y_train (dimensions = [90,2]) and y_test (dimensions = [10,2]).
The code below shows how GALPRO does the random forest regression calculation:
model = RandomForestRegressor(**self.params)
model.fit(x_train, y_train)
I then make point estimate predictions using:
# Use the model to make predictions on new objects
y_pred = model.predict(x_test)
I am then trying to create error estimates using the forestci package random_forest_error:
y_error = fci.random_forest_error(model, x_train, x_test)
However I get an error:
ValueError Traceback (most recent call last)
/tmp/ipykernel_2626600/1096083143.py in <module>
----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
2 print(point_estimates)
/scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
158 # Use the model to make predictions on new objects
159 y_pred = self.model.predict(self.x_test)
--> 160 y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
161
162 # Update class variables
~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
279 n_trees = forest.n_estimators
280 V_IJ = _core_computation(
--> 281 X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
282 )
283 V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
135 """
136 if not memory_constrained:
--> 137 return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
138
139 if not memory_limit:
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)
I'm not sure what this error means or why my dimensions are wrong as I am following a similar example. If anyone has any ideas please let me know!

Does classification report on sklearn require same length for both input x and y?

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
I'm getting an error with sklearn classification report.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-6a63be1ce4c8> in <module>
----> 1 classification_report(y_test, predictions)
/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict)
1522 """
1523
-> 1524 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
1525
1526 labels_given = True
/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred)
69 y_pred : array or indicator matrix
70 """
---> 71 check_consistent_length(y_true, y_pred)
72 type_true = type_of_target(y_true)
73 type_pred = type_of_target(y_pred)
/usr/local/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
233 if len(uniques) > 1:
234 raise ValueError("Found input variables with inconsistent numbers of"
--> 235 " samples: %r" % [int(l) for l in lengths])
236
237
ValueError: Found input variables with inconsistent numbers of samples: [360, 144]
This is the only thing I'm passing in, and y_test.shape is (360,) and predictions.shape is (144,).
classification_report(y_test, predictions)
Do they need to be the same length? (I'm assuming so because of that second stack trace).. If so, how can the length of X and Y can be the same when you split your data? Wouldn't they have different length always?

It seems like there's a bit of a misunderstanding here about the stats/ML data splitting framework.
Like you suspected, y_test and pred need to be the same length—let's call it k. Why? Because we need there to be k testing examples ((x, y) pairs) to test the model. X_test and y_test are each k entries long. (Each entry x in X_test may have several features, but it counts as one record.) For each x in X_test, we make a prediction about its label. Then, to compute a metric like classification accuracy, we compare the predicted label to the true label for each testing example.
If so, how can the length of X and Y can be the same when you split your data?
Peek at the API of sklearn.model_selection.train_test_split. You'd call it something like this:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
What this shows is that X_test and y_test will have the same number of records in them—they'll always be the same shape, by design. Then for each entry in X_test, you make a prediction using your model. It'll be paired with the corresponding entry in y_test, and that's how you can compute your classification score.

ValueError: Unknown label type: 'unknown' while using KNN

I am new to python and trying to run KNN but when I input the code, I get the error ValueError: Unknown label type:'unknown'.
I have encoded all the categorical data and dropped the ones I don't need to avoid dummy trapping.
What else do I need to do to clear this?
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import fbeta_score
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(x_train, y_train)
train_pred=knn.predict(x_train)
test_pred=knn.predict(x_test)
training_accuracy.append(fbeta_score(y_train, train_pred, beta=1))
test_accuracy.append(fbeta_score(y_test, test_pred, beta=1))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.savefig('knn_compare_model')
I expect a graph to show the test and training accuracy but I get this below;
ValueError Traceback (most recent call last)
<ipython-input-22-8a3a1f3c5c24> in <module>
11 # build the model
12 knn = KNeighborsClassifier(n_neighbors=n_neighbors)
---> 13 knn.fit(x_train, y_train)
>
14
15 # if accuracy of prediction on training set is high but it is low
on test set: So overfitting
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\base.py in fit(self, X, y)
903 self.outputs_2d_ = True
904
--> 905 check_classification_targets(y)
906 self.classes_ = []
907 self._y = np.empty(y.shape, dtype=np.int)
>
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in
check_classification_targets(y)
169 if y_type not in ['binary', 'multiclass', 'multiclass- multioutput',
>
170 'multilabel-indicator', 'multilabel-sequences']:
--> 171 raise ValueError("Unknown label type: %r" % y_type)
>
172
173
ValueError: Unknown label type: 'unknown'

Your y_train could be of object type which could cause this error, so kindly add the line
y_train = y_train.astype('int')
before
knn.fit(x_train, y_train)
Also do the same with your y_test.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Found input variables with inconsistent numbers of samples: [4, 1] [closed] - python

You need to change your predictions variable after train test splitting prediction = model.predict(X_test) ```

Related

LogisticRegression not iterating through combinations of features in a dataframe to find the best combination

How to generate confusion matrix?

Using forestci to create error bars for random forest regression algorithms

Does classification report on sklearn require same length for both input x and y?

ValueError: Unknown label type: 'unknown' while using KNN

Categories

Resources