How to generate confusion matrix? - python

I have a school projects with deep learning face recognition. I need reciprocal matrix to measure performance metrics like accuracy, precision. I tried the following codes for this. However, the y_test parameter gives an error. How can I solve this?
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(img_array, img_labels,
shuffle=True, stratify=img_labels,
test_size=0.1, random_state=42)
print('Eğitim için eleman sayısı, yükseklik/genişlik ve kanal sayısı: ', x_train.shape)
print('Test için eleman sayısı, yükseklik/genişlik ve kanal sayısı: : ',x_test.shape)
print('Eğitimdeki örnek ve sınıf sayısı :', y_train.shape)
print('Testteki örnek ve sınıf sayısı : ',y_test.shape)
my code
cm = confusion_matrix(y_test, y_pred)
print(cm)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [55], in <cell line: 1>()
----> 1 cm = confusion_matrix(y_test, y_pred)
2 print(cm)
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:307, in confusion_matrix(y_true, y_pred, labels, sample_weight, normalize)
222 def confusion_matrix(
223 y_true, y_pred, *, labels=None, sample_weight=None, normalize=None
224 ):
225 """Compute confusion matrix to evaluate the accuracy of a classification.
226
227 By definition a confusion matrix :math:`C` is such that :math:`C_{i, j}`
(...)
305 (0, 2, 1, 1)
306 """
--> 307 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
308 if y_type not in ("binary", "multiclass"):
309 raise ValueError("%s is not supported" % y_type)
File ~\anaconda3\lib\site-packages\sklearn\metrics\_classification.py:93, in _check_targets(y_true, y_pred)
90 y_type = {"multiclass"}
92 if len(y_type) > 1:
---> 93 raise ValueError(
94 "Classification metrics can't handle a mix of {0} and {1} targets".format(
95 type_true, type_pred
96 )
97 )
99 # We can't have more than one value on y_type => The set is no more needed
100 y_type = y_type.pop()
ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

I know I should not be providing this in the answer but am not able to add comments right now.
The classification report expects both the y_pred and y_test to be a 1-D array with class labels as integers.
The prediction of the TensorFlow model is mostly a 2D array with each entry being 1D array with class probability for a given row. So, you need to do some preprocessing on the y_pred.
I came across something similar a few weeks ago, I am gonna share a few lines of codes that may be helpful.
res = np.array(res)
res = res.flatten()
res = np.round(res)
Please note that the above code is for binary classification. For multilabel classification, you may use np.argmax.

Related

Classification metrics can't handle a mix of unknown and binary targets

I am trying to evaluate my xgboost model using accuracy_score(). And I have the code:
predictions = xgb_model.predict(X_test)
accuracy = accuracy_score(y_test.to_numpy(), predictions)
The error message looks like this:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [126], in <cell line: 1>()
----> 1 accuracy = accuracy_score(y_test.to_numpy(), predictions)
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:211, in accuracy_score(y_true, y_pred, normalize, sample_weight)
145 """Accuracy classification score.
146
147 In multilabel classification, this function computes subset accuracy:
(...)
207 0.5
208 """
210 # Compute accuracy for each possible representation
--> 211 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
212 check_consistent_length(y_true, y_pred, sample_weight)
213 if y_type.startswith("multilabel"):
File ~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:93, in _check_targets(y_true, y_pred)
90 y_type = {"multiclass"}
92 if len(y_type) > 1:
---> 93 raise ValueError(
94 "Classification metrics can't handle a mix of {0} and {1} targets".format(
95 type_true, type_pred
96 )
97 )
99 # We can't have more than one value on y_type => The set is no more needed
100 y_type = y_type.pop()
ValueError: Classification metrics can't handle a mix of unknown and binary targets
The parameters look like this
y_test.to_numpy()
predictions
They are all 1-d arrays and I cannot find where is the problem.
How should I calculate the accuracy score?
Thanks!

"Found input variables with inconsistent numbers of samples" Have I done something wrong during the train_test_split?

I am trying to logistic Regression Model, and run some test but I keep getting this error. Not really sure what I have done differently to everyone else
from sklearn import preprocessing
X = df.iloc[:,:len(df.columns)-1]
y = df.iloc[:,len(df.columns)-1]ere
This is how I am separating my columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
TTS
logReg = LogisticRegression(n_jobs=-1)
logReg.fit(X_train, y_train)
y_pred = logReg.predict(X_train)
mae = mean_absolute_error(y_test, y_pred)
print("MAE:" , mae)
ValueError Traceback (most recent call last)
Cell In [112], line 1
----> 1 mae = mean_absolute_error(y_test, y_pred)
2 print("MAE:" , mae)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.py:196, in mean_absolute_error(y_true, y_pred, sample_weight, multioutput)
141 def mean_absolute_error(
142 y_true, y_pred, *, sample_weight=None, multioutput="uniform_average"
143 ):
144 """Mean absolute error regression loss.
145
146 Read more in the :ref:`User Guide <mean_absolute_error>`.
(...)
194 0.85...
195 """
--> 196 y_type, y_true, y_pred, multioutput = _check_reg_targets(
197 y_true, y_pred, multioutput
198 )
199 check_consistent_length(y_true, y_pred, sample_weight)
200 output_errors = np.average(np.abs(y_pred - y_true), weights=sample_weight, axis=0)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\metrics\_regression.py:100, in _check_reg_targets(y_true, y_pred, multioutput, dtype)
66 def _check_reg_targets(y_true, y_pred, multioutput, dtype="numeric"):
67 """Check that y_true and y_pred belong to the same regression task.
68
69 Parameters
(...)
98 correct keyword.
99 """
--> 100 check_consistent_length(y_true, y_pred)
101 y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
102 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:387, in check_consistent_length(*arrays)
385 uniques = np.unique(lengths)
386 if len(uniques) > 1:
--> 387 raise ValueError(
388 "Found input variables with inconsistent numbers of samples: %r"
389 % [int(l) for l in lengths]
390 )
ValueError: Found input variables with inconsistent numbers of samples: [25404, 101612]
I thought it was the way I split the columns but that doesn't seem to be the issue
It works when the test size is 50/50 but no other test size works
You are comparing the predicted labels for the train set with the labels for the test set, which are of different sizes, hence the error.
Replace
y_pred = logReg.predict(X_train)
with
y_pred = logReg.predict(X_test)

Using forestci to create error bars for random forest regression algorithms

I am using a program called GALPRO to implement a random forest regression algorithm to predict photometric redshift estimates. It uses a random forest algorithm as a method of machine learning. I input testing and training data. I use x_train (dimensions = [90,13]), x_train (dimensions = [10,13]) y_train (dimensions = [90,2]) and y_test (dimensions = [10,2]).
The code below shows how GALPRO does the random forest regression calculation:
model = RandomForestRegressor(**self.params)
model.fit(x_train, y_train)
I then make point estimate predictions using:
# Use the model to make predictions on new objects
y_pred = model.predict(x_test)
I am then trying to create error estimates using the forestci package random_forest_error:
y_error = fci.random_forest_error(model, x_train, x_test)
However I get an error:
ValueError Traceback (most recent call last)
/tmp/ipykernel_2626600/1096083143.py in <module>
----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
2 print(point_estimates)
/scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
158 # Use the model to make predictions on new objects
159 y_pred = self.model.predict(self.x_test)
--> 160 y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
161
162 # Update class variables
~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
279 n_trees = forest.n_estimators
280 V_IJ = _core_computation(
--> 281 X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
282 )
283 V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
135 """
136 if not memory_constrained:
--> 137 return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
138
139 if not memory_limit:
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)
I'm not sure what this error means or why my dimensions are wrong as I am following a similar example. If anyone has any ideas please let me know!

Found input variables with inconsistent numbers of samples: [4, 1] [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
this is what I did. The code is down bellow. I have the music.csv dataset.
The error is Found input variables with inconsistent numbers of samples: [4, 1]. The error details is after the code.
# importing Data
import pandas as pd
music_data = pd.read_csv('music.csv')
music_data
# split into training and testing- nothing to clean
# genre = predictions
# Inputs are age and gender and output is genre
# method=drop
X = music_data.drop(columns=['genre']) # has everything but genre
# X= INPUT
Y = music_data['genre'] # only genre
# Y=OUTPUT
# now select algorithm
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier() # model
model.fit(X, Y)
prediction = model.predict([[21, 1]])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2) # 20% of date=testing
# first two input other output
model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, predictions)
Then this error comes. This error is a value error
ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_28312/3992581865.py in <module>
5 model.fit(X_train, y_train)
6 from sklearn.metrics import accuracy_score
----> 7 score = accuracy_score(y_test, predictions)
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\metrics\_classification.py in accuracy_score(y_true, y_pred, normalize,
sample_weight)
200
201 # Compute accuracy for each possible representation
--> 202 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
203 check_consistent_length(y_true, y_pred, sample_weight)
204 if y_type.startswith('multilabel'):
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
81 y_pred : array or indicator matrix
82 """
---> 83 check_consistent_length(y_true, y_pred)
84 type_true = type_of_target(y_true)
85 type_pred = type_of_target(y_pred)
c:\users\shrey\appdata\local\programs\python\python39\lib\site-
packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
317 uniques = np.unique(lengths)
318 if len(uniques) > 1:
--> 319 raise ValueError("Found input variables with inconsistent numbers of"
320 " samples: %r" % [int(l) for l in lengths])
321
ValueError: Found input variables with inconsistent numbers of samples: [4, 1]
Pls help me. I dont know whats happening but I think it has to do with this score = accuracy_score(y_test, predictions).
In the test data after splitting, you have four entries (rows), which means y_test has a length of 4.
While trying to predict on [21, 1] you are basically predicting on just one row. So, prediction has length of 1.
That's why you get an inconsistent number of samples error.
You can navigate this by
predicting on X_test
prediction = model.predict(X_test)
In case you want to predict on a new data, you have to separate the targets(y_test) and the input features (X_test)
and then make predictions
For eg. if target for [21,1] is [2]
prediction = model.predict([[21,1]])
y_test = [2] ## note this depends on what the corresponding target label is
score = accuracy_score(y_test,prediction)
You need to change your predictions variable after train test splitting
prediction = model.predict(X_test)
```

Does classification report on sklearn require same length for both input x and y?

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
I'm getting an error with sklearn classification report.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-6a63be1ce4c8> in <module>
----> 1 classification_report(y_test, predictions)
/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict)
1522 """
1523
-> 1524 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
1525
1526 labels_given = True
/usr/local/lib/python3.7/site-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred)
69 y_pred : array or indicator matrix
70 """
---> 71 check_consistent_length(y_true, y_pred)
72 type_true = type_of_target(y_true)
73 type_pred = type_of_target(y_pred)
/usr/local/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
233 if len(uniques) > 1:
234 raise ValueError("Found input variables with inconsistent numbers of"
--> 235 " samples: %r" % [int(l) for l in lengths])
236
237
ValueError: Found input variables with inconsistent numbers of samples: [360, 144]
This is the only thing I'm passing in, and y_test.shape is (360,) and predictions.shape is (144,).
classification_report(y_test, predictions)
Do they need to be the same length? (I'm assuming so because of that second stack trace).. If so, how can the length of X and Y can be the same when you split your data? Wouldn't they have different length always?
It seems like there's a bit of a misunderstanding here about the stats/ML data splitting framework.
Like you suspected, y_test and pred need to be the same length—let's call it k. Why? Because we need there to be k testing examples ((x, y) pairs) to test the model. X_test and y_test are each k entries long. (Each entry x in X_test may have several features, but it counts as one record.) For each x in X_test, we make a prediction about its label. Then, to compute a metric like classification accuracy, we compare the predicted label to the true label for each testing example.
If so, how can the length of X and Y can be the same when you split your data?
Peek at the API of sklearn.model_selection.train_test_split. You'd call it something like this:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
What this shows is that X_test and y_test will have the same number of records in them—they'll always be the same shape, by design. Then for each entry in X_test, you make a prediction using your model. It'll be paired with the corresponding entry in y_test, and that's how you can compute your classification score.

Categories

Resources