get true labels from keras generator - python

I want to use sklearn.metrics.confusion_matrix(y_true, y_pred) to create a confusion matrix for a keras model.
After training a model I can use predict_generator(generator) to get predictions for a test dataset, which gives me y_pred. How can I get the corresponding true labels, y_true from a data generator?

generator.classes will give you observed values in sparse format. You probably need it in dense (i.e., one-hot encoded format). You could get that with:
import pandas as pd
pd.get_dummies(pd.Series(generator.classes)).to_dense()
NOTE though: you must set the generator's shuffle attribute to False before generating the predictions and fetching the observed classes, otherwise your predictions and observations will not line up!

After creating a data generator, either your own or the built in ImageDataGenerator, use your trained model to make predictions:
true_labels = data_generator.classes
predictions = model.predict_generator(data_generator)
sklearn's confusion matrix expects a 1-d array of labels, so you have to convert your predictions using np.argmax()
y_true = true_labels
y_pred = np.array([np.argmax(x) for x in predictions])
Then you can use those variables directly in the confusion_matrix function
cm = sklearn.metrics.confusion_matrix(y_true, y_pred)
And you can plot it using the example plot_confusion_matrix() function found here:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

Related

ValueError: X has 2 features, but SVC is expecting 472082 features as input

I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.

strange behaivour of 'inverse_transform' function in sklearn.preprocessing.MinMaxScalar

I used MinMaxScalar function in sklearn.preprocessing for normalizing the attributes of some of my variables(array) to use that in a model(linear regression), after the model creation and training
I tested my model with x_test(splited usind train_test_split) and stored the result in some variable(say predicted) ,for evaluating purpose i wanna evaluate my prediction with the original dataset for that i used "MinMaxScalar.inverse_transform" function, that function works well when my code is in below order,
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
x_test=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
when i changed the order like the below code it throws me error
on-broadcastable output operand with shape (379,1) doesn't match the
broadcast shape (379,13))
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,train_size=0.75,random_state=27)
sc=MinMaxScaler(feature_range=(0,1))
x_train=sc.fit_transform(x_train)
y_train=y_train.reshape(-1,1)
y_train=sc.fit_transform(y_train)
x_test=sc.fit_transform(x_train)
please compare the two photos for better understanding of my query
It can be seen from the linked printscreen figure that you use the same MinMaxScaler to fit and transform both the train and test x-data, and also the training y-data (which does not make sense).
The correct process would be
Fit the scaler with train x-data. The fit_transform() also transforms (scales) the x_train.
sc = MinMaxScaler(feature_range=(0,1))
x_train = sc.fit_transform(x_train)
Scale also the test x-data with the same scaler. Do not fit here; just scale/transform.
x_test = sc.transform(x_test)
If you think scaling is needed also for y-data, you will have to fit another scaler for that purpose. It could also be that there is no need for scaling the y-data.
# Option A: Do not scale y-data
# (do nothing)
# Option B: Scale y-data
sc_y = MinMaxScaler(feature_range=(0,1))
y_train = sc_y.fit_transform(y_train)
After you have trained your model (lr), you can make predictions with the scaled x_test and the model:
# Option A:
predicted = lr.predict(x_test)
# Option B:
y_test_scaled = lr.predict(x_test)
predicted = sc_y.inverse_transform(y_test_scaled)

Use score() after predict() in sklearn without recalculating

Context
I use sklearn machine learning algorithms like SVR for a regression-task.
from sklearn.svm import SVR
model = SVR(kernel='poly', degree=2, epsilon=.5)
model.fit(
features # Numpy array with features
, target # Numpy array with the target
)
Afterwards I return the score of the regression using the .score()-function.
Additionally, I need the prediction-results using .predict() for further processing.
some_data = [...] # Numpy array with some data to predict
correct_targets = [...] # Numpy array with targets according to some data
# Get R²
print("R²:", model.score(
some_data
, correct_targets
))
# Store prediction
pred = model.predict(some_data)
Question
When I run the code in the above version the model is calculated twice - once for .score() and once for .predict().
However, I cannot run the .score() on the saved .predict().
This is a bit nasty since the calculation takes some time.
Is it possible to store the prediction and apply .score() afterwards without recalculating?
If you already have the predicted values:
pred = model.predict(some_data)
and the respective ground truth correct_targets, it is straightforward to get the R^2 score without re-running the model, as scikit-learn has a dedicated function for this:
from sklearn.metrics import r2_score
r2_score(correct_targets, pred)

Improving classification by using clustering as a feature

I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?
Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .

Scikit-learn f1_score for list of strings

Is there any way to compute f1_score for a list of labels as strings regardless their order?
f1_score(['a','b','c'],['a','c','b'],average='macro')
I wish this to return 1 instead of 0.33333333333
I know I could vectorize labels but this syntax would be far easier, in my case, since I am dealing with many labels
What you need is the f1_score for a multilabel classification task and for that you need a 2-d matrix for y_true and y_pred of shape [n_samples, n_labels].
You are currently supplying a 1-d array only. Hence it will be considered as a multi-class problem, not multilabel.
The official documentation provides the necessary details.
And for that to be scored correctly you need to convert the y_true, y_pred to label-indicator matrix as documented here:
y_true : 1d array-like, or label indicator array / sparse matrix
y_pred : 1d array-like, or label indicator array / sparse matrix
So you need to change the code like this:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
y_true = [['a','b','c']]
y_pred = [['a','c','b']]
binarizer = MultiLabelBinarizer()
# This should be your original approach
#binarizer.fit(your actual true output consisting of all labels)
# In this case, I am considering only the given labels.
binarizer.fit(y_true)
f1_score(binarizer.transform(y_true),
binarizer.transform(y_pred),
average='macro')
Output: 1.0
You can have a look at examples of MultilabelBinarizer here:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
https://stackoverflow.com/a/42392689/3374996

Categories

Resources