CNN: Generate a confusion matrix for entire test dataset - python

I'm using following code to predict my model output on dataset.
correct = 0
total_predictions = []
actual_labels = []
with torch.no_grad():
for images, labels in testloader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
actual_labels.append(labels)
total_predictions.append(final_pred)
final_pred = torch.FloatTensor(final_pred).to(device)
correct += (predicted == labels).sum().item()
Now to generate a confusion matrix of entire dataset, I tried storing my predictions and test labels in a list and pass it to confusion_matrix in sklearn, but that fails with following error:
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
Can someone help me in calculating confusion matrix for my entire dataset?
The following code only calculates it for the last batch:
cf = confusion_matrix(predicted.cpu(), labels.cpu())
Update-1
Using #CutePoison's template, I'm getting this.
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
labels={}
labels['healthy_wheat'] = 0
labels['leaf_rust'] = 1
labels['stem_rust'] = 2
def conf_mat(y_true,y_pred,columns,**kwargs):
conf_mat = confusion_matrix(y_true,y_pred,labels = columns,**kwargs)
df = pd.DataFrame(conf_mat,columns = columns, index = columns)
df.columns.name="pred"
df.index.name="true"
return df
conf_mat(actual_labels,total_predictions ,columns =labels,normalize="true")

I use this snippet for createing confusion matrices, which works for multiple classes
from sklearn.metrics import confusion_matrix
def conf_mat(y_true,y_pred,columns,**kwargs):
"""
Creates a "pretty" confusion matrix
"""
conf_mat = confusion_matrix(y_true,y_pred,labels = columns,**kwargs)
df = pd.DataFrame(conf_mat,columns = columns, index = columns)
df.columns.name="pred"
df.index.name="true"
return df
conf_mat(actual_labels,final_pred ,columns =np.unique(actual_labels),normalize="true")
Note, you might want change the columns depending on how your labels are created.
Furthmore your final_pred has to contain your class-prediction and not the score i.e final_pred = [0,1,2,0...] and not final_pred= [[0.8,0.1,0.1], [0.1,0.7,0.2],[0.05,0.05,0.9],[0.75,0.2,0.05],...]

Related

Why do predictions of multiple targets sometimes sum to 1 with sklearn RandomForestRegressor?

With a supervised learning method, we have features (inputs) and targets (outputs). If we have multi-dimensional targets that sum to 1 row-wise (e.g [0.3, 0.4, 0.3]) why does sklearn's RandomForestRegressor seem to normalize all outputs/predictions to sum to 1 when the training data sums to 1?
It seems like somewhere in the source code of sklearn it is normalizing outputs if the training data sums to 1, but I haven't been able to find it. I've gotten to the BaseDecisionTree class which seems to be used by random forests, but haven't been able to see any normalization going on it there. I created a gist to show how it works. When the row-wise sums of the targets don't sum to 1, the outputs of the regressor do not sum to 1. But when the row-wise sums of the targets DO sum to 1, it seems to normalize it. Here is the demonstration code from the gist:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
# simulate data
# 12 rows train, 6 rows test, 5 features, 3 columns for target
features = np.random.random((12, 5))
targets = np.random.random((12, 3))
test_features = np.random.random((6, 5))
rfr = RandomForestRegressor(random_state=42)
rfr.fit(features, targets)
preds = rfr.predict(features)
print('preds sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
# normalize targets to sum to 1
norm_targets = targets / targets.sum(axis=1, keepdims=1)
rfr.fit(features, norm_targets)
preds = rfr.predict(features)
te_preds = rfr.predict(test_features)
print('predictions all sum to 1?')
print(np.allclose(preds.sum(axis=1), np.ones(12)))
print('test predictions all sum to 1?')
print(np.allclose(te_preds.sum(axis=1), np.ones(6)))
As one last note, I tried running a comparable fit in other random forest implementations (H2O in Python, in R: rpart, Rborist, RandomForest) but didn't find another implementation that allows multiple outputs.
My guess is that there is a bug in the sklearn code which is mixing up classification and regression somehow, and the outputs are being normalized to 1 like a classification problem.
What can be misleading here, is that you are only looking at the resulting sum of the output values. The reason why all predictions add up to 1 when the model is trained with the normalized labels, is that it will be predicting only among these multi-output arrays that it has seen. And this is happening because with such few samples, the model is overfitting, and the decision tree is de facto acting like a classifier.
In other words, looking at the example where the output is not normalised (the same applies to a DecisionTree):
from sklearn.tree import DecisionTreeRegressor
features = np.random.random((6, 5))
targets = np.random.random((6, 3))
rfr = DecisionTreeRegressor(random_state=42)
rfr.fit(features, targets)
If we now predict on a new set of random features, we will be getting predictions among the set of outputs the model has been trained on:
features2 = np.random.random((6, 5))
preds = rfr.predict(features2)
print(preds)
array([[0.0017143 , 0.05348525, 0.60877828], #0
[0.05232433, 0.37249988, 0.27844562], #1
[0.08177551, 0.39454957, 0.28182183],
[0.05232433, 0.37249988, 0.27844562],
[0.08177551, 0.39454957, 0.28182183],
[0.80068346, 0.577799 , 0.66706668]])
print(targets)
array([[0.80068346, 0.577799 , 0.66706668],
[0.0017143 , 0.05348525, 0.60877828], #0
[0.08177551, 0.39454957, 0.28182183],
[0.75093787, 0.29467892, 0.11253746],
[0.87035059, 0.32162589, 0.57288903],
[0.05232433, 0.37249988, 0.27844562]]) #1
So logically, if all training outputs add up to 1, the same will apply to the predicted values.
If we take the intersection of the sums along the first axis for both the targets and predicted values, we see that all predicted values' sum exists in targets:
preds_sum = np.unique(preds.sum(1))
targets_sum = np.unique(targets.sum(1))
len(np.intersect1d(targets_sum, preds_sum)) == len(features)
# True

How to get the clustering data (y_true, y_pred) from my code, Keras, python

I'm using a CNN with an Autoencoder to cluster different types of RNA. The clusters are calculated from the compressed representations of the different RNAs. Every RNA has a label corresponding to the type of RNA. In my case 7 different classes. After I get the result of the clustering I would like to visualize the results and see which RNA clusters where but right now the y_pred value does not correspond to the to the RNA-class but to the cluster that was initialized by kmeans.
kmeans = KMeans(n_clusters=self.n_clusters, n_init=20)
self.y_pred = kmeans.fit_predict(self.encoder.predict(x))
y_pred_last = np.copy(self.y_pred)
self.model.get_layer(name='clustering').set_weights([kmeans.cluster_centers_])
print(kmeans.labels_)
self.y_pred = q.argmax(1)
if y is not None:
acc = np.round(metrics.acc(y, self.y_pred), 5)
nmi = np.round(metrics.nmi(y, self.y_pred), 5)
ari = np.round(metrics.ari(y, self.y_pred), 5)
loss = np.round(loss, 5)
logdict = dict(iter=ite, acc=acc, nmi=nmi, ari=ari, L=loss[0], Lc=loss[1], Lr=loss[2])
optimizer = 'adam'
dcec.compile(loss=['kld', 'mse'], loss_weights=[args.gamma, 1], optimizer=optimizer)
dcec.fit(x, y=y, tol=args.tol, maxiter=args.maxiter,
update_interval=args.update_interval,
save_dir=args.save_dir,
cae_weights=args.cae_weights)
y_pred = dcec.y_pred
result = list(itertools.chain(y))
with open('datapoints.csv', mode='w', newline='') as data_points:
data_writer = csv.writer(data_points)
data_writer.writerow(['id', 'ytrue', 'ypred'])
truth= y
prediction = dcec.y_pred
for i in range(len(result)):
data_writer.writerow([i, truth[i], prediction[i]])
My problem right now is this part: prediction = dcec.y_pred
The output shows me the correct true label but not the "correct" predicted label. It returns a value but this does not correspond to the RNA-types
I don't know if this is the right path. Mainly I just want to visualize the clusters and see which RNA type was rightly and wrongly classified.
You might not be using the correct function call to get the prediction from the Keras model. I believe you should be doing something like:
prediction = dcec.predict(x)
Additional details are here: https://keras.io/models/model/
I hope this helps.

What's the best way to standardize data in Tensorflow? [duplicate]

My training data are saved in 3 files, each file is too large and cannot fit into memory.For each training example, the data are two dimensionality (2805 rows and 222 columns, the 222nd column is for label) and are numerical values. I would like to normalize the data before feeding into models for training. Below is my code for input_pipeline, and
the data has not been normalized before creating dataset. Is there some functions in tensorflow that can do normalization for my case?
dataset = tf.data.TextLineDataset([file1, file2, file3])
# combine 2805 lines into a single example
dataset = dataset.batch(2805)
def parse_example(line_batch):
record_defaults = [[1.0] for col in range(0, 221)]
record_defaults.append([1])
content = tf.decode_csv(line_batch, record_defaults = record_defaults, field_delim = '\t')
features = tf.stack(content[0:221])
features = tf.transpose(features)
label = content[-1][-1]
label = tf.one_hot(indices = tf.cast(label, tf.int32), depth = 2)
return features, label
dataset = dataset.map(parse_example)
dataset = dataset.shuffle(1000)
# batch multiple examples
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
data_batch, label_batch = iterator.get_next()
There are different ways of "normalizing data". Depending which one you have in mind, it may or may not be easy to implement in your case.
1. Fixed normalization
If you know the fixed range(s) of your values (e.g. feature #1 has values in [-5, 5], feature #2 has values in [0, 100], etc.), you could easily pre-process your feature tensor in parse_example(), e.g.:
def normalize_fixed(x, current_range, normed_range):
current_min, current_max = tf.expand_dims(current_range[:, 0], 1), tf.expand_dims(current_range[:, 1], 1)
normed_min, normed_max = tf.expand_dims(normed_range[:, 0], 1), tf.expand_dims(normed_range[:, 1], 1)
x_normed = (x - current_min) / (current_max - current_min)
x_normed = x_normed * (normed_max - normed_min) + normed_min
return x_normed
def parse_example(line_batch,
fixed_range=[[-5, 5], [0, 100], ...],
normed_range=[[0, 1]]):
# ...
features = tf.transpose(features)
features = normalize_fixed(features, fixed_range, normed_range)
# ...
2. Per-sample normalization
If your features are supposed to have approximately the same range of values, per-sample normalization could also be considered, i.e. applying normalization considering the features moments (mean, variance) for each sample:
def normalize_with_moments(x, axes=[0, 1], epsilon=1e-8):
mean, variance = tf.nn.moments(x, axes=axes)
x_normed = (x - mean) / tf.sqrt(variance + epsilon) # epsilon to avoid dividing by zero
return x_normed
def parse_example(line_batch):
# ...
features = tf.transpose(features)
features = normalize_with_moments(features)
# ...
3. Batch normalization
You could apply the same procedure over a complete batch instead of per-sample, which may make the process more stable:
data_batch = normalize_with_moments(data_batch, axis=[1, 2])
Similarly, you could use tf.nn.batch_normalization
4. Dataset normalization
Normalizing using the mean/variance computed over the whole dataset would be the trickiest, since as you mentioned it is a large, split one.
tf.data.Dataset isn't really meant for such global computation. A solution would be to use whatever tools you have to pre-compute the dataset moments, then use this information for your TF pre-processing.
As mentioned by #MiniQuark, Tensorflow has a Transform library you could use to preprocess your data. Have a look at the Get Started, or for instance at the tft.scale_to_z_score() method for sample normalization.
Exapnding on benjaminplanche's answer for "#4 Dataset normalization", there is actually a pretty easy way to accomplish this.
Tensorflow's Keras provides a preprocessing normalization layer. Now as this is a layer, its intent is to be used within the model. However you don't have to (more on that later).
The model usage is simple:
input = tf.keras.Input(shape=dataset.element_spec.shape)
norm = tf.keras.layers.preprocessing.Normalization()
norm.adapt(dataset) # you can use dataset.take(N) if N samples is enough for it to figure out the mean & variance.
layer1 = norm(input)
...
The advantage of using it in the model is that the normalization mean & variance are saved as part of the model weights. So when you load the saved model, it'll use the same values it was trained with.
As mentioned earlier, if you don't want to use keras models, you don't have to use the layer as part of one. If you'd rather use it in your dataset pipeline, you can do that too.
norm = tf.keras.layers.experimental.preprocessing.Normalization()
norm.adapt(dataset)
dataset = dataset.map(lambda t: norm(t))
The disadvantage is that you need to save and restore those weights manually now (norm.get_weights() and norm.set_weights()). Numpy has convenient save() and load() functions you can use here.
np.save("norm_weights.npy", norm.get_weights())
norm.set_weights(np.load("norm_weights.npy", allow_pickle=True))
After defining inputs, execute the following line of code:
import tensorflow as tf
inputs = tf.keras.layers.LayerNormalization(
axis=-1,
center=True,
scale=True,
trainable=True,
name='input_normalized',
)(inputs)
I inferred this from the tensorflow API (which has been updated since the answers above).

H2o Python: Combining XGB Holdout Predictions

When using:
"keep_cross_validation_predictions": True
"keep_cross_validation_fold_assignment": True
in H2O's XGBoost Estimator, I am not able to map these cross validated probabilities back to the original dataset. There is one documentation example for R but not for Python (combining holdout predictions).
Any leads on how to do this in Python?
The cross-validated predictions are stored in two different places -- once as a list of length k (for k-folds) in model.cross_validation_predictions(), and another as an H2O Frame with the CV preds in the same order as the original training rows in model.cross_validation_holdout_predictions(). The latter is usually what people want (we added this later, that's why there are two versions).
Yes, unfortunately the R example to get this frame in the "Cross-validation" section of the H2O User Guide does not have a Python version (ticket to fix that). In the keep_cross_validation_predictions argument documentation, it only shows one of the two locations.
Here's an updated example using XGBoost and showing both types of CV predictions:
import h2o
from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# try using the `keep_cross_validation_predictions` (boolean parameter):
# first initialize your estimator, set nfolds parameter
xgb = H2OXGBoostEstimator(keep_cross_validation_predictions = True, nfolds = 5, seed = 1)
# then train your model
xgb.train(x = x, y = y, training_frame = train)
# print the cross-validation predictions as a list
xgb.cross_validation_predictions()
# print the cross-validation predictions as an H2OFrame
xgb.cross_validation_holdout_predictions()
The CV pred frame of predictions looks like this:
Out[57]:
predict p0 p1
--------- --------- --------
1 0.396057 0.603943
1 0.149905 0.850095
1 0.0407018 0.959298
1 0.140991 0.859009
0 0.67361 0.32639
0 0.865698 0.134302
1 0.12927 0.87073
1 0.0549603 0.94504
1 0.162544 0.837456
1 0.105603 0.894397
[10000 rows x 3 columns]
For Python there is an example of this on GBM, and it should be exactly the same for XGB. According to that page, you should be able to do something like this:
model = H2OXGBoostEstimator(keep_cross_validation_predictions = True)
model.train(x = predictors, y = response, training_frame = train)
cv_predictions = model.cross_validation_predictions()

Tensorflow DNNclassifier getting bad results

I am trying to make a classifier to learn if a movie review was positive or negative from its contents. I have using a couple of files that are relevant, a file of the total vocabulary(one word per line) across every document, two CSVs(one for the training set, one for the testing) containing the score each document got in a specific order, and two CSVs(same as above) where on one line it is the the index of each word that appears in that review looking at the vocab as a list. So for every a review like "I liked this movie" have something like a score line of 1(0: dislike, 1 like) and a word line of [2,13,64,33]. I use the DNNClassifier and currently am using 1 feature which is an embedding column wrapped around a categorical_column_with_identity. My code runs but it takes absolutely terrible results and I'm not sure why. Perhaps someone with more knowledge about tensor flow could help me out. Also I don't go on here much but I honestly tried and couldn't find a post that directly helps me.
import tensorflow as tf
import pandas as pd
import numpy as np
import os
embedding_d = 18
label_name = ['Label']
col_name = ["Words"]
hidden_unit = [10]*5
BATCH = 50
STEP = 5000
#Ignore some warning messages but an optional compiler
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
##Function to feed into training
def train_input_fn(features, labels, batch_size):
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
dataset = dataset.shuffle(1000).repeat().batch(batch_size)
# Return the dataset.
return dataset
##Orignal Eval. Untouched so far. Mostly likely will need to be changed.
def eval_input_fn(features, labels, batch_size):
"""An input function for evaluation or prediction"""
features=dict(features)
if labels is None:
# No labels, use only features.
inputs = features
else:
inputs = (features, labels)
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices(inputs)
# Batch the examples
assert batch_size is not None, "batch_size must not be None"
dataset = dataset.batch(batch_size)
# Return the dataset.
return dataset
## Produces dataframe for labels and features(words) using pandas
def loadData():
train_label =pd.read_csv("aclImdb/train/yaynay.csv",names=label_name)
test_label =pd.read_csv("aclImdb/test/yaynay.csv",names=label_name)
train_feat = pd.read_csv("aclImdb/train/set.csv", names = col_name)
test_feat = pd.read_csv("aclImdb/test/set.csv", names = col_name)
train_feat[col_name] =train_feat[col_name].astype(np.int64)
test_feat[col_name] =test_feat[col_name].astype(np.int64)
return (train_feat,train_label),(test_feat,test_label)
## Stuff that I believe is somewhat working
# Get labels for test and training data
(train_x,train_y), (test_x,test_y) = loadData()
## Get the features for each document
train_feature = []
#Currently only one key but this could change in the future
for key in train_x.keys():
#Create a categorical_column column
idCol = tf.feature_column.categorical_column_with_identity(
key= key,
num_buckets=89528)
embedding_column = tf.feature_column.embedding_column(
categorical_column= idCol,
dimension=embedding_d)
train_feature.append(embedding_column)
##Create the neural network
classifier = tf.estimator.DNNClassifier(
feature_columns=train_feature,
# Species no. of layers and no. of neurons in each layer
hidden_units=hidden_unit,
# Number of output options(here there are 11 for scores 0-10 inclusive)
n_classes= 2)
# Train the Model
#First numerical value is batch size, second is total steps to take.
classifier.train(input_fn= lambda: train_input_fn(train_x, train_y, BATCH),steps=STEP)
#Evaluate the model
eval_result = classifier.evaluate(
input_fn=lambda:eval_input_fn(test_x, test_y,
BATCH), steps = STEP)
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

Categories

Resources