Training MSE loss larger than theoretical maximum?

Training MSE loss larger than theoretical maximum? - python

I am training a keras model whose last layer is a single sigmoid unit:
output = Dense(units=1, activation='sigmoid')
I am training this model with some training data in which the expected output is always a number between 0.0 and 1.0.
I am compiling the model with mean-squared-error:
model.compile(optimizer='adam', loss='mse')
Since both the expected output and the real output are single floats between 0 and 1, I was expecting a loss between 0 and 1 as well, but when I start the training I get a loss of 3.3932, larger than 1.
Am I missing something?
Edit:
I am adding an example to show the problem:
https://drive.google.com/file/d/1fBBrgW-HlBYhG-BUARjTXn3SpWqrHHPK/view?usp=sharing
(I cannot just paste the code because I need to attach the training data)
After running python stackoverflow.py, the summary of the model will be shown, as well as the training process.
I also print the minimum and maximum values of y_true each step to verify that they are within the [0, 1] range.
There is no need to wait for the training to finish, you will see that the loss during the first few epochs is much larger than 1.

First, we can demystify mse loss - it's a normal callable function in tf.keras:
import tensorflow as tf
import numpy as np
mse = tf.keras.losses.mse
print(mse([1] * 3, [0] * 3)) # tf.Tensor(1, shape=(), dtype=int32)
Next, as the name "mean squared error" implies, it's a mean, meaning size of vectors passed to it do not change the value as long as the mean is the same:
print(mse([1] * 10, [0] * 10)) # tf.Tensor(1, shape=(), dtype=int32)
In order for the mse to exceed 1, average error must exceed 1:
print( mse(np.random.random((100,)), np.random.random((100,))) ) # tf.Tensor(0.14863832582680103, shape=(), dtype=float64)
print( mse( 10 * np.random.random((100,)), np.random.random((100,))) ) # tf.Tensor(30.51209646429651, shape=(), dtype=float64)
Lastly, sigmoid indeed guarantees that output is between 0 and 1:
sigmoid = tf.keras.activations.sigmoid
signal = 10 * np.random.random((100,))
output = sigmoid(signal)
print(f"Raw: {np.mean(signal):.2f}; Sigmoid: {np.mean(output):.2f}" ) # Raw: 5.35; Sigmoid: 0.92
What this implies is that in your code, mean of y_true is NOT between 0 and 1.
You can verify this with np.mean(y_true).

I do not have an answer for the question asked. I am getting nans in my MSE loss, with input in range [0,1] and sigmoid at output. So I thought the question is relevant.
Here are a few observations about sigmoid:
import tensorflow as tf
import numpy as np
x=tf.constant([-20, -1.0, 0.0, 1.0, 20], dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([2.0611537e-09, 2.6894143e-01, 5.0000000e-01, 7.3105860e-01,
# 1.0000000e+00], dtype=float32)
x=tf.constant([float('nan')]*5, dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([nan, nan, nan, nan, nan], dtype=float32)
x=tf.constant([np.inf]*5, dtype = tf.float32)
x=tf.keras.activations.sigmoid(x)
x.numpy()
# array([1., 1., 1., 1., 1.], dtype=float32)
So, it is possible to get nans out of sigmoid. Just in case someone (me, in near future) has this doubt (again).

Related

Simple GAN predicts NaN in Tensorflow after 2 steps

I'm implementing a basic GAN based on the one in the Tensorflow documentation.
After 2 training steps, the prediction from the generator is all NaN.
I don't know why it happens, but I noticed that the gradients of the 2nd convolution layer of the discriminator are all NaN since the first step:
<tf.Tensor: shape=(5, 5, 64, 128), dtype=float32, numpy=
array([[[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan,
My loss functions:
loss = BinaryCrossentropy(from_logits=True)
def generator_loss(discriminator_output):
loss_value = loss(tf.ones_like(discriminator_output), discriminator_output)
print(f"Generator loss: {loss_value}")
return loss_value
def discriminator_loss(outputs_for_reals, outputs_for_fakes):
real_loss = loss(tf.ones_like(outputs_for_reals), outputs_for_reals)
fake_loss = loss(tf.zeros_like(outputs_for_fakes), outputs_for_fakes)
print(f"Disc loss (real/fake): {real_loss}/{fake_loss}")
return real_loss + fake_loss
The training loop:
def training_step(self, real_images):
noise = tf.random.normal([BATCH_SIZE, NOISE_DIM])
with tf.GradientTape() as generator_tape, tf.GradientTape() as discriminator_tape:
generated_images = self._generator(noise, training=True)
if True in tf.math.is_nan(generated_images)[0, 0, :, 0]:
print("NaN result found")
else:
print("OK Result")
results_for_real = self._discriminator(real_images, training=True)
results_for_fake = self._discriminator(generated_images, training=True)
generator_loss = self._generator_loss(results_for_fake)
discriminator_loss = self._discriminator_loss(results_for_real, results_for_fake)
generator_gradients = generator_tape.gradient(generator_loss,
self._generator.trainable_variables)
discriminator_gradients = discriminator_tape.gradient(discriminator_loss,
self._discriminator.trainable_variables)
self._generator_optimizer.apply_gradients(
zip(generator_gradients, self._generator.trainable_variables))
self._discriminator_optimizer.apply_gradients(
zip(discriminator_gradients, self._discriminator.trainable_variables))
return generator_loss, discriminator_loss, generated_images
I build the models exactly the same way as in the documentation.
Things I tried:
Reducing the learning rate
Running the model in different training modes (training=False/training=True)
Decorating the training step with tf.function
No matter what I do, calling the generator with any input will produce exclusively NaN elements.
Example output:
2020-12-11 18:15:42.783543: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
OK Result
Generator loss: 1.5484254360198975
Disc loss (real/fake): 1.4994642734527588/0.2869953215122223
OK Result
Generator loss: 1.2899521589279175
Disc loss (real/fake): 1.3189733028411865/0.3767632842063904
Backend Qt5Agg is interactive backend. Turning interactive mode on.
NaN result found

I used RTX3080 which only works with CUDA 11, and I was using CUDA 10.1. With 10.1 it works without warnings but produces garbage/nan values

How to correctly use the Tensorflow MeanIOU metric?

I want to use the MeanIoU metric in keras (doc link). But I don't really understand how it could be integrated with the keras api. In the example, the prediction and the ground truth are given as binary values but with keras we should get probabilities, especially because the loss is mse...
We should have something like:
m = tf.keras.metrics.MeanIoU(num_classes=2)
m.update_state([0, 0, 1, 1], [0.3, 0.6, 0.2, 0.9])
But now the result isn't the same, we have:
# <tf.Variable 'UnreadVariable' shape=(2, 2) dtype=float64, numpy=array([[2., 0.],
# [2., 0.]])>
m.result().numpy() # 0.25
So my question is how should we use this metric if the output of the model is probabilities? binary or even in a multi-class setting (one hot)?
For the Accuracy there is a distinction between BinaryAccuracy and CategoricalAccuracy and they both take probabilities in y_pred. Shouldn't it be the same for MeanIoU?

I am having similar issues. Despite looking for examples online, all demonstrations happens after applying argmax on the model's output.
The workaround I have for now is to subclass tf.keras.metrics.MeanIoU:
class MyMeanIOU(tf.keras.metrics.MeanIoU):
def update_state(self, y_true, y_pred, sample_weight=None):
return super().update_state(tf.argmax(y_true, axis=-1), tf.argmax(y_pred, axis=-1), sample_weight)
It is also possible to create your own function, but it is recommended to subclass tf.keras.metrics.Metric if you wish to benefit from the extra features such as distributed strategies.
I am still looking for cleaner solutions.

i have the same problem, and i look into the source code.
In tf2.0, at the end of the update_state function, there is :
current_cm = confusion_matrix.confusion_matrix(
y_true,
y_pred,
self.num_classes,
weights=sample_weight,
dtype=dtypes.float64)
and i look into confusion_matrix function,
with ops.name_scope(name, 'confusion_matrix',
(predictions, labels, num_classes, weights)) as name:
labels, predictions = remove_squeezable_dimensions(
ops.convert_to_tensor(labels, name='labels'),
ops.convert_to_tensor(
predictions, name='predictions'))
predictions = math_ops.cast(predictions, dtypes.int64)
labels = math_ops.cast(labels, dtypes.int64)
# Sanity checks - underflow or overflow can cause memory corruption.
labels = control_flow_ops.with_dependencies(
[check_ops.assert_non_negative(
labels, message='`labels` contains negative values')],
labels)
predictions = control_flow_ops.with_dependencies(
[check_ops.assert_non_negative(
predictions, message='`predictions` contains negative values')],
predictions)
if num_classes is None:
num_classes = math_ops.maximum(math_ops.reduce_max(predictions),
math_ops.reduce_max(labels)) + 1
else:
num_classes_int64 = math_ops.cast(num_classes, dtypes.int64)
labels = control_flow_ops.with_dependencies(
[check_ops.assert_less(
labels, num_classes_int64, message='`labels` out of bound')],
labels)
predictions = control_flow_ops.with_dependencies(
[check_ops.assert_less(
predictions, num_classes_int64,
message='`predictions` out of bound')],
predictions)
if weights is not None:
weights = ops.convert_to_tensor(weights, name='weights')
predictions.get_shape().assert_is_compatible_with(weights.get_shape())
weights = math_ops.cast(weights, dtype)
shape = array_ops.stack([num_classes, num_classes])
indices = array_ops.stack([labels, predictions], axis=1)
values = (array_ops.ones_like(predictions, dtype)
if weights is None else weights)
cm_sparse = sparse_tensor.SparseTensor(
indices=indices,
values=values,
dense_shape=math_ops.cast(shape, dtypes.int64))
zero_matrix = array_ops.zeros(math_ops.cast(shape, dtypes.int32), dtype)
return sparse_ops.sparse_add(zero_matrix, cm_sparse)
the trick is at 6th line of the code, tf use math_ops.cast cast the predictions to int64, so when you send [0.3, 0.6, 0.2, 0.9] into cast function, it returns [0, 0, 0, 0].
So, that's why you got a confusion maxtrix
[[2., 0.],
[2., 0.]]

How to sample tensor values given probabilities for each value in tensorflow?

We need to sample values from one tensor regarding to another tensor that contains probabilities. Lets say, we have two tensors t1,t2 of shape (?,3), and want to find another tensor t3 of shape (?,1) that contains a sample of each row in t1 regarding to probabilities in t2.

You do this in two steps. First, you sample indices 0, 1 and 2, then you replace those indices with tensor values. This can be done with tf.random.categorical (see this question for more information about this function). Note tf.random.categorical was added in version 1.13.1.
import tensorflow as tf
with tf.Graph().as_default(), tf.Session() as sess:
tf.random.set_random_seed(0)
values = tf.placeholder(tf.float32, [None, 3])
probabilities = tf.placeholder(tf.float32, [None, 3])
# You can change the number of samples per row (or make it a placeholder)
num_samples = 1
# Use log to get log-probabilities or give logit values (pre-softmax) directly
logits = tf.log(probabilities)
# Sample column indices
col_idx = tf.random.categorical(logits, num_samples, dtype=tf.int32)
# Make row indices
num_rows = tf.shape(values)[0]
row_idx = tf.tile(tf.expand_dims(tf.range(num_rows), 1), (1, num_samples))
# Gather actual values
result = tf.gather_nd(values, tf.stack([row_idx, col_idx], axis=-1))
# Test
print(sess.run(result, feed_dict={
values: [[10., 20., 30.], [40., 50., 60.]],
# Can be actual or "proportional" probabilities not summing 1
probabilities: [[.1, .3, .6], [0., 4., 2.]]
}))
# [[30.]
# [60.]]

Tensorflow always predict the same output

So, I'm trying to learn tensorflow and, for that, I try to create a classifier for something that, I think, is not so hard.
I'd like to predict if a number is odd or even.
The problem is that Tensorflow always predict the same output, I searched answers the last days but nothing helped me...
I saw the following answers : -Tensorflow predicts always the same result
-TensorFlow always converging to same output for all items after training
-TensorFlow always return same result
Here's my code:
in:
df
nb y1
0 1 0
1 2 1
2 3 0
3 4 1
4 5 0
...
19 20 1
inputX = df.loc[:, ['nb']].as_matrix()
inputY = df.loc[:, ['y1']].as_matrix()
print(inputX.shape)
print(inputY.shape)
out:
(20, 1)
(20, 1)
in:
# Parameters
learning_rate = 0.00000001
training_epochs = 2000
display_step = 50
n_samples = inputY.size
x = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([1, 1]))
b = tf.Variable(tf.zeros([1]))
y_values = tf.add(tf.matmul(x, W), b)
y = tf.nn.relu(y_values)
y_ = tf.placeholder(tf.float32, [None,1])
# Cost function: Mean squared error
cost = tf.reduce_sum(tf.pow(y_ - y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initialize variabls and tensorflow session
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(training_epochs):
sess.run(optimizer, feed_dict={x: inputX, y_: inputY}) # Take a gradient descent step using our inputs and labels
# Display logs per epoch step
if (i) % display_step == 0:
cc = sess.run(cost, feed_dict={x: inputX, y_:inputY})
print("Training step:", '%04d' % (i), "cost=", "{:.9f}".format(cc)) #, \"W=", sess.run(W), "b=", sess.run(b)
print ("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={x: inputX, y_: inputY})
print ("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
out:
Training step: 0000 cost= 0.250000000
Training step: 0050 cost= 0.250000000
Training step: 0100 cost= 0.250000000
...
Training step: 1800 cost= 0.250000000
Training step: 1850 cost= 0.250000000
Training step: 1900 cost= 0.250000000
Training step: 1950 cost= 0.250000000
Optimization Finished!
Training cost= 0.25 W= [[ 0.]] b= [ 0.]
in:
sess.run(y, feed_dict={x: inputX })
out:
array([[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.],
[ 0.]], dtype=float32)
I tried to play with my Hyper parameters like, the learning rate or the number of training epochs.
I changed the activation function from softmax to relu.
I changed my dataframe to have more examples but nothing happened.
I also tried to add random for my Weights, but nothing changed, the cost was just starting to a higher value.

From giving a quick look at the code, it looks ok to me (maybe a part initializing the weights to zero, usually you want a small number different from zero to avoid a trivial solution), while I don't think that you can fit the problem of the parity of integers with a linear regression.
The point is that you are trying to fit
x % 2
with predictions of the form
activation(x * w + b)
and there is no way to find good w and b to solve this problem.
Another way to understand this is to plot your data: the scatter plot of the parity of x are two lines of points, and the only way to fit them with a line is with a flat line (that will have a high cost anyway).
I think it would be better to change data to start with, but if you want to address this problem, you should obtain some result using a sine or a cosine as activation function.

The main problem that I see is that you initialize your weights in the W matrix with 0s. The operation that you have in the linear layer is basically Wx + b. Hence the gradient with respect to x is W. If you start now with zeros for W then the gradient is 0 as well and you are not able to learn anything. Try to use random initial values as stated on tensorflow.org
# Create two variables.
weights = tf.Variable(tf.random_normal([784, 200], stddev=0.35),
name="weights")
biases = tf.Variable(tf.zeros([200]), name="biases")

first of all I have to admit that I never used tensorflow. But I think you have a modelling problem here.
You are using the simplest network architecture possible (a 1-dimensional perceptron). You have two variables (w and b) which you want to learn and your decision rule for the output looks like
if you subtract the b and divide by w you get
So you are basically looking for a threshold to seperate odd and even numbers. No matter how you choose w and b you will always misclassify half of the numbers.
Although decinding if a number is odd or even seems to be a super trivial task for us humans it is not for a single perceptron.

Get a simple MLP in TensorFlow to model XOR

I tried to build a simple MLP with an input layer (2 neurons), a hidden layer (5 neurons) and an output layer (1 neuron). I planned to train and feed it with [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] for getting the desired output of [0., 1., 1., 0.] (elementwise).
Unfortunately my code refuses to run. I keep getting dimensionality errors no matter what I'm trying. Quite frustrating :/ I think I'm missing something but I can not figure out what is wrong.
For better readability I also uploaded the code to a pastebin: code
Any ideas?
import tensorflow as tf
#####################
# preparation stuff #
#####################
# define input and output data
input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]] # XOR input
output_data = [0., 1., 1., 0.] # XOR output
# create a placeholder for the input
# None indicates a variable batch size for the input
# one input's dimension is [1, 2]
n_input = tf.placeholder(tf.float32, shape=[None, 2])
# number of neurons in the hidden layer
hidden_nodes = 5
################
# hidden layer #
################
b_hidden = tf.Variable(0.1) # hidden layer's bias neuron
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0)) # hidden layer's weight matrix
# initialized with a uniform distribution
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden) # calc hidden layer's activation
################
# output layer #
################
W_output = tf.Variable(tf.random_uniform([hidden_nodes, 1], -1.0, 1.0)) # output layer's weight matrix
output = tf.sigmoid(tf.matmul(W_output, hidden)) # calc output layer's activation
############
# learning #
############
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(output, n_input) # calc cross entropy between current
# output and desired output
loss = tf.reduce_mean(cross_entropy) # mean the cross_entropy
optimizer = tf.train.GradientDescentOptimizer(0.1) # take a gradient descent for optimizing with a "stepsize" of 0.1
train = optimizer.minimize(loss) # let the optimizer train
####################
# initialize graph #
####################
init = tf.initialize_all_variables()
sess = tf.Session() # create the session and therefore the graph
sess.run(init) # initialize all variables
# train the network
for epoch in xrange(0, 201):
sess.run(train) # run the training operation
if epoch % 20 == 0:
print("step: {:>3} | W: {} | b: {}".format(epoch, sess.run(W_hidden), sess.run(b_hidden)))
EDIT: I am still getting errors :/
hidden = tf.sigmoid(tf.matmul(n_input, W_hidden) + b_hidden)
outputs line 27 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible. Altering the line to:
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
seems to be working, but then the error appears in:
output = tf.sigmoid(tf.matmul(hidden, W_output))
telling me: line 34 (...) ValueError: Dimensions Dimension(2) and Dimension(5) are not compatible
Turning the statement to:
output = tf.sigmoid(tf.matmul(W_output, hidden))
also throws an exception: line 34 (...) ValueError: Dimensions Dimension(1) and Dimension(5) are not compatible.
EDIT2: I do not really understand this. Shouldn't hidden be W_hidden x n_input.T, since in dimensions this would be (5, 2) x (2, 1)? If I transpose n_input hidden is still working (I even don't get the point why it is working without a transpose at all). However, output keeps throwing errors but this operation in dimensions should be (1, 5) x (5, 1)?!

(0) It's helpful to include the error output - it's also a useful thing to look at, because it does identify exactly where you were having shape problems.
(1) The shape errors arose because you have the arguments to matmul backwards in both of your matmuls, and have the tf.Variable backwards. The general rule is that the weights for layer that has input_size, output_size should be [input_size, output_size], and the matmul should be tf.matmul(input_to_layer, weights_for_layer) (and then add the biases, which are of shape [output_size]).
So with your code,
W_hidden = tf.Variable(tf.random_uniform([hidden_nodes, 2], -1.0, 1.0))
should be:
W_hidden = tf.Variable(tf.random_uniform([2, hidden_nodes], -1.0, 1.0))
and
hidden = tf.sigmoid(tf.matmul(W_hidden, n_input) + b_hidden)
should be tf.matmul(n_input, W_hidden); and
output = tf.sigmoid(tf.matmul(W_output, hidden))
should be tf.matmul(hidden, W_output)
(2) Once you've fixed those bugs, your run needs to be fed a feed_dict:
sess.run(train)
should be:
sess.run(train, feed_dict={n_input: input_data})
At least, I presume that this is what you're trying to achieve.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Training MSE loss larger than theoretical maximum? - python

Related

Simple GAN predicts NaN in Tensorflow after 2 steps

How to correctly use the Tensorflow MeanIOU metric?

How to sample tensor values given probabilities for each value in tensorflow?

Tensorflow always predict the same output

Get a simple MLP in TensorFlow to model XOR

Categories

Resources