Cross entropy loss suddenly increases to infinity

Cross entropy loss suddenly increases to infinity - python

I am attempting to replicate an deep convolution neural network from a research paper. I have implemented the architecture, but after 10 epochs, my cross entropy loss suddenly increases to infinity. This can be seen in the chart below. You can ignore what happens to the accuracy after the problem occurs.
Here is the github repository with a picture of the architecture
After doing some research I think using an AdamOptimizer or relu might be a problem.
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
EDIT
If anyone is interested, the solution was that I was basically feeding in incorrect data.

Solution: Control the solution space. This might mean using smaller datasets when training, it might mean using less hidden nodes, it might mean initializing your wb differently. Your model is reaching a point where the loss is undefined, which might be due to the gradient being undefined, or the final_conv signal.
Why: Sometimes no matter what, a numerical instability is reached. Eventually adding a machine epsilon to prevent dividing by zero (cross entropy loss here) just won't help because even then the number cannot be accurately represented by the precision you are using. (Ref: https://en.wikipedia.org/wiki/Round-off_error and https://floating-point-gui.de/basic/)
Considerations:
1) When tweaking epsilons, be sure to be consistent with your data type (Use the machine epsilon of the precision you are using, in your case float32 is 1e-6 ref: https://en.wikipedia.org/wiki/Machine_epsilon and python numpy machine epsilon.
2) Just in-case others reading this are confused: The value in the constructor for Adamoptimizer is the learning rate, but you can set the epsilon value (ref: How does paramater epsilon affects AdamOptimizer? and https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
3) Numerical instability of tensorflow is there and its difficult to get around. Yes there is tf.nn.softmax_with_cross_entropy but this is too specific (what if you don't want a softmax?). Refer to Vahid Kazemi's 'Effective Tensorflow' for an insightful explanation: https://github.com/vahidk/EffectiveTensorflow#entropy

that jump in your loss graph is very weird...
I would like you to focus on few points :
if your images are not normalized between 0 and 1 then normalize them
if you have normalized your values between -1 and 1 then use a sigmoid layer instead of softmax because softmax squashes the values between 0 and 1
before using softmax add a sigmoid layer to squash your values (Highly Recommended)
other things you can do is add dropouts for every layer
also I would suggest you to use tf.clip so that your gradients does not explode and implode
you can also use L2 regularization
and experiment with the learning rate and epsilon of AdamOptimizer
I would also suggest you to use tensor-board to keep track of the weights so that way you will come to know where the weights are exploding
You can also use tensor-board for keeping track of loss and accuracy
See The softmax formula below:
Probably that e to power of x, the x is being a very large number because of which softmax is giving infinity and hence the loss is infinity
Heavily use tensorboard to debug and print the values of the softmax so that you can figure out where you are going wrong
One more thing I noticed you are not using any kind of activation functions after the convolution layers... I would suggest you to leaky relu after every convolution layer
Your network is a humongous network and it is important to use leaky relu as activation function so that it adds non-linearity and hence improves the performance

You may want to use a different value for epsilon in the Adam optimizer (e.g. 0.1 -- 1.0).This is mentioned in the documentation:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.

Related

Is there a way to improve DNN for linear regression?

I'm creating a Deep Neural Network for linear regression. The net has 3 hidden layers with 256 units per layer. Here is the model:
Each unit has ReLU as activation function. I also used Early Stopping to make sure it doesn't overfit.
The target is an integer and in the training set its values goes from 0 to 7860.
After the training i've got the following losses:
train_MSE = 33640.5703, train_MAD = 112.6294,
val_MSE = 53932.8125, val_MAD = 138.7836,
test_MSE = 52595.9414, test_MAD= 137.2564
I've tried many different configurations of the net (different optimizer, loss functions, normalizations, regularizers...) but nothing seems to help me to reduce the loss even further. Even if the training error decrease, the test error never goes under a value of MAD = 130.
Here's the behavior of my net:
My question is if there's a way to improve my dnn to make more accurate predictions or this is the best that i can achieve with my dataset?

If your problem is linear by nature, meaning the real function behind your data is of the from: y = a*x + b + epsilon where the last term is just random noise.
You won't get any better than fitting the underlying function y = a*x + b. Fitting espilon would only result in loss of generalization over new data.

You can try a many different things to improve DNN,
Increase hidden layers
Scale or Normalize your data
Try rectified linear unit as Activation
Take More data
Change learning algorithm parameters like learning rates

Network loss stalls where it should fall to zero quickly

I have a neural network with 30 input nodes, 1 hidden node, and 1 output node. I am training it on a dataset where the inputs are 30-dimensional vectors with entries between -1 and 1, and the targets are the 2nd entry of these vectors.
I expect the network to train and learn to output the 2nd entry of the input vector quickly, since this is as simple as decreasing the weights in the network which connect the input nodes to the hidden node to zero except the one for the 2nd entry.
However, The loss stalls quickly at approximately 0.168. I'ld expect it to quickly go to zero, which is the case if the targets are just 0.
The following code showcases the problem with a randomised dataset.
import numpy as np
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow as tf
np.random.seed(123)
dataSize = 100000
xdata = np.zeros((dataSize, 30))
ydata = np.zeros((dataSize))
for i in range(dataSize):
vec = (np.random.rand(30) * 2) - 1
xdata[i] = vec
ydata[i] = vec[1]
model = models.Sequential()
model.add(layers.Dense(1, activation="relu", input_shape=(30, )))
model.add(layers.Dense(1, activation="sigmoid"))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
lossObject = tf.keras.losses.MeanSquaredError()
model.compile(optimizer=optimizer, loss=lossObject)
model.fit(xdata, ydata, epochs=200, batch_size=32)
I have tried multiple different optimizers, loss functions, batch sizes, dataset sizes and learning rates, however the result is always the loss stalling at a relatively high value.
Why is this happening? I am not interested in responses asking why I am doing this. I am new to neural networks and I need to understand why this is happening before I can continue with my original task.
Thank you in advance.

Your targets are between -1 and 1, but a sigmoid output activation limits outputs to [0, 1], making it impossible to achieve zero loss if any targets happen to be < 0 (which is very likely with a large dataset). You could fix it by using tanh as activation instead, which maps to [-1, 1], or just using no activation in the output layer should be fine in this case. When you fix all targets to 0, this is obviously not an issue, and (almost) zero loss can be achieved.
As a general lesson: Always make sure your output activation makes sense with regards to your target data. At the very least, the value ranges should be identical -- although this might not be a sufficient condition for a good output activation.
As a second point: Having a single node with relu activation is also a bad idea. If the input to relu is < 0, the output will be 0, and the gradient will be, as well. In this case, no learning is possible and incorrect outputs for some data points may never be corrected.
It is generally not a problem if some units are 0 some of the time because the gradient can flow through other paths, but with only one unit, this will likely cripple learning as well. I would recommend that you either use more units in the hidden layer, or use a different activation function.

Incorporate side conditions into Keras neural network

I want to train my neural network (in Keras) with an additional condition on the output elements.
An example:
Minimize my loss function MSE between network output y_pred and y_true.
Additionally, ensure that the norm of y_pred is less or equal 1.
Without the condition, the task is straightforward.
Note: The condition is not necessarily the vector norm of y_pred.
How can I implement the additional condition/restriction in a Keras (or maybe Tensorflow) model?

In principle, tensorflow (and keras) don't allow you to add hard constraints to your model.
You have to convert your invarient (norm <= 1) to a penalty function, which is added to the loss. This could look like this:
y_norm = tf.norm(y_pred)
norm_loss = tf.where(y_norm > 1, y_norm, 0)
total_loss = mse + norm_loss
Look at the docs of where. If your prediction has a norm bigger than one, backpropagation tries to minimize the norm. If it is less than or equal, this part of the loss is simply 0. No gradient is produced.
But this can be very hard to optimize. Your predictions could oscillate around a norm of 1. It is also possible to add a factor: total_loss = mse + 1000* norm_loss. Be very careful with this, it makes optimization even harder.
In the example above, the norm above one contributes linearly to the loss. This is called l1-regularization. You could also square it, which would become l2-regularization.
In your specific case, you could get creative. Why not normalize your predictions and the targets to one (just a suggestion, might be a bad idea)?
loss = mse(y_pred / tf.norm(y_pred), y_target / np.linalg.norm(y_target)

Autoencoder loss is not decreasing (and starts very high)

I have the following function which is supposed to autoencode my data.
My data can be thought of as an image of length 100, width 2, and it has 2 channels (100, 2, 2)
def construct_ae(input_shape):
encoder_input = tf.placeholder(tf.float32, input_shape, name='x')
with tf.variable_scope("encoder"):
flattened = tf.layers.flatten(encoder_input)
e_fc_1 = tf.layers.dense(flattened, units=150, activation=tf.nn.relu)
encoded = tf.layers.dense(e_fc_1, units=75, activation=None)
with tf.variable_scope("decoder"):
d_fc_1 = tf.layers.dense(encoded, 150, activation=tf.nn.relu)
d_fc_2 = tf.layers.dense(d_fc_1, 400, activation=None)
decoded = tf.reshape(d_fc_2, input_shape)
with tf.variable_scope('training'):
loss = tf.losses.mean_squared_error(labels=encoder_input, predictions=decoded)
cost = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(cost)
return optimizer
I'm running into the issue where my cost is on the order of 1.1e9, and it's not decreasing over time
I visualized the gradients (removed the code because it would just clutter things) and I think something is wrong there? But I'm not sure
Questions
1) Does anything in the construction of the network look incorrect?
2) Does the data need to be normalized between 0-1?
3) I hit NaNs sometimes when I try increasing the learning rate to 1. Is that indicative of anything?
4) I think I should probably use a CNN but I ran into the same issues so I thought I'd move to an FC since it's likely easier to debug.
5) I imagine I'm using the wrong loss function but I can't really find any papers regarding the right loss to use. If anyone can direct me to one I'd be very appreciative

Given that this is a plain autoencoder and not a convolutional one, you shouldn't expect good (low) error rates.
Normalizing does get you faster convergence. However given that your final layer does not have an activation function that enforces a range on the output, it shouldn't be a problem. However, do try normalizing your data to [0,1] and then using a sigmoid activation in your last decoder layer.
A very high learning rate may get you stuck in an optimization loop and/or get you too far from any local minima, thus leading to extremely high error rates.
Most blogs (like Keras) use 'binary_crossentropy' as their loss function, but MSE isn't "wrong"
As far as the high starting error is concerned; it all depends on your parameters' initialization. A good initialization technique gets you starting errors that are not too far from a desired minima. However, the default random or zeros-based initialization almost always leads to such scenarios.

Backpropagation with Rectified Linear Units

I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.
def backprop_deep(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:
def backprop_relu(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.

Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.
The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.
I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.