NaN loss when training regression network

NaN loss when training regression network - python

I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:
model = Sequential()
model.add(Dense(1024, input_shape=(n_train,)))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.1))
model.add(Dense(1))
sgd = SGD(lr=0.01, nesterov=True);
#rms = RMSprop()
#model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
model.compile(loss='mean_absolute_error', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )
However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:
Train on 260000 samples, validate on 64905 samples
Epoch 1/3
260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
13.4925
Epoch 2/3
88448/260000 [=========>....................] - ETA: 161s - loss: nan
I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.
Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version.

Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).
Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.
Here are some things you could potentially try:
Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).
Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.
If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.
Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.
In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1.

I faced the same problem before. I search and find this question and answers. All those tricks mentioned above are important for training a deep neural network. I tried them all, but still got NAN.
I also find this question here. https://github.com/fchollet/keras/issues/2134.
I cited the author's summary as follows：
I wanted to point this out so that it's archived for others who may
experience this problem in future. I was running into my loss function
suddenly returning a nan after it go so far into the training process.
I checked the relus, the optimizer, the loss function, my dropout in
accordance with the relus, the size of my network and the shape of the
network. I was still getting loss that eventually turned into a nan
and I was getting quite fustrated.
Then it dawned on me. I may have some bad input. It turns out, one of
the images that I was handing to my CNN (and doing mean normalization
on) was nothing but 0's. I wasn't checking for this case when I
subtracted the mean and normalized by the std deviation and thus I
ended up with an exemplar matrix which was nothing but nan's. Once I
fixed my normalization function, my network now trains perfectly.
I agree with the above viewpoint: the input is sensitive for your network. In my case, I use the log value of density estimation as an input. The absolute value could be very huge, which may result in NaN after several steps of gradients. I think the input check is necessary. First, you should make sure the input does not include -inf or inf, or some extremely large numbers in absolute value.

I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:
print(np.any(np.isnan(X_test)))
print(np.any(np.isnan(y_test)))
you can solve this by adding a small value(0.000001) to Std like this,
def standardize(train, test):
mean = np.mean(train, axis=0)
std = np.std(train, axis=0)+0.000001
X_train = (train - mean) / std
X_test = (test - mean) /std
return X_train, X_test

To sum up the different solutions mentioned here and from this github discussion, which would depend of course on your particular situation:
Add regularization to add l1 or l2 penalties to the weights. Otherwise, try a smaller l2 reg. i.e l2(0.001), or remove it if already exists.
Try a smaller Dropout rate.
Clip the gradients to prevent their explosion. For instance in Keras you could use clipnorm=1. or clipvalue=1. as parameters for your optimizer.
Check validity of inputs (no NaNs or sometimes 0s). i.e df.isnull().any()
Replace optimizer with Adam which is easier to handle. Sometimes also replacing sgd with rmsprop would help.
Use RMSProp with heavy regularization to prevent gradient explosion.
Try normalizing your data, or inspect your normalization process for any bad values introduced.
Verify that you are using the right activation function (e.g. using a softmax instead of sigmoid for multiple class classification).
Try to increase the batch size (e.g. 32 to 64 or 128) to increase the stability of your optimization.
Try decreasing your learning rate.
Check the size of your last batch which may be different from the batch size.

I faced a very similar problem, and this is how I got it to run.
The first thing you can try is changing your activation to LeakyReLU instead of using Relu or Tanh. The reason is that often, many of the nodes within your layers have an activation of zero, and backpropogation doesn't update the weights for these nodes because their gradient is also zero. This is also called the 'dying ReLU' problem (you can read more about it here: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks).
To do this, you can import the LeakyReLU activation using:
from keras.layers.advanced_activations import LeakyReLU
and incorporate it within your layers like this:
model.add(Dense(800,input_shape=(num_inputs,)))
model.add(LeakyReLU(alpha=0.1))
Additionally, it is possible that the output feature (the continuous variable you are trying to predict) is an imbalanced data set and has too many 0s. One way to fix this issue is to use smoothing. You can do this by adding 1 to the numerator of all your values in this column and dividing each of the values in this column by 1/(average of all the values in this column)
This essentially shifts all the values from 0 to a value greater than 0 (which may still be very small). This prevents the curve from predicting 0s and minimizing the loss (eventually making it NaN). Smaller values are more greatly impacted than larger values, but on the whole, the average of the data set remains the same.

I had the same problem, I was using Keras for a Multivariate regression problem. What I later realised was that some values in my dataset were nan and that led to a nan loss.
I used the command:
df=df.dropna()
And it resolved my issue.

I tried every suggestion on this page and many others to no avail. We were importing csv files with pandas, then using keras Tokenizer with text input to create vocabularies and word vector matrices. After noticing some CSV files led to nan while others worked, suddenly we looked at the encoding of the files and realized that ascii files were NOT working with keras, leading to nan loss and accuracy of 0.0000e+00; however, utf-8 and utf-16 files were working! Breakthrough.
If you're performing textual analysis and getting nan loss after trying these suggestions, use file -i {input} (linux) or file -I {input} (osx) to discover your file type. If you have ISO-8859-1 or us-ascii, try converting to utf-8 or utf-16le. Haven't tried the latter but I'd imagine it would work as well. Hopefully this helps someone very very frustrated!

I was getting the loss as nan in the very first epoch, as soon as the training starts. Solution as simple as removing the nas from the input data worked for me (df.dropna())
I hope this helps someone encountering similar problem

I had the same problem with my RNN with keras LSTM layers, so I tried each solution from above. I had already scaled my data (with sklearn.preprocessing.MinMaxScaler), there were no NaN values in my data after scaling. Solutions like using LeakyRelU or changing learning rate didn't help.
So I decided to change the scaler from MinMaxScaler to StandardScaler, even though I had no NaN values and I found it odd but it worked!

In my case the issue was that I copy-pasted my previous work for binary classification and used sigmoid activation on the output layer instead of softmax (the new network was about multiclass classification).

I had a similar problem using keras. Loss turned into NAN after the second batch was input.
I tried to:
Use softmax as activation of output dense layer
Drop nan in the input
Normalize the input
However, that didn't work. So, then I tried to:
Decrease the learning rate
Problem solved.

Try to check your data if there are NAN values. Removing NAN values solve the problem for me.

I had this issue when one of my training data entries contained a nan

I had similar issue with my logloss, MAE and others being all NA's. I looked into the data and found, I had few features with NA's in them. I imputed NA's with approximate values and was able to solve the issue.

I had the same problem with my keras CNN, as others I tried all above solutions: decrease learning rate, drop nullity from train data, normalize data, add dropout layer and ...
but there couldn't solve nan problem, I tried change activation function in classifier (last) layer from sigmoid to softmax. It worked!
try changing activation function of last layer to softmax!

I was getting the same thing when I tried creating a bounding box regressor.
My neural net had larger layer than yours . I increased the dropout value and got suitable results.

Was getting NaN for my classification network.
Answering here as it might help someone.
Had made a blunder -
Number of classes in training labels was 5. i.e. from 0 to 4.
In the last dense layer of classification had 4 nodes which means 4 classes which is the issue.
Chaging the number of nodes in the last layer of network to 5 solved the issue for me.

I had a similar issue and I tried changing my activations from Sigmoid to Softmax and from RelU to LeakyRelU and the problem was resolved. So I guess as long as there is no NaN in the input to begin with, and you have tried lowering your learning rate, the viable solution is to play with your activations!

My situation:
Train Loss: nan, Train Accuracy: 0.0, Validation Loss: nan, Validation Accuracy: 0.0
later I found out it was because of my labels are 1, 2, 3, 4 not start with 0.
So I relabel them, use 0, 1, 2, 3 instead of 1, 2, 3, 4 as labels.
Problem solved!
Hope my answer helps!

In keras, class labels begins from 0. If, for example, you have 7 classes, therefore either begin labeling them from 0 through 6 and feed the last dense layer (with the softmax activation function) with units=7. Or if you should label your data from 1 through 7, in this case, you must set units=8 (in the last dense layer).

I was getting nan values for binary classification then I changed the loss function to 'binary cross entropy' from categorical cross-entropy and it worked fine.

BTW, it seems to be a dying gradient NOT exploding.
neuron dies when its input is negative for all training instances.
here 'adam' optimizer helped against NaNs.
But concerning your situation - be sure, you have scaled dataset & loss='mean_squared_error' (as opposed to yours)
model.compile(optimizer = 'adam', loss = keras.losses.mean_squared_error, metrics=keras.metrics.mean_absolute_error)

This error may occur when your data have problems like:
Input contains NaN, infinity or a value too large for dtype('float64').
In these cases, you can solve it by removing NaN values like :
`df = df.dropna()`
or any other fillna() methods
Note: These methods work for pandas' data frames only.

Remove rows containing with nan value or replace the nan value with some value from datasets or dataframe. it will solve the error.

I got the same issue. Successfully you can use keras for regression. Convert All your data into Rounded number that solved my issue. Eg. 23.43 to 23

I had the same problem. Examining the data, I realized that an error had occurred during data acquisition.

Related

Why does my LSTM model predict wrong values although the loss is decreasing?

I am trying to build a machine learning model which predicts a single number from a series of numbers. I am using an LSTM model with Tensorflow.
You can imagine my dataset to look something like this:
Index
x data
y data
0
np.array(shape (10000,1) )
numpy.float32
1
np.array(shape (10000,1) )
numpy.float32
2
np.array(shape (10000,1) )
numpy.float32
...
...
...
56
np.array(shape (10000,1) )
numpy.float32
Easily said I just want my model to predict a number (y data) from a sequence of numbers (x data).
For example like this:
array([3.59280851, 3.60459062, 3.60459062, ...]) => 2.8989773
array([3.54752101, 3.56740332, 3.56740332, ...]) => 3.0893357
...
x and y data
From my x data I created a numpy array x_train which I want to use to train the network.
Because I am using an LSTM network, x_train should be of shape (samples, time_steps, features).
I reshaped my x_train array to be shaped like this: (57, 10000, 1), because I have 57 samples, which each are of length 10000 and contain a single number.
The y data was created similarly and is of shape (57,1) because, once again, I have 57 samples which each contain a single number as the desired y output.
Current model attempt
My model summary looks like this:
The model was compiled with model.compile(loss="mse", optimizer="adam") so my loss function is simply the mean squared error and as an optimizer I'm using Adam.
Current results
Training of the model works fine and I can see that the loss and validation loss decreases after some epochs.
The actual problem occurs when I want to predict some data y_verify from some data x_verify.
I do this after the training is finished to determine how well the model is trained.
In the following example I simply used the data I used for training to determine how well the model is trained (I know about overfitting and that verifying with the training set is not the right way of doing it, but that is not the problem I want to demonstrate right not).
In the following graph you can see the y data I provided to the model in blue.
The orange line is the result of calling model.predict(x_verify) where x_verify is of the same shape as x_train.
I also calculated the mean absolute percentage error (MAPE) of my prediction and the actual data and it came out to be around 4% which is not bad, because I only trained for 40 epochs. But this result still is not helpful at all because as you can see in the graph above the curves do not match at all.
Question:
What is going on here?
Am I using an incorrect loss function?
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be?
Ideally the prediction should be the y data which I provided so the curves should look the same (more or less).
Do you have any ideas?
Thanks! :)

After some back and forth in the comments, I'll give my best estimation to your questions:
What is going on here?
Very complex (too many layers deep) model with very little data, trained for too few epochs on non-normalized data (credit to Muhammad in his answer). The biggest issue, as far as I can tell, is the number of training epochs.
Am I using an incorrect loss function?
MSE is an appropriate loss function for a regression task.
Why does it seem like the model tries to predict a single value for all samples rather than predicting a different value for all samples like it's supposed to be? Ideally the prediction should be the y data which I provided so the curves should look the same (more or less). Do you have any ideas?
Too few training epochs is the biggest contributor, as far as I can tell.
Based on the collab notebook that Luca shared:
30 Epochs, no normalization
Way off target, flat predictions (though I can't reproduce how flat the predictions are that Luca posted)
30 Epochs, with normalization
Worse off.
2000(!) epochs, no normalization
Okay, now the predictions are at least in the ballpark
2000 epochs, with normalization
And now the model seems to be starting to figure things out, like we'd hope it should. Given, this is training on the 11 samples that were cobbled together in the notebook, so it's naturally going to overfit. We're just happy to see it learn something.
2000 epochs, normalization, different loss
Never be afraid to try out different losses, as some may be better suited than others. Not knowing the domain of this task, I'm just trying out mean_absolute_error instead of mean_squared_error.
Caution! Don't compare loss values between different losses. They're not on the same scale.
2000 epochs, normalization, larger learning rate
Okay, so it's taking a long time to learn. Can I nudge it along a little faster? Sure, up the learning rate of the optimizer, and it'll get you to where you're going faster. Here, we up it by a factor of 5.
model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(learning_rate=0.005))
You could even employ a learning rate scheduler that starts big and slowly diminishes it over the course of epochs.
def scheduler(epoch, lr):
if epoch < 400:
return lr
else:
return lr * tf.math.exp(-0.01)
lrs = tf.keras.callbacks.LearningRateScheduler(scheduler)
history = model.fit(x=x_train, y=y_train, epochs=1000, callbacks=[lrs])
Hope this all helps!

From the notebook it seems you are not scaling your data. You should normalize or standardize your data before training your model.
https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/
can add normalization layer in keras https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization

I just wanted to post a quick update.
First of all, this is my current result:
I am absolutely happy, that I was finally able to achieve what I wanted to. At least to some extent.
There were some steps I had to take to achieve this result:
Normalization
Training for 500-1000 epochs
Most importantly: Reducing the amount of time steps to 1000
In the end my thought of "the more data, the better" was a huge misconception. I was not able to achieve such results with 10000 time steps per sample AT ALL. So I'm glad that I just gave 1000 a shot.
Thank you all very much for your answers!
I will try to further imroved my model with your suggestions :)

i think it would be helpful if you change loss into huber loss and even change optimizer into sgd and then first try out to define the best learning rate based on a callback (learning rate schedule) cause of small dataset and even normalize or standardize data before training model.

Keras neural network predicts the same number for all inputs

I am trying to create a keras neural network to predict distance on roads between two points in city. I am using Google Maps to get travel distance and then train neural network to do that.
import pandas as pd
arr=[]
for i in range(0,100):
arr.append(generateTwoPoints(55.901819,37.344735,55.589537,37.832254))
df=pd.DataFrame(arr,columns=['p1Lat','p1Lon','p2Lat','p2Lon', 'distnaceInMeters', 'timeInSeconds'])
print(df)
Neural network architecture:
from keras.optimizers import SGD
sgd = SGD(lr=0.00000001)
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(100, input_dim=4 , activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mse'])
Then i divide sets to test/train
Xtrain=train[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytrain=train[['distnaceInMeters']]/100000
Xtest=test[['p1Lat','p1Lon','p2Lat','p2Lon']]/100
Ytest=test[['distnaceInMeters']]/100000
Then i fit data into the model, but loss stays the same:
history = model.fit(Xtrain, Ytrain,
batch_size=1,
epochs=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(Xtest, Ytest))
I later print the data:
prediction = model.predict(Xtest)
print(prediction)
print (Ytest)
But result is the same for all the inputs:
[[0.26150784]
[0.26171574]
[0.2617755 ]
[0.2615582 ]
[0.26173398]
[0.26166356]
[0.26185763]
[0.26188275]
[0.2614446 ]
[0.2616575 ]
[0.26175532]
[0.2615183 ]
[0.2618127 ]]
distnaceInMeters
2 0.13595
6 0.27998
7 0.48849
16 0.36553
21 0.37910
22 0.40176
33 0.09173
39 0.24542
53 0.04216
55 0.38212
62 0.39972
64 0.29153
87 0.08788
I can not find the problem. What is it? I am new to machine learning.

You are doing a very elementary mistake: since you are in a regression setting, you should not use a sigmoid activation for your final layer (this is used for binary classification cases); change your last layer to
model.add(Dense(1,activation='linear'))
or even
model.add(Dense(1))
since, according to the docs, if you do not specify the activation argument it defaults to linear.
Various other advice offered already in the other answer and the comments may be useful (lower LR, more layers, other optimizers e.g. Adam), and you certainly need to increase your batch size; but nothing will work with the sigmoid activation function you currently use for your last layer.
Irrelevant to the issue, but in regression settings you don't need to repeat your loss function as a metric; this
model.compile(loss='mse', optimizer='sgd')
will suffice.

It would be very useful if you could post the the progression of the loss and MSE (of both the training and validation/test set) as it goes throughout the training. Even better, it would be best if you can visualize it as per https://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/ and post the vizualization here.
In the meantime, based on the facts:
1) You say the loss isn't decreasing (I'm assuming on the training set, during training, based on your compile args).
2) You say that the prediction "accuracy" on your test set is bad.
3) My experience/intuition (not an empirical assessment) tells me that your two layer dense model is a little too small to be able to capture the complexity inherent in your data. AKA your model is suffering from too high a Bias https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229
The fastest and easiest thing you can try, is to try to add both more layers and more nodes to each layer.
However, I should note that there is a lot of causal information that can affect the driving distance and driving time beyond just the the distance between two coordinates, which might be the feature that your Neural network will most readily extract. For example, whether you drive on a highway or sides treets, traffic lights, whehter the roads twist and turn or go straight... to infer all of that just from that the data you will need enormous amounts of data(examples) in my opinion. If you could add input columns with e.g. disatance to nearest higway from both points, you might be able to train with less data
I would also reccomend that you souble check that you are feeding as input what you think you are feeding (and its shape), and also, you should use some standardization from function sklearn which might help the model learn faster and converge faster to a higher "accuracy".
If and when you post either more code or the training history I can help you more (and also how many training samples).
EDIT 1: Try changing batch size to a larger number preferably batch_size=32 if it fits in your memory. you can use a small batch size (such as 1) when working with an "info rich" input like an image, but when using a very "info poor" datum like 4 floats (2 coordinates), the gradient will point each batch (with batch_size=1) to a practically random (pseudo...) direction and not neccessarily get any closer to a local minimum. Only when taking the gradient on the collective loss of a larger batch (like 32, and perhaps more) will you get a gradient that points at least approximately in the direction of the local minimum and converge to a better result. Also, I suggest that you don't mess with the learning rate manually and perhaps change to an optimizer like "adam" or "RMSProp".
Edit 2: #Desertnaut made an excellent point that I totally missed, a correction without which, your code will not work properly. He deserves the credit so I will not include it here. Please refer to his answer. Also, don't forget to raise your batch size, and not "manually mess" with your learning rate, "adam" for example, will do it for you.

Why does my TensorFlow NN model's predicted values have upper limit?

I have a neural network with three layers. I've tried using tanh and sigmoid functions for my activations and then the output layer is just a simple linear function (I'm trying to model a regression problem).
For some reason my model seems to have a hard cut off where it will never predict a value above some threshold (even though it should). What reason could there be for this?
Here is what predictions from the model look like (with sigmoid activations):
update:
With relu activation, and switching from gradient descent to Adam, and adding L2 regularization... the model predicts same value for every input...

A linear layer regressing a single value will have outputs of the form
output = bias + sum(kernel * inputs)
If inputs comes from a tanh, then -1 <= inputs <= 1, and hence
bias - sum(abs(kernel)) <= output <= bias + sum(abs(kernel))
If you want an unbounded output, consider using an unbounded activation on all intermediate layers, e.g. relu.

I think your problem concerns the generalization/expressiveness of the model. Regression is a basic task, there should be no problem with the method itself, but problem with the execution. #DomJack explained how output is restricted for a specific set of parameters, but that only happens for anomaly data. In general, when training parameters would be tuned so that it will predict output correctly.
So first point is about the quality of training data. Make sure you have large enough training data (and it is split randomly if you split train/test from one dataset). Also, maybe trivial, but make sure you didn't mess up input/output value in preprocessing.
Another point is about the size of the network. Make sure you use large enough hidden layer.

In Tensorflow, a loss can be a tensor of more than one dimension. What does that mean?

A scalar loss makes perfect sense to me which is in general what is fed as Loss Function in standard NN architectures. However there is a provision to make your loss not scalar, for example I had input a loss of size ([batch_size,]) and it did not throw an error.
Does it summarize the loss as sum/mean internally? It is not coming out very clear to me from the source code.
Any help is highly appreciated. Thanks. :)

Imagine you are trying to classify the MNIST data set. In this case, you have 10 output neurons. Therefore, after propagating a training sample through it, you get 10 activations. These have to be compared to the desired output via a cost function for each neuron (something like cross-entropy).
What you are trying to do here, is to minimize these costs among all the neurons. In the MNIST example you have to do something like
xent = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y), name="xent")
You see that the cost is calculated across all neurons as mean (reduce_mean).

Issue NaN with Adam solver

I'm training networks with the Adam solver and ran into the problem, that optimization hits 'nan' at some point, but the loss seems to decrease nicely up to that point. It happens only for some specific configurations and after a couple of thousand iterations. For example the network with batch size 5 will have the problem, while with a batch size of one it works. So I started debugging my code:
1) First thing that came to my mind is to check the inputs when the net hits 'nan', but they look reasonable (correctly labled ground truth and input with an okayish value range)
2) While searching I discovered tf.verify_tensor_all_finite(..) and I put that all over my code to see, which tensor first becomes 'nan'.
I could narrow down the problem to the following lines:
kernel = tf.verify_tensor_all_finite(kernel, 'kernel')
in_tensor = tf.verify_tensor_all_finite(in_tensor, 'in_tensor')
tmp_result = tf.nn.conv2d_transpose(value=in_tensor, filter=kernel, output_shape=output_shape,
strides=strides, padding='SAME')
tmp_result = tf.verify_tensor_all_finite(tmp_result, 'convres')
Which throw an error, which reads:
InvalidArgumentError (see above for traceback): convres : Tensor had NaN values
[[Node: upconv_logits5_fs/VerifyFinite_2/CheckNumerics = CheckNumerics[T=DT_FLOAT, _class=["loc:#upconv_logits5_fs/conv2d_transpose"], message="convres", _device="/job:localhost/replica:0/task:0/gpu:0"](upconv_logits5_fs/conv2d_transpose)]]
[[Node: Adam/update/_2794 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_154_Adam/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Now I'm not sure what happened here.
I guess, that during the forward pass everything went well, because the scalar loss didn't trigger an error and also kernel & input were still valid numbers. It seems as some Adam update node modifies the value of my upconv_logits5_fs towards nan. This transposed convolution op is the very last of my network and therefore the first one to be updated.
I'm working with a tf.nn.softmax_cross_entropy_with_logits() loss and put tf.verify_tensor_all_finite() on all it's in- and outputs, but they don't trigger errors. The only conclusion I can draw is that there might be a numerical issue with the Adam solver.
What do you think about that conclusion?
Does anybody have any idea how to proceed or what I could try?
Your help is very much appreciated.
EDIT:
I was able to solve my problem by increasing the solvers epsilon parameter from 1e-8 to 1e-4. It seemed like some of my parameters tend to have very little to zero variance and that resulted in tf.sqrt(0.0 + epsilon), which caused numerical problems.

I had run to the same issue multiple time. the reason behind this problem is by using the softmax and crossentropy. So when you are computing the gradient and diving by zero or inf you are getting nan which is propagating throw all your parameters.
few advises to avoid this problem
if error starts increasing then NaN appears afterwards: diverging due to too high learning rate
if NaNs appear suddenly: saturating units yielding non-differentiable gradient
NaN computation due to log(0)
NaN due to floating point issues (to high weights) or activations on the output
0/0, inf/inf, inf*weight...
solutions:
reduce learning rate
Change the Weight initialization
Use L2 norm
Safe softmax (small value add to log(x))
gradient clipping
In my case learning rate solved the issue but I'm still working to optimize it more

One additional step which was not included in the Feras's answer and costed me a day of debugging.
Increasing the precision of your variables. I had a network where a lot of variables were defined as float16. The network worked fine for all the optimizers except of Adam and Adadelta. After hours of debugging I switched to tf.float64 and it worked.

This is probably quite particular to my case but might still help someone else.
I was having my loss suddenly turn to nan without reaching particularly large values beforehands. I checked that my data wasn't corrupted, tried playing with the learning rate, adding clipnorm, batch normalization layers and so on without any success.
I was actually adding a random epsilon to my denominator somewhere in the model (to avoid division by 0) but didn't pay attention to its range. By changing the minval from 0 to 1e-18 got rid of the problem for me.
rand_num = Lambda(lambda input: K.random_uniform(tf.shape(input), minval = 1e-18, maxval=1e-17))(s_p)
I guess some of the randomly picked values were too small to serve their purpose and fix potential division by zero.

This is a numerical stability issue. I would suggest trying a lower learning rate to see if that resolve your problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.