Tensorflow Model fills up with NaNs during training

Tensorflow Model fills up with NaNs during training - python

I am basically playing around with duplicating AlphaZero. It has worked for some small games, but I am trying to scale it up to work with a more complicated game. However, now my network after training on 2-10 million moves will just fill up with NaNs. Unfortunately, due to the fact that it isn't deterministic and the failure point occurs in such a wide range using the debugger has not been very effective. It takes about 5 minutes to train 12000 moves when I am having tfdbg check for "has_inf_or_nan". So the debugger is doing nothing for me because it would take a very long time to hit the error.
At the very bottom of this post, I'll describe what the model looks like.
Here is how I am using certain things that are common sources of NaNs:
Loss Functions (Single network with 2 outputs: policy (odds of selecting a move) and value (quality of the board position for the active player)):
Note: move_result_placeholder gets filled with a batch of moves that are the output of a MonteCarlo Tree Search. Since most of the move positions are invalid it is typically full of 0s with 5-10 that are floats that represent the odds of selecting that move. I have an assert that verifies they all sum to 1. When running training I also have asserts that verify none of the inputs are NaN. I select uniformly at random from a collection of the last 1,000,000 (Board State, Move, Reward) when populating the batch. Then I feed the board states, moves, and rewards into the training step.
self.loss_policy = tf.losses.softmax_cross_entropy(self.move_result_placeholder, out_dense)
self.loss_value =
tf.losses.mean_squared_error(self.value_result_placeholder,
tf.reshape(self.out_value_layer, shape=[-1,]))
self.total_loss = self.loss_policy + self.loss_value
Optimizer (learning rate 1e-4):
self.train_step = tf.train.AdamOptimizer(learning_rate=self.learning_rate_placeholder).minimize(self.total_loss, name="optimizer")
Softmax:
self.out_policy_layer = tf.nn.softmax(out_dense, name="out_policy_layer")
Batch Normalization (is_training is a placeholder that is 1 when training and 0 when playing games) batch_norm_decay is .999:
input_bn = tf.contrib.layers.batch_norm(input_conv, center=True, scale=True, is_training=self.is_training, decay=self._config.batch_norm_decay)
Regularization (L2 on all weights in layers scale is 1e-4):
initializer=tf.contrib.layers.xavier_initializer()
if use_regularizer:
regularizer = tf.contrib.layers.l2_regularizer(scale=self._config.l2_regularizer_scale)
weights = tf.get_variable(name, shape=shape, initializer=initializer, regularizer=regularizer)
MODEL DESCRIPTION:
The model is created in tensorflow and consists of an input layer that is 4x8x3 (batch size 1024). This captures the state of the 4x8 board and how many moves have been made since a player has scored and how many times that board state has been seen during that specific game. That feeds into a conv2d layer with kernel size of 3x3 and strides=1. I then apply BatchNormalization tf.contrib.layers.batch_norm(input_conv, center=True, scale=True, is_training=self.is_training, decay=self._config.batch_norm_decay) and relu. At the end of the input relu the size is 4x8x64.
After that there are 5 residual blocks. After the residual block it splits into two. The first is the policy network output which runs it through another convolutional layer with kernal size of 1x1 with strides of 1 and a batch normalization and a ReLU. At this point it is 4x8x2 and it gets flattened and run through a dense layer and then to a softmax to output 256 outputs that represent the odds that it will pick any given move. The 256 outputs map to the 4x8 board with planes for the direction the piece is moving. So the first 4x8 would tell you the odds of selecting a piece and moving it Northwest. The second would tell you the odds of selecting a piece and moving it Northeast, etc.
On the other side of the split is the value output. On that side it runs through a convolutional layer then it gets flattened and goes through a dense layer and finally through a TanH so it outputs a single value that tells us the quality of that board state.
Weights for all of the layers are using L2 Regularization (1e-4).
Loss is Cross Entropy for the policy side and Mean Squared Error for the value side, and I am using the Adam Optimizer.

If I were you, I would investigate the tensorflow debugger plugin for tensorboard. You will find that using this tool, it is very easy to trace problems through your graph.
You can step through computations in your graph, and you can also track the occurrence of NaN values that pop up.
https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/debugger

Well this is just too broad a problem to tackle like this. In general, you need to think about what can produce NaNs and approach this issue with modular disabling, that is disable or bypass things in your model and see if the error disappears. Some candidates where problems might originate: batch normalization, or softmax for some edge cases (all zero input), or you might have a gradient explosion (try to limit learning rate.)
So for example, turn off batch normalization and run the model, see if the error happens. If yes, lower the learning rate a few orders of magnitude. And so on.

The issue was actually a video card that was dying. If you are running into a similar problem, and you investigate the usual sources without success remember to consider an issue with the memory in your videocard.

Related

Why is Normalization causing my network to have exploding gradients in training?

I've built a network (In Pytorch) that performs well for image restoration purposes. I'm using an autoencoder with a Resnet50 encoder backbone, however, I am only using a batch size of 1. I'm experimenting with some frequency domain stuff that only allows me to process one image at a time.
I have found that my network performs reasonably well, however, it only behaves well if I remove all batch normalization from the network. Now of course batch norm is useless for a batch size of 1 so I switched over to group norm, designed for this purpose. However, even with group norm, my gradient explodes. The training can go very well for 20 - 100 epochs and then game over. Sometimes it recovers and explodes again.
I should also say that in training, every new image fed in is given a wildly different amount of noise to train for random noise amounts. This has been done before but perhaps coupled with a batch size of 1 it could be problematic.
I'm scratching my head at this one and I'm wondering if anyone has suggestions. I've dialed in my learning rate and clipped the max gradients but this isn't really solving the actual issue. I can post some code but I'm not sure where to start and hoping someone could give me a theory. Any ideas? Thanks!

To answer my own question, my network was unstable in training because a batch size of 1 makes the data too different from batch to batch. Or as the papers like to put it, too high an internal covariate shift.
Not only were my images drawn from a very large varied dataset, but they were also rotated and flipped randomly. As well as this, random Gaussain of noise between 0 and 30 was chosen for each image, so one image may have little to no noise while the next may be barely distinguisable in some cases. Or as the papers like to put it, too high an internal covariate shift.
In the above question I mentioned group norm - my network is complex and some of the code is adapted from other work. There were still batch norm functions hidden in my code that I missed. I removed them. I'm still not sure why BN made things worse.
Following this I reimplemented group norm with groups of size=32 and things are training much more nicely now.
In short removing the extra BN and adding Group norm helped.

How to force RNN output to become smoother?

I am using experimental data with Keras LSTM to model a complicated physical system. The problem is output value tends to change drastically between two points at certain points. All physical systems must show some continuous/smooth behavior. How can I make my output smoother, is there some kind of layer or regularization?
I tried introducing l1-l2 regularization, drop-outs... They help but I could not get good results. What I seek is some kind of layer which limits sudden changes in the values. By the way I work with a rather small amount of data; I am using 2 series to train and validify, 1 to test.
Network structure: I get similar results for 2 LSTM + 1 Dense layer or 1 LSTM + 1 Dense layer. (With/without dropout layers between LSTM and Dense, and some l2 regularization)
The time-series data represents some measurements. Measurements are taken in short intervals, resulting in repeated values time to time. I remove some of the repeated lines as well.(I concatenated them together and then removed the rows with respect to one of the inputs. I tried doing it for several inputs. But as you can understand, I did not remove all the repeated lines with this approach, can this be the source of the problem?)
I use sklearn.StandardScaler or sklearn.MinMaxScaler to normalize the input data, not much difference between the two.
You can see a sample result on test data, which has l2 regularization - please note the first two peaks in the start. There is around 20,000 points in the graph and these peaks occur over 3-5 points. In the training set there are some jumps as well but they are far more smooth and spread out. Is there someway to smoothen the output within the neural net, without adding some external filters?

Training Neural Network with Simulated Annealing

I am trying to train a simple neural network with simulated annealing. I have programmed a neural network with an input layer of 784 input nodes (28 x 28 pixels: I am using the MNIST database to train), 1 hidden layer with 100 nodes and an output layer with 10 end nodes. I also programmed a simulated annealing algorithm that takes an input vector and minimizes a function to get the desired output vector.
Now my question is how to combine the two? I have read a couple papers but they don't specify exactly how this is done. I think the idea is as follows:
Initialize a vector of random weights (in my case the vector is of length 79,400; 78,400 weights for the input layer to the hidden layer, and 1,000 weights for the hidden layer to the output layer). Calculate the corresponding output, which will of course be incorrect, and the sum of squared errors. Then loop through the weight vector and adjust each weight slightly by adding or subtracting a small number. For each adjustment, calculate the sum of squared errors again and see which adjustment (adding or substracting) decreased this value. Repeating this process should in my opinion result in the weights that correspond to the desired output.
I would like to know if this approach is the way to go. If it is, this seems to be a very time consuming process and I was wondering whether there is a more efficient way to do this?
Thanks in advance!

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this

First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

Neural Network Becomes Unruly with Large Layers

This is a higher-level question about the performance of a neural network. The issue I'm having is that with larger numbers of neurons per layer, the network has frequent rounds of complete stupidity. They are not consistent; it seems that the probability of general success vs failure is about 50/50 when layers get larger than 60 neurons (always 3 layers).
I tested this by teaching the same function to networks with input and hidden layers of sizes from 10-200. The success rate is either 0-1% or 90+%, but nothing in between. To help visualize this, I graphed it. Failures is a total count of incorrect responses on 200 data sets after 5k training iterations. .
I think it's also important to note that the numbers at which the network succeeds or fails change for each run of the experiment. The only possible culprit I've come up with is local minima (but don't let this influence your answers, I'm new to this, and initial attempts to minimize the chance of local minima seem to have no effect).
So, the ultimate question is, what could cause this behavior? Why is this thing so wildly inconsistent?
The Python code is on Github and the code that generated this graph is the testHugeNetwork method in test.py (line 172). If any specific parts of the network algorithm would be helpful I'm glad to post relevant snippets.

My guess is, that your network is oscillating heavily across a jagged error surface. Trying a lower error rate might help. But first of all, there are a few things you can do to better understand what your network is doing:
plot the output error over training epochs. This will show you when in the training process things go wrong.
have a graphical representation (an image) of your weight matrices and of your outputs. Makes it much easier to spot irregularities.
A major problem with ANN training is saturation of the sigmoid function. Towards the asymptotes of both the logistic function and tanh, the derivative is close to 0, numerically it probably even is zero. As a result, the network will only learn very slowly or not at all. This problem occurs when the input for the sigmoid are too big, here's what you can do about it:
initialize your weights proportional to number of inputs a neuron receives. Standard literature suggests to draw them from a distribution with mean = 0 and standard deviation 1/sqrt(m), where m is the number of input connections.
scale your teachers so that they lie where the network can learn the most; that is, where the activation function is the steepest: the maximum of the first derivative. For tanh you can alternatively scale the function to f(x) = 1.7159 * tanh(2/3 * x) and keep the teachers at [-1, 1]. However, don't forget to adjust the derivative to f'(x) = 2/3 * 1.7159 * (1 - tanh^2 (2/3 * x)
Let me know if you need additional clarification.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.