Training Neural Network with Simulated Annealing - python

I am trying to train a simple neural network with simulated annealing. I have programmed a neural network with an input layer of 784 input nodes (28 x 28 pixels: I am using the MNIST database to train), 1 hidden layer with 100 nodes and an output layer with 10 end nodes. I also programmed a simulated annealing algorithm that takes an input vector and minimizes a function to get the desired output vector.
Now my question is how to combine the two? I have read a couple papers but they don't specify exactly how this is done. I think the idea is as follows:
Initialize a vector of random weights (in my case the vector is of length 79,400; 78,400 weights for the input layer to the hidden layer, and 1,000 weights for the hidden layer to the output layer). Calculate the corresponding output, which will of course be incorrect, and the sum of squared errors. Then loop through the weight vector and adjust each weight slightly by adding or subtracting a small number. For each adjustment, calculate the sum of squared errors again and see which adjustment (adding or substracting) decreased this value. Repeating this process should in my opinion result in the weights that correspond to the desired output.
I would like to know if this approach is the way to go. If it is, this seems to be a very time consuming process and I was wondering whether there is a more efficient way to do this?
Thanks in advance!

Related

How to force RNN output to become smoother?

I am using experimental data with Keras LSTM to model a complicated physical system. The problem is output value tends to change drastically between two points at certain points. All physical systems must show some continuous/smooth behavior. How can I make my output smoother, is there some kind of layer or regularization?
I tried introducing l1-l2 regularization, drop-outs... They help but I could not get good results. What I seek is some kind of layer which limits sudden changes in the values. By the way I work with a rather small amount of data; I am using 2 series to train and validify, 1 to test.
Network structure: I get similar results for 2 LSTM + 1 Dense layer or 1 LSTM + 1 Dense layer. (With/without dropout layers between LSTM and Dense, and some l2 regularization)
The time-series data represents some measurements. Measurements are taken in short intervals, resulting in repeated values time to time. I remove some of the repeated lines as well.(I concatenated them together and then removed the rows with respect to one of the inputs. I tried doing it for several inputs. But as you can understand, I did not remove all the repeated lines with this approach, can this be the source of the problem?)
I use sklearn.StandardScaler or sklearn.MinMaxScaler to normalize the input data, not much difference between the two.
You can see a sample result on test data, which has l2 regularization - please note the first two peaks in the start. There is around 20,000 points in the graph and these peaks occur over 3-5 points. In the training set there are some jumps as well but they are far more smooth and spread out. Is there someway to smoothen the output within the neural net, without adding some external filters?

Tensorflow: How to pass output of a fully connected network into the same network for few time steps

I have a fully connected network that takes input a vector of N dimensions concatenated along with a constant vector of M dimension, the output of the network is an N dimension vector. I want to feed this output again into the fully connected network and perform chaining for few iterations, after all the iterations, I want the loss gradients due to the last output to propagate through each iteration. Will simple looping by forming TensorArray of my ground truths do the job?Also, can anyone help me make a custom RNN cell that doesn't pass on the hidden states to the iterations following.

Tensorflow Model fills up with NaNs during training

I am basically playing around with duplicating AlphaZero. It has worked for some small games, but I am trying to scale it up to work with a more complicated game. However, now my network after training on 2-10 million moves will just fill up with NaNs. Unfortunately, due to the fact that it isn't deterministic and the failure point occurs in such a wide range using the debugger has not been very effective. It takes about 5 minutes to train 12000 moves when I am having tfdbg check for "has_inf_or_nan". So the debugger is doing nothing for me because it would take a very long time to hit the error.
At the very bottom of this post, I'll describe what the model looks like.
Here is how I am using certain things that are common sources of NaNs:
Loss Functions (Single network with 2 outputs: policy (odds of selecting a move) and value (quality of the board position for the active player)):
Note: move_result_placeholder gets filled with a batch of moves that are the output of a MonteCarlo Tree Search. Since most of the move positions are invalid it is typically full of 0s with 5-10 that are floats that represent the odds of selecting that move. I have an assert that verifies they all sum to 1. When running training I also have asserts that verify none of the inputs are NaN. I select uniformly at random from a collection of the last 1,000,000 (Board State, Move, Reward) when populating the batch. Then I feed the board states, moves, and rewards into the training step.
self.loss_policy = tf.losses.softmax_cross_entropy(self.move_result_placeholder, out_dense)
self.loss_value =
tf.losses.mean_squared_error(self.value_result_placeholder,
tf.reshape(self.out_value_layer, shape=[-1,]))
self.total_loss = self.loss_policy + self.loss_value
Optimizer (learning rate 1e-4):
self.train_step = tf.train.AdamOptimizer(learning_rate=self.learning_rate_placeholder).minimize(self.total_loss, name="optimizer")
Softmax:
self.out_policy_layer = tf.nn.softmax(out_dense, name="out_policy_layer")
Batch Normalization (is_training is a placeholder that is 1 when training and 0 when playing games) batch_norm_decay is .999:
input_bn = tf.contrib.layers.batch_norm(input_conv, center=True, scale=True, is_training=self.is_training, decay=self._config.batch_norm_decay)
Regularization (L2 on all weights in layers scale is 1e-4):
initializer=tf.contrib.layers.xavier_initializer()
if use_regularizer:
regularizer = tf.contrib.layers.l2_regularizer(scale=self._config.l2_regularizer_scale)
weights = tf.get_variable(name, shape=shape, initializer=initializer, regularizer=regularizer)
MODEL DESCRIPTION:
The model is created in tensorflow and consists of an input layer that is 4x8x3 (batch size 1024). This captures the state of the 4x8 board and how many moves have been made since a player has scored and how many times that board state has been seen during that specific game. That feeds into a conv2d layer with kernel size of 3x3 and strides=1. I then apply BatchNormalization tf.contrib.layers.batch_norm(input_conv, center=True, scale=True, is_training=self.is_training, decay=self._config.batch_norm_decay) and relu. At the end of the input relu the size is 4x8x64.
After that there are 5 residual blocks. After the residual block it splits into two. The first is the policy network output which runs it through another convolutional layer with kernal size of 1x1 with strides of 1 and a batch normalization and a ReLU. At this point it is 4x8x2 and it gets flattened and run through a dense layer and then to a softmax to output 256 outputs that represent the odds that it will pick any given move. The 256 outputs map to the 4x8 board with planes for the direction the piece is moving. So the first 4x8 would tell you the odds of selecting a piece and moving it Northwest. The second would tell you the odds of selecting a piece and moving it Northeast, etc.
On the other side of the split is the value output. On that side it runs through a convolutional layer then it gets flattened and goes through a dense layer and finally through a TanH so it outputs a single value that tells us the quality of that board state.
Weights for all of the layers are using L2 Regularization (1e-4).
Loss is Cross Entropy for the policy side and Mean Squared Error for the value side, and I am using the Adam Optimizer.
If I were you, I would investigate the tensorflow debugger plugin for tensorboard. You will find that using this tool, it is very easy to trace problems through your graph.
You can step through computations in your graph, and you can also track the occurrence of NaN values that pop up.
https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/debugger
Well this is just too broad a problem to tackle like this. In general, you need to think about what can produce NaNs and approach this issue with modular disabling, that is disable or bypass things in your model and see if the error disappears. Some candidates where problems might originate: batch normalization, or softmax for some edge cases (all zero input), or you might have a gradient explosion (try to limit learning rate.)
So for example, turn off batch normalization and run the model, see if the error happens. If yes, lower the learning rate a few orders of magnitude. And so on.
The issue was actually a video card that was dying. If you are running into a similar problem, and you investigate the usual sources without success remember to consider an issue with the memory in your videocard.

What is the key feature in MNIST Dataset that is used to classify images

I was recently learning about neural networks and came across MNIST data set. i understood that a sigmoid cost function is used to reduce the loss. Also, weights and biases gets adjusted and an optimum weights and biases are found after the training. the thing i did not understand is, on what basis the images are classified. For example, to classify whether a patient has cancer or not, data like age, location, etc., becomes features. in MNIST dataset, i did not find any of that. Am i missing something here. Please help me with this
First of all the Network pipeline consists of 3 main parts:
Input Manipulation:
Parameters that effect the finding of minimum:
Parameters like your descission function in your interpretation
layer (often fully connected layer)
In contrast to your regular machine learning pipeline where you have to extract features manually a CNN uses filters. (Filters like in edge detection or viola and jones).
If a filter runs across the images and is convolved with pixels it Produces an output.
This output is then interpreted by a neuron. If the output is above a threshold it is considered as valid (Step function counts 1 if valid or in case of Sigmoid it has a value on the sigmoid function).
The next steps are the same as before.
This is progressed until the interpretation layer (often softmax). This layer interprets your computation (if the filters are good adapted to your problem you will get a good predicted label) which means you have a low difference between (y_guess - y_true_label).
Now you can see that for the guess of y we have multiplied the input x with many weights w and also used functions on it. This can be seen like a chain rule in analysis.
To get better results the effect of a single weight on the input must be known. Therefore, you use Backpropagation which is a derivative of the Error with respect to all w. The Trick is that you can reuse derivatives which is more or less Backpropagation and it becomes easier since you can use Matrix vector notation.
If you have your gradient, you can use the normal concept of minimization where you walk along the steepest descent. (There are also many other gradient methods like adagrad or adam etc).
The steps will repeat until convergence or until you reach the maximum epochs.
So the answer is: THE COMPUTED WEIGHTS (FILTERS) ARE THE KEY TO DETECT NUMBERS AND DIGITS :)

How to decide activation function in neural network

I am using feedforward, backpropagation, multilayer neural network and I am using sigmoid function as a activation function which is having range of -1 to 1. But the minimum error is not going below 5.8 and I want so less, you can see the output after 100000 iterations.
I think this is because of my output range is above 1, and sigmoid functions range is only -1 to 1. Can anybody suggest me how i can overcome this problem as my desired output range is 0 to 2.5. Suggest me which activation function will be best for this range.
If you are seeking to reduce output error, there are a couple of things to look at before tweaking a node's activation function.
First, do you have a bias node? Bias nodes have several implications, but - most relevant to this discussion - they allow the network output to be translated to the desired output range. As this reference states:
The use of biases in a neural network increases the capacity of the network to solve problems by allowing the hyperplanes that separate individual classes to be offset for superior positioning.
This post provides a very good discussion:
Role of Bias in Neural Networks.
This one is good, too: Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
Second method: it often helps to normalize your inputs and outputs. As you note, your sigmoid offers a range of +/- 1. This small range can be problematic when trying to learn functions that have a range of 0 to 1000 (for example). To aid learning, it's common to scale and translate inputs to accommodate the node activation functions. In this example, one might divide the range by 500, yielding a 0 to 2 range, and then subtract 1 from this range. In this manner, the inputs have been normalized to a range of -1 to 1, which better fits the activation function. Note that network output should be denormalized: first, add +1 to the output, then multiply by 500.
In your case, you might consider scaling the inputs by 0.8, then subtracting 1 from the result. You would then add 1 to the network output, and then multiply by 1.25 to recover the desired range. Note that this method may be easiest to accomplish since it does not directly change your network topology like the addition of bias would.
Finally, have you experimented with changing the number of hidden nodes? Although I believe the first two options are better candidates for improving performance, you might give this one a try. (Just as a point of reference, I can't recall an instance in which modifying the activation function's shape improved network response more than option 1 and 2.)
Here are some good discussion of hidden layer/node configuration:
multi-layer perceptron (MLP) architecture: criteria for choosing number of hidden layers and size of the hidden layer?
How to choose number of hidden layers and nodes in neural network?
24 inputs make your problem a high-dimensional one. Ensure that your training dataset adequately covers the input state space, and ensure that you are your test data and training data are drawn from similarly representative populations. (Take a look at the "cross-validation" discussions when training neural networks).
The vanilla sigmoid function is:
def sigmoid(x):
return 1/(1+math.e**-x)
You could transform that to:
def mySigmoid(x):
return 2.5/(1+math.e**-x)
in order to make the transformation that you want

Categories

Resources