I have a neural network with 30 input nodes, 1 hidden node, and 1 output node. I am training it on a dataset where the inputs are 30-dimensional vectors with entries between -1 and 1, and the targets are the 2nd entry of these vectors.
I expect the network to train and learn to output the 2nd entry of the input vector quickly, since this is as simple as decreasing the weights in the network which connect the input nodes to the hidden node to zero except the one for the 2nd entry.
However, The loss stalls quickly at approximately 0.168. I'ld expect it to quickly go to zero, which is the case if the targets are just 0.
The following code showcases the problem with a randomised dataset.
import numpy as np
from tensorflow.keras import models
from tensorflow.keras import layers
import tensorflow as tf
np.random.seed(123)
dataSize = 100000
xdata = np.zeros((dataSize, 30))
ydata = np.zeros((dataSize))
for i in range(dataSize):
vec = (np.random.rand(30) * 2) - 1
xdata[i] = vec
ydata[i] = vec[1]
model = models.Sequential()
model.add(layers.Dense(1, activation="relu", input_shape=(30, )))
model.add(layers.Dense(1, activation="sigmoid"))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)
lossObject = tf.keras.losses.MeanSquaredError()
model.compile(optimizer=optimizer, loss=lossObject)
model.fit(xdata, ydata, epochs=200, batch_size=32)
I have tried multiple different optimizers, loss functions, batch sizes, dataset sizes and learning rates, however the result is always the loss stalling at a relatively high value.
Why is this happening? I am not interested in responses asking why I am doing this. I am new to neural networks and I need to understand why this is happening before I can continue with my original task.
Thank you in advance.
Your targets are between -1 and 1, but a sigmoid output activation limits outputs to [0, 1], making it impossible to achieve zero loss if any targets happen to be < 0 (which is very likely with a large dataset). You could fix it by using tanh as activation instead, which maps to [-1, 1], or just using no activation in the output layer should be fine in this case. When you fix all targets to 0, this is obviously not an issue, and (almost) zero loss can be achieved.
As a general lesson: Always make sure your output activation makes sense with regards to your target data. At the very least, the value ranges should be identical -- although this might not be a sufficient condition for a good output activation.
As a second point: Having a single node with relu activation is also a bad idea. If the input to relu is < 0, the output will be 0, and the gradient will be, as well. In this case, no learning is possible and incorrect outputs for some data points may never be corrected.
It is generally not a problem if some units are 0 some of the time because the gradient can flow through other paths, but with only one unit, this will likely cripple learning as well. I would recommend that you either use more units in the hidden layer, or use a different activation function.
Related
New to neural nets so please correct my syntax.
I'm trying to create a LSTM RNN that will predict the Fibonacci sequence. When I ran the code below, the loss remains incredibly high (around 35339663592701874176).
Why does the shape of the input have to be (batch_size, timesteps, input_dim)? In my example I have 100 data entries so that'd be my batch_size, and the Fibonacci sequence takes in 2 inputs so that'd be input_dim but what would timesteps be in this case? 1?
Shouldn't the the units of the LSTM be 1? If I'm understanding correctly, the "units" are just the amount of hidden state nodes that are in the LSTM. So in theory, each of the 2 inputs would have a "1" coefficient weight towards that hidden state after training.
Would an RNN be a suitable model for this problem? When I've looked online, most people like to use the Fibonacci sequence as an example to explain how RNN's work.
Thanks for the help!
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create Training Data
xs = [[[1, 1]]]
ys = []
i = 0
while i < 100:
ys.append([xs[i][0][0]+xs[i][0][1]])
xs.append([[xs[i][0][1], ys[len(ys)-1][0]]])
i = i + 1
del xs[len(xs)-1]
xs = np.array(xs, dtype=float)
ys = np.array(ys, dtype=float)
# Create Model
model = keras.Sequential()
model.add(layers.LSTM(1, input_shape=(1, 2)))
model.add(layers.Dense(1))
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=[ 'accuracy' ])
# Train
model.fit(xs, ys, epochs=100000)
You can't feed a NN data where some of the values are 10^21 times as large as some of the others and expect it to work, it just doesn't happen.
You're not doing anything here that actually calls for LSTM (or any RNN), you're not actually using the time dimension, and you're basically just trying to learn addition. Maybe you meant to do something different (like input digits as a sequence, or have the output run for multiple timesteps and give you several values of the sequence), but that's not what you're doing, and it's unclear what you want.
The number of units is your memory/procesing capacity. Each unit of an RNN is able to receive values from all of the units in the previous timestep. One unit alone can't do anything interesting, especially with no layer before it to preprocess the data.
I am attempting to replicate an deep convolution neural network from a research paper. I have implemented the architecture, but after 10 epochs, my cross entropy loss suddenly increases to infinity. This can be seen in the chart below. You can ignore what happens to the accuracy after the problem occurs.
Here is the github repository with a picture of the architecture
After doing some research I think using an AdamOptimizer or relu might be a problem.
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
EDIT
If anyone is interested, the solution was that I was basically feeding in incorrect data.
Solution: Control the solution space. This might mean using smaller datasets when training, it might mean using less hidden nodes, it might mean initializing your wb differently. Your model is reaching a point where the loss is undefined, which might be due to the gradient being undefined, or the final_conv signal.
Why: Sometimes no matter what, a numerical instability is reached. Eventually adding a machine epsilon to prevent dividing by zero (cross entropy loss here) just won't help because even then the number cannot be accurately represented by the precision you are using. (Ref: https://en.wikipedia.org/wiki/Round-off_error and https://floating-point-gui.de/basic/)
Considerations:
1) When tweaking epsilons, be sure to be consistent with your data type (Use the machine epsilon of the precision you are using, in your case float32 is 1e-6 ref: https://en.wikipedia.org/wiki/Machine_epsilon and python numpy machine epsilon.
2) Just in-case others reading this are confused: The value in the constructor for Adamoptimizer is the learning rate, but you can set the epsilon value (ref: How does paramater epsilon affects AdamOptimizer? and https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer)
3) Numerical instability of tensorflow is there and its difficult to get around. Yes there is tf.nn.softmax_with_cross_entropy but this is too specific (what if you don't want a softmax?). Refer to Vahid Kazemi's 'Effective Tensorflow' for an insightful explanation: https://github.com/vahidk/EffectiveTensorflow#entropy
that jump in your loss graph is very weird...
I would like you to focus on few points :
if your images are not normalized between 0 and 1 then normalize them
if you have normalized your values between -1 and 1 then use a sigmoid layer instead of softmax because softmax squashes the values between 0 and 1
before using softmax add a sigmoid layer to squash your values (Highly Recommended)
other things you can do is add dropouts for every layer
also I would suggest you to use tf.clip so that your gradients does not explode and implode
you can also use L2 regularization
and experiment with the learning rate and epsilon of AdamOptimizer
I would also suggest you to use tensor-board to keep track of the weights so that way you will come to know where the weights are exploding
You can also use tensor-board for keeping track of loss and accuracy
See The softmax formula below:
Probably that e to power of x, the x is being a very large number because of which softmax is giving infinity and hence the loss is infinity
Heavily use tensorboard to debug and print the values of the softmax so that you can figure out where you are going wrong
One more thing I noticed you are not using any kind of activation functions after the convolution layers... I would suggest you to leaky relu after every convolution layer
Your network is a humongous network and it is important to use leaky relu as activation function so that it adds non-linearity and hence improves the performance
You may want to use a different value for epsilon in the Adam optimizer (e.g. 0.1 -- 1.0).This is mentioned in the documentation:
The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1.
Sorry I am new to RNN. I have read this post on TimeDistributed layer.
I have reshaped my data in to Keras requried [samples, time_steps, features]: [140*50*19], which means I have 140 data points, each has 50 time steps, and 19 features. My output is shaped [140*50*1]. I care more about the last data point's accuracy. This is a regression problem.
My current code is :
x = Input((None, X_train.shape[-1]) , name='input')
lstm_kwargs = { 'dropout_W': 0.25, 'return_sequences': True, 'consume_less': 'gpu'}
lstm1 = LSTM(64, name='lstm1', **lstm_kwargs)(x)
output = Dense(1, activation='relu', name='output')(lstm1)
model = Model(input=x, output=output)
sgd = SGD(lr=0.00006, momentum=0.8, decay=0, nesterov=False)
optimizer = sgd
model.compile(optimizer=optimizer, loss='mean_squared_error')
My questions are:
My case is many-to-many, so I need to use return_sequences=True? How about if I only need the last time step's prediction, it would be many-to-one. So I need to my output to be [140*1*1] and return_sequences=False?
Is there anyway to enhance my last time points accuracy if I use many-to-many? I care more about it than the other points accuracy.
I have tried to use TimeDistributed layer as
output = TimeDistributed(Dense(1, activation='relu'), name='output')(lstm1)
the performance seems to be worse than without using TimeDistributed layer. Why is this so?
I tried to use optimizer=RMSprop(lr=0.001). I thought RMSprop is supposed to stabilize the NN. But I was never able to get good result using RMSprop.
How do I choose a good lr and momentum for SGD? I have been testing on different combinations manually. Is there a cross validation method in keras?
So:
Yes - return_sequences=False makes your network to output only a last element of sequence prediction.
You could define the output slicing using the Lambda layer. Here you could find an example on how to do this. Having the output sliced you can provide the additional output where you'll feed the values of the last timestep.
From the computational point of view these two approaches are equivalent. Maybe the problem lies in randomness introduced by weight sampling.
Actually - using RMSProp as a first choice for RNNs is a rule of thumb - not a general proved law. Moreover - it is strongly adviced not to change it's parameters. So this might cause the problems. Another thing is that LSTM needs a lot of time to stabalize. Maybe you need to leave it for more epochs. Last thing - is that maybe your data could favour another activation function.
You could use a keras.sklearnWrapper.
I build a LSTM like:
lstm_cell = tf.nn.rnn_cell.LSTMCell(n_hidden, forget_bias=1.0, state_is_tuple=True, activation=tf.nn.tanh)
lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=0.5)
lstm_cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * 3, state_is_tuple=True)
Then i train the model, and save variables.
The next time i load saved variables and skip training, it gives me a different prediction.
If i change the output_keep_prob to 1, this model can always show me the same prediction, but if the output_keep_prob is less than 1, like 0.5, this model shows me different prediction every time.
So i guess if the DropoutWrapper leads to different output?
If so, how can i solve this problem?
Thanks
Try using the seed keyword argument to DropoutWrapper(...):
lstm_cell = tf.nn.rnn_cell.DropoutWrapper(lstm_cell, output_keep_prob=0.5, seed=42)
See the docs here for DropoutWrapper.__init__
Dropout will randomly activate a subset of your net, and is used during training for regularization. Because you've hardcoded dropout as 0.5, it means every time you run the net half your nodes will be randomly silenced, thus producing a different and random result.
You can sanity check this is what's happening by setting a seed, so that the same nodes will be 'randomly' silenced by dropout each time. However, what you probably want to do is make dropout a placeholder so that you can set it to 1.0 (ie keep all the nodes) during test time, which will produce the same output for each input deterministically.
I have written some code to implement backpropagation in a deep neural network with the logistic activation function and softmax output.
def backprop_deep(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = dsigmoid(node_values[i][:,:-1]) * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
The code works well, but when I switched to ReLU as the activation function it stopped working. In the backprop routine I only change the derivative of the activation function:
def backprop_relu(node_values, targets, weight_matrices):
delta_nodes = node_values[-1] - targets
delta_weights = delta_nodes.T.dot(node_values[-2])
weight_updates = [delta_weights]
for i in xrange(-2, -len(weight_matrices)- 1, -1):
delta_nodes = (node_values[i]>0)[:,:-1] * delta_nodes.dot(weight_matrices[i+1])[:,:-1]
delta_weights = delta_nodes.T.dot(node_values[i-1])
weight_updates.insert(0, delta_weights)
return weight_updates
However, the network no longer learns, and the weights quickly go to zero and stay there. I am totally stumped.
Although I have determined the source of the problem, I'm going to leave this up in case it might be of benefit to someone else.
The problem was that I did not adjust the scale of the initial weights when I changed activation functions. While logistic networks learn very well when node inputs are near zero and the logistic function is approximately linear, ReLU networks learn well for moderately large inputs to nodes. The small weight initialization used in logistic networks is therefore not necessary, and in fact harmful. The behavior I was seeing was the ReLU network ignoring the features and attempting to learn the bias of the training set exclusively.
I am currently using initial weights distributed uniformly from -.5 to .5 on the MNIST dataset, and it is learning very quickly.