I'm using Mxnet to train a XOR neural network, but the losses don't go down, they are always above 0.5.
Below is my code in Mxnet 1.1.0; Python 3.6; OS X El Capitan 10.11.6
I tried 2 loss functions - squared loss and softmax loss, both didn't work.
from mxnet import ndarray as nd
from mxnet import autograd
from mxnet import gluon
import matplotlib.pyplot as plt
X = nd.array([[0,0],[0,1],[1,0],[1,1]])
y = nd.array([0,1,1,0])
batch_size = 1
dataset = gluon.data.ArrayDataset(X, y)
data_iter = gluon.data.DataLoader(dataset, batch_size, shuffle=True)
plt.scatter(X[:, 1].asnumpy(),y.asnumpy())
plt.show()
net = gluon.nn.Sequential()
with net.name_scope():
net.add(gluon.nn.Dense(2, activation="tanh"))
net.add(gluon.nn.Dense(1, activation="tanh"))
net.initialize()
softmax_cross_entropy = gluon.loss.SigmoidBCELoss()#SigmoidBinaryCrossEntropyLoss()
square_loss = gluon.loss.L2Loss()
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.3})
train_losses = []
for epoch in range(100):
train_loss = 0
for data, label in data_iter:
with autograd.record():
output = net(data)
loss = square_loss(output, label)
loss.backward()
trainer.step(batch_size)
train_loss += nd.mean(loss).asscalar()
train_losses.append(train_loss)
plt.plot(train_losses)
plt.show()
I got this question figured out in somewhere else, so I'm going to post the answer here.
Basically, the issue in my original code was multi-dimensional.
Weight initialization. Notice that I used default initialization
net.initialize()
which actually does
net.initialize(initializer.Uniform(scale=0.07))
Apparently these initial weights were too small, and the network could never jump out of them. So the fix is
net.initialize(mx.init.Uniform(1))
After doing this, the network could converge using sigmoid/tanh as the activation, and using L2Loss as the loss function. And it worked with sigmoid and SigmoidBCELoss. However, it still didn't work with tanh and SigmoidBCELoss, which can be fixed by the second item below.
SigmoidBCELoss has to be used in these 2 scenarios in the output layer.
2.1. Linear activation and SigmoidBCELoss(from_sigmoid=False);
2.2. Non-linear activation and SigmoidBCELoss(from_sigmoid=True), in which the output of the non-linear function falls into (0, 1).
In my original code, when I used SigmoidBCELoss, I was using either all sigmoid, or all tanh. So just need to change the activation in the output layer from tanh to sigmoid, and the network could converge. I can still have tanh in the hidden layers.
Hope this helps!
Related
I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.
I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.
I'm trying to train a LSTM ae.
It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this.
Can you inspect the code and give me a hint?
The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error?
This is the input sequence I get from training set:
sequence1 = x_train[0][:128]
looks like this:
I get the data from a public signal dataset(128*1)
This is the code: (I modify it from keras blog)
# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers
# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
model.fit(sequence, sequence, epochs=epo)
The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.
You might have a problem with exploding gradient values when doing the backpropagation.
Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/
Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.
'relu' is the main culprit - see here. Possible solutions:
Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks
Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.
def inspect_weights_l2(model, names='lstm', axis=-1):
def _get_l2(w, axis=-1):
axis = axis if axis != -1 else len(w.shape) - 1
reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
return np.sqrt(np.sum(np.square(w), axis=reduction_axes))
def _print_layer_l2(layer, idx, axis=-1):
W = layer.get_weights()
l2_all = []
txt = "{} "
for w in W:
txt += "{:.4f}, {:.4f} -- "
l2 = _get_l2(w, axis)
l2_all.extend([l2.max(), l2.mean()])
txt = txt.rstrip(" -- ")
print(txt.format(idx, *l2_all))
names = [names] if isinstance(names, str) else names
for idx, layer in enumerate(model.layers):
if any([name in layer.name.lower() for name in names]):
_print_layer_l2(layer, idx, axis=axis)
I have been trying to implement a simple linear regression model using neural networks in Keras in hopes to understand how do we work in Keras library. Unfortunately, I am ending up with a very bad model. Here is the implementation:
from pylab import *
from keras.models import Sequential
from keras.layers import Dense
#Generate dummy data
data = data = linspace(1,2,100).reshape(-1,1)
y = data*5
#Define the model
def baseline_model():
model = Sequential()
model.add(Dense(1, activation = 'linear', input_dim = 1))
model.compile(optimizer = 'rmsprop', loss = 'mean_squared_error', metrics = ['accuracy'])
return model
#Use the model
regr = baseline_model()
regr.fit(data,y,epochs =200,batch_size = 32)
plot(data, regr.predict(data), 'b', data,y, 'k.')
The generated plot is as follows:
Can somebody point out the flaw in the above definition of the model (which could ensure a better fit)?
You should increase the learning rate of optimizer. The default value of learning rate in RMSprop optimizer is set to 0.001, therefore the model takes a few hundred epochs to converge to a final solution (probably you have noticed this yourself that the loss value decreases slowly as shown in the training log). To set the learning rate import optimizers module:
from keras import optimizers
# ...
model.compile(optimizer=optimizers.RMSprop(lr=0.1), loss='mean_squared_error', metrics=['mae'])
Either of 0.01 or 0.1 should work fine. After this modification you may not need to train the model for 200 epochs. Even 5, 10 or 20 epochs may be enough.
Also note that you are performing a regression task (i.e. predicting real numbers) and 'accuracy' as metric is used when you are performing a classification task (i.e. predicting discrete labels like category of an image). Therefore, as you can see above, I have replaced it with mae (i.e. mean absolute error) which is also much more interpretable than the value of loss (i.e. mean squared error) used here.
The below code best fits for your data.
Take a look at this.
from pylab import *
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
%matplotlib inline
# Generate dummy data
data = data = linspace(1,2,100).reshape(-1,1)
y = data*5
# Define the model
def baseline_model():
global num_neurons
model = Sequential()
model.add(Dense(num_neurons, activation = 'linear', input_dim = 1))
model.add(Dense(1 , activation = 'linear'))
model.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
return model
set num_neurons in first dense layer
** You may change it later
num_neurons = 17
#Use the model
regr = baseline_model()
regr.fit(data,y,epochs =200, verbose = 0)
plot(data, regr.predict(data), 'bo', data,y, 'k-')
the first plot with num_neurons = 17 , is good fit.
But even we can explore more.
click on the links below to see the plots
Plot for num_neurons = 12
Plot for num_neurons = 17
Plot for num_neurons = 19
Plot for num_neurons = 20
You can see that , as we increase the num of neurons
our model is becoming more intelligent.
and best fit.
I hope you got it.
Thank you
Interesting problem, so I plugged the dataset into a model-builder framework I wrote. The framework has two callbacks: EarlyStopping callback for 'loss' and Tensorboard for analysis. I removed the 'metric' attribute from the model compile - unnecessary, and should be 'mae' anyways.
#mathnoob123 model as written and learning rate(lr) = 0.01 had a loss=1.2623e-06 in 2968 epochs
BUT the same model replacing the RMSprop optimizer with Adam was most accurate with loss=1.22e-11 in 2387 epochs.
The best compromise I found was #mathnoob123 model using lr=0.01 results in loss=1.5052e-4 in 230 epochs. This is better than #Kunam's model with 20 nodes in the first dense layer with lr=0.001: loss=3.1824e-4 in 226 epochs.
Now I'm going to look at randomizing the label dataset (y) and see what happens....
I'm starting to study neural Networks. So I started to program some easy neural networks in Python with TensorFlow.
I'm trying to construct one with the MNIST database.
The problem that I have is: when trainning the loss function doesn't decrease. It gets stuck in 60000 that is the number of traininning images.
I've realized that the prediction that it does is all full of zeros. Here it is the code (Also I'm new in this platform so I'm sorry if there is something wrong in the post):
# -*- coding: utf-8 -*-
from keras.datasets import mnist # subroutines for fetching the MNIST dataset
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = np.reshape(x_train,[60000,784])
y_train = np_utils.to_categorical(y_train, 10) # One-hot encode the labels
x_test = np.reshape(x_test,[10000,784])
y_test = np_utils.to_categorical(y_test, 10) # One-hot encode the labels
input = tf.placeholder(tf.float32, name='Input')
output = tf.placeholder(tf.float32, name = 'Output')
syn0 = tf.Variable(2*tf.random_uniform([784,10],seed=1)-1, name= 'syn0')
bias = tf.Variable(2*tf.random_uniform([10],seed=1)-1, name= 'syn0')
syn0 = tf.Variable(tf.zeros([784,10]))
bias = tf.Variable(tf.zeros([10]))
init = tf.global_variables_initializer()
#model
l1 = tf.sigmoid((tf.matmul(input,syn0) + bias),name='layer1')
error = tf.square(l1-output,name='error')
loss = tf.reduce_sum(error, name='cost')
#optimizer
with tf.name_scope('trainning'):
optimizer = tf.train.GradientDescentOptimizer(0.1)
train = optimizer.minimize(loss)
#session
sess = tf.Session()
sess.run(init)
#trainning
for i in range (100):
_,lossNow = sess.run([train,loss],{input: x_train,output: y_train})
print(lossNow)
#print debug
print("Finally, the coeficients are: " , sess.run(tf.transpose(syn0)))
print()
pred = sess.run(l1,{input: x_test,output: y_test})
print("Next prediction: " , pred)
print()
print("Final Loss: ", sess.run(loss,{input: x_test,output: y_test}))
#print graph
sess.close
After few iterations this is what I get:
[[ 150000.]]
[[ 60000.]]
[[ 60000.]]
[[ 60000.]]
[[ 60000.]]
It seems that the loss gets stuck. I've tried to change the learning_rate and I've added more layers just to try but I get the same result.
Hope you can help me! And thank you! :D
I think here are two problems. First is that you sum up all the 60000 data points in your set to compute your loss function instead of using mini batches. That makes your loss function extremely steep with very flat minimums. Second is that you've found a local minimum of the loss function and because of the steepness of the function you've got locked there.
One more problem is that you use sigmoid instead of softmax. If you check your predicted values they are all zeros. With sigmoid you can have this kind of prediction since all outputs are independent and there is no normalization like with softmax (the sum of the output of the softmax is always 1).
For training, did you try executing the node "l1" is session.run(), it is the one that actually does the computation. It is necessary for training also, your error and loss will depend upon the output of "l1", if you are not executing in the session, then loss will not come correctly.
error = tf.square(l1-output,name='error')
See in this line you are calculating error with l1-output, output here are the ground truth but "l1" will not have any value unless you compute it through graph under session.run().
Can you try with below command and check the outputs
predictions, _,lossNow = sess.run([l1,train,loss],{input: x_train,output: y_train})
Also, instead of using "-" sign while calculating the error (l1-output), please use tensorflow operators like show here
I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks
try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.