I'm trying to train a LSTM ae.
It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this.
Can you inspect the code and give me a hint?
The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error?
This is the input sequence I get from training set:
sequence1 = x_train[0][:128]
looks like this:
I get the data from a public signal dataset(128*1)
This is the code: (I modify it from keras blog)
# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers
# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
model.fit(sequence, sequence, epochs=epo)
The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.
You might have a problem with exploding gradient values when doing the backpropagation.
Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/
Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.
'relu' is the main culprit - see here. Possible solutions:
Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks
Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.
def inspect_weights_l2(model, names='lstm', axis=-1):
def _get_l2(w, axis=-1):
axis = axis if axis != -1 else len(w.shape) - 1
reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
return np.sqrt(np.sum(np.square(w), axis=reduction_axes))
def _print_layer_l2(layer, idx, axis=-1):
W = layer.get_weights()
l2_all = []
txt = "{} "
for w in W:
txt += "{:.4f}, {:.4f} -- "
l2 = _get_l2(w, axis)
l2_all.extend([l2.max(), l2.mean()])
txt = txt.rstrip(" -- ")
print(txt.format(idx, *l2_all))
names = [names] if isinstance(names, str) else names
for idx, layer in enumerate(model.layers):
if any([name in layer.name.lower() for name in names]):
_print_layer_l2(layer, idx, axis=axis)
Related
I am working on a project to implement CNN-LSTM sentiment analysis. Below is the code
from keras.models import Sequential
from keras import regularizers
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.layers import Dense, Conv1D , MaxPool1D , Flatten , Dropout
from keras.layers import BatchNormalization
from keras import regularizers
model7 = Sequential()
model7.add(Embedding(max_words, 40,input_length=max_len)) #The embedding layer
model7.add(Conv1D(20, 5, activation='relu', kernel_regularizer = regularizers.l2(l = 0.0001), bias_regularizer=regularizers.l2(0.01)))
model7.add(Dropout(0.5))
model7.add(Bidirectional(LSTM(20,dropout=0.5, kernel_regularizer=regularizers.l2(0.01), recurrent_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))))
model7.add(Dense(1,activation='sigmoid'))
model7.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
checkpoint7 = ModelCheckpoint("best_model7.hdf5", monitor='val_accuracy', verbose=1,save_best_only=True, mode='auto', period=1,save_weights_only=False)
history = model7.fit(X_train_padded, y_train, epochs=10,validation_data=(X_test_padded, y_test),callbacks=[checkpoint7])
Even after adding regularizers and dropout, my model has very high validation loss and low accuracy.
Epoch 3: val_accuracy improved from 0.54517 to 0.57010, saving model to best_model7.hdf5
2188/2188 [==============================] - 290s 132ms/step - loss: 0.4241 - accuracy: 0.8301 - val_loss: 0.9713 - val_accuracy: 0.5701
My train and test data:
train: (70000, 7)
test: (30000, 7)
train['sentiment'].value_counts()
1 41044
0 28956
test['sentiment'].value_counts()
1 17591
0 12409
Can anyone please let me know how to reduce overfitting.
Since your code works, I believe that your network is failing silently by 'not learning' a lot from the data. Here's a list of some of the things you can generally check:
Is your textual data well transformed into numerical data? Is it well reprented using TF-IDF or bag of words or any other method that returns a numerical representation?
I see that you imported batch normalization but you do not apply it. Batch norm actually helps and most importantly, does the job of regularizers since each input to each layer is normalized using the mini-batch the network has seen. So maybe remove your L2 regularizations in all layers and apply a simple batch norm instead which should reduce overfitting (also, use it without the drop out since some empirical studies show that they should not be combined together)
Your embedding output is currently set to 40, that is 40 numerical elements of a text vector that may contain more than 10,000 elements. It seems a bit low. Try something more 'standard' such as 128 or 256 instead of 40.
Lastly, you set the adam optimizer with all the default parameters. However, the learning rate can have a big impact on the way your loss function is computed. As I am sure you know, the gradient step uses this learning rate to progress in its calculation of the derivatives for each neuron. the default is learning_rate=0.001. So try the following code and increase a bit the learning rate (for example 0.01 or even 0.1).
A simple example :
# define model
model = Sequential()
model.add(LSTM(32)) # or CNN
model.add(BatchNormalization())
model.add(Dense(1))
# define optimizer
optimizer = keras.optimizers.Adam(0.01)
# define loss function
loss = keras.losses.binary_crossentropy
# define metric to optimize
metric = [keras.metrics.Accuracy(name='accuracy')] # you can add more
# compile model
model.compile(optimizer=optimizer, loss=loss, metrics=metric)
Final thought: I see that you went for a combination of CNN and LSTM which has great merite. However, it is always recommended to try a simple MLP network to establish a baseline score that you may later try to beat. Does a simple MLP with 1 or 2 layers and not a lot of units produce a low accuracy score as well? If it performs better than maybe the problem is in the implementation or in the hyper parameters that you chose for the layers (or even theoretical).
I hope this answer helps and cheers!
New to neural nets so please correct my syntax.
I'm trying to create a LSTM RNN that will predict the Fibonacci sequence. When I ran the code below, the loss remains incredibly high (around 35339663592701874176).
Why does the shape of the input have to be (batch_size, timesteps, input_dim)? In my example I have 100 data entries so that'd be my batch_size, and the Fibonacci sequence takes in 2 inputs so that'd be input_dim but what would timesteps be in this case? 1?
Shouldn't the the units of the LSTM be 1? If I'm understanding correctly, the "units" are just the amount of hidden state nodes that are in the LSTM. So in theory, each of the 2 inputs would have a "1" coefficient weight towards that hidden state after training.
Would an RNN be a suitable model for this problem? When I've looked online, most people like to use the Fibonacci sequence as an example to explain how RNN's work.
Thanks for the help!
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create Training Data
xs = [[[1, 1]]]
ys = []
i = 0
while i < 100:
ys.append([xs[i][0][0]+xs[i][0][1]])
xs.append([[xs[i][0][1], ys[len(ys)-1][0]]])
i = i + 1
del xs[len(xs)-1]
xs = np.array(xs, dtype=float)
ys = np.array(ys, dtype=float)
# Create Model
model = keras.Sequential()
model.add(layers.LSTM(1, input_shape=(1, 2)))
model.add(layers.Dense(1))
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=[ 'accuracy' ])
# Train
model.fit(xs, ys, epochs=100000)
You can't feed a NN data where some of the values are 10^21 times as large as some of the others and expect it to work, it just doesn't happen.
You're not doing anything here that actually calls for LSTM (or any RNN), you're not actually using the time dimension, and you're basically just trying to learn addition. Maybe you meant to do something different (like input digits as a sequence, or have the output run for multiple timesteps and give you several values of the sequence), but that's not what you're doing, and it's unclear what you want.
The number of units is your memory/procesing capacity. Each unit of an RNN is able to receive values from all of the units in the previous timestep. One unit alone can't do anything interesting, especially with no layer before it to preprocess the data.
I'm using keras with tensorflow backend. My goal is to query the batchsize of the current batch in a custom loss function. This is needed to compute values of the custom loss functions which depend on the index of particular observations. I like to make this clearer given the minimum reproducible examples below.
(BTW: Of course I could use the batch size defined for the training procedure and plugin it's value when defining the custom loss function, but there are some reasons why this can vary, especially if epochsize % batchsize (epochsize modulo batchsize) is unequal zero, then the last batch of an epoch has different size. I didn't found a suitable approach in stackoverflow, especially e. g.
Tensor indexing in custom loss function and Tensorflow custom loss function in Keras - loop over tensor and Looping over a tensor because obviously the shape of any tensor can't be inferred when building the graph which is the case for a loss function - shape inference is only possible when evaluating given the data, which is only possible given the graph. Hence I need to tell the custom loss function to do something with particular elements along a certain dimension without knowing the length of the dimension.
(this is the same in all examples)
from keras.models import Sequential
from keras.layers import Dense, Activation
# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
example 1: nothing special without issue, no custom loss
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 2: nothing special, with a fairly simple custom loss
def custom_loss(yTrue, yPred):
loss = np.abs(yTrue-yPred)
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
(Output omitted, this runs perfectily fine)
example 3: the issue
def custom_loss(yTrue, yPred):
print(yPred) # Output: Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
n = yPred.shape[0]
for i in range(n): # TypeError: __index__ returned non-int (type NoneType)
loss = np.abs(yTrue[i]-yPred[int(i/2)])
return loss
model.compile(optimizer='rmsprop',
loss=custom_loss,
metrics=['accuracy'])
# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)
Of course the tensor has not shape info yet which can't be inferred when building the graph, only at training time. Hence for i in range(n) rises an error. Is there any way to perform this?
The traceback of the output:
-------
BTW here's my true custom loss function in case of any questions. I skipped it above for clarity and simplicity.
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = yTrue.shape[0]
for i in range(n):
s1 = K.greater_equal(yTime, yTime[i])
s2 = K.exp(yPred[s1])
s3 = K.sum(s2)
logsum = K.log(y3)
loss = K.sum(yStatus[i] * yPred[i] - logsum)
return loss
Here's an image of the partial negative log-likelihood of the cox proportional harzards model.
This is to clarify a question in the comments to avoid confusion. I don't think it is necessary to understand this in detail to answer the question.
As usual, don't loop. There are severe performance drawbacks and also bugs. Use only backend functions unless totally unavoidable (usually it's not unavoidable)
Solution for example 3:
So, there is a very weird thing there...
Do you really want to simply ignore half of your model's predictions? (Example 3)
Assuming this is true, just duplicate your tensor in the last dimension, flatten and discard half of it. You have the exact effect you want.
def custom_loss(true, pred):
n = K.shape(pred)[0:1]
pred = K.concatenate([pred]*2, axis=-1) #duplicate in the last axis
pred = K.flatten(pred) #flatten
pred = K.slice(pred, #take only half (= n samples)
K.constant([0], dtype="int32"),
n)
return K.abs(true - pred)
Solution for your loss function:
If you have sorted times from greater to lower, just do a cumulative sum.
Warning: If you have one time per sample, you cannot train with mini-batches!!!
batch_size = len(labels)
It makes sense to have time in an additional dimension (many times per sample), as is done in recurrent and 1D conv netoworks. Anyway, considering your example as expressed, that is shape (samples_equal_times,) for yTime:
def neg_log_likelihood(yTrue,yPred):
yStatus = yTrue[:,0]
yTime = yTrue[:,1]
n = K.shape(yTrue)[0]
#sort the times and everything else from greater to lower:
#obs, you can have the data sorted already and avoid doing it here for performance
#important, yTime will be sorted in the last dimension, make sure its (None,) in this case
# or that it's (None, time_length) in the case of many times per sample
sortedTime, sortedIndices = tf.math.top_k(yTime, n, True)
sortedStatus = K.gather(yStatus, sortedIndices)
sortedPreds = K.gather(yPred, sortedIndices)
#do the calculations
exp = K.exp(sortedPreds)
sums = K.cumsum(exp) #this will have the sum for j >= i in the loop
logsums = K.log(sums)
return K.sum(sortedStatus * sortedPreds - logsums)
I am trying to learn deep learning, I have stumbled on one exercise here
It is first warm-up exercise. I am stuck. For constant sequence of small lengths(2,3) it solves it no problem. However when I try whole sequence of 50. it stops at 50% accuracy, which is basically random guess.
According to here it is too big flat space ant cant find gradient to solve it. So i tried approach of continuously increasing length ans saving model each time (2,5,10,15,20,30,40,50).It seems it does not generalise well, as if i type bigger sequence then what I learned it on, it fails.
According to here it should be easy problem. I cant figure it out. There is used some different LSTM architecture hoverer.
And one solution here to exactly same problem says it works with Adagrad optimizer and learning rate of 0.5.
I am unsure about one bit at time, if I am feeding it right in first place. I hope I got it right.
And for variable length, i tried and failed miserably.
Code:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, LSTM
from keras.optimizers import SGD, Adagrad, Adadelta
from keras.callbacks import TensorBoard
from keras.models import load_model
import numpy as np
import time
import os.path
# building the model
def build_model():
model = Sequential()
model.add(LSTM(
32,
input_shape=(None, 1),
return_sequences=False))
model.add(Dropout(0.1))
model.add(Dense(
1))
model.add(Activation("sigmoid"))
return model
# generating random data
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.sum(axis=1) % 2
return x, y
X, y = generate_data(100000, 50)
X_test, y_test = generate_data(1000, 50)
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=0,
write_graph=True, write_images=False)
if os.path.isfile('model.h5'):
model = load_model('model.h5')
else:
model = build_model()
opti = Adagrad(lr=0.5)
model.compile(loss="mse", optimizer=opti, metrics=['binary_accuracy'])
model.fit(
X, y,
batch_size=10, callbacks=[tensorboard],
epochs=5)
score = model.evaluate(X_test, y_test,
batch_size=1,
verbose=1)
print('Test score:', score)
print('Model saved')
model.save('model.h5')
I am so confused now. Thanks for any response!
Edit: Fixed return_sequences to False typo from previous experiments.
Well, this might be a really valuable exercise about LSTM and vanishing gradient. So let's dive into it. I'd start from changing task a little bit. Let's change our dataset to:
def generate_data(num_points, seq_length):
#seq_rand = np.random.randint(1,12)
x = np.random.randint(2, size=(num_points, seq_length, 1))
y = x.cumsum(axis=1) % 2
return x, y
and model by setting return_sequences=True, changing the loss to binary_crossentropy and epochs=10. So well - if we solve this task perfectly - then we'd also solve the initial task. Well - in 10 out of 10 runs of the setup I provided I observed the following behavior - for first few epochs model saturated around 50% of accuracy - and then suddenly dropped to 99% of accuracy.
Why have this happened?
Well - in LSTM a sweet spot for parameters is a synchrony between memory cells dynamics and normal activation dynamics. Very often one should wait a lot of time in order to get such behavior. Moreover - the architecture needs to be sufficient in order to catch valuable dependencies. In a changed behavior - we are providing much more insights to a network thanks to which it could be trained faster. Still - it takes some time to find the sweet spot.
Why your network failed?
Vanishing gradient problem and problem complexity - it's completely not obvious what information network should extract if it gets only a single signal at the end of the sequence of computations. This is why it needs either supervision in the form which I provided (cumsum) or a lot of time and luck in order to finally find a sweet stop on its own.
I am trying to perform the usual classification on the MNIST database but with randomly cropped digits.
Images are cropped the following way : removed randomly first/last and/or row/column.
I would like to use a Convolutional Neural Network using Keras (and Tensorflow backend) to perform convolution and then the usual classification.
Inputs are of variable size and i can't manage to get it to work.
Here is how I cropped digits
import numpy as np
from keras.utils import to_categorical
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.images
X = np.expand_dims(X, axis=3)
X_crop = list()
for index in range(len(X)):
X_crop.append(X[index, np.random.randint(0,2):np.random.randint(7,9), np.random.randint(0,2):np.random.randint(7,9), :])
X_crop = np.array(X_crop)
y = to_categorical(digits.target)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_crop, y, train_size=0.8, test_size=0.2)
And here is the architecture of the model I want to use
from keras.layers import Dense, Dropout
from keras.layers.convolutional import Conv2D
from keras.models import Sequential
model = Sequential()
model.add(Conv2D(filters=10,
kernel_size=(3,3),
input_shape=(None, None, 1),
data_format='channels_last'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.summary()
model.fit(X_train, y_train, epochs=100, batch_size=16, validation_data=(X_test, y_test))
Does someone have an idea on how to handle variable sized input in my neural network?
And how to perform classification?
TL/DR - go to point 4
So - before we get to the point - let's fix some problems with your network:
Your network will not work because of activation: with categorical_crossentropy you need to have a softmax activation:
model.add(Dense(10, activation='softmax'))
Vectorize spatial tensors: as Daniel mentioned - you need to, at some stage, switch your vectors from spatial (images) to vectorized (vectors). Currently - applying Dense to output from a Conv2D is equivalent to (1, 1) convolution. So basically - output from your network is spatial - not vectorized what causes dimensionality mismatch (you can check that by running your network or checking the model.summary(). In order to change that you need to use either GlobalMaxPooling2D or GlobalAveragePooling2D. E.g.:
model.add(Conv2D(filters=10,
kernel_size=(3, 3),
input_shape=(None, None, 1),
padding="same",
data_format='channels_last'))
model.add(GlobalMaxPooling2D())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
Concatenated numpy arrays need to have the same shape: if you check the shape of X_crop you'll see that it's not a spatial matrix. It's because you concatenated matrices with different shapes. Sadly - it's impossible to overcome this issue as numpy.array need to have a fixed shape.
How to make your network train on examples of different shape: The most important thing in doing this is to understand two things. First - is that in a single batch every image should have the same size. Second - is that calling fit multiple times is a bad idea - as you reset inner model states. So here is what needs to be done:
a. Write a function which crops a single batch - e.g. a get_cropped_batches_generator which given a matrix cuts a batch out of it and crops it randomly.
b. Use train_on_batch method. Here is an example code:
from six import next
batches_generator = get_cropped_batches_generator(X, batch_size=16)
losses = list()
for epoch_nb in range(nb_of_epochs):
epoch_losses = list()
for batch_nb in range(nb_of_batches):
# cropped_x has a different shape for different batches (in general)
cropped_x, cropped_y = next(batches_generator)
current_loss = model.train_on_batch(cropped_x, cropped_y)
epoch_losses.append(current_loss)
losses.append(epoch_losses.sum() / (1.0 * len(epoch_losses))
final_loss = losses.sum() / (1.0 * len(losses))
So - a few comments to code above: First, train_on_batch doesn't use nice keras progress bar. It returns a single loss value (for a given batch) - that's why I added logic to compute loss. You could use Progbar callback for that also. Second - you need to implement get_cropped_batches_generator - I haven't written a code to keep my answer a little bit more clear. You could ask another question on how to implement it. Last thing - I use six to keep compatibility between Python 2 and Python 3.
Usually, a model containing Dense layers cannot have variable size inputs, unless the outputs are also variable. But see the workaround and also the other answer using GlobalMaxPooling2D - The workaround is equivalent to GlobalAveragePooling2D. These are layers that can eliminiate the variable size before a Dense layer and suppress the spatial dimensions.
For an image classification case, you may want to resize the images outside the model.
When my images are in numpy format, I resize them like this:
from PIL import Image
im = Image.fromarray(imgNumpy)
im = im.resize(newSize,Image.LANCZOS) #you can use options other than LANCZOS as well
imgNumpy = np.asarray(im)
Why?
A convolutional layer has its weights as filters. There is a static filter size, and the same filter is applied to the image over and over.
But a dense layer has its weights based on the input. If there is 1 input, there is a set of weights. If there are 2 inputs, you've got twice as much weights. But weights must be trained, and changing the amount of weights will definitely change the result of the model.
As #Marcin commented, what I've said is true when your input shape for Dense layers has two dimensions: (batchSize,inputFeatures).
But actually keras dense layers can accept inputs with more dimensions. These additional dimensions (which come out of the convolutional layers) can vary in size. But this would make the output of these dense layers also variable in size.
Nonetheless, at the end you will need a fixed size for classification: 10 classes and that's it. For reducing the dimensions, people often use Flatten layers, and the error will appear here.
A possible fishy workaround (not tested):
At the end of the convolutional part of the model, use a lambda layer to condense all the values in a fixed size tensor, probably taking a mean of the side dimensions and keeping the channels (channels are not variable)
Suppose the last convolutional layer is:
model.add(Conv2D(filters,kernel_size,...))
#so its output shape is (None,None,None,filters) = (batchSize,side1,side2,filters)
Let's add a lambda layer to condense the spatial dimensions and keep only the filters dimension:
import keras.backend as K
def collapseSides(x):
axis=1 #if you're using the channels_last format (default)
axis=-1 #if you're using the channels_first format
#x has shape (batchSize, side1, side2, filters)
step1 = K.mean(x,axis=axis) #mean of side1
return K.mean(step1,axis=axis) #mean of side2
#this will result in a tensor shape of (batchSize,filters)
Since the amount of filters is fixed (you have kicked out the None dimensions), the dense layers should probably work:
model.add(Lambda(collapseSides,output_shape=(filters,)))
model.add(Dense.......)
.....
In order for this to possibly work, I suggest that the number of filters in the last convolutional layer be at least 10.
With this, you can make input_shape=(None,None,1)
If you're doing this, remember that you can only pass input data with a fixed size per batch. So you have to separate your entire data in smaller batches, each batch having images all of the same size. See here: Keras misinterprets training data shape