What is the problem with this SGD loss graph?

What is the problem with this SGD loss graph? - python

I've been trying to train audio classification model. When i used SGD with learning_rate=0.01, momentum=0.0 and nesterov=False i get the following Loss and Accuracy graphs:
I can't figure out what what causes the instant decrease in loss at around epoch 750. I tried different learning rates, momentum values and their combinations, different batch sizes, initial layer weights etc. to get more appropriate graph but no luck at all. So if you have any knowledge about what causes this please let me know.
Code i used for this training is below:
# MFCCs Model
x = tf.keras.layers.Dense(units=512, activation="sigmoid")(mfcc_inputs)
x = tf.keras.layers.Dropout(0.5)(x)
x = tf.keras.layers.Dense(units=256, activation="sigmoid")(x)
x = tf.keras.layers.Dropout(0.5)(x)
# Spectrograms Model
y = tf.keras.layers.Conv2D(32, kernel_size=(3,3), strides=(2,2))(spec_inputs)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), padding="same")(y)
y = tf.keras.layers.AveragePooling2D(pool_size=(2,2), strides=(2,2))(y)
y = tf.keras.layers.BatchNormalization()(y)
y = tf.keras.layers.Activation("sigmoid")(y)
y = tf.keras.layers.Flatten()(y)
y = tf.keras.layers.Dense(units=256, activation="sigmoid")(y)
y = tf.keras.layers.Dropout(0.5)(y)
# Chroma Model
t = tf.keras.layers.Dense(units=512, activation="sigmoid")(chroma_inputs)
t = tf.keras.layers.Dropout(0.5)(t)
t = tf.keras.layers.Dense(units=256, activation="sigmoid")(t)
t = tf.keras.layers.Dropout(0.5)(t)
# Merge Models
concated = tf.keras.layers.concatenate([x, y, t])
# Dense and Output Layers
z = tf.keras.layers.Dense(64, activation="sigmoid")(concated)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(64, activation="sigmoid")(z)
z = tf.keras.layers.Dropout(0.5)(z)
z = tf.keras.layers.Dense(1, activation="sigmoid")(z)
mdl = tf.keras.Model(inputs=[mfcc_inputs, spec_inputs, chroma_inputs], outputs=z)
mdl.compile(optimizer=SGD(), loss="binary_crossentropy", metrics=["accuracy"])
mdl.fit([M_train, X_train, C_train], y_train, batch_size=8, epochs=1000, validation_data=([M_val, X_val, C_val], y_val), callbacks=[tensorboard_cb])

I'm not too sure myself, but as Frightera said, sigmoid activations in hidden layers can cause trouble since it is more sensitive to weight initialization, and if the weights aren't perfectly set, it can cause gradients to be very small. Perhaps the model eventually deals with the small sigmoid gradients and loss finally decreases around epoch 750, but just my hypothesis. If ReLU doesn't work, try using LeakyReLU since it doesn't have the dead neuron effect that ReLU does.

Related

Why loss is NaN

I am learning about DNN and transformers and I have the following problem. Running the model on IMDB dataset this transformer NN works just fine. But when I try and run it on another dataset which instead of 2 classes has 4, the loss goes to NaN and the accuracy quickly goes to 0. Please if you are kind to help me. I am struggling for 2 days with the same problem.
I've tried clipping gradients, increasing batch size, increasing dropout, reducing lr.
If you are kind to help me I will be forever grateful.
embed_dim = 32 # Embedding size for each token
num_heads = 2 # Number of attention heads
ff_dim = 32 # Hidden layer size in feed forward network inside transformer
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
# x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(4, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
opt = tf.keras.optimizers.Adam(learning_rate=3e-04,clipvalue=0.5)
model.compile(optimizer = opt, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
x_train, y_train, batch_size=128, epochs=10, validation_data=(x_val, y_val)
)

I was running on GPU and the error did not appeared. It was because I was having labels [1,2,4,5] and I should have labels between 0 and 4.

How to calculate different loss for different input in keras model

My model has two inputs and I want to calculate the loss of the two inputs separately because the loss of input 2 has to be multiplied by a weight. Then add up these two losses as the final loss for the model. The structure is somehow like this:
This is my model:
def final_loss(y_true, y_pred):
loss = x_loss_value.output + y_model.output*weight
return loss
def mymodel(input_shape): #pooling=max or avg
img_input1 = Input(shape=(input_shape[0], input_shape[1], input_shape[2], ))
image_input2 = Input(shape=(input_shape[0], input_shape[1], input_shape[2], ))
#for input1
x = Conv2D(32, (3, 3), strides=(2, 2))(img_input1)
x_dense = Dense(2, activation='softmax', name='predictions')(x)
x_loss_value = my_categorical_crossentropy_layer(x)[input1_y_true, input1_y_pred]
x_model = Model(inputs=img_input1, outputs=x_loss_value)
#for input2
y = Conv2D(32, (3, 3), strides=(2, 2))(image_input2)
y_dense = Dense(2, activation='softmax', name='predictions')(y)
y_loss_value = my_categorical_crossentropy_layer(y)[input2_y_true, input2_y_pred]
y_model = Model(inputs=img_input2, outputs=y_loss_value)
concat = concatenate([x_model.output, y_model.output])
final_dense = Dense(2, activation='softmax')(concat)
# Create model.
model = Model(inputs=[img_input1,image_input2], output = final_dense)
return model
model.compile(optimizer = optimizers.adam(lr=1e-7), loss = final_loss, metrics = ['accuracy'])
Most of the related solutions I found just customize the final loss and change the loss in Model.complie(loss=customize_loss).
However, I need to apply different losses for different inputs. I'm trying to use a customized layer like this, and get my loss value for final the loss calculation:
class my_categorical_crossentropy_layer1(Layer):
def __init__(self, **kwargs):
self.is_placeholder = True
super(my_categorical_crossentropy_layer1, self).__init__(**kwargs)
def my_categorical_crossentropy_loss(self, y_true, y_pred):
y_pred = K.constant(y_pred) if not K.is_tensor(y_pred) else y_pred
y_true = K.cast(y_true, y_pred.dtype)
return K.categorical_crossentropy(y_true, y_pred, from_logits=from_logits)
def call(self, y_true, y_pred):
loss = self.my_categorical_crossentropy_loss(y_true, y_pred)
self.add_loss(loss, inputs=(y_true, y_pred))
return loss
But, inside the keras model, I can't figure out how to get the y_true and y_pred of the current epoch/batch for my loss layer.
So I can't add x = my_categorical_crossentropy_layer()[y_true, y_pred] to my model.
Is there any way to do the variable calculation like this in the keras model?
Further, can Keras get the previous epoch's training loss or val loss during training process?
I want to apply the previous epoch's training loss as my weight in the final loss.

this is my proposal...
your it's a double binary classification problem that you want to carry out using a single fit. the first thing to notice is that you need to take care of dimensionality: your input is 4d while your target is 2d one-hot encoded so your network needs something to reduce dimensionality, for example, flatten or global pooling. after this, you can start fitting creating a single model with two inputs and two outputs and use two losses. in your case, the losses are weighted categorical_crossentropy. keras enable by default to set the loss weights using loss_weights parameters. to reproduce the formula loss1*1+loss2*W set the weights to [1, W]. you can use the loss_weights parameter also specifying different losses for your output in this way losses=[loss1, loss2, ....] which are linearly combined with the weights specified in the loss_weights
below a working example
input_shape = (28,28,3)
n_sample = 10
# create dummy data
X1 = np.random.uniform(0,1, (n_sample,)+input_shape) # 4d
X2 = np.random.uniform(0,1, (n_sample,)+input_shape) # 4d
y1 = tf.keras.utils.to_categorical(np.random.randint(0,2, n_sample)) # 2d
y2 = tf.keras.utils.to_categorical(np.random.randint(0,2, n_sample)) # 2d
def mymodel(input_shape, weight):
img_input1 = Input(shape=(input_shape[0], input_shape[1], input_shape[2], ))
img_input2 = Input(shape=(input_shape[0], input_shape[1], input_shape[2], ))
# for input1
x = Conv2D(32, (3, 3), strides=(2, 2))(img_input1)
x = GlobalMaxPool2D()(x) # pass from 4d to 2d
x = Dense(2, activation='softmax', name='predictions1')(x)
# for input2
y = Conv2D(32, (3, 3), strides=(2, 2))(img_input2)
y = GlobalMaxPool2D()(y) # pass from 4d to 2d
y = Dense(2, activation='softmax', name='predictions2')(y)
# Create model
model = Model([img_input1,img_input2], [x,y])
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'],
loss_weights=[1,weight])
return model
weight = 0.3
model = mymodel(input_shape, weight)
model.summary()
model.fit([X1,X2], [y1,y2], epochs=2)

CNN predicting the same class for all input data

I am trying the recreate a CNN in Keras to classify point cloud data. The CNN is described in this paper.
Network Design
This is my current implementation:
inputs = Input(shape=(None, 3))
x = Conv1D(filters=64, kernel_size=1, activation='relu')(inputs)
x = BatchNormalization()(x)
x = Conv1D(filters=64, kernel_size=1, activation='relu')(x)
x = BatchNormalization()(x)
y = Conv1D(filters=64, kernel_size=1, activation='relu')(x)
y = BatchNormalization()(y)
y = Conv1D(filters=128, kernel_size=1, activation='relu')(y)
y = BatchNormalization()(y)
y = Conv1D(filters=2048, kernel_size=1, activation='relu')(y)
y = MaxPooling1D(1)(y)
z = keras.layers.concatenate([x, y], axis=2)
z = Conv1D(filters=512, kernel_size=1, activation='relu')(z)
z = BatchNormalization()(z)
z = Conv1D(filters=512, kernel_size=1, activation='relu')(z)
z = BatchNormalization()(z)
z = Conv1D(filters=512, kernel_size=1, activation='relu')(z)
z = BatchNormalization()(z)
z = Dense(9, activation='softmax')(z)
model = Model(inputs=inputs, outputs=z)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
The problem is that the network predicts the same class for all input data. This may be caused by a mistake in my implementation of the network, overfitting or insufficient training data. Can someone spot a mistake in my implementation?
Yousefhussien, M., Kelbe, D. J., Ientilucci, E. J., & Salvaggio, C. (2017). A Fully Convolutional Network for Semantic Labeling of 3D Point Clouds. arXiv preprint arXiv:1710.01408.

The same output class typically indicates a network that has just been initialized, meaning that the training weights are not loaded. Did this same class thing happen during training? Another reason could be bad pre-processing though. Another thing that I noticed is that the paper states "1D-fully convolutional network". Your dense layer is a convolutional in the paper.

I believe that the mistake is not in the implementation. Most probably the problem is that you have an insufficient amount of data. Also, if network predicts the same class for all input data, it usually means that you lack regularization. Try adding some Dropout layers with dropout of 0.2 to 0.5 and see if the results have improved.
Also, I don't think that
x = Conv1D(filters=64, kernel_size=1, activation='relu')(inputs)
x = BatchNormalization()(x)
is the same as
x = Conv1D(filters=64, kernel_size=1)(inputs)
x = BatchNormalization()(x)
x = ReLU(x)
and I think you need the latter.
Another thing for you to try is LeakyReLU as it usually gives better results than plain ReLU.

The network is fixed as it provides the expected predictions now. Thanks for the help!
Based on the answers I changed the following things:
The order of the activation and the batch normalization.
The last layer from a dense to a convolutional layer.
I also added the training=True parameter to the batch normalization layer
The code of the correct implementation:
inputs = Input(shape=(None, 3))
x = Conv1D(filters=64, kernel_size=1, input_shape=(None, 4))(inputs)
x = BatchNormalization()(x, training=True)
x = Activation('relu')(x)
x = Conv1D(filters=64, kernel_size=1, use_bias=False)(x)
x = BatchNormalization()(x, training=True)
x = Activation('relu')(x)
y = Conv1D(filters=64, kernel_size=1)(x)
y = BatchNormalization()(y, training=True)
y = Activation('relu')(y)
y = Conv1D(filters=128, kernel_size=1)(y)
y = BatchNormalization()(y, training=True)
y = Activation('relu')(y)
y = Conv1D(filters=2048, kernel_size=1)(y)
y = BatchNormalization()(y, training=True)
y = Activation('relu')(y)
y = MaxPooling1D(1)(y)
z = keras.layers.concatenate([x, y], axis=2)
z = Conv1D(filters=512, kernel_size=1)(z)
z = BatchNormalization()(z, training=True)
z = Activation('relu')(z)
z = Conv1D(filters=512, kernel_size=1)(z)
z = BatchNormalization()(z, training=True)
z = Activation('relu')(z)
z = Conv1D(filters=512, kernel_size=1)(z)
z = BatchNormalization()(z, training=True)
z = Activation('relu')(z)
z = Conv1D(filters=2, kernel_size=1, activation='softmax')(z)
model = Model(inputs=inputs, outputs=z)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Neural network sine approximation

After spending days failing to use neural network for Q learning, I decided to go back to the basics and do a simple function approximation to see if everything was working correctly and see how some parameters affected the learning process.
Here is the code that I came up with
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import random
import numpy
from sklearn.preprocessing import MinMaxScaler
regressor = Sequential()
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform', input_dim=1))
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=20, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=1))
regressor.compile(loss='mean_squared_error', optimizer='sgd')
#regressor = ExtraTreesRegressor()
N = 5000
X = numpy.empty((N,))
Y = numpy.empty((N,))
for i in range(N):
X[i] = random.uniform(-10, 10)
X = numpy.sort(X).reshape(-1, 1)
for i in range(N):
Y[i] = numpy.sin(X[i])
Y = Y.reshape(-1, 1)
X_scaler = MinMaxScaler()
Y_scaler = MinMaxScaler()
X = X_scaler.fit_transform(X)
Y = Y_scaler.fit_transform(Y)
regressor.fit(X, Y, epochs=2, verbose=1, batch_size=32)
#regressor.fit(X, Y.reshape(5000,))
x = numpy.mgrid[-10:10:100*1j]
x = x.reshape(-1, 1)
y = numpy.mgrid[-10:10:100*1j]
y = y.reshape(-1, 1)
x = X_scaler.fit_transform(x)
for i in range(len(x)):
y[i] = regressor.predict(numpy.array([x[i]]))
plt.figure()
plt.plot(X_scaler.inverse_transform(x), Y_scaler.inverse_transform(y))
plt.plot(X_scaler.inverse_transform(X), Y_scaler.inverse_transform(Y))
The problem is that all my predictions are around 0 in value. As you can see I used an ExtraTreesRegressor from sklearn (commented lines) to check that the protocol is actually correct. So what is wrong with my neural network ? Why is it not working ?
(The actual problem that I'm trying to solve is to compute the Q function for the mountain car problem using neural network. How is it different from this function approximator ?)

With these changes:
Activations to relu
Remove kernel_initializer (i.e. leave the default 'glorot_uniform')
Adam optimizer
100 epochs
i.e.
regressor = Sequential()
regressor.add(Dense(units=20, activation='relu', input_dim=1))
regressor.add(Dense(units=20, activation='relu'))
regressor.add(Dense(units=20, activation='relu'))
regressor.add(Dense(units=1))
regressor.compile(loss='mean_squared_error', optimizer='adam')
regressor.fit(X, Y, epochs=100, verbose=1, batch_size=32)
and the rest of your code unchanged, here is the result:
Tinker, again and again...

A more concise version of your code that works:
def data_gen():
while True:
x = (np.random.random([1024])-0.5) * 10
y = np.sin(x)
yield (x,y)
regressor = Sequential()
regressor.add(Dense(units=20, activation='tanh', input_dim=1))
regressor.add(Dense(units=20, activation='tanh'))
regressor.add(Dense(units=20, activation='tanh'))
regressor.add(Dense(units=1, activation='linear'))
regressor.compile(loss='mse', optimizer='adam')
regressor.fit_generator(data_gen(), epochs=3, steps_per_epoch=128)
x = (np.random.random([1024])-0.5)*10
x = np.sort(x)
y = np.sin(x)
plt.plot(x, y)
plt.plot(x, regressor.predict(x))
plt.show()
Changes made: replacing low layer activations with hyperbolic tangents, replacing the static dataset with a random generator, replacing sgd with adam. That said, there still are problems with other parts of your code that I haven't been able to locate yet (most likely your scaler and random process).

I managed to get a good approximation by changing the architecture and the training as in the following code. It's a bit of an overkill but at least I know where the problem was coming from.
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import random
import numpy
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import ExtraTreesRegressor
from keras import optimizers
regressor = Sequential()
regressor.add(Dense(units=500, activation='sigmoid', kernel_initializer='uniform', input_dim=1))
regressor.add(Dense(units=500, activation='sigmoid', kernel_initializer='uniform'))
regressor.add(Dense(units=1, activation='sigmoid'))
regressor.compile(loss='mean_squared_error', optimizer='adam')
#regressor = ExtraTreesRegressor()
N = 5000
X = numpy.empty((N,))
Y = numpy.empty((N,))
for i in range(N):
X[i] = random.uniform(-10, 10)
X = numpy.sort(X).reshape(-1, 1)
for i in range(N):
Y[i] = numpy.sin(X[i])
Y = Y.reshape(-1, 1)
X_scaler = MinMaxScaler()
Y_scaler = MinMaxScaler()
X = X_scaler.fit_transform(X)
Y = Y_scaler.fit_transform(Y)
regressor.fit(X, Y, epochs=50, verbose=1, batch_size=2)
#regressor.fit(X, Y.reshape(5000,))
x = numpy.mgrid[-10:10:100*1j]
x = x.reshape(-1, 1)
y = numpy.mgrid[-10:10:100*1j]
y = y.reshape(-1, 1)
x = X_scaler.fit_transform(x)
for i in range(len(x)):
y[i] = regressor.predict(numpy.array([x[i]]))
plt.figure()
plt.plot(X_scaler.inverse_transform(x), Y_scaler.inverse_transform(y))
plt.plot(X_scaler.inverse_transform(X), Y_scaler.inverse_transform(Y))
However I'm still baffled that I found papers saying that they were using only two hidden layers of five neurons to approximate the Q function of the mountain car problem and training their network for only a few minutes and get good results. I will try changing my batch size in my original problem to see what results I can get but I'm not very optimistic

Sudden spike in validation loss

So I am doing binary image classification on a small data set containing 250 images in each class, I am using transfer learning using Resnet50 as base network architecture and over it I've added 2 hidden layer and one final output layer, after training for 20 epochs, what I've saw is that loss is suddenly increases in initial epoch, I am unable to understand the reason behind it.
Network architecture -
image_input = Input(shape=(224, 224, 3))
model = ResNet50(input_tensor=image_input,include_top=True, weights='imagenet')
last_layer = model.get_layer('avg_pool').output
x = Flatten(name='flatten')(last_layer)
x = Dense(1000, activation='relu', name='fc1000')(x)
x = Dropout(0.5)(x)
x = Dense(200, activation='relu', name='fc200')(x)
x = Dropout(0.5)(x)
out = Dense(num_classes, activation='softmax', name='output')(x)
custom_model = Model(image_input, out)
I am using binary_crossentropy, Adam with default parameters
Loss -
Accuracy -

With such small class of data, there is definitely chance of overfitting do increase your dataset size and check it out use data augmentation if possible

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the problem with this SGD loss graph? - python

Related

Why loss is NaN

How to calculate different loss for different input in keras model

CNN predicting the same class for all input data

Neural network sine approximation

Sudden spike in validation loss

Categories

Resources