Related
To test a nonlinear sequential model using Keras,
I made some random data x1,x2,x3
and y = a + b*x1 + c*x2^2 + d*x3^3 + e (a,b,c,d,e are constants).
Loss is getting low really quickly but the model actually predicts a pretty wrong number. I've done it with a linear model with similar codes but it worked right. Maybe the Sequential model is designed wrong. Here is my code
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras import initializers
# y = 3*x1 + 5*x2 + 10
def gen_sequential_model():
model = Sequential([Input(3,name='input_layer')),
Dense(16, activation = 'relu', name = 'hidden_layer1', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
Dense(16, activation = 'relu', name = 'hidden_layer2', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
Dense(1, activation = 'relu', name = 'output_layer', kernel_initializer=initializers.RandomNormal(mean = 0.0, stddev= 0.05, seed=42)),
])
model.summary()
model.compile(optimizer='adam',loss='mse')
return model
def gen_linear_regression_dataset(numofsamples=500, a=3, b=5, c=7, d=9, e=11):
np.random.seed(42)
X = np.random.rand(numofsamples,3)
# y = a + bx1 + cx2^2 + dx3^3+ e
for idx in range(numofsamples):
X[idx][1] = X[idx][1]**2
X[idx][2] = X[idx][2]**3
coef = np.array([b,c,d])
bias = e
y = a + np.matmul(X,coef.transpose()) + bias
return X, y
def plot_loss_curve(history):
import matplotlib.pyplot as plt
plt.figure(figsize = (15,10))
plt.plot(history.history['loss'][1:])
plt.plot(history.history['val_loss'][1:])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','test'],loc = 'upper right')
plt.show()
def predict_new_sample(model, x, a=3, b=5, c=7, d=9, e=11):
x = x.reshape(1,3)
y_pred = model.predict(x)[0][0]
y_actual = a + b*x[0][0] + c*(x[0][1]**2) + d*(x[0][2]**3) + e
print("y actual value: ", y_actual)
print("y pred value: ", y_pred)
model = gen_sequential_model()
X,y = gen_linear_regression_dataset(numofsamples=2000)
history = model.fit(X,y,epochs = 100, verbose=2, validation_split=0.3)
plot_loss_curve(history)
predict_new_sample(model, np.array([0.7,0.5,0.5]))
Result:
...
Epoch 99/100
44/44 - 0s - loss: 1.0631e-10 - val_loss: 9.9290e-11
Epoch 100/100
44/44 - 0s - loss: 1.0335e-10 - val_loss: 9.3616e-11
y actual value: 20.375
y pred value: 25.50001
Why is my predicted value so different from the real value?
Despite the improper use of activation = 'relu' in the last layer and the use of non-recommended kernel initializations, your model works fine, and the reported metrics are true and not flukes.
The problem is not in the model; the problem is that your data generating function does not return what you intend it to return.
First, in order to see that your model indeed learns what you have asked it to learn, let's run your code as is and then use your data generating function to produce a sample:
X, y_true = gen_linear_regression_dataset(numofsamples=1)
print(X)
print(y_true)
Result:
[[0.37454012 0.90385769 0.39221343]]
[25.72962531]
So for this particular X, the true output is 25.72962531; let's pass now this X to the model using your predict_new_sample function:
predict_new_sample(model, X)
# result:
y actual value: 22.134424269890232
y pred value: 25.729633
Well, the predicted output 25.729633 is extremely close to the true one as calculated above (25.72962531); thing is, your function thinks that the true output should be 22.134424269890232, which is demonstrably not the case.
What has happened is that your gen_linear_regression_dataset function returns the data X after you have calculated the squared and cubic components, which is not what you want; you want the returned data X to be before calculating the square & cube components, so that your model learns how to do this itself.
So, you need to change the function as follows:
def gen_linear_regression_dataset(numofsamples=500, a=3, b=5, c=7, d=9, e=11):
np.random.seed(42)
X_init = np.random.rand(numofsamples,3) # data to be returned
# y = a + bx1 + cx2^2 + dx3^3+ e
X = X_init.copy() # temporary data
for idx in range(numofsamples):
X[idx][1] = X[idx][1]**2
X[idx][2] = X[idx][2]**3
coef = np.array([b,c,d])
bias = e
y = a + np.matmul(X,coef.transpose()) + bias
return X_init, y
After modifying the function and re-training the model (you'll notice that the validation error ends up somewhat higher, ~ 1.3), we have
X, y_true = gen_linear_regression_dataset(numofsamples=1)
print(X)
print(y_true)
Result:
[[0.37454012 0.95071431 0.73199394]]
[25.72962531]
and
predict_new_sample(model, X)
# result:
y actual value: 25.729625308532768
y pred value: 25.443237
which is consistent. You will still not be getting perfect predictions of course, especially for unseen data (and remember that the error is now higher):
predict_new_sample(model, np.array([0.07,0.6,0.5]))
# result:
y actual value: 17.995
y pred value: 19.69147
As commented briefly above, you should really change your model to get rid from the kernel initializers (i.e. use the default, recommended ones) and use the correct activation function for your last layer:
def gen_sequential_model():
model = Sequential([Input(3,name='input_layer'),
Dense(16, activation = 'relu', name = 'hidden_layer1'),
Dense(16, activation = 'relu', name = 'hidden_layer2'),
Dense(1, activation = 'linear', name = 'output_layer'),
])
model.summary()
model.compile(optimizer='adam',loss='mse')
return model
You'll discover that you get a better validation error and better predictions:
predict_new_sample(model, np.array([0.07,0.6,0.5]))
# result:
y actual value: 17.995
y pred value: 18.272991
Nice catch from #desertnaut.
Just to add a few things uppon #desertnaut solution that seem to improve the results.
Scale your data (even that you always use 0-1 it seems to add a little boost)
Add Dropout between layers
Increase number of epochs (150 -200 ?)
Add reduce learning rate on plateau (give it some try)
Add more units to the layers
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2,
patience=5, min_lr=0.001)
def gen_sequential_model():
model = Sequential([
Input(3,name='input_layer'),
Dense(64, activation = 'relu', name = 'hidden_layer1'),
Dropout(0.5),
Dense(64, activation = 'relu', name = 'hidden_layer2'),
Dropout(0.5),
Dense(1, name = 'output_layer')
])
history = model.fit(X, y, epochs = 200, verbose=2, validation_split=0.2, callbacks=[reduce_lr])
predict_new_sample(model, x=np.array([0.07, 0.6, 0.5]))
y actual value: 17.995
y pred value: 17.710054
So my question is, if I have something like:
model = Model(inputs = input, outputs = [y1,y2])
model.compile(loss = my_loss ...)
I have only seen my_loss as a dictionary of independent losses and, then, the final loss is defined as the sum of those. But, can I define in a multitask model a loss function that take all the predicted/true values and then I can multiply them (for instance)?
This is the loss I am trying to define:
def my_loss(y_true1, y_true2, y_pred1, y_pred2):
final_loss = binary_crossentropy(y_true1, y_pred1) + y_true1 * categorical_crossentropy(y_true2, y_pred2)
return final_loss
Usually, your paramaters are y_true, y_pred in the loss function, where y_pred is either y1 or y2. But now I need both to compute the loss, so how can I define this loss function and pass all the parameters to the function: y_true1, y_true2, y_pred1, y_pred2.
My current model that I want to change its loss:
x = Input(shape=(n, ))
shared = Dense(32)(x)
sub1 = Dense(16)(shared)
sub2 = Dense(16)(shared)
y1 = Dense(1)(sub1, activation='sigmoid')
y2 = Dense(4)(sub2, activation='softmax')
model = Model(inputs = input, outputs = [y1,y2])
model.compile(loss = ['binary_crossentropy', 'categorical_crossentropy'] ...) #THIS LINE I WANT TO CHANGE IT
Thanks!
I'm not sure if I'm understanding correctly, but I'll try.
The loss function must contain both the predicted and the actual data -- it's a way to measure the error between what your model is predicting and the true data. However, the predicted and actual data do not need to be one-dimensional. You can make y_pred a tensor that contains both y_pred1 and y_pred2. Likewise, y_true can be a tensor that contains both y_true1 and y_true2.
As far as I know, loss functions should return a single number. That's why loss functions often have a mean or a sum to add up all of the losses for individual data points.
Here's an example of mean square error that will work for more than 1D:
import keras.backend as K
def my_loss(y_true, y_pred):
# this example is mean squared error
# works if if y_pred and y_true are greater than 1D
return K.mean(K.square(y_pred - y_true))
Here's another example of a loss function that I think is closer to your question (although I cannot comment on whether or not it's a good loss function):
def my_loss(y_true, y_pred):
# calculate mean(abs(y_pred1*y_pred2 - y_true1*ytrue2))
# this will work for 2D inputs of y_pred and y_true
return K.mean(K.abs(K.prod(y_pred, axis = 1) - K.prod(y_true, axis = 1)))
Update:
You can concatenate two outputs into a single tensor with keras.layers.Concatenate. That way you can still have a loss function with only two arguments.
In the model you wrote above, the y1 output shape is (None, 1) and the y2 output shape is (None, 4). Here's an example of how you could write your model so that the output is a single tensor that concatenates y1 and y1 into a shape of (None, 5):
from keras import Model
from keras.layers import Input, Dense
from keras.layers import Concatenate
input_layer = Input(shape=(n, ))
shared = Dense(32)(input_layer)
sub1 = Dense(16)(shared)
sub2 = Dense(16)(shared)
y1 = Dense(1, activation='sigmoid')(sub1)
y2 = Dense(4, activation='softmax')(sub2)
mergedOutput = Concatenate()([y1, y2])
Below, I show an example for how you could rewrite your loss function. I wasn't sure which of the 5 columns of the output to call y_true1 vs. y_true2, so I guessed that y_true1 was column 1 and y_true2 was the remaining 4 columns. The same column structure would apply to y_pred1 and y_pred2.
from keras import losses
def my_loss(y_true, y_pred):
final_loss = (losses.binary_crossentropy(y_true[:, 0], y_pred[:, 0]) +
y_true[:, 0] *
losses.categorical_crossentropy(y_true[:, 1:], y_pred[:,1:]))
return final_loss
Finally, you can compile the model without any major changes from normal:
model.compile(optimizer='adam', loss=my_loss)
I am doing a slight modification of a standard neural network by defining a custom loss function. The custom loss function depends not only on y_true and y_pred, but also on the training data. I implemented it using the wrapping solution described here.
Specifically, I wanted to define a custom loss function that is the standard mse plus the mse between the input and the square of y_pred:
def custom_loss(x_true)
def loss(y_true, y_pred):
return K.mean(K.square(y_pred - y_true) + K.square(y_true - x_true))
return loss
Then I compile the model using
model_custom.compile(loss = custom_loss( x_true=training_data ), optimizer='adam')
fit the model using
model_custom.fit(training_data, training_label, epochs=100, batch_size = training_data.shape[0])
All of the above works fine, because the batch size is actually the number of all the training samples.
But if I set a different batch_size (e.g., 10) when I have 1000 training samples, there will be an error
Incompatible shapes: [1000] vs. [10].
It seems that Keras is able to automatically adjust the size of the inputs to its own loss function base on the batch size, but cannot do so for the custom loss function.
Do you know how to solve this issue?
Thank you!
==========================================================================
* Update: the batch size issue is solved, but another issue occurred
Thank you, Ori, for the suggestion of concatenating the input and output layers! It "worked", in the sense that the codes can run under any batch size. However, it seems that the result from training the new model is wrong... Below is a simplified version of the codes to demonstrate the problem:
import numpy as np
import scipy.io
import keras
from keras import backend as K
from keras.models import Model
from keras.layers import Input, Dense, Activation
from numpy.random import seed
from tensorflow import set_random_seed
def custom_loss(y_true, y_pred): # this is essentially the mean_square_error
mse = K.mean( K.square( y_pred[:,2] - y_true ) )
return mse
# set the seeds so that we get the same initialization across different trials
seed_numpy = 0
seed_tensorflow = 0
# generate data of x = [ y^3 y^2 ]
y = np.random.rand(5000+1000,1) * 2 # generate 5000 training and 1000 testing samples
x = np.concatenate( ( np.power(y, 3) , np.power(y, 2) ) , axis=1 )
training_data = x[0:5000:1,:]
training_label = y[0:5000:1]
testing_data = x[5000:6000:1,:]
testing_label = y[5000:6000:1]
# build the standard neural network with one hidden layer
seed(seed_numpy)
set_random_seed(seed_tensorflow)
input_standard = Input(shape=(2,)) # input
hidden_standard = Dense(10, activation='relu', input_shape=(2,))(input_standard) # hidden layer
output_standard = Dense(1, activation='linear')(hidden_standard) # output layer
model_standard = Model(inputs=[input_standard], outputs=[output_standard]) # build the model
model_standard.compile(loss='mean_squared_error', optimizer='adam') # compile the model
model_standard.fit(training_data, training_label, epochs=50, batch_size = 500) # train the model
testing_label_pred_standard = model_standard.predict(testing_data) # make prediction
# get the mean squared error
mse_standard = np.sum( np.power( testing_label_pred_standard - testing_label , 2 ) ) / 1000
# build the neural network with the custom loss
seed(seed_numpy)
set_random_seed(seed_tensorflow)
input_custom = Input(shape=(2,)) # input
hidden_custom = Dense(10, activation='relu', input_shape=(2,))(input_custom) # hidden layer
output_custom_temp = Dense(1, activation='linear')(hidden_custom) # output layer
output_custom = keras.layers.concatenate([input_custom, output_custom_temp])
model_custom = Model(inputs=[input_custom], outputs=[output_custom]) # build the model
model_custom.compile(loss = custom_loss, optimizer='adam') # compile the model
model_custom.fit(training_data, training_label, epochs=50, batch_size = 500) # train the model
testing_label_pred_custom = model_custom.predict(testing_data) # make prediction
# get the mean squared error
mse_custom = np.sum( np.power( testing_label_pred_custom[:,2:3:1] - testing_label , 2 ) ) / 1000
# compare the result
print( [ mse_standard , mse_custom ] )
Basically, I have a standard one-hidden-layer neural network, and a custom one-hidden-layer neural network whose output layer is concatenated with the input layer. For testing purpose, I did not use the concatenated input layer in the custom loss function, because I wanted to see if the custom network can reproduce the standard neural network. Since the custom loss function is equivalent to the standard 'mean_squared_error' loss, both networks should have the same training results (I also reset the random seeds to make sure that they have the same initialization).
However, the training results are very different. It seems that the concatenation makes the training process different? Any ideas?
Thank you again for all your help!
Final update: Ori's approach of concatenating input and output layers works, and is verified by using the generator. Thanks!!
The problem is that when compiling the model, you set x_true to be a static tensor, in the size of all the samples. While the input for keras loss functions are the y_true and y_pred, where each of them is of size [batch_size, :].
As I see it there are 2 options you can solve this, the first one is using a generator for creating the batches, in such a way that you will have control over which indices are evaluated each time, and at the loss function you could slice the x_true tensor to fit the samples being evaluated:
def custom_loss(x_true)
def loss(y_true, y_pred):
x_true_samples = relevant_samples(x_true)
return K.mean(K.square(y_pred - y_true) + K.square(y_true - x_true_samples))
return loss
This solution can be complicated, what I would suggest is a simpler workaround -
Concatenate the input layer with the output layer, such that your new output will be of the form original_output , input.
Now you can use a new modified loss function:
def loss(y_true, y_pred):
return K.mean(K.square(y_pred[:,:output_shape] - y_true[:,:output_shape]) +
K.square(y_true[:,:output_shape] - y_pred[:,outputshape:))
Now your new loss function will take in account both the input data, and the prediction.
Edit:
Note that while you set the seed, your models are not exactly the same, and as you did not use a generator, you let keras choose the batches, and for different models he might pick different samples.
As your model does not converge, different samples can lead to different results.
I added a generator to your code, to verify the samples we pick for training, now you can see both results are the same:
def custom_loss(y_true, y_pred): # this is essentially the mean_square_error
mse = keras.losses.mean_squared_error(y_true, y_pred[:,2])
return mse
def generator(x, y, batch_size):
curIndex = 0
batch_x = np.zeros((batch_size,2))
batch_y = np.zeros((batch_size,1))
while True:
for i in range(batch_size):
batch_x[i] = x[curIndex,:]
batch_y[i] = y[curIndex,:]
i += 1;
if i == 5000:
i = 0
yield batch_x, batch_y
# set the seeds so that we get the same initialization across different trials
seed_numpy = 0
seed_tensorflow = 0
# generate data of x = [ y^3 y^2 ]
y = np.random.rand(5000+1000,1) * 2 # generate 5000 training and 1000 testing samples
x = np.concatenate( ( np.power(y, 3) , np.power(y, 2) ) , axis=1 )
training_data = x[0:5000:1,:]
training_label = y[0:5000:1]
testing_data = x[5000:6000:1,:]
testing_label = y[5000:6000:1]
batch_size = 32
# build the standard neural network with one hidden layer
seed(seed_numpy)
set_random_seed(seed_tensorflow)
input_standard = Input(shape=(2,)) # input
hidden_standard = Dense(10, activation='relu', input_shape=(2,))(input_standard) # hidden layer
output_standard = Dense(1, activation='linear')(hidden_standard) # output layer
model_standard = Model(inputs=[input_standard], outputs=[output_standard]) # build the model
model_standard.compile(loss='mse', optimizer='adam') # compile the model
#model_standard.fit(training_data, training_label, epochs=50, batch_size = 10) # train the model
model_standard.fit_generator(generator(training_data,training_label,batch_size), steps_per_epoch= 32, epochs= 100)
testing_label_pred_standard = model_standard.predict(testing_data) # make prediction
# get the mean squared error
mse_standard = np.sum( np.power( testing_label_pred_standard - testing_label , 2 ) ) / 1000
# build the neural network with the custom loss
seed(seed_numpy)
set_random_seed(seed_tensorflow)
input_custom = Input(shape=(2,)) # input
hidden_custom = Dense(10, activation='relu', input_shape=(2,))(input_custom) # hidden layer
output_custom_temp = Dense(1, activation='linear')(hidden_custom) # output layer
output_custom = keras.layers.concatenate([input_custom, output_custom_temp])
model_custom = Model(inputs=input_custom, outputs=output_custom) # build the model
model_custom.compile(loss = custom_loss, optimizer='adam') # compile the model
#model_custom.fit(training_data, training_label, epochs=50, batch_size = 10) # train the model
model_custom.fit_generator(generator(training_data,training_label,batch_size), steps_per_epoch= 32, epochs= 100)
testing_label_pred_custom = model_custom.predict(testing_data)
# get the mean squared error
mse_custom = np.sum( np.power( testing_label_pred_custom[:,2:3:1] - testing_label , 2 ) ) / 1000
# compare the result
print( [ mse_standard , mse_custom ] )
I am training a classification problem that ends in a softmax layer. I want to compute the loss as the average of the product of the prediction probability and square distances of each example's softmax output from the truth label.
In pseudocode: average(probability*distance_from_label**2)
Right now I am using the following code, which runs, but converges to outputting '0' for every instance. There is something wrong in its implementation:
bs = batch_size
l = labels
X = K.constant([[0,1,...,l-1] for y in range(bs)], shape=((bs,l))
X = tf.add(-y_true, X)
X = tf.abs(X)
X = tf.multiply(y_pred, X)
X = tf.multiply(X, X)
return K.mean(X)
Is there a way to implement this square difference loss function and keep a softmax layer? And still to measure the actual Euclidian distance between the predicted and true labels, not just the element-wise difference in one-hot vectors?
For clarity, I have provided these examples:
Example 1:
label1 = [0, 0, 1, 0], prediction1 = [1, 0, 0, 0]
loss1 = 4 = (4 + 0 + 0 + 0) = (1*2^2 + 0*1^2 + 0*0^2 + 0*1^2)
Example 2:
label2 = [0, 1, 0, 0],prediction2 = [0.3, 0.1, 0.3, 0.3]
loss2 = 1.8 = (0.3 + 0 + 0.3 + 1.2) = (0.3*1^2 + 0.1*0^2 + 0.3*1^2 + 0.3*2^2)
You can use Tensorflow (or Theano) as well as Keras Backends when designing a custom loss function. Note the tf.multiply and other functions from Tensorflow.
The following code implements this custom loss function in Keras (tested and working):
def custom_loss(y_true, y_pred):
bs = 1 # batch size
l = 4 # label number
c = K.constant([[x for x in range(l)] for y in range(bs)], shape=((bs, l)))
truths = tf.multiply(y_true, c)
truths = K.sum(truths, axis=1)
truths = K.concatenate(list(truths for i in range(l)))
truths = K.reshape(truths, ((bs,l)))
distances = tf.add(truths, -c)
sqdist = tf.multiply(distances, distances)
out = tf.multiply(y_pred, sqdist)
out = K.sum(out, axis=1)
return K.mean(out)
y_true = K.constant([0,0,1,0])
y_pred = K.constant([1,0,0,0])
print(custom_loss(y_true, y_pred)) # tf.Tensor(4.0, shape=(), dtype=float32)
Hereby another implementation which achieves the same objective with slightly different weights. Note: the one-hot labels need to be dtype=float32. All credits go to my supervisor.
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(5, activation="relu"))
model.add(keras.layers.Dense(4, activation="softmax"))
opt = keras.optimizers.Adam(lr=1e-3, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
def custom_loss(y_true, y_pred):
a = tf.reshape(tf.range(tf.cast(tf.shape(y_pred)[1],dtype=tf.float32)),(1,-1))
c = tf.tile(a,(tf.shape(y_pred)[0],1))
x = tf.math.multiply(y_true,c)
x = tf.reshape(tf.reduce_sum(x, axis=1),(-1,1))
x = tf.tile(x,(1,tf.shape(y_pred)[1]))
x = tf.math.pow(tf.math.subtract(x,c),2)
x = tf.math.multiply(x,y_pred)
x = tf.reduce_sum(x, axis=1)
return tf.reduce_mean(x)
yTrue= np.array([[0,0,1,0],[0,0,1,0],[0, 1, 0, 0]]).astype(np.float32)
y_true = tf.constant(yTrue)
y_pred = tf.constant([[1,0,0,0],[0,0,1,0],[0.3, 0.1, 0.3, 0.3]])
custom_loss(y_true,y_pred)
model.compile(optimizer=opt, loss=custom_loss)
model.fit(np.random.random((3,10)),yTrue, epochs=100)
I'm trying to implement sentence similarity architecture based on this work using the STS dataset. Labels are normalized similarity scores from 0 to 1 so it is assumed to be a regression model.
My problem is that the loss goes directly to NaN starting from the first epoch. What am I doing wrong?
I have already tried updating to latest keras and theano versions.
The code for my model is:
def create_lstm_nn(input_dim):
seq = Sequential()`
# embedd using pretrained 300d embedding
seq.add(Embedding(vocab_size, emb_dim, mask_zero=True, weights=[embedding_weights]))
# encode via LSTM
seq.add(LSTM(128))
seq.add(Dropout(0.3))
return seq
lstm_nn = create_lstm_nn(input_dim)
input_a = Input(shape=(input_dim,))
input_b = Input(shape=(input_dim,))
processed_a = lstm_nn(input_a)
processed_b = lstm_nn(input_b)
cos_distance = merge([processed_a, processed_b], mode='cos', dot_axes=1)
cos_distance = Reshape((1,))(cos_distance)
distance = Lambda(lambda x: 1-x)(cos_distance)
model = Model(input=[input_a, input_b], output=distance)
# train
rms = RMSprop()
model.compile(loss='mse', optimizer=rms)
model.fit([X1, X2], y, validation_split=0.3, batch_size=128, nb_epoch=20)
I also tried using a simple Lambda instead of the Merge layer, but it has the same result.
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
The nan is a common issue in deep learning regression. Because you are using Siamese network, you can try followings:
check your data: do they need to be normalized?
try to add an Dense layer into your network as the last layer, but be careful picking up an activation function, e.g. relu
try to use another loss function, e.g. contrastive_loss
smaller your learning rate, e.g. 0.0001
cos mode does not carefully deal with division by zero, might be the cause of NaN
It is not easy to make deep learning work perfectly.
I didn't run into the nan issue, but my loss wouldn't change. I found this info
check this out
def cosine_distance(shapes):
y_true, y_pred = shapes
def l2_normalize(x, axis):
norm = K.sqrt(K.sum(K.square(x), axis=axis, keepdims=True))
return K.sign(x) * K.maximum(K.abs(x), K.epsilon()) / K.maximum(norm, K.epsilon())
y_true = l2_normalize(y_true, axis=-1)
y_pred = l2_normalize(y_pred, axis=-1)
return K.mean(1 - K.sum((y_true * y_pred), axis=-1))