I am trying to train a multi-input (3) multi-output (4) model using Keras and I need to use a SINGLE loss function that takes in all the output predictions. 2 of these outputs are my true model outputs that I care about and have corresponding labels, while the other 2 outputs are learnable parameters from within my model that I want to use to dynamically update the loss weights for my true model outputs.
I need something like this:
model.compile(optimizer=optimizer, loss = unified_loss
where the unified loss should have access to all my model outputs and corresponding labels. I am using tf.data.from_tensor_slices(...) to train.
The only workaround I have found is to use a custom training loop, which allows this. But, I lose a lot of functionality and callbacks become trickier to implement.
Is there a way to solve this using the regular model.compilt(...) and model.fit(...)?
Apart from a custom training loop, which is not preferred, I did try the standard approach of:
model.compile(optimizer=optimizer, loss = [loss1, loss2], loss_weights = [alpha, beta]
where I tried to make alpha and beta learnable parameters but this is not desired because I have a custom equation that is more involved than a simple weighted sum.
Add a layer to your model that concats the losses into a single tensor/output. Have your custom loss parse out each of the four values and run the necessary math on them. During inference, run the model without the extra layer.
The pattern of having a slightly different model for training and inference is a common one.
Here is an example of the basic idea:
import tensorflow as tf
inp1 = tf.keras.Input((1,))
inp2 = tf.keras.Input((1,))
inp3 = tf.keras.Input((1,))
inputs = tf.keras.layers.Concatenate()([inp1, inp2, inp3])
out1 = tf.keras.layers.Dense(1)(inputs)
out2 = tf.keras.layers.Dense(1)(inputs)
out3 = tf.keras.layers.Dense(1)(inputs)
out4 = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.Model([inp1, inp2, inp3], [out1, out2, out3, out4])
x1 = tf.convert_to_tensor([1])
x2 = tf.convert_to_tensor([1])
x3 = tf.convert_to_tensor([1])
model((x1, x2, x3))
outs = tf.stack([out1, out2, out3, out4])
training_model = tf.keras.Model([inp1, inp2, inp3], outs)
training_model((x1, x2, x3))
def exotic_loss(y_true, y_pred):
true1, true2, true3 = tf.unstack(y_true)
pred1, pred2, pred3 = tf.unstack(y_pred)
return true1 + true2 + true3 + pred1 + pred2 + pred3
training_model.compile(loss=exotic_loss)
Related
I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.
I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.
I want to train a Siamese Network to compare vectors for similarity.
My dataset consist of pairs of vectors and a target column with "1" if they are the same and "0" otherwise (binary classification):
import pandas as pd
# Define train and test sets.
X_train_val = pd.read_csv("train.csv")
print(X_train_val.head())
y_train_val = X_train_val.pop("class")
print(y_train_val.value_counts())
# Keep 50% of X_train_val in validation set.
X_train, X_val = X_train_val[:991], X_train_val[991:]
y_train, y_val = y_train_val[:991], y_train_val[991:]
del X_train_val, y_train_val
# Split our data to 'left' and 'right' inputs (one for each side Siamese).
X_left_train, X_right_train = X_train.iloc[:, :200], X_train.iloc[:, 200:]
X_left_val, X_right_val = X_val.iloc[:, :200], X_val.iloc[:, 200:]
assert X_left_train.shape == X_right_train.shape
# Repeat for test set.
X_test = pd.read_csv("test.csv")
y_test = X_test.pop("class")
print(y_test.value_counts())
X_left_test, X_right_test = X_test.iloc[:, :200], X_test.iloc[:, 200:]
returns
v0 v1 v2 ... v397 v398 v399 class
0 0.003615 0.013794 0.030388 ... -0.093931 0.106202 0.034870 0.0
1 0.018988 0.056302 0.002915 ... -0.007905 0.100859 -0.043529 0.0
2 0.072516 0.125697 0.111230 ... -0.010007 0.064125 -0.085632 0.0
3 0.051016 0.066028 0.082519 ... 0.012677 0.043831 -0.073935 1.0
4 0.020367 0.026446 0.015681 ... 0.062367 -0.022781 -0.032091 0.0
1.0 1060
0.0 923
Name: class, dtype: int64
1.0 354
0.0 308
Name: class, dtype: int64
The rest of my script is as follows:
import keras
import keras.backend as K
from keras.layers import Dense, Dropout, Input, Lambda
from keras.models import Model
def euclidean_distance(vectors):
"""
Find the Euclidean distance between two vectors.
"""
x, y = vectors
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
# Epsilon is small value that makes very little difference to the value of the denominator, but ensures that it isn't equal to exactly zero.
return K.sqrt(K.maximum(sum_square, K.epsilon()))
def contrastive_loss(y_true, y_pred):
"""
Distance-based loss function that tries to ensure that data samples that are semantically similar are embedded closer together.
See:
* https://gombru.github.io/2019/04/03/ranking_loss/
"""
margin = 1
return K.mean(y_true * K.square(y_pred) + (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))
def accuracy(y_true, y_pred):
"""
Compute classification accuracy with a fixed threshold on distances.
"""
return K.mean(K.equal(y_true, K.cast(y_pred < 0.5, y_true.dtype)))
def create_base_network(input_dim: int, dense_units: int, dropout_rate: float):
input1 = Input(input_dim, name="encoder")
x = input1
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu")(x)
x = Dropout(dropout_rate)(x)
x = Dense(dense_units, activation="relu", name="Embeddings")(x)
return Model(input1, x)
def build_siamese_model(input_dim: int):
shared_network = create_base_network(input_dim, dense_units=128, dropout_rate=0.1)
left_input = Input(input_dim)
right_input = Input(input_dim)
# Since this is a siamese nn, both sides share the same network.
encoded_l = shared_network(left_input)
encoded_r = shared_network(right_input)
# The euclidean distance layer outputs close to 0 value when two inputs are similar and 1 otherwise.
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
return siamese_net
model = build_siamese_model(X_left_train.shape[1])
es_callback = keras.callbacks.EarlyStopping(monitor="val_loss", patience=3, verbose=0)
history = model.fit(
[X_left_train, X_right_train],
y_train,
validation_data=([X_left_val, X_right_val], y_val),
epochs=100,
callbacks=[es_callback],
verbose=1,
)
I have plotted the contrastive loss vs epoch and model accuracy vs epoch:
The validation line is almost flat, which seems odd to me (overfitted?).
After changing the dropout of the shared network from 0.1 to 0.5, I get the following results:
Somehow it looks better, but yields bad predictions as well.
My questions are:
Most examples of Siamese Networks I've seen so far involves embedding layers (text pairs) and/or Convolution layers (image pairs). My input pairs are the actual vector representation of some text, which is why I used Dense layers for the shared network. Is this the proper approach?
The output layer of my Siamese Network is as follows:
distance = Lambda(euclidean_distance, name="Euclidean-Distance")([encoded_l, encoded_r])
siamese_net = Model(inputs=[left_input, right_input], outputs=distance)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
but someone over the internet suggested this instead:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="sigmoid")(distance) # returns the class probability
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
I'm not sure which one to trust nor the difference between them (except that the former returns the distance and the latter returns the class probability). In my experiments, I get poor results with binary_crossentropy.
EDIT:
After following #PlzBePython suggestions, I come up with the following base network:
distance = Lambda(lambda tensors: K.abs(tensors[0] - tensors[1]), name="L1-Distance")([encoded_l, encoded_r])
output = Dense(1, activation="linear")(distance)
siamese_net = Model(inputs=[left_input, right_input], outputs=output)
siamese_net.compile(loss=contrastive_loss, optimizer="RMSprop", metrics=[accuracy])
Thank you for your help!
This is less of an answer and more writing my thoughts down and hoping they can help find an answer.
In general, everything you do seems pretty reasonable to me.
Regarding your Questions:
1:
Embedding or feature extraction layers are never a must, but almost always make it easier to learn the intended. You can think of them like providing your distance model with the comprehensive summary of a sentence instead of its raw words. This also makes your model not depend on the location of a word. In your case, creating the summary/important features of a sentence and embedding similar sentences close to each other is done by the same network. Of course, this can also work, and I don't even think it's a bad approach. However, I would maybe increase the network size.
2:
In my opinion, those two loss functions are not too different. Binary Crossentropy is defined as:
While Contrastive Loss (margin = 1) is:
So you basically swap a log function for a hinge function.
The only real difference comes from the distance calculation. You probably got suggested using some kind of L1 distance, since L2 distance is supposed to perform worse with higher dimensions (see for example here) and your dimensionality is 128. Personally, I would rather go with L1 in your case, but I don't think it's a dealbreaker.
What I would try is:
increase the margin parameter. "1" always results in a pretty low loss in the false positive case. This could slow down training in general
try out embedding into the [-inf, inf] space (change last layer embedding activation to "linear")
change "binary_crossentropy" loss into "keras.losses.BinaryCrossentropy(from_logits=True)" and last activation from "sigmoid" to "linear". This should actually not make a difference, but I've made some weird experiences with the keras binary crossentropy function and from_logits seems to help sometimes
increase parameters
Lastly, a validation accuracy of 90% actually looks pretty good to me. Keep in mind, that when the validation accuracy is calculated in the first epoch, the model already has done about 60 weight updates (batch_size = 32). That means, especially in the first episode, a validation accuracy that is higher than the training accuracy (which is calculated during training) is kind of to be expected. Also, this can sometimes cause the misbelief that training loss is increasing faster than validation loss.
EDIT
I recommended "linear" in the last layer, because tensorflow recommends it ("from_logits"=True which requires value in [-inf, inf]) for Binary Crossentropy. In my experience, it converges better.
I want to create a model which can predict two outputs. I did some research and I found that there's a way to do it by creating two branches (for predicting two outputs) using functional API in Tensorflow Keras but I have a another approach in my mind which looks like this :
i.e. given a input, first I want to predict output1 and then based on that I want to predict output2.
So how can this can be done in Tensorflow ?
Please let me know how the training will be done as well i.e. how I'll be to pass labels for each output1 and output2 and then calculate the loss as well.
Thank you
You can do it with functional API of tensorflow. I write it in some sort of pseudo code:
Inputs = your_input
x = hidden_layers()(Inputs)
Output1 = Dense()(x)
x = hidden_layers()(Output1)
Output2 = Dense()(x)
So you can separate it to two models if it is what you desired:
model1 = tf.keras.models.Model(inputs=[Input], outputs=[Output1])
model2 = tf.keras.models.Model(inputs=[Input], outputs=[Output2])
Or have everything in one model:
model = tf.keras.models.Model(inputs=[Input], outputs=[Output2])
Output1_pred = model.get_layer('Output1').output
UPDATE:
In order to training model with two outputs, you can separate model to two parts and train each part separately as follow:
model1 = tf.keras.models.Model(inputs=[Input], outputs=[Output1])
model2 = tf.keras.models.Model(inputs=[model1.get_layer('Output1').output], outputs=[Output2])
model1.cmpile(...)
model1.fit(...)
for layer in model1.layers:
layer.trainable = False
model2.compile(...)
model2.fit(...)
You can actually modify the great answer by #Mohammad to compose a unique model with two outputs.
Inputs = your_input
x = hidden_layers()(Inputs)
Output1 = Dense()(x)
x = hidden_layers()(Output1)
Output2 = Dense()(x)
model = tf.keras.models.Model(inputs=[Inputs], outputs=[Output1, Output2])
model.compile(loss=[loss_1, loss_2], loss_weights=[0.5, 0.5], optimizer=sgd, metrics=['accuracy'])
of course you can change weights, optimiser and metric according to your case.
Then the model has to be trained on data like (X, y1, y2) where (y1, y2) are output1 and output2 labels respectively.
I want to create a custom loss function for a Keras deep learning regression model. For the custom loss function, I want to use a feature that is in the dataset but I am not using that particular feature as an input to the model.
My data looks like this:
X | Y | feature
---|-----|--------
x1 | y1 | f1
x2 | y2 | f2
The input to the model is X and I want to predict Y using the model. I want something like the following as the loss function:
def custom_loss(feature):
def loss(y_true, y_pred):
root_mean__square(y_true - y_pred) + std(y_pred - feature)
return loss
I can't use a wrapper function as above, because the feature values depends on the training and test batches, thus cannot be passed to the custom loss function at the model compile time. How can I use the additional feature in the dataset to create a custom loss function?
EDIT:
I did the following based on an answer on this thread. When I make predictions using this model, does it make predictions for 'Y' or a combination of Y and the additional feature? I want to make sure because model.fit( ) takes both 'Y' and 'feature' as y to train but model.predict( ) only gives the one output. If the predictions are a combination of Y and the additional feature, how can I extract only Y?
def custom_loss(data, y_pred):
y_true = data[:, 0]
feature = data[:, 1]
return K.mean(K.square((y_pred - y_true) + K.std(y__pred - feature)))
def create_model():
# create model
model = Sequential()
model.add(Dense(5, input_dim=1, activation="relu"))
model.add(Dense(1, activation="linear"))
(train, test) = train_test_split(df, test_size=0.3, random_state=42)
model = models.create_model(train["X"].shape[1])
opt = Adam(learning_rate=1e-2, decay=1e-3/200)
model.compile(loss=custom_loss, optimizer=opt)
model.fit(train["X"], train[["Y", "feature"]], validation_data=(test["X"], test[["Y", "feature"]]), batch_size = 8, epochs=90)
predY = model.predict(test["X"]) # what does the model predict here?
First check the data structure of your input Y in fit function see if it have same structure as the answer in that thread you following, if you does thing exactly right then it should solve your problem.
When I make predictions using this model, does it make predictions for 'Y' or a combination of Y and the additional feature?
The model will have same output shape exactly like what you defined, in your case because model output is Dense(1, activation="linear"), so it have output shape y_pred.shape == (batchsize, 1), nothing more, you can be sure about that, print it out using tf.print(y_pred) to see for yourself
also i don't know if it's your typing error, last line of your custom_loss function should be :
return K.mean(K.square((y_pred - y_true) + K.std(y_pred - feature)))
instead of
return K.mean(K.square((y_pred - y_true) + K.std(y__pred - feature)))
You can also use .add_loss with a simple mse loss the following way:
input = Input(size)
output = YourLayers(input)
model = Model(input, output)
model.add_loss(std(tf.gather(input, feature_idx, axis=1) - output))
model.compile(loss='mse', optimizer=opt)
BTW, it is strange that your regularizer is a square of variance, while your loss is mse. Maybe you would prefer them to be on the same squared scale (variance and mse), as people usually do (consider any L2 shrinkage, e.g. Ridge regression).
I am trying to implement a fairly simple custom loss function in Keras.
I am trying to make the network predict a bad input case (i.e. on which it has no chance of predicting correct output), along with correct output. To try to do this, I used a loss function which allows the network to 'choose' a constant loss (8) instead of it's current loss (determined by MAE).
loss = quality * output + (1-quality) * 8
Where quality is output from sigmoid, so in [0,1]
How would I design such a loss function properly in Keras?
Specifically, in the basic case, the network gets several predictions of the output, along with metrics known or thought to correlate with prediction quality. The role of the (small) network is to use these metrics to determine the weights to give when averaging these different prediction. This works well enough.
However, in some fraction of cases (say 5-10%) the input data is so bad that all predictors will be wrong. In that case, I want to output '?' to the user instead of a wrong answer.
My code complained about 1 array vs 2 arrays (presumably, identical number of y_true and y_pred are expected, but I don't have these).
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
model.compile(loss=quality_loss, optimizer='adam', metrics=['mae'])
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, epochs=50, batch_size=5000, verbose=2, shuffle=True)
It seems having two outputs is causing the loss function to be called independently for each output.
ValueError: Error when checking model target: the list of Numpy arrays that you
are passing to your model is not the size the model expected. Expected to see 2
array(s), but instead got the following list of 1 arrays:
This was solved by passing a concatenated array instead.
def quality_loss(y_true, y_pred):
qual = y_pred[:,0]
hr = y_pred[:,1]
const = 8
return qual * mean_absolute_error(y_true,hr) + (1 - qual) * const
def my_mae(y_true,y_pred):
return mean_absolute_error(y_true,y_pred[:,1])
model = Model(inputs=[xin, in1, in2, in3, in4, in5, hr], outputs=concatenate([qual, pred_hr]))
model.compile(loss=quality_loss, optimizer='adam', metrics=[my_mae])
Network code:
xin = Input(shape=(1,))
in1 = Input(shape=(4,))
net1 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in1) )
in2 = Input(shape=(4,))
net2 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in2) )
in3 = Input(shape=(4,))
net3 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in3) )
in4 = Input(shape=(4,))
net4 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in4) )
in5 = Input(shape=(4,))
net5 = Dense(3,activation='tanh')( Dense(6,activation='tanh')(in5) )
smweights = Dense(5, activation='softmax')( concatenate([xin, net1, net2, net3, net4, net5]) )
qual = Dense(1, activation='sigmoid')( Dense(3, activation='tanh')( concatenate([xin, net1, net2, net3, net4, net5]) ) )
x = Input(shape=(5,))
pred = dot([x, smweights], axes=1)
This runs, but converges to loss = const and mae > 25 (whereas a simple mae loss here achieves 3-4 quite easily). Something is still not quite right with the loss function. Since shape on y_true/y_pred in the loss function gives (?) it's hard to track what is being passed exactly.
This issue is actually not caused by your custom loss function, but by something else: The problem arises because of how you call the fit function.
When you define the model, you give it 7 inputs and 2 outputs:
model = Model(inputs=[ain, in1, in2, in3, in4, in5, x], outputs=[pred,qual])
When you eventually call the fit function, you give a list 7 arrays as the input of the network but only 1 target output value called ref:
model.fit([acc, hrmet0, hrmet1, hrmet2, hrmet3, hrmet4, hrs], ref, ...)
This will not work. You have to supply the fit function with the same number of inputs and outputs as declared in the model's definition.
Edit: I think there is some conceptual problem with your approach: how are you actually planning to define the quality of your prediction? Why are you thinking, that adding a branch of your network which is supposed to judge the quality of your network's prediction will actually help to train it? The network will converge to a local minimum of the loss function. The fancier your loss function is, the more likely it is, that it will not actually converge to the state you actually want it to be in, but to some other local and not global minimum. You could try to experiment with different optimizers and learning rates - maybe this helps your training.