Keras historical averaging custom loss function - python

I am currently experimenting with generative adversarial networks in Keras.
As proposed in this paper, I want to use the historical averaging loss function. Meaning that I want to penalize the change of the network weights.
I am not sure how to implement it in a clever way.
I was implementing the custom loss function according to the answer to this post.
def historical_averaging_wrapper(current_weights, prev_weights):
def historical_averaging(y_true, y_pred):
diff = 0
for i in range(len(current_weights)):
diff += abs(np.sum(current_weights[i]) + np.sum(prev_weights[i]))
return K.binary_crossentropy(y_true, y_pred) + diff
return historical_averaging
The weights of the network are penalized, and the weights are changing after each batch of data.
My first idea was to update the loss function after each batch.
Roughly like this:
prev_weights = model.get_weights()
for i in range(len(data)/batch_len):
current_weights = model.get_weights()
model.compile(loss=historical_averaging_wrapper(current_weights, prev_weights), optimizer='adam')
model.fit(training_data[i*batch_size:(i+1)*batch_size], training_labels[i*batch_size:(i+1)*batch_size], epochs=1, batch_size=batch_size)
prev_weights = current_weights
Is this reasonable? That approach seems to be a bit "messy" in my opinion.
Is there another possibility to do this in a "smarter" way?
Like maybe updating the loss function in a data generator and use fit_generator()?
Thanks in advance.

Loss functions are operations on the graph using tensors.
You can define additional tensors in the loss function to hold previous values. This is an example:
import tensorflow as tf
import tensorflow.keras.backend as K
keras = tf.keras
class HistoricalAvgLoss(object):
def __init__(self, model):
# create tensors (initialized to zero) to hold the previous value of the
# weights
self.prev_weights = []
for w in model.get_weights():
self.prev_weights.append(K.variable(np.zeros(w.shape)))
def loss(self, y_true, y_pred):
err = keras.losses.mean_squared_error(y_true, y_pred)
werr = [K.mean(K.abs(c - p)) for c, p in zip(model.get_weights(), self.prev_weights)]
self.prev_weights = K.in_train_phase(
[K.update(p, c) for c, p in zip(model.get_weights(), self.prev_weights)],
self.prev_weights
)
return K.in_train_phase(err + K.sum(werr), err)
The variable prev_weights holds the previous values. Note that we added a K.update operation after the weight errors are calculated.
A sample model for testing:
model = keras.models.Sequential([
keras.layers.Input(shape=(4,)),
keras.layers.Dense(8),
keras.layers.Dense(4),
keras.layers.Dense(1),
])
loss_obj = HistoricalAvgLoss(model)
model.compile('adam', loss_obj.loss)
model.summary()
Some test data and objective function:
import numpy as np
def test_fn(x):
return x[0]*x[1] + 2.0 * x[1]**2 + x[2]/x[3] + 3.0 * x[3]
X = np.random.rand(1000, 4)
y = np.apply_along_axis(test_fn, 1, X)
hist = model.fit(X, y, validation_split=0.25, epochs=10)
The model losses decrease over time, in my test.

Related

Using a custom loss function that references some of the inputs of the NN

I'm having trouble implementing a custom loss function into a Neural Network I'm building in TensorFlow. I want use one of my features as part of the loss function, so I've tried using model.add_loss instead of giving loss a value in the model.compile function.
My data looks like this:
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
feature_df = np.array([600,9])
training, test, = feature_df[:350,:], feature_df[350:,:]
x_train = training[:,[0,1,2,3,4,5,6]]
y_train = training[:,8]
loss_inp_train = training[:,[6]]
x_test = test[:,[0,1,2,3,4,5,6]]
y_test = test[:,8]
loss_inp_test = test[:,[6]]
I want to use a custom loss function because its not necessarily the mse I'm interested in minimizing, I want to minimize the profitability of this model, which depends if y_true and y_pred fall above or below loss_inp_train
I've tried creating a loss function that looks like this
def custom_loss(y_pred, y_true,inp):
loss = 0
if (y_pred < inp):
if y_true < inp:
loss = loss + .9
else:
loss = loss - 1
else:
if y_true > inp:
loss = loss + .9
else:
loss = loss - 1
loss = loss*-1
return(loss)
And the Model
model = tf.keras.Sequential([
normalize,
layers.Dense(18),
layers.Dense(1)
])
model.add_loss(profit_loss(y_pred,y_train,loss_inp_train))
model.compile(loss = None,
optimizer = tf.optimizers.Adam())
I'm having trouble feeding the loss function the output of the model. I'm still new to TensorFlow, whenever I've accessed predicted values its after the training using model.predict, but obviously I don't have a fitted model yet. How do I reference both a feature of the training data and y_true, y_pred in a function?
Probably the best way to do this is to define a custom loss. Unfortunately I'm not sure how to handle nested if statements like you have. Probably with a combination of K.switch. I can try to give you a partial solutions taking in consideration only the presence of a single if statement. Let's take the following simplified code:
loss = 0
if (y_pred < inp):
loss = # assignment 1
else:
loss = # assignment 2
In this case the loss function could be converted into this:
def profit_loss(inp):
def loss_function(y_true, y_pred):
loss = 0
condition = K.greater(y_pred - inp, 0)
loss1 = # assignment 1 if y_pred < inp
loss2 = # assignment 2 if y_pred >= inp
loss = K.switch(condition, loss2, loss1)
return - K.sum(loss)
return loss_function
model.compile(optimizer = tf.optimizers.Adam(), loss=profit_loss(inp))
This way y_true and y_pred are automatically handled and you just have to feed the inp argument.
Hope this helps getting you closer to solving the problem.

Implementing binary cross entropy from scratch - inconsistent results in training a neural network

I'm trying to implement and train a neural network using the JAX library and its little neural network submodule, "Stax". Since this library doesn't come with an implementation of binary cross entropy, I wrote my own:
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
I implemented a simple neural network and trained it on MNIST, and started to get suspicious of some of the results I was getting. So I implemented the same setup in Keras, and I immediately got wildly different results! The same model, trained in the same way on the same data, was getting 90% training accuracy in Keras instead of around 50% in JAX. Eventually I tracked down part of the issue to my naive implementation of cross-entropy, which is supposedly numerically unstable. Following this post and this code I found, I wrote the following new version:
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
This works a little better. Now my JAX implementation gets up to 80% train accuracy, but that's still a lot less than the 90% Keras gets. What I want to know is what is going on? Why are my two implementations not behaving the same way?
Below, I condensed my two implementations down to a single script. In this script, I implement the same model in JAX and in Keras. I initialize both with the same weights, and train them using full-batch gradient descent for 10 steps on 1000 datapoints from MNIST, the same data for each model. JAX finishes with 80% training accuracy, while Keras finishes with 90%. Specifically, I get this output:
Initial Keras accuracy: 0.4350000023841858
Initial JAX accuracy: 0.435
Final JAX accuracy: 0.792
Final Keras accuracy: 0.9089999794960022
JAX accuracy (Keras weights): 0.909
Keras accuracy (JAX weights): 0.7919999957084656
And actually, when I vary the conditions a little (using different random initial weights or a different training set), sometimes I get back the 50% JAX accuracy and 90% Keras accuracy.
I swap the weights at the end to verify that the weights obtained from training are indeed the issue, not something to do with the actual computation of the network predictions, or the way I calculate accuracy.
The code:
import numpy as np
import jax
from jax import jit, grad
from jax.experimental import stax, optimizers
import jax.numpy as jnp
import keras
import keras.datasets.mnist
def binary_cross_entropy(y_hat, y):
bce = y * jnp.log(y_hat) + (1 - y) * jnp.log(1 - y_hat)
return jnp.mean(-bce)
def binary_cross_entropy_stable(y_hat, y):
y_hat = jnp.clip(y_hat, 0.000001, 0.9999999)
logits = jnp.log(y_hat/(1 - y_hat))
max_logit = jnp.clip(logits, 0, None)
bces = logits - logits * y + max_logit + jnp.log(jnp.exp(-max_logit) + jnp.exp(-logits - max_logit))
return jnp.mean(bces)
def binary_accuracy(y_hat, y):
return jnp.mean((y_hat >= 1/2) == (y >= 1/2))
########################################
# #
# Create dataset #
# #
########################################
input_dimension = 784
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data(path="mnist.npz")
xs = np.concatenate([x_train, x_test])
xs = xs.reshape((70000, 784))
ys = np.concatenate([y_train, y_test])
ys = (ys >= 5).astype(np.float32)
ys = ys.reshape((70000, 1))
train_xs = xs[:1000]
train_ys = ys[:1000]
########################################
# #
# Create JAX model #
# #
########################################
jax_initializer, jax_model = stax.serial(
stax.Dense(1000),
stax.Relu,
stax.Dense(1),
stax.Sigmoid
)
rng_key = jax.random.PRNGKey(0)
_, initial_jax_weights = jax_initializer(rng_key, (1, input_dimension))
########################################
# #
# Create Keras model #
# #
########################################
initial_keras_weights = [*initial_jax_weights[0], *initial_jax_weights[2]]
keras_model = keras.Sequential([
keras.layers.Dense(1000, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
keras_model.compile(
optimizer=keras.optimizers.SGD(learning_rate=0.01),
loss=keras.losses.binary_crossentropy,
metrics=["accuracy"]
)
keras_model.build(input_shape=(1, input_dimension))
keras_model.set_weights(initial_keras_weights)
if __name__ == "__main__":
########################################
# #
# Compare untrained models #
# #
########################################
initial_keras_predictions = keras_model.predict(train_xs, verbose=0)
initial_jax_predictions = jax_model(initial_jax_weights, train_xs)
_, keras_initial_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
jax_initial_accuracy = binary_accuracy(jax_model(initial_jax_weights, train_xs), train_ys)
print("Initial Keras accuracy:", keras_initial_accuracy)
print("Initial JAX accuracy:", jax_initial_accuracy)
########################################
# #
# Train JAX model #
# #
########################################
L = jit(binary_cross_entropy_stable)
gradL = jit(grad(lambda w, x, y: L(jax_model(w, x), y)))
opt_init, opt_apply, get_params = optimizers.sgd(0.01)
network_state = opt_init(initial_jax_weights)
for _ in range(10):
wT = get_params(network_state)
gradient = gradL(wT, train_xs, train_ys)
network_state = opt_apply(
0,
gradient,
network_state
)
final_jax_weights = get_params(network_state)
final_jax_training_predictions = jax_model(final_jax_weights, train_xs)
final_jax_accuracy = binary_accuracy(final_jax_training_predictions, train_ys)
print("Final JAX accuracy:", final_jax_accuracy)
########################################
# #
# Train Keras model #
# #
########################################
for _ in range(10):
keras_model.fit(
train_xs,
train_ys,
epochs=1,
batch_size=1000,
verbose=0
)
final_keras_loss, final_keras_accuracy = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Final Keras accuracy:", final_keras_accuracy)
########################################
# #
# Swap weights #
# #
########################################
final_keras_weights = keras_model.get_weights()
final_keras_weights_in_jax_format = [
(final_keras_weights[0], final_keras_weights[1]),
tuple(),
(final_keras_weights[2], final_keras_weights[3]),
tuple()
]
jax_accuracy_with_keras_weights = binary_accuracy(
jax_model(final_keras_weights_in_jax_format, train_xs),
train_ys
)
print("JAX accuracy (Keras weights):", jax_accuracy_with_keras_weights)
final_jax_weights_in_keras_format = [*final_jax_weights[0], *final_jax_weights[2]]
keras_model.set_weights(final_jax_weights_in_keras_format)
_, keras_accuracy_with_jax_weights = keras_model.evaluate(train_xs, train_ys, verbose=0)
print("Keras accuracy (JAX weights):", keras_accuracy_with_jax_weights)
Try changing the PRNG seed at line 57 to a value other than 0 to run the experiment using different initial weights.
Your binary_cross_entropy_stable function does not match the output of keras.binary_crossentropy; for example:
x = np.random.rand(10)
y = np.random.rand(10)
print(keras.losses.binary_crossentropy(x, y))
# tf.Tensor(0.8134677734043875, shape=(), dtype=float64)
print(binary_cross_entropy_stable(x, y))
# 0.9781515
That is where I would start if you're trying to exactly duplicate the model.
You can view the source of the keras loss function here: keras/losses.py#L1765-L1810, with the main part of the implementation here: keras/backend.py#L4972-L5017
One detail: it appears that with a sigmoid activation function, Keras re-uses some cached logits to compute the binary cross entropy while avoiding problematic values: keras/backend.py#L4988-L4997. I'm not sure how to easily replicate that behavior using JAX & stax.

Custom Loss function, Keras \\ ValueError: No gradients

I'm attempting to wrap my Keras neural network in a class object. I have implemented the below outside of a class setting, but I want to make this more object-friendly.
To summarize, my model calls function sequential_model which creates a sequential model. Within the compile step, I have defined my own loss function weighted_categorical_crossentropy which I want the sequential model to implement.
However, when I run the code below I get the following error: ValueError: No gradients provided for any variable:
I suspect the issue is with how I'm defining the weighted_categorical_crossentropy function for its use by sequential.
Again, I was able to get this work in a non-object oriented way. Any help will be much appreciated.
from tensorflow.keras import Sequential, backend as K
class MyNetwork():
def __init__(self, file, n_output=4, n_hidden=20, epochs=3,
dropout=0.10, batch_size=64, metrics = ['categorical_accuracy'],
optimizer = 'rmsprop', activation = 'softmax'):
[...] //Other Class attributes
def model(self):
self.model = self.sequential_model(False)
self.model.summary()
def sequential_model(self, val):
K.clear_session()
if val == False:
self.epochs = 3
regressor = Sequential()
#regressor.run_eagerly = True
regressor.add(LSTM(units = self.n_hidden, dropout=self.dropout, return_sequences = True, input_shape = (self.X.shape[1], self.X.shape[2])))
regressor.add(LSTM(units = self.n_hidden, dropout=self.dropout, return_sequences = True))
regressor.add(Dense(units = self.n_output, activation=self.activation))
self.weights = np.array([0.025,0.225,0.78,0.020])
regressor.compile(optimizer = self.optimizer, loss = self.weighted_categorical_crossentropy(self.weights), metrics = [self.metrics])
regressor.fit(self.X, self.Y*1.0,batch_size=self.batch_size, epochs=self.epochs, verbose=1, validation_data=(self.Xval, self.Yval*1.0))
return regressor
def weighted_categorical_crossentropy(self, weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
There are several problems with above code, but the most noticeable one is you don't return the loss from weighted_categorical_crossentropy. It should look more like:
def weighted_categorical_crossentropy(self, weights):
weights = K.variable(weights)
def loss(y_true, y_pred):
y_pred /= K.sum(y_pred, axis=-1, keepdims=True)
y_pred = K.clip(y_pred, K.epsilon(), 1 - K.epsilon())
loss = y_true * K.log(y_pred) * weights
loss = -K.sum(loss, -1)
return loss
return loss # Return the callable function!
The error is ValueError: No gradients provided for any variable because the loss method doesn't return anything, it returns None! If you try to fit a method with loss=None, the model will have no way of computing gradients and therefore it will throw the same exact error.
Next up is the that you are using return_sequences = True in the layer right before a non-recurrent layer. This causes the Dense layer to be called on mis-shaped data, that's appropriate only for recurrent layers. Don't use it like that.
If you have a good reason for using the return_sequences = True, then you must add Dense layer like:
model.add(keras.layers.TimeDistributed(keras.layers.Dense(...)))
This will cause the Dense layer to act on output sequence on every time step separately. This also means that your y_true must be of appropriate shape.
There could be other problems with the custom loss function that you defined, but I can not deduce the input/output shapes, so you will have to run it and add see if it works. There will probably be matrix multiplication shape mismatch.
Last but not least, think about using the Sub-classing API. Could it make any of your operations easier to write?
Thanks for reading and I'll update this answer once I have that info. Cheers.

Keras: how to implement target replication for LSTM?

Using examples from Lipton et al (2016), target replication is basically calculating the loss at each time step (except final) of the LSTM (or GRU) and averaging this loss and adding it to the main loss while training. Mathematically, it is given by -
Graphically, it can be represented as -
So how do I go about exactly implementing this in Keras? Say, I have binary classification task. Let's say my model is a simple one given below -
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='binary_crossentropy', class_weights={0:0.5, 1:4}, optimizer=Adam(), metrics=['accuracy'])
model.fit(x_train, y_train)
I think y_train needs to be reshaped/tiled from (batch_size, 1) to (batch_size, time_step) right?
The dense layer needs TimeDistributed to be applied correctly to the LSTM after setting return_sequences=True?
How do I exactly implement the exact loss function given above? Will class_weights need to be modified?
Target replication is only during training. How to implement validation set evaluation using only the main loss?
How should I deal with zero paddings in target replication? My sequences are padded to a max_len of 15 with average length being 7. Since the target replication loss averages over all the steps, how do I make sure it doesn't use the padded words in calculating the loss? Basically, dynamically assign T the actual sequence length.
Question 1:
So, for the targets, you need it shaped as (batch_size, time_steps, 1). Just use:
y_train = np.stack([y_train]*time_steps, axis=1)
Question 2:
You're correct, but TimeDistributed is optional in Keras 2.
Question 3:
I don't know how class weights will behave, but a regular loss function should go like:
from keras.losses import binary_crossentropy
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
model.compile(......, loss = target_replication_loss(alpha), ...)
Question 3a:
Since the above doens't work well with class weights, I created an alternative where the weights go into the loss:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
Question 4:
To avoid complexity, I'd say you should use an additional metric in validation:
def last_step_BC(true,pred):
return binary_crossentropy(true[:,-1], pred[:,-1])
model.compile(....,
loss = target_replication_loss(alpha),
metrics=[last_step_BC])
Question 5:
This is a hard one and I'd need to research a little....
As an initial workaround, you can set the model with an input shape of (None, features), and train each sequence individually.
Working example without class_weight
def target_replication_loss(alpha):
def inner_loss(true,pred):
losses = binary_crossentropy(true,pred)
#print(K.int_shape(losses))
#print(K.int_shape(losses[:,:-1]))
#print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
#print(K.int_shape(losses[:,-1]))
return (alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1])
return inner_loss
alpha = 0.6
i1 = Input((5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)
Working example with class weights:
def target_replication_loss(alpha, class_weights):
def get_weights(x):
b = class_weights[0]
a = class_weights[1] - b
return (a*x) + b
def inner_loss(true,pred):
#this will only work for classification with only one class 0 or 1
#and only if the target is the same for all classes
true_classes = true[:,-1,0]
weights = get_weights(true_classes)
losses = binary_crossentropy(true,pred)
print(K.int_shape(losses))
print(K.int_shape(losses[:,:-1]))
print(K.int_shape(K.mean(losses[:,:-1], axis=-1)))
print(K.int_shape(losses[:,-1]))
print(K.int_shape(weights))
return weights*((alpha*K.mean(losses[:,:-1], axis=-1)) + ((1-alpha)*losses[:,-1]))
return inner_loss
alpha = 0.6
class_weights={0: 0.5, 1:4.}
i1 = Input(batch_shape=(3,5,2))
i2 = Input((5,2))
out = LSTM(1, activation='sigmoid', return_sequences=True)(i1)
model = Model(i1, out)
model.compile(optimizer='adam', loss = target_replication_loss(alpha, class_weights))
model.fit(np.arange(30).reshape((3,5,2)), np.arange(15).reshape((3,5,1)), epochs = 200)

Sentence similarity using keras

I'm trying to implement sentence similarity architecture based on this work using the STS dataset. Labels are normalized similarity scores from 0 to 1 so it is assumed to be a regression model.
My problem is that the loss goes directly to NaN starting from the first epoch. What am I doing wrong?
I have already tried updating to latest keras and theano versions.
The code for my model is:
def create_lstm_nn(input_dim):
seq = Sequential()`
# embedd using pretrained 300d embedding
seq.add(Embedding(vocab_size, emb_dim, mask_zero=True, weights=[embedding_weights]))
# encode via LSTM
seq.add(LSTM(128))
seq.add(Dropout(0.3))
return seq
lstm_nn = create_lstm_nn(input_dim)
input_a = Input(shape=(input_dim,))
input_b = Input(shape=(input_dim,))
processed_a = lstm_nn(input_a)
processed_b = lstm_nn(input_b)
cos_distance = merge([processed_a, processed_b], mode='cos', dot_axes=1)
cos_distance = Reshape((1,))(cos_distance)
distance = Lambda(lambda x: 1-x)(cos_distance)
model = Model(input=[input_a, input_b], output=distance)
# train
rms = RMSprop()
model.compile(loss='mse', optimizer=rms)
model.fit([X1, X2], y, validation_split=0.3, batch_size=128, nb_epoch=20)
I also tried using a simple Lambda instead of the Merge layer, but it has the same result.
def cosine_distance(vests):
x, y = vests
x = K.l2_normalize(x, axis=-1)
y = K.l2_normalize(y, axis=-1)
return -K.mean(x * y, axis=-1, keepdims=True)
def cos_dist_output_shape(shapes):
shape1, shape2 = shapes
return (shape1[0],1)
distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([processed_a, processed_b])
The nan is a common issue in deep learning regression. Because you are using Siamese network, you can try followings:
check your data: do they need to be normalized?
try to add an Dense layer into your network as the last layer, but be careful picking up an activation function, e.g. relu
try to use another loss function, e.g. contrastive_loss
smaller your learning rate, e.g. 0.0001
cos mode does not carefully deal with division by zero, might be the cause of NaN
It is not easy to make deep learning work perfectly.
I didn't run into the nan issue, but my loss wouldn't change. I found this info
check this out
def cosine_distance(shapes):
y_true, y_pred = shapes
def l2_normalize(x, axis):
norm = K.sqrt(K.sum(K.square(x), axis=axis, keepdims=True))
return K.sign(x) * K.maximum(K.abs(x), K.epsilon()) / K.maximum(norm, K.epsilon())
y_true = l2_normalize(y_true, axis=-1)
y_pred = l2_normalize(y_pred, axis=-1)
return K.mean(1 - K.sum((y_true * y_pred), axis=-1))

Categories

Resources