ValueError- Scratch Training for Multiple Output in tf.keras

ValueError- Scratch Training for Multiple Output in tf.keras - python

I have a ML models which have 3 output. Each of them have passed into a SoftMax player. I'm trying to train the model from custom scratch with training loop in keras. My generator output are something like this:
x,y = (batch_size, h,w,c), [(batch_size, 2), (batch_size, 8), (batch_size, 10)]
As you notice, each of the 3 output has different numbers of labels. First output layer has 2 labels output, second one has 8 and last one has 10.
Here is the code snippet,
epochs = 1
optimizer = tf.keras.optimizers.Adam(learning_rate=0.05)
loss_fn = tf.keras.losses.CategoricalCrossentropy()
train_acc_metric = tf.keras.metrics.Accuracy()
val_acc_metric = tf.keras.metrics.Accuracy()
#tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
logits = model(x, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
train_loss_value = loss_fn(y, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(train_loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Update training metric.
train_acc_metric.update_state(y, logits)
return train_loss_value
And,
for epoch in range(epochs):
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_generator):
train_loss_value = train_step(x_batch_train, y_batch_train)
# Reset metrics at the end of each epoch
train_acc_metric.reset_states()
When I run the code using fit method, it works nicely. But approaching with custom scratch training like above throws following error message:
ValueError: Dimension 1 in both shapes must be equal, but are 8 and 10. Shapes are [64,8] and [64,10].
From merging shape 1 with other shapes.
for '{{node categorical_crossentropy/packed}} = Pack[N=3, T=DT_FLOAT, axis=0](net_2/gra/Softmax, net_2/vow/Softmax, net_2/cons/Softmax)'
with input shapes: [64,2], [64,8], [64,10].
I suspect, I am not handling properly the computation of loss in here: loss_fn(y, logits). y has 3 output, logits has also 3 output. When I pass loss_fn(y[0], logits[0], it works for the first output from rest.
However, any suggestion would be appreciated. Thanks.

Related

BERT Classifier ValueError: Target size (torch.Size([4, 1])) must be the same as input size (torch.Size([4, 2]))

I'm training a Classifier Model but it's a few days that I cannot overcame a problem!
I have the ValueError: Target size (torch.Size([4, 1])) must be the same as input size (torch.Size([4, 2])) error but actually it seems correct to me ! Indeed I used unsqueeze(1) to put them of the same size. WHat else I can try? Thank you!
class SequenceClassifier(nn.Module):
def __init__(self, n_classes):
super(SequenceClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME,return_dict=False)
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
_, pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
output = self.drop(pooled_output)
return self.out(output)
model = SequenceClassifier(len(class_names))
model = model.to(device)
EPOCHS = 10
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)
weights=[0.5,1]
pos_weight=torch.FloatTensor(weights).to(device)
loss_fn=nn.BCEWithLogitsLoss(pos_weight=pos_weight)
def train_epoch(
model,
data_loader,
loss_fn,
optimizer,
device,
scheduler,
n_examples
):
model = model.train()
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
targets = targets.unsqueeze(1)
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
return correct_predictions.double() / n_examples, np.mean(losses)
%%time
history = defaultdict(list)
best_accuracy = 0
for epoch in range(EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
print('-' * 10)
train_acc, train_loss = train_epoch(
model,
train_data_loader,
loss_fn,
optimizer,
device,
scheduler,
len(df_train)
)
print(f'Train loss {train_loss} accuracy {train_acc}')
val_acc, val_loss = eval_model(
model,
val_data_loader,
loss_fn,
device,
len(df_val)
)
print(f'Val loss {val_loss} accuracy {val_acc}')
print()
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_model_state.bin')
best_accuracy = val_acc
ValueError: Target size (torch.Size([4, 1])) must be the same as input size (torch.Size([4, 2]))
EDIT
I have a binary classifier problem, indeed i have 2 classes encoded 0 ("bad") and 1 ("good").

In case anyone stumbles on this like I did, I'll write out an answer since there aren't a lot of google hits for this target size/input size error and the previous answer has some factual inaccuracies.
Unlike the previous answer would suggest, the real problem isn't with the loss function but with the output of the model.nn.BCEWithLogitsLoss is completely fine for multi-label and multi-class applications. Chiara updated her post saying that in fact she has a binary classification problem, but even that should not be a problem for this loss function. So why the error?
The original code has:
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
This means "Run the model, then create preds with the row indeces of the highest output of the model". Obviously, there is only a "index of highest" if there are multiple predicted values. Multiple output values usually means multiple input classes, so I can see why Shai though this was multi-class. But why would we get multiple outputs from a binary classifier?
As it turns out, BERT (or Huggingface anyway) for binary problems expects that n_classes is set to 2 -- setting classes to 1 puts the model in regression mode. This means that under the hood, binary problems are treated like a two-class problem, outputting predictions with the size [2, batch size] -- one column predicting the chance of it being a 1 and one for the chance of it being 0. The loss fucntion throws an error because it is supplied with only one row of one-hot encoded labels: targets = d["targets"].to(device) so the labels have dimensions [batch size] or after the unsqueeze, [1, batch size]. Either way, the dimensions don't match up.
Some loss functions can deal with this fine, but others require the exact same dimensions. To make things more frustrating, for version 1.10, nn.BCEWithLogitsLoss requires matching dimensions but later versions do not.
One solution may therefore be to update your pytorch (version 1.11 would work for example).
For me, this was not an option, so I ended up going with a different loss function. nn.CrossEntropyLoss, as suggested by Shai, indeed does the trick as it accepts any input with the same length. In other words, they had a working solution for the wrong reasons.

You are using nn.BCEWithLogitsLoss loss function. This loss function works for binary classification tasks, and expects the predictions and the targets to be of the same shape (and float data type).
This is in contrast to the multi-class CE loss function nn.CrossEntropyLoss that expects the targets to be integers pointing to the proper place in the predicted class probabilities.
Please read carefully the doc of the functions you are using and make sure you are using them correctly.

How to generate predictions from new data using trained tensorflow network?

I want to train Googles VGGish network (Hershey et al 2017) from scratch to predict classes specific to my own audio files.
For this I am using the vggish_train_demo.py script available on their github repo which uses tensorflow. I've been able to modify the script to extract melspec features from my own audio by changing the _get_examples_batch() function, and, then train the model on the output of this function. This runs to completetion and prints the loss at each epoch.
However, I've been unable to figure out how to get this trained model to generate predictions from new data. Can this be done with changes to the vggish_train_demo.py script?

For anyone who stumbles across this in the future, I wrote this script which does the job. You must save logmel specs for train and test data in the arrays: X_train, y_train, X_test, y_test. The X_train/test are arrays of the (n, 96,64) features and the y_train/test are arrays of shape (n, _NUM_CLASSES) for two classes, where n = the number of 0.96s audio segments and _NUM_CLASSES = the number of classes used.
See the function definition statement for more info and the vggish github in my original post:
### Run the network and save the predictions and accuracy at each epoch
### Train NN, output results
r"""This uses the VGGish model definition within a larger model which adds two
layers on top, and then trains this larger model.
We input log-mel spectrograms (X_train) calculated above with associated labels
(y_train), and feed the batches into the model. Once the model is trained, it
is then executed on the test log-mel spectrograms (X_test), and the accuracy is
ouput, alongside a .csv file with the predictions for each 0.96s chunk and their
true class."""
def main(X):
with tf.Graph().as_default(), tf.Session() as sess:
# Define VGGish.
embeddings = vggish_slim.define_vggish_slim(training=FLAGS.train_vggish)
# Define a shallow classification model and associated training ops on top
# of VGGish.
with tf.variable_scope('mymodel'):
# Add a fully connected layer with 100 units. Add an activation function
# to the embeddings since they are pre-activation.
num_units = 100
fc = slim.fully_connected(tf.nn.relu(embeddings), num_units)
# Add a classifier layer at the end, consisting of parallel logistic
# classifiers, one per class. This allows for multi-class tasks.
logits = slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='logits')
tf.sigmoid(logits, name='prediction')
linear_out= slim.fully_connected(
fc, _NUM_CLASSES, activation_fn=None, scope='linear_out')
logits = tf.sigmoid(linear_out, name='logits')
# Add training ops.
with tf.variable_scope('train'):
global_step = tf.train.create_global_step()
# Labels are assumed to be fed as a batch multi-hot vectors, with
# a 1 in the position of each positive class label, and 0 elsewhere.
labels_input = tf.placeholder(
tf.float32, shape=(None, _NUM_CLASSES), name='labels')
# Cross-entropy label loss.
xent = tf.nn.sigmoid_cross_entropy_with_logits(
logits=logits, labels=labels_input, name='xent')
loss = tf.reduce_mean(xent, name='loss_op')
tf.summary.scalar('loss', loss)
# We use the same optimizer and hyperparameters as used to train VGGish.
optimizer = tf.train.AdamOptimizer(
learning_rate=vggish_params.LEARNING_RATE,
epsilon=vggish_params.ADAM_EPSILON)
train_op = optimizer.minimize(loss, global_step=global_step)
# Initialize all variables in the model, and then load the pre-trained
# VGGish checkpoint.
sess.run(tf.global_variables_initializer())
vggish_slim.load_vggish_slim_checkpoint(sess, FLAGS.checkpoint)
# The training loop.
features_input = sess.graph.get_tensor_by_name(
vggish_params.INPUT_TENSOR_NAME)
accuracy_scores = []
for epoch in range(num_epochs):#FLAGS.num_batches):
epoch_loss = 0
i=0
while i < len(X_train):
start = i
end = i+batch_size
batch_x = np.array(X_train[start:end])
batch_y = np.array(y_train[start:end])
_, c = sess.run([train_op, loss], feed_dict={features_input: batch_x, labels_input: batch_y})
epoch_loss += c
i+=batch_size
#print no. of epochs and loss
print('Epoch', epoch+1, 'completed out of', num_epochs,', loss:',epoch_loss) #FLAGS.num_batches,', loss:',epoch_loss)
#If these lines are left here, it will evaluate on the test data every iteration and print accuracy
#note this adds a small computational cost
correct = tf.equal(tf.argmax(logits, 1), tf.argmax(labels_input, 1)) #This line returns the max value of each array, which we want to be the same (think the prediction/logits is value given to each class with the highest value being the best match)
accuracy = tf.reduce_mean(tf.cast(correct, 'float')) #changes correct to type: float
accuracy1 = accuracy.eval({features_input:X_test, labels_input:y_test})
accuracy_scores.append(accuracy1)
print('Accuracy:', accuracy1)#TF is smart so just knows to feed it through the model without us seeming to tell it to.
#Save predictions for test data
predictions_sigm = logits.eval(feed_dict = {features_input:X_test}) #not really _sigm, change back later
#print(predictions_sigm) #shows table of predictions, meaningless if saving at each epoch
test_preds = pd.DataFrame(predictions_sigm, columns = col_names) #converts predictions to df
true_class = np.argmax(y_test, axis = 1) #This saves the true class
test_preds['True class'] = true_class #This adds true class to the df
#Saves csv file of table of predictions for test data. NB. header will not save when using np.text for some reason
np.savetxt("/content/drive/MyDrive/..."+"Epoch_"+str(epoch+1)+"_Accuracy_"+str(accuracy1), test_preds.values, delimiter=",")
if __name__ == '__main__':
tf.app.run()
#'An exception has occurred, use %tb to see the full traceback.' error will occur, fear not, this just means its finished (perhaps as its exited the tensorflow session?)

Linear classifier with PyTorch from scratch questions

I'm trying to implement linear classifier in PyTorch, using 1 layer with tensors W and b, softmax and cross entropy loss. For each batch I have to:
Calculate logits
Transform logits to probabilities with softmax
Compute most probable classes
Calculate cross entropy between true and predicted classes
Use an optimizer to change W and b
So far I have (I have flat MNIST loaded with Scikit-learn):
# convert Numpy arrays to PyTorch tensor Variables
input_X_train = torch.from_numpy(X_train_flat).float().to(device)
input_X_val = torch.from_numpy(X_val_flat).float().to(device)
input_X_test = torch.from_numpy(X_test_flat).float().to(device)
input_y_train = torch.from_numpy(y_train).long().to(device)
input_y_val = torch.from_numpy(y_val).long().to(device)
input_y_test = torch.from_numpy(y_test).long().to(device)
# model parameters: W and b
W = torch.randn(input_dim, output_dim, device=device, dtype=dtype, requires_grad=True)
b = torch.randn(1, device=device, dtype=dtype, requires_grad=True)
BATCH_SIZE = 512
EPOCHS = 40
LEARNING_RATE = 1e-6
# create torch.optim.Adam optimizer for loss function minimization
optimizer = torch.optim.Adam([W, b], lr=LEARNING_RATE)
# create negative log loss function object for loss function evaluation
# use mean loss value from all batch samples
loss_fn = torch.nn.NLLLoss(reduction="mean")
for t in range(EPOCHS):
# logits for input_X, resulting shape should be [input_X.shape[0], 10]
logits = torch.matmul(input_X_train, W) + b
# apply torch.nn.functional.softmax (torch_F.softmax) to logits
probas = torch_f.softmax(logits, dim=1)
# apply torch.argmax to find a class index with highest probability
classes = torch.argmax(probas, dim=1)
# loss should be a scalar number: average loss over all the objects with torch.mean()
# PyTorch implements negative log loss (NLL) *without* log - you have to first compute log of
# softmax, then negative log loss, which will swap sign
# Use torch.nn.functional.log_softmax (torch_f.log_softmax) on top of input_y and logits
# It is identical to calculating cross-entropy (log and then NLL) on top of probas,
# but is more numerically friendly (read the docs).
log_probas = torch_f.log_softmax(logits, dim=1)
loss = loss_fn(log_probas, input_y_train)
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# calculate backward gradients for backpropagation
loss.backward()
# Calling the step function on an Optimizer makes an update to its parameters
optimizer.step()
For some reason, the W and b don't change. What am I doing wrong?
EDIT:
I've seen and tried in the code above e. g. this minimal working example https://discuss.pytorch.org/t/minimal-working-example-of-optim-sgd/11623/2.
EDIT 2:
Gradients W.grad are often, I think it should not be like that. Probabilities of classes are definitely right (so it's not e. g. like this example), since I've checked sum of every row and probabilities of all classes for each sample sum to 1.

Transfer learning with pretrained model by tf.GradientTape can't converge

I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks

Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.

Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.

Change loss function dynamically during training in Keras, without recompiling other model properties like optimizer

Is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss, like for example:
class NewCallback(Callback):
def __init__(self):
super(NewCallback,self).__init__()
def on_epoch_end(self, epoch, logs={}):
self.model.loss=[loss_wrapper(t_change, current_epoch=epoch)]
self.model.compile_only_loss() # is there a version or hack of
# model.compile(...) like this?
To expand more with previous examples on stackoverflow:
To achieve a loss function which depends on the epoch number, like (as in this stackoverflow question):
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
c_epoch = K.get_value(current_epoch)
if c_epoch < t_change:
# compute loss_1
else:
# compute loss_2
return custom_loss
where "current_epoch" is a Keras variable updated with a callback:
current_epoch = K.variable(0.)
model.compile(optimizer=opt, loss=loss_wrapper(5, current_epoch),
metrics=...)
class NewCallback(Callback):
def __init__(self, current_epoch):
self.current_epoch = current_epoch
def on_epoch_end(self, epoch, logs={}):
K.set_value(self.current_epoch, epoch)
One can essentially turn python code into compositions of backend functions for the loss to work as follows:
def loss_wrapper(t_change, current_epoch):
def custom_loss(y_true, y_pred):
# compute loss_1 and loss_2
bool_case_1=K.less(current_epoch,t_change)
num_case_1=K.cast(bool_case_1,"float32")
loss = (num_case_1)*loss_1 + (1-num_case_1)*loss_2
return loss
return custom_loss
it works.
I am not satisfied with these hacks, and wonder, is it possible to set model.loss in a callback without re-compiling model.compile(...) after (since then the optimizer states are reset), and just recompiling model.loss?

I hope you have found a solution to your problem by now but using tensorflow I think you can solve this by building a custom training loop (here is the doc). this will not override the loss attribute as you requested however you can probably achieve what you are looking for.
example
initializing variable
modifying the example from the documentation, with a model and dataset as such:
inputs = tf.keras.Input(shape=(784,), name="digits")
x1 = tf.keras.layers.Dense(64, activation="relu")(inputs)
x2 = tf.keras.layers.Dense(64, activation="relu")(x1)
outputs = tf.keras.layers.Dense(10, name="predictions")(x2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
we can define our two loss functions (the two I chose make no sense from a scientific point of view but allow us to check the code works)
# Instantiate an optimizer.
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
# Instantiate a loss function.
loss_1 = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss_2 = lambda y_true, y_pred: -1 * loss_1(y_true, y_pred)
training loop
we can then execute our custom training loop:
epochs = 10
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
loss_fn = loss_1 if epoch % 2 else loss_2
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
and we check the output is what we want (alternate positive and negative losses)
Start of epoch 0
Training loss (for one batch) at step 0: -96.1003
Seen so far: 64 samples
Training loss (for one batch) at step 200: -3383849.5000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: -40419124.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: -149133008.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: -328322816.0000
Seen so far: 51264 samples
Start of epoch 1
Training loss (for one batch) at step 0: 580457984.0000
Seen so far: 64 samples
Training loss (for one batch) at step 200: 297710528.0000
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 213328544.0000
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 159328976.0000
Seen so far: 38464 samples
Training loss (for one batch) at step 800: 105737024.0000
Seen so far: 51264 samples
drawbacks and further improvments
the problem with writing custom loops as such is that you will loose the convenience of keras's fit method. I think you can manage this by defining a custom model and overriding the train_step as shown here in the documentation
If you really need to have the loss attribute of your model changed, you can set the compiled_loss attribute using a keras.engine.compile_utils.LossesContainer (here is the reference) and set model.train_function to model.make_train_function() (so that the new loss gets taken into account).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.