I'm working on a custom transformer model where the training steps method goes like this:
#simplified version of my training method. where model = myTransformerModel()
for windows in data: #step through data
l1 = model(window)
loss = torch.mean(l1)
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
scheduler.step()
I'm trying to recreate this in TensorFlow, currently its like this:
for windows in data: #step through data
with tf.GradientTape() as tape:
l1 = model.call(window)
loss = tf.reduce_mean(l1)
train = optimizer.minimize(loss, var_list=model.trainable_variables,tape=tape)
This functions, but causes the scheduler to step with every window, which throws off the learning rate.
I have also tried this in place of the minimize line:
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
Is there a good way to make the TensorFlow model more like the PyTorch one? Is there a better way to implement my steps with the gradienttape?
Related
I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.
During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving.
This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.
My code:
def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
x1 = Input(62)
x2 = Input((62, 3))
x = Embedding(30, 100, mask_zero=True)(x1)
x = Concatenate()([x, x2])
x = Bidirectional(LSTM(500,
return_sequences=True,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Bidirectional(LSTM(500,
return_sequences=False,
kernel_regularizer=kernel_regularizer,
dropout=dropout,
recurrent_dropout=recurrent_dropout))(x)
x = Activation('softmax')(x)
x = Dense(1000)(x)
x = Dense(500)(x)
x = Dense(250)(x)
x = Dense(1, bias_initializer='ones')(x)
x = tf.math.abs(x)
return Model(inputs=[x1, x2], outputs=x)
optimizer = Adam(learning_rate=0.0001)
model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA
dat_train = tf.data.Dataset.from_generator(
generator= lambda: <load_function()>
output_types=((tf.int32, tf.float32), tf.float32)
)
dat_train = dat_train.with_options(options)
# keras training
model.fit(dat_train, epochs=50)
# custom training
for epoch in range(50):
for (x1, x2), y in dat_train:
with tf.GradientTape() as tape:
y_pred = model((x1, x2), training=True)
loss = model.loss(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
model.optimizer.apply_gradients(zip(grads, model.trainable_variables))
I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.
I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.
All in all, my problem seems to be similar to:
Keras model doesn't train when using GradientTape
Edit 1
The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model.
Additionally, some things I noticed:
The custom training takes roughly 2x the amount of time compared to model.fit(...).
The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).
Edit 2
I've added a Google Colab notebook to reproduce the issue:
https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing
The loss and MSE for 20 epochs is shown here:
custom training
keras training
While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image.
So far, I've noticed two ways of improving the performance of the custom training:
The usage of custom layer initialization
Using MSE as a loss function
Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.
I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.
I started using Pytorch and I'm currently working on a Project where I'm using a simple feed forward neural network for linear regression. The Problem is I didn't find anything in Pytorch that allows me to get the Accuracy of a linear regression Model as in Keras or in SKlearn. in keras it would be simple just by setting metrics=["accuracy"] inside the compile function. I searched in the docs and official website of Pytorch but I didn't find anything. seems that this API doesn't exist in Pytorch. I know that I can observe the loss during training or I can simply get the test loss and based on it I can know whether the loss decreased or not but I want to use that Keras Structure where I get the loss value and also an Accuracy value. the Keras way looks more clear. I also tried to implement an accuracy function using the r2_score from sklearn but it gave me wierd values:
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)
def train(model, optimizer, loss_fn):
def train_step(x, y):
model.train()
optimizer.zero_grad()
out = model(x)
loss = loss_fn(out, y)
loss.backward()
optimizer.step()
return loss.item()
return train_step
def fit(epochs=100):
train_func = train(model, optimizer, criterion)
count, total = 0, 0
loss_list, accuracy_list, iters = [], [], []
for e in range(epochs):
for X, y in train_loader:
loss = train_func(X, y)
count += 1
total += len(y)
if count % 50 == 0:
print("loss= ", loss)
loss_list.append(loss)
iters.append(total)
if count % 100 == 0:
model.eval() # im not sure if we can do this in pytorch. I mean evaluating the model while training! it would be great if you tell me whether this is ok or not
out = model(X)
out = out.detach().numpy()
y = y.detach().numpy()
accuracy = r2_score(y, out) # r2_score is the scikit learn r2 score function.
print("accuracy = ", accuracy) # here i get wierd values and it doesn't get better over time, in contrast the loss decreased over time
accuracy_list.append(accuracy)
return iters, loss_list, accuracy_list
I know how to implement an Accuracy function in case of Classification Problem because it is using discrete values. that is clear to me because the implementation is easy and clear. I must only look which correct prediction did the model made and then calculate accuracy. but in this Case I have continuous values so that's why I couldn't implement the function myself and it surprised me that Pytorch don't have a built in function for this. so could someone maybe tell me how to implement this or where to find an Implementation of it?
another thing is where to use the evaluation and where to set the model in evaluation mode by calling the eval function. should I use it during training like I did in my Code or should I train and then test after training and if I test during training should I call the eval function as I did there or it will affect the training when the loop goes back to training mode? another thing I didn't find it also in Pytorch which is Cross validation. how should I implement it in pytorch if there is no API for it like in Keras?
Accuracy does not exist in regression problems.
A similar measure of "accuracy" for a regression problem might be the R-squared score.
If you are using pytorch to train your neural networks, have a look at the package
torchmetrics. You may find what you need there.
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
Look here for more info: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
I would like to perform transfer learning with pretrained model of keras
import tensorflow as tf
from tensorflow import keras
base_model = keras.applications.MobileNetV2(input_shape=(96, 96, 3), include_top=False, pooling='avg')
x = base_model.outputs[0]
outputs = layers.Dense(10, activation=tf.nn.softmax)(x)
model = keras.Model(inputs=base_model.inputs, outputs=outputs)
Training with keras compile/fit functions can converge
model.compile(optimizer=keras.optimizers.Adam(), loss=keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])
history = model.fit(train_data, epochs=1)
The results are: loss: 0.4402 - accuracy: 0.8548
I wanna train with tf.GradientTape, but it can't converge
optimizer = keras.optimizers.Adam()
train_loss = keras.metrics.Mean()
train_acc = keras.metrics.SparseCategoricalAccuracy()
def train_step(data, labels):
with tf.GradientTape() as gt:
pred = model(data)
loss = keras.losses.SparseCategoricalCrossentropy()(labels, pred)
grads = gt.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
train_loss(loss)
train_acc(labels, pred)
for xs, ys in train_data:
train_step(xs, ys)
print('train_loss = {:.3f}, train_acc = {:.3f}'.format(train_loss.result(), train_acc.result()))
But the results are: train_loss = 7.576, train_acc = 0.101
If I only train the last layer by setting
base_model.trainable = False
It converges and the results are: train_loss = 0.525, train_acc = 0.823
What's the problem with the codes? How should I modify it? Thanks
Try RELU as activation function. It may be Vanishing Gradient issue which occurs if you use activation function other than RELU.
Following my comment, the reason why it didn't converge is because you picked a learning rate that was too big. This causes the weight to change too much and the loss to explode. When setting base_model.trainable to False, most of the weight in the networks were fixed and the learning rate was a good fit for your last layers. Here's a picture :
As a general rule, your learning rate should always be chosen for each experiments.
Edit : Following Wilson's comment, I'm not sure this is the reason you have different results but this could be it :
When you specify your loss, your loss is computed on each element of the batch, then to get the loss of the batch, you can take the sum or the mean of the losses, depending on which one you chose, you get a different magnitude. For example, if your batch size is 64, summing the loss will yield you a 64 times bigger loss which will yield 64 times bigger gradient, so choosing sum over mean with a batch size 64 is like picking a 64 times bigger learning rate.
So maybe the reason you have different results is that by default a keras.losses wrapped in a model.compile has a different reduction method. In the same vein, if the loss is reduced by a sum method, the magnitude of the loss depends on the batch size, if you have twice the batch size, you get (on average) twice the loss, and twice the gradient and so it's like doubling the learning rate.
My advice is to check the reduction method used by the loss to be sure it's the same in both case, and if it's sum, to check that the batch size is the same. I would advise to use mean reduction in general since it's not influenced by batch size.
I'm using a very simple NN with a normalized word2vec as input.
When running my train (based on the mini batch) the train cost start around 1020 and decrease around 1000 but never less than this and my accuracy is around 50%.
Why doesn't the cost decrease ? How can I verify that the weigth matrice is updated at each run?
apply_weights_OP = tf.matmul(X, weights, name="apply_weights")
add_bias_OP = tf.add(apply_weights_OP, bias, name="add_bias")
activation_OP = tf.nn.sigmoid(add_bias_OP, name="activation")
cost_OP = tf.nn.l2_loss(activation_OP-yGold, name="squared_error_cost")
optimizer = tf.train.AdamOptimizer(0.001)
global_step = tf.Variable(0, name='global_step', trainable=False)
training_OP = optimizer.minimize(cost_OP, global_step=global_step)
correct_predictions_OP = tf.equal(
tf.argmax(activation_OP, 0),
tf.argmax(yGold, 0)
)
accuracy_OP = tf.reduce_mean(tf.cast(correct_predictions_OP, "float"))
newCost, train_accuracy, _ = sess.run(
[cost_OP, accuracy_OP, training_OP],
feed_dict={
X: trainX[indice_bas: indice_haut],
yGold: trainY[indice_bas: indice_haut]
}
)
Thanks
try using cross entropy instead of the L2 loss, also there is no real point in having an activation function on your output layer.
The examples that ship with tensorflow actually have a basic model that is very similar to what you are trying.
btw: it might also be that the problem you are trying to learn is simply not solvable by a simple linear model (i.e. what you are trying to do), try using a deeper model. Here is an example of a 2 layer deep multilayer perceptron.
I would like to train the weights of a model based on the sum of the loss value of several batches. However it seems that once you run the graph for each of the individual batches, the object that is returned is just a regular numpy array. So when you try and use an optimizer like GradientDescentOptimizer, it no longer has information about the variables that were used to calculate the sum of the losses, so it can't find the gradients of the weights that what help minimize the loss. Here's an example tensorflow script to illustrate what I'm talking about:
weights = tf.Variable(tf.ones([num_feature_values], tf.float32))
feature_values = tf.placeholder(tf.int32, shape=[num_feature_values])
labels = tf.placeholder(tf.int32, shape=[1])
loss_op = some_loss_function(weights, feature_values, labels)
with tf.Session() as sess:
for batch in batches:
feed_dict = fill_feature_values_and_labels(batch)
#Calculates loss for one batch
loss = sess.run(loss_op, feed_dict=feed_dict)
#Adds it to total loss
total_loss += loss
# Want to train weights to minimize total_loss, however this
# doesn't work because the graph has already been run.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(total_loss)
with tf.Session() as sess:
for step in xrange(num_steps):
sess.run(optimizer)
The total_loss is a numpy array and thus cannot be used in the optimizer. Does anyone know a way around the problem, where I want to use information across many batches but still need the graph intact in order to preserve the fact that the total_loss is a function of the weights?
The thing you optimize in any of the trainers must be a part of the graph, here what you train on is the actual realized result, so it won't work.
I think the way you should probably do this is to construct your input as a batch of batches e.g.
intput = tf.placeholder("float", (number_of_batches, batch_size, input_size)
Then have your target also be a 3d tensor which can be trained on.