I was wondering if anyone has an example of using a keras model on a TPU pod?
I have a model creating method which returns a keras model which is compiled within a TPU strategy scope, as recommended by many examples on using TPUs with keras. This works with v3-8 but gives an error when tried with more cores (specifically v3-32):
with strategy.scope():
keras_model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
keras_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
When running model.fit, it fails with the following error:
Failed copying input tensor from /job:worker/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:1/device:CPU:0 in order to run DatasetFromGraph: FetchOutputs node : not found [Op:DatasetFromGraph]
The model input are in the form of numpy arrays. Is perhaps a tensorflow.data.Dataset required?
Related
I am training a transformer model for a chat bot. I have thought of saving the checkpoints in colab to reuse the trained model whenever required after the training process is done.
I have followed the model saving tutorial from tensorflow but it keeps me giving me the following error.
UnimplementedError: File system scheme '[local]' not implemented (file: 'training_1/cp.ckpt_temp/part-00000-of-00001') [Op:MultiDeviceIteratorInit]
This is my try in saving the checkpoints.
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
#Create checkpoint callback
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
#Fit model:
model.fit(dataset, epochs=EPOCHS,callbacks=[cp_callback])
In some training instances, the model get trained for about 5 epochs and this error occurs while in some instances the error occurs within just one or two epochs. I am using TPU to train the model.
What causes this issue and is there a way to get rid of it?
Any help will be highly appreciated.
I am using tensorflow version '2.0.0' and keras version '2.3.0' to develop the model. Here's how I saved the model:
seed = 1234
random.seed(seed)
np.random.seed(seed)
tf.compat.v1.random.set_random_seed(seed)
I then save the entire model as instructed here:
model.save('some_model_name.h5')
I am getting an accuracy of about 95% during training. When I load the model from a different python session, like:
# Recreate the exact same model
new_model = load_model('some_model_name.h5', custom_objects={'SeqSelfAttention': SeqSelfAttention})
score = new_model.evaluate([x_img_train, x_txt_train], y_train, verbose=2)
print("%s: %.2f%%" % (new_model.metrics_names[1], score[1]*100))
The accuracy now is about 4%. Please note that I have batch norm and dropout layers. How can I make the predictions of my model consistent across different sessions?
Firstly, I have downgraded the TensorFlow version to 1.13.1, owing to stability issues of 2.0.0.
Secondly, I had to ensure a few things before I could achieve some level of reproducibility:
Use Adagrad optimizer instead of Adam gave me performance comparable to the train session. When every time I loaded the session, it was giving me a high variance in the predictions (for Adam)
Loading architecture from json and loading model weights subsequently gave me different results as compared to saving and loading weights only. The former approach seemed to produce comparable performance (to training)
Using tf.session to train and saving it and reloading the tf.session in a new python session did the trick.
There is no variation in the results with or without dropouts or Batch norm.
Please note that following these steps gave me some level of consistency although it's not 100% reproducible. If you're facing a similar issue, perhaps these insights could help.
After loading the model in a new kernel instance, make sure to config losses and metrics again with .compile() in the same way you did before saving.
For example:
old_model = tf.keras.Sequential([ ... ])
old_model.compile(loss = 'mean_squared_error', optimizer = 'sgd', metrics = ['accuracy'])
old_model.fit(train_ds, validation_data=valid_ds, epochs=3)
old_model.evaluate(test_ds)
old_model.save('some_model_name.h5')
Then in the new kernel:
from tensorflow.keras.models import load_model
new_model = load_model("some_model_name.h5")
new_model.compile(loss = 'mean_squared_error', optimizer = 'sgd', metrics = ['accuracy'])
new_model.evaluate(test_ds) # should be the same now
I have trained a Keras (with Tensorflow backend) model which has two outputs with a custom loss function. I need help in loading the model from disk using the custom_objects argument.
When compiling the model I have used the loss and loss_weights argument as follows:
losses = {
'output_layer_1':custom_loss_fn,
'output_layer_2':custom_loss_fn
}
loss_weights = {
'output_layer_1': 1.0,
'output_layer_2': 1.0
}
model.compile(loss=losses, loss_weights=loss_weights, optimizer=opt)
The model is training without any problems. I save the model as follows:
model.save(model_path)
The reason I haven't defined "custom_loss_fn" here is because custom_loss_fn is defined inside another custom Keras layer.
My question is how do I load the model which is persisted to disk during inference. If it was a single ouput model I would load the model using custom_objects as described in this stackoverflow question: Loading model with custom loss + keras
model = keras.models.load_model(model_path, custom_objects={'custom_loss_fn':custom_loss_fn})
But how to extend this in my case where I have two outputs with the losses and loss weights defined in a dictionary along with a custom loss function?
In other words, how should custom_objects be populated in this case where losses and loss_weights are defined as dictionaries?
I'm using Keras v2.1.6 with Tensorflow backend v1.8.0.
If you can recompile the model on the loading side, the easiest way is to save just the weights: model.save_weights(). If you want to use save_model and have custom Keras layers, be sure they implement the get_config method (see this reference).
As for the ops without gradient, I have seen this while mixing tensorflow and Keras without using properly the keras.backend functions, but I can't help any more without the model code itself.
I have keras with tensorflow backend that runs on GPU. However, I am training an LSTM so instead I am training on the CPU.
with tf.device('/cpu:0'):
model = Sequential()
model.add(Bidirectional(LSTM(50, return_sequences=True), input_shape=(50, len(train_x[0][0]))))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['acc'])
The problem I have is that when I save and load the model, the predict function for the loaded model performs very slowly. After some timed tests I believe what is happening is that the loaded model is running on the GPU rather than the CPU, so it is slow. I tried compiling the loaded model on the CPU however this does not speed things up:
model.save('test_model.h5')
new_model = load_model('test_model.h5')
with tf.device('/cpu:0'):
new_model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['acc'])
Is there a way to achieve the same speeds with the loaded model as with the newly trained model? The newly trained model is almost five times faster. Thanks for your help.
Load the model with the device you want to use:
with tf.device('/cpu:0'):
new_model = load_model('test_model.h5')
I have several neural networks built using Keras that I used so far mostly in Jupyter. I often save models from scikit-learn with joblib and Keras with json + hdf5 and use them in other notebooks without issue.
I made a Python Spark application that can make use of those serialized models in cluster mode. joblib models are working fine however, I encountered an issue with Keras.
Here is the model used in notebook and pyspark:
def build_gru_model():
model = Sequential()
model.add(Embedding(max_nb_words, 128, input_length=max_sequence_length, dropout=0.2))
model.add(GRU(128, dropout_W=0.2, dropout_U=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
both called the same way:
preds = model.predict_proba(data, verbose=0)
However, only in Spark I get the error:
MissingInputError: ("An input of the graph, used to compute DimShuffle{x,x,x,x}(keras_learning_phase), was not provided and not given a value.Use the Theano flag exception_verbosity='high',for more information on this error.", keras_learning_phase)
I've done the mandatory search and found: https://github.com/fchollet/keras/issues/2430 which points to https://keras.io/getting-started/faq/
If I indeed remove dropout from my model, it works. However, I fail to understand how to implement something that would allow me to keep dropout during the training phase like described in the FAQ.
Based on the model code, how one would accomplish this?
You can try to put (before your prediction)
import keras.backend as K
K.set_learning_phase(0)
It should set your learning phase to 0 (test time)