This question already has answers here:
Keras not training on entire dataset
(3 answers)
TensorFlow Only running on 1/32 of the Training data provided [duplicate]
(1 answer)
Closed 2 years ago.
I ran into a problem when playing with the introduction Tutorial for Tensorflow 2.0 Keras (https://www.tensorflow.org/tutorials/keras/classification).
The Problem:
There should be (and there are) 60.000 Images to fit the model. I checked this by printing out the length of train_images and train_labels.
The output when fitting the model on the other hand lets me believe that not all of the data was used as it says 1875/1875. Same for the Testing Data.
I deactivated the GPU Detection which does not have an effect on this (so it seems).
I'm using:
Python 3.8.3
Tensorflow 2.2.0
My Code:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
data = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = data.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
# preprocess the image data to have a pixel value between 0 and 1
train_images = train_images / 255.0
test_images = test_images / 255.0
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)
test_loss, test_acc = model.evaluate(test_images, test_labels, verbose=2)
print('\nTest accuracy:', test_acc)
Output:
2020-05-17 17:48:07.147033: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-05-17 17:48:10.075816: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-05-17 17:48:10.098581: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-05-17 17:48:10.105898: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-UU9P1OG
2020-05-17 17:48:10.109837: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-UU9P1OG
2020-05-17 17:48:10.113879: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-05-17 17:48:10.127711: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x14dc97288a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-17 17:48:10.132743: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Epoch 1/10
1875/1875 [==============================] - 2s 1ms/step - loss: 0.4943 - accuracy: 0.8264
Epoch 2/10
1875/1875 [==============================] - 2s 938us/step - loss: 0.3747 - accuracy: 0.8649
Epoch 3/10
1875/1875 [==============================] - 2s 929us/step - loss: 0.3403 - accuracy: 0.8762
Epoch 4/10
1875/1875 [==============================] - 2s 914us/step - loss: 0.3146 - accuracy: 0.8844
Epoch 5/10
1875/1875 [==============================] - 2s 937us/step - loss: 0.2985 - accuracy: 0.8900
Epoch 6/10
1875/1875 [==============================] - 2s 923us/step - loss: 0.2808 - accuracy: 0.8964
Epoch 7/10
1875/1875 [==============================] - 2s 939us/step - loss: 0.2702 - accuracy: 0.8998
Epoch 8/10
1875/1875 [==============================] - 2s 911us/step - loss: 0.2585 - accuracy: 0.9032
Epoch 9/10
1875/1875 [==============================] - 2s 918us/step - loss: 0.2482 - accuracy: 0.9073
Epoch 10/10
1875/1875 [==============================] - 2s 931us/step - loss: 0.2412 - accuracy: 0.9106
313/313 - 0s - loss: 0.3484 - accuracy: 0.8729
Test accuracy: 0.8729000091552734
The model is being trained with a batchsize of 32, hence there are 60,000/32 = 1875 batches.
Despite tensorflow documentation shows batch_size=None in the fit function overview, the information about this argument says:
batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.Sequence instances (since they generate batches).
Related
So I've been following Google's official tensorflow guide and trying to build a simple neural network using Keras. But when it comes to training the model, it does not use the entire dataset (with 60000 entries) and instead uses only 1875 entries for training. Any possible fix?
import tensorflow as tf
from tensorflow import keras
import numpy as np
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
class_names = ['T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot']
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss= tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)
Output:
Epoch 1/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3183 - accuracy: 0.8866
Epoch 2/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3169 - accuracy: 0.8873
Epoch 3/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3144 - accuracy: 0.8885
Epoch 4/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3130 - accuracy: 0.8885
Epoch 5/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3110 - accuracy: 0.8883
Epoch 6/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3090 - accuracy: 0.8888
Epoch 7/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3073 - accuracy: 0.8895
Epoch 8/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3057 - accuracy: 0.8900
Epoch 9/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3040 - accuracy: 0.8905
Epoch 10/10
1875/1875 [==============================] - 3s 2ms/step - loss: 0.3025 - accuracy: 0.8915
<tensorflow.python.keras.callbacks.History at 0x7fbe0e5aebe0>
Here's the original google colab notebook where I've been working on this: https://colab.research.google.com/drive/1NdtzXHEpiNnelcMaJeEm6zmp34JMcN38
The number 1875 shown during fitting the model is not the training samples; it is the number of batches.
model.fit includes an optional argument batch_size, which, according to the documentation:
If unspecified, batch_size will default to 32.
So, what happens here is - you fit with the default batch size of 32 (since you have not specified anything different), so the total number of batches for your data is
60000/32 = 1875
It does not train on 1875 samples.
Epoch 1/10
1875/1875 [===
1875 here is the number of steps, not samples. In fit method, there is an argument, batch_size. The default value for it is 32. So 1875*32=60000. The implementation is correct.
If you train it with batch_size=16, you will see the number of steps will be 3750 instead of 1875, since 60000/16=3750.
Just use batch_size = 1, if you want the entire 60000 data samples to be visible.
I am working on project with Keras Captcha OCR model. This model is about text Captcha recognition with CTC encoded output, apart form combining CNN and RNN.
I am trying to see the accuracy number from training output. How can I get the number of accuracy and validation accuracy?
Here is the training code form keras model:
epochs = 100
early_stopping_patience = 10
# Add early stopping
early_stopping = keras.callbacks.EarlyStopping(
monitor="val_loss", patience=early_stopping_patience, restore_best_weights=True
)
# Train the model
history = model.fit(
train_dataset,
validation_data=validation_dataset,
epochs=epochs,
callbacks=[early_stopping],
)
And this is the training output:
Epoch 1/100
59/59 [==============================] - 3s 53ms/step - loss: 21.5722 - val_loss: 16.3351
Epoch 2/100
59/59 [==============================] - 2s 27ms/step - loss: 16.3335 - val_loss: 16.3062
Epoch 3/100
59/59 [==============================] - 2s 27ms/step - loss: 16.3360 - val_loss: 16.3116
Epoch 4/100
59/59 [==============================] - 2s 27ms/step - loss: 16.3318 - val_loss: 16.3167
Epoch 5/100
Before calling model.fit just specify the metric you want to compute during training, in this case accuracy:
model.compile(optimizer= your_optimizer, loss= your_loss, metrics=['acc'])
Link to the documentation.
I am trying to train a simple convolutional network using Keras (Tensorflow 2.8.0) in Python 3.7.9 on Spyder IDE 5.2.2. The network involves Conv2D, MaxPooling2D, Flatten and Dense layers.
The model ran perfectly when I used my CPU, but training was slow. So I decided to try to run it on my GPU (GeForce GTX 1050 Ti).
I installed CUDA 11.2 and added lib, include and bin to my path. I installed CuDNN 8.1, and copied cudnn84_8.dll into the CUDA bin directory, also copied cudnn.h into CUDA header and cudnn.lib into CUDA lib.
Once I had done the above, I was able to use my GPU with Tensorflow. Tensorflow does recognize my GPU: when I run tf.config.list_physical_devices() I get:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
However, when I run the model containing Conv2D layers using my GPU, it fails as follows. This is the output I get:
2022-03-17 18:32:04.687276: I
tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow
binary is optimized with oneAPI Deep Neural Network Library (oneDNN)
to use the following CPU instructions in performance-critical
operations: AVX AVX2 To enable them in other operations, rebuild
TensorFlow with the appropriate compiler flags.
2022-03-17 18:32:05.171943: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device
/job:localhost/replica:0/task:0/device:GPU:0 with 2782 MB memory: ->
device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0,
compute capability: 6.1
Epoch 1/10
This is where you would normally see the progress bars for the training (verbose=1) but nothing further appears after "Epoch 1/10". Instead, I think the kernel restarts, because it clears all saved variables and I have to re-import all packages.
If I train a model using only Dense layers, it does work using the GPU. I have checked that it is actually using the GPU in this case, by monitoring GPU usage - it does use it. And as mentioned, the Conv2D model works fine on my CPU.
So in summary, I seem to have a problem selective for Conv2D models on my GPU. Any help in understanding this would be greatly appreciated!
My code:
from matplotlib import pyplot
import tensorflow as tf
import numpy as np
import keras
from keras.datasets import cifar10
#from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dense
from keras.layers import Flatten
from tensorflow.keras.optimizers import SGD
def load_dataset():
(trainX, trainy), (testX, testy) = cifar10.load_data()
print('Train shape: X=%s, y=%s' % (trainX.shape, trainy.shape))
print('Test shape: X=%s, y=%s' % (testX.shape, testy.shape))
trainY = tf.keras.utils.to_categorical(trainy)
testY = tf.keras.utils.to_categorical(testy)
return trainX, trainY, testX, testY
def prep_pixels(train, test):
# convert from integers to floats
train_norm = train.astype('float32')
test_norm = test.astype('float32')
# normalize to range 0-1
train_norm = train_norm / 255.0
test_norm = test_norm / 255.0
# return normalized images
return train_norm, test_norm
def define_model():
model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', padding='same', input_shape=(32, 32, 3))) #kernel_initializer='he_uniform',
model.add(Conv2D(32, (3, 3), activation='relu', padding='same'))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))
# compile model
opt = SGD(learning_rate=0.001, momentum=0.9)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
return model
def summarize_diagnostics(history):
# plot loss
pyplot.subplot(211)
pyplot.title('Cross Entropy Loss')
pyplot.plot(history.history['loss'], color='blue', label='train')
pyplot.plot(history.history['val_loss'], color='orange', label='test')
# plot accuracy
pyplot.subplot(212)
pyplot.title('Classification Accuracy')
pyplot.plot(history.history['accuracy'], color='blue', label='train')
pyplot.plot(history.history['val_accuracy'], color='orange', label='test')
pyplot.show()
def run_test_harness():
trainX, trainY, testX, testY = load_dataset()
trainX, testX = prep_pixels(trainX, testX)
model = define_model()
history = model.fit(trainX, trainY, epochs=10, batch_size=64, validation_data=(testX, testY))
_, acc = model.evaluate(testX, testY)
print('> %.3f' % (acc * 100.0)) #overall validation accuracy
summarize_diagnostics(history)
tf.config.list_physical_devices()
run_test_harness()
This could be possible due to less RAM available for running this code using GPU in your system.
This code works fine when I tried replicating the same in Google colab using gpu mode, However cpu mode, takes more time to run the code comparatively:
tf.config.list_physical_devices()
Output:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
To run the code:
run_test_harness()
Output:
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 2s 0us/step
170508288/170498071 [==============================] - 2s 0us/step
Train shape: X=(50000, 32, 32, 3), y=(50000, 1)
Test shape: X=(10000, 32, 32, 3), y=(10000, 1)
Epoch 1/10
782/782 [==============================] - 22s 10ms/step - loss: 1.9416 - accuracy: 0.3059 - val_loss: 1.7283 - val_accuracy: 0.3935
Epoch 2/10
782/782 [==============================] - 7s 9ms/step - loss: 1.6194 - accuracy: 0.4278 - val_loss: 1.5163 - val_accuracy: 0.4528
Epoch 3/10
782/782 [==============================] - 8s 10ms/step - loss: 1.4575 - accuracy: 0.4835 - val_loss: 1.3856 - val_accuracy: 0.5099
Epoch 4/10
782/782 [==============================] - 6s 8ms/step - loss: 1.3539 - accuracy: 0.5213 - val_loss: 1.3111 - val_accuracy: 0.5348
Epoch 5/10
782/782 [==============================] - 5s 7ms/step - loss: 1.2553 - accuracy: 0.5589 - val_loss: 1.2233 - val_accuracy: 0.5658
Epoch 6/10
782/782 [==============================] - 5s 6ms/step - loss: 1.1763 - accuracy: 0.5865 - val_loss: 1.1691 - val_accuracy: 0.5841
Epoch 7/10
782/782 [==============================] - 5s 6ms/step - loss: 1.1091 - accuracy: 0.6115 - val_loss: 1.1284 - val_accuracy: 0.6005
Epoch 8/10
782/782 [==============================] - 5s 6ms/step - loss: 1.0402 - accuracy: 0.6344 - val_loss: 1.0932 - val_accuracy: 0.6183
Epoch 9/10
782/782 [==============================] - 5s 6ms/step - loss: 0.9842 - accuracy: 0.6546 - val_loss: 1.0825 - val_accuracy: 0.6255
Epoch 10/10
782/782 [==============================] - 5s 6ms/step - loss: 0.9350 - accuracy: 0.6742 - val_loss: 1.0722 - val_accuracy: 0.6256
313/313 [==============================] - 1s 3ms/step - loss: 1.0722 - accuracy: 0.6256
> 62.560
Please check again in Google colab by selecting the GPU
(Runtime - Change runtime type - Hardware accelerator - GPU - Save)
and let us know if issue still persists.
To do some parameter tuning, I like to loop over some training function with Keras. However, I realized that when using tensorflow.keras.metrics.AUC() as a metric, for every training loop, an integer gets added to the auc metric name (e.g. auc_1, auc_2, ...). So actually the keras metrics are somehow stored even when coming out of the training function.
This first of all leads to the callbacks not recognizing the metric anymore and also makes me wonder if there are not other things stored like the model weights.
How can I reset the metrics and are there other things that get stored by keras that I need to reset to get a clean restart for training?
Below you can find a minimal working example:
edit: this example seems to only work with tensorflow 2.2
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.metrics import AUC
def dummy_network(input_shape):
model = keras.Sequential()
model.add(keras.layers.Dense(10,
input_shape=input_shape,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=keras.regularizers.l2(l=1e-3)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(11, activation='sigmoid'))
model.compile(optimizer='adagrad',
loss='binary_crossentropy',
metrics=[AUC()])
return model
def train():
CB_lr = tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_auc",
patience=3,
verbose=1,
mode="max",
min_delta=0.0001,
min_lr=1e-6)
CB_es = tf.keras.callbacks.EarlyStopping(
monitor="val_auc",
min_delta=0.00001,
verbose=1,
patience=10,
mode="max",
restore_best_weights=True)
callbacks = [CB_lr, CB_es]
y = [np.ones((11, 1)) for _ in range(1000)]
x = [np.ones((37, 12, 1)) for _ in range(1000)]
dummy_dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size=100).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size=100).repeat()
model = dummy_network(input_shape=((37, 12, 1)))
model.fit(dummy_dataset, validation_data=val_dataset, epochs=2,
steps_per_epoch=len(x) // 100,
validation_steps=len(x) // 100, callbacks=callbacks)
for i in range(3):
print(f'\n\n **** Loop {i} **** \n\n')
train()
The output is:
**** Loop 0 ****
2020-06-16 14:37:46.621264: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f991e541f10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-16 14:37:46.621296: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Epoch 1/2
10/10 [==============================] - 0s 44ms/step - loss: 0.1295 - auc: 0.0000e+00 - val_loss: 0.0310 - val_auc: 0.0000e+00 - lr: 0.0010
Epoch 2/2
10/10 [==============================] - 0s 10ms/step - loss: 0.0262 - auc: 0.0000e+00 - val_loss: 0.0223 - val_auc: 0.0000e+00 - lr: 0.0010
**** Loop 1 ****
Epoch 1/2
10/10 [==============================] - ETA: 0s - loss: 0.4751 - auc_1: 0.0000e+00WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_1,val_loss,val_auc_1,lr
WARNING:tensorflow:Early stopping conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_1,val_loss,val_auc_1,lr
10/10 [==============================] - 0s 36ms/step - loss: 0.4751 - auc_1: 0.0000e+00 - val_loss: 0.3137 - val_auc_1: 0.0000e+00 - lr: 0.0010
Epoch 2/2
10/10 [==============================] - ETA: 0s - loss: 0.2617 - auc_1: 0.0000e+00WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_1,val_loss,val_auc_1,lr
WARNING:tensorflow:Early stopping conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_1,val_loss,val_auc_1,lr
10/10 [==============================] - 0s 10ms/step - loss: 0.2617 - auc_1: 0.0000e+00 - val_loss: 0.2137 - val_auc_1: 0.0000e+00 - lr: 0.0010
**** Loop 2 ****
Epoch 1/2
10/10 [==============================] - ETA: 0s - loss: 0.1948 - auc_2: 0.0000e+00WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_2,val_loss,val_auc_2,lr
WARNING:tensorflow:Early stopping conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_2,val_loss,val_auc_2,lr
10/10 [==============================] - 0s 34ms/step - loss: 0.1948 - auc_2: 0.0000e+00 - val_loss: 0.0517 - val_auc_2: 0.0000e+00 - lr: 0.0010
Epoch 2/2
10/10 [==============================] - ETA: 0s - loss: 0.0445 - auc_2: 0.0000e+00WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_2,val_loss,val_auc_2,lr
WARNING:tensorflow:Early stopping conditioned on metric `val_auc` which is not available. Available metrics are: loss,auc_2,val_loss,val_auc_2,lr
10/10 [==============================] - 0s 10ms/step - loss: 0.0445 - auc_2: 0.0000e+00 - val_loss: 0.0389 - val_auc_2: 0.0000e+00 - lr: 0.0010
Your reproducible example failed in several places for me, so I changed just a few things (I'm using TF 2.1). After getting it to run, I was able to get rid of the additional metric names by specifying metrics=[AUC(name='auc')]. Here's the full (fixed) reproducible example:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.metrics import AUC
def dummy_network(input_shape):
model = keras.Sequential()
model.add(keras.layers.Dense(10,
input_shape=input_shape,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=keras.regularizers.l2(l=1e-3)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(11, activation='softmax'))
model.compile(optimizer='adagrad',
loss='binary_crossentropy',
metrics=[AUC(name='auc')])
return model
def train():
CB_lr = tf.keras.callbacks.ReduceLROnPlateau(
monitor="val_auc",
patience=3,
verbose=1,
mode="max",
min_delta=0.0001,
min_lr=1e-6)
CB_es = tf.keras.callbacks.EarlyStopping(
monitor="val_auc",
min_delta=0.00001,
verbose=1,
patience=10,
mode="max",
restore_best_weights=True)
callbacks = [CB_lr, CB_es]
y = tf.keras.utils.to_categorical([np.random.randint(0, 11) for _ in range(1000)])
x = [np.ones((37, 12, 1)) for _ in range(1000)]
dummy_dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size=100).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x, y)).batch(batch_size=100).repeat()
model = dummy_network(input_shape=((37, 12, 1)))
model.fit(dummy_dataset, validation_data=val_dataset, epochs=2,
steps_per_epoch=len(x) // 100,
validation_steps=len(x) // 100, callbacks=callbacks)
for i in range(3):
print(f'\n\n **** Loop {i} **** \n\n')
train()
Train for 10 steps, validate for 10 steps
Epoch 1/2
1/10 [==>...........................] - ETA: 6s - loss: 0.3426 - auc: 0.4530
7/10 [====================>.........] - ETA: 0s - loss: 0.3318 - auc: 0.4895
10/10 [==============================] - 1s 117ms/step - loss: 0.3301 -
auc: 0.4893 - val_loss: 0.3222 - val_auc: 0.5085
This happens because every loop, you created a new metric without a specified name by doing this: metrics=[AUC()]. The first iteration of the loop, TF automatically created a variable in the name space called auc, but at the second iteration of your loop, the name 'auc' was already taken, so TF named it auc_1 since you didn't specify a name. But, your callback was set to be based on auc, which is a metric that this model didn't have (it was the metric of the model from the previous loop). So, you either do name='auc' and overwrite the previous metric name, or define it outside of the loop, like this:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.metrics import AUC
auc = AUC()
def dummy_network(input_shape):
model = keras.Sequential()
model.add(keras.layers.Dense(10,
input_shape=input_shape,
activation=tf.nn.relu,
kernel_initializer='he_normal',
kernel_regularizer=keras.regularizers.l2(l=1e-3)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(11, activation='softmax'))
model.compile(optimizer='adagrad',
loss='binary_crossentropy',
metrics=[auc])
return model
And don't worry about keras resetting the metrics. It takes care of all that in the fit() method. If you want more flexibility and/or do it yourself, I suggest using custom training loops, and reset it yourself:
auc = tf.keras.metrics.AUC()
auc.update_state(np.random.randint(0, 2, 10), np.random.randint(0, 2, 10))
print(auc.result())
auc.reset_states()
print(auc.result())
Out[6]: <tf.Tensor: shape=(), dtype=float32, numpy=0.875> # state updated
Out[8]: <tf.Tensor: shape=(), dtype=float32, numpy=0.0> # state reset
The multi_gpu_model in tf.keras seems to be much slower than the one in keras. For the example given here, it is about 12x slower when importing from tensorflow.keras instead of keras.
import tensorflow as tf
from tensorflow.keras.applications import Xception
from tensorflow.keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 1000
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
The only changes are
from tensorflow.keras.applications import Xception
from tensorflow.keras.utils import multi_gpu_model
instead of
from keras.applications import Xception
from keras.utils import multi_gpu_model
With tf.keras
Epoch 1/20
1000/1000 [==============================] - 78s 78ms/step - loss: 3487.2197
Epoch 2/20
1000/1000 [==============================] - 37s 37ms/step - loss: 3454.2403
Epoch 3/20
1000/1000 [==============================] - 37s 37ms/step - loss: 3453.6264
Epoch 4/20
1000/1000 [==============================] - 37s 37ms/step - loss: 3452.7994
Epoch 5/20
1000/1000 [==============================] - 37s 37ms/step - loss: 3452.3592
Importing directly from keras
Epoch 1/20
1000/1000 [==============================] - 52s 52ms/step - loss: 3486.8955
Epoch 2/20
1000/1000 [==============================] - 3s 3ms/step - loss: 3454.1935
Epoch 3/20
1000/1000 [==============================] - 3s 3ms/step - loss: 3453.5585
Epoch 4/20
1000/1000 [==============================] - 3s 3ms/step - loss: 3452.8249
Epoch 5/20
1000/1000 [==============================] - 3s 3ms/step - loss: 3452.1542
That is like a 12x speed difference from the second epoch onwards.
I am using latest keras 2.2.4 and tensorflow-gpu-1.10